Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Tier

All Fine Tuning Studio metadata is persisted in a SQLite database at .app/state.db (configurable via FINE_TUNING_STUDIO_SQLITE_DB). The ORM layer uses SQLAlchemy declarative models defined in ft/db/model.py. Access is managed through FineTuningStudioDao in ft/db/dao.py.

Schema Topology

Table Schemas

All primary keys are String type (UUIDs assigned by domain logic). All columns are nullable except id. ORM classes are defined in ft/db/model.py.

models

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type (e.g., huggingface, cml)
frameworkStringModel framework identifier
nameStringDisplay name
descriptionStringHuman-readable description
huggingface_model_nameStringHuggingFace Hub model ID
locationStringLocal filesystem path
cml_registered_model_idStringCML Model Registry ID
mlflow_experiment_idStringAssociated MLflow experiment
mlflow_run_idStringAssociated MLflow run

datasets

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type (e.g., huggingface, local)
nameStringDisplay name
descriptionTextLong-form description
huggingface_nameStringHuggingFace Hub dataset ID
locationTextLocal filesystem path
featuresTextJSON string of dataset feature names

adapters

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type
nameStringDisplay name
descriptionStringHuman-readable description
huggingface_nameStringHuggingFace Hub adapter ID
model_idStringFK -> models.idBase model this adapter targets
locationTextLocal filesystem path to adapter weights
fine_tuning_job_idStringFK -> fine_tuning_jobs.idJob that produced this adapter
prompt_idStringFK -> prompts.idPrompt template used during training
cml_registered_model_idStringCML Model Registry ID
mlflow_experiment_idStringAssociated MLflow experiment
mlflow_run_idStringAssociated MLflow run

prompts

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringPrompt type
nameStringDisplay name
descriptionStringHuman-readable description
dataset_idStringFK -> datasets.idDataset this prompt is designed for
prompt_templateStringFull prompt format string
input_templateStringInput portion template
completion_templateStringCompletion portion template

fine_tuning_jobs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
base_model_idStringFK -> models.idBase model to fine-tune
dataset_idStringFK -> datasets.idTraining dataset
prompt_idStringFK -> prompts.idPrompt template
num_workersIntegerNumber of worker processes
cml_job_idStringCML Job ID for tracking
adapter_idStringFK -> adapters.idResulting adapter
num_cpuIntegerCPU allocation
num_gpuIntegerGPU allocation
num_memoryIntegerMemory allocation (GB)
num_epochsIntegerTraining epochs
learning_rateDoubleLearning rate
out_dirStringOutput directory for adapter weights
training_arguments_config_idStringFK -> configs.idTraining arguments config
model_bnb_config_idStringFK -> configs.idModel BitsAndBytes quantization config
adapter_bnb_config_idStringFK -> configs.idAdapter BitsAndBytes quantization config
lora_config_idStringFK -> configs.idLoRA hyperparameters config
training_arguments_configStringSerialized training arguments (snapshot)
model_bnb_configStringSerialized model BnB config (snapshot)
adapter_bnb_configStringSerialized adapter BnB config (snapshot)
lora_configStringSerialized LoRA config (snapshot)
dataset_fractionDoubleFraction of dataset to use
train_test_splitDoubleTrain/test split ratio
user_scriptStringCustom user training script path
user_config_idStringFK -> configs.idCustom user config
framework_typeStringTraining framework (legacy, axolotl, etc.)
axolotl_config_idStringFK -> configs.idAxolotl YAML config
gpu_label_idIntegerGPU label selector
adapter_nameStringName assigned to the output adapter

The fine_tuning_jobs table stores both config ID references (foreign keys to configs) and serialized config snapshots (plain string columns). This allows job records to remain self-describing even if the referenced config is later deleted.

evaluation_jobs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringEvaluation type
cml_job_idStringCML Job ID for tracking
parent_job_idStringParent fine-tuning job (if derived)
base_model_idStringFK -> models.idModel under evaluation
dataset_idStringFK -> datasets.idEvaluation dataset
prompt_idStringFK -> prompts.idPrompt template
num_workersIntegerNumber of worker processes
adapter_idStringFK -> adapters.idAdapter under evaluation
num_cpuIntegerCPU allocation
num_gpuIntegerGPU allocation
num_memoryIntegerMemory allocation (GB)
evaluation_dirStringOutput directory for evaluation artifacts
model_bnb_config_idStringFK -> configs.idModel BnB quantization config
adapter_bnb_config_idStringFK -> configs.idAdapter BnB quantization config
generation_config_idStringFK -> configs.idGeneration config for inference
model_bnb_configStringSerialized model BnB config (snapshot)
adapter_bnb_configStringSerialized adapter BnB config (snapshot)
generation_configStringSerialized generation config (snapshot)

configs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringConfig type (training_arguments, bnb, lora, generation, axolotl)
descriptionStringHuman-readable description
configTextJSON or YAML content stored as string
model_familyStringModel family this config targets
is_defaultInteger1 = shipped default, 0 = user-created

ORM Mix-ins

All ORM model classes inherit from three bases: Base (SQLAlchemy declarative base), MappedProtobuf, and MappedDict. These mix-ins provide bidirectional serialization.

MappedProtobuf

Converts between protobuf messages and ORM instances.

# Protobuf message -> ORM instance
adapter_orm = Adapter.from_message(adapter_proto_msg)

# ORM instance -> Protobuf message
adapter_proto = adapter_orm.to_protobuf(AdapterMetadata)

from_message() uses ListFields() (protobuf >= 3.15) to extract only fields that were explicitly set in the message, avoiding default-value contamination. to_protobuf() iterates the ORM instance’s non-null columns and sets matching fields on a new protobuf message.

MappedDict

Converts between Python dictionaries and ORM instances.

# Dict -> ORM instance
model_orm = Model.from_dict({"id": "abc", "name": "llama-2"})

# ORM instance -> Dict (non-null fields only)
model_dict = model_orm.to_dict()

Table-Model Registry

ft/db/model.py exports two lookup dictionaries for programmatic table access:

TABLE_TO_MODEL_REGISTRY = {
    'datasets': Dataset,
    'models': Model,
    'prompts': Prompt,
    'adapters': Adapter,
    'fine_tuning_jobs': FineTuningJob,
    'evaluation_jobs': EvaluationJob,
    'configs': Config
}

MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}

These are used by the database import/export logic to iterate all application tables.

DAO

FineTuningStudioDao in ft/db/dao.py manages SQLAlchemy engine and session lifecycle.

Constructor

class FineTuningStudioDao:
    def __init__(self, engine_url=None, echo=False, engine_args={}):
        if engine_url is None:
            engine_url = f"sqlite+pysqlite:///{get_sqlite_db_location()}"
        self.engine = create_engine(engine_url, echo=echo, **engine_args)
        self.Session = sessionmaker(bind=self.engine, autoflush=True, autocommit=False)
        Base.metadata.create_all(self.engine)

The servicer instantiates the DAO with connection pool parameters:

ParameterValueDescription
pool_size5Persistent connections in the pool
max_overflow10Additional connections beyond pool_size
pool_timeout30Seconds to wait for a connection
pool_recycle1800Seconds before a connection is recycled

Tables are auto-created on first initialization via Base.metadata.create_all(engine).

Session Context Manager

All domain functions access the database through dao.get_session():

@contextmanager
def get_session(self):
    session = self.Session()
    try:
        yield session
        session.commit()
    except Exception as e:
        session.rollback()
        raise e
    finally:
        session.close()

Usage in domain code:

def list_datasets(request, cml, dao):
    with dao.get_session() as session:
        datasets = session.query(Dataset).all()
        # ... convert and return

The context manager guarantees: commit on success, rollback on exception, close in all cases.

Database Export and Import

ft/db/db_import_export.py provides DatabaseJsonConverter for full database serialization.

Export

export_to_json(output_path=None) iterates all non-system tables (excluding sqlite_* internal tables), captures the CREATE TABLE schema and all row data, and returns a JSON string:

{
  "models": {
    "schema": "CREATE TABLE IF NOT EXISTS models (...)",
    "data": [
      {"id": "abc-123", "name": "llama-2", "type": "huggingface", ...}
    ]
  },
  "datasets": { ... },
  ...
}

If output_path is provided, the JSON is also written to that file.

Import

import_from_json(json_path) reads a JSON file in the export format, executes each table’s CREATE TABLE IF NOT EXISTS statement, and inserts all rows. Rows that fail to insert (e.g., due to duplicate primary keys) are logged but do not abort the import.

Alembic Migrations

Schema migrations are managed by Alembic. Configuration is at alembic.ini with migration scripts in db_migrations/. When adding or modifying columns, generate a new migration with:

alembic revision --autogenerate -m "description of change"
alembic upgrade head

The DAO’s create_all() call handles initial table creation, but column additions and type changes on existing databases require Alembic migrations.

Cross-References