Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Fine Tuning Studio is a Cloudera AMP (Applied ML Prototype) for managing, fine-tuning, and evaluating large language models within Cloudera Machine Learning (CML). It provides a Streamlit UI backed by a gRPC API, a SQLite metadata store, and job dispatch to CML workloads for training and evaluation. Models, datasets, PEFT adapters, and prompt templates are managed as first-class resources that flow through the import-train-evaluate-deploy lifecycle.

This guide serves two audiences:

If you are…Start here
Building a training harness or extending the platform (custom gRPC clients, new dataset types, training scripts, Axolotl integrations)Architecture Reference
Building a validation SDK or CI/CD pipeline for fine-tuning artifacts (config validation, adapter packaging, model export)Resource Specifications and Validation Rules

Terminology

TermDefinition
DatasetA reference to a HuggingFace Hub dataset or a local file (CSV, JSON, JSONL) registered in the Studio’s metadata store. Features are auto-extracted on import.
ModelA base foundation model registered from HuggingFace Hub or the CML Model Registry. Serves as the starting point for fine-tuning.
AdapterA PEFT LoRA adapter — either produced by a fine-tuning job, imported from a local directory, or fetched from HuggingFace Hub. Applied on top of a base model.
Prompt TemplateA format-string template that maps dataset feature columns into training input. Contains prompt_template, input_template, and completion_template fields.
ConfigA named configuration blob — training arguments, BitsAndBytes quantization, LoRA hyperparameters, generation config, or Axolotl YAML. Configs are deduplicated by content.
Fine-Tuning JobA CML Job that trains a PEFT adapter. Dispatched via the gRPC API, tracked in the metadata store, executed as a CML workload with configurable CPU/GPU/memory.
Evaluation JobA CML Job that runs MLflow evaluation against one or more model+adapter combinations. Results are tracked in MLflow experiments.
gRPC ServiceThe Fine Tuning Service (FTS) — a stateless gRPC server on port 50051 that hosts all application logic. Accessed via FineTuningStudioClient.
DAOData Access Object — FineTuningStudioDao manages SQLAlchemy sessions and connection pooling against the SQLite database.
CML WorkloadA Cloudera ML Job, session, or model endpoint. Fine-tuning and evaluation are dispatched as CML Jobs via the cmlapi SDK.

Resource Lifecycle

The lifecycle begins with importing resources (datasets from HuggingFace or local files, base models, prompt templates) and ends with deploying trained adapters to the CML Model Registry or as CML Model endpoints. The gRPC API drives every step — the Streamlit UI is a client of this API, not the source of truth.

System Overview

Fine Tuning Studio is a three-layer application running inside a single CML Application pod. A Streamlit frontend communicates with a gRPC backend over localhost; the backend persists metadata to SQLite and dispatches CML Jobs for training and evaluation workloads.

Component Topology

Layer Summary

Presentation Layer

Entry point: main.py. Page modules live in pgs/. Two navigation modes are controlled by the IS_COMPOSABLE environment variable:

  • Composable mode (IS_COMPOSABLE set): Horizontal navbar with dropdown menus for Home, Database, Resources, Experiments, AI Workbench, Examples, and Feedback.
  • Standard mode (default): Sidebar navigation with section headers and Material Design icons.

Pages obtain shared gRPC and CML client instances through @st.cache_resource decorators defined in pgs/streamlit_utils.py. See Streamlit Presentation Layer for full details.

Application Layer

A gRPC server runs on port 50051, started by bin/start-grpc-server.py as a background subprocess. The service class FineTuningStudioApp in ft/service.py implements FineTuningStudioServicer (generated from protobuf). It is a pure router – each RPC method delegates to a domain function in the corresponding module:

ModuleDomain
ft/datasets.pyDataset import, listing, removal
ft/models.pyModel registration, export
ft/adapters.pyAdapter management, dataset split lookup
ft/prompts.pyPrompt template CRUD
ft/jobs.pyFine-tuning job dispatch and tracking
ft/evaluation.pyEvaluation job dispatch and tracking
ft/configs.pyConfiguration blob management
ft/databse_ops.pyDatabase export/import operations

The servicer holds a cmlapi.default_client() and a FineTuningStudioDao instance, passing both to every domain function call. See gRPC Service Design for the full API surface.

Data Layer

SQLite at .app/state.db via SQLAlchemy ORM. Seven tables: models, datasets, adapters, prompts, fine_tuning_jobs, evaluation_jobs, configs. The DAO manages sessions with connection pooling (pool_size=5, max_overflow=10, pool_timeout=30, pool_recycle=1800). See Data Tier for schemas and the DAO API.

Initialization Sequence

The startup sequence is defined in .project-metadata.yaml and executed by bin/start-app-script.sh:

  1. Install dependenciesbin/install-dependencies-uv.py installs from requirements.txt and performs pip install -e . to install the ft package in dev mode.
  2. Create template CML JobsAccel_Finetuning_Base_Job and Mlflow_Evaluation_Base_Job are created as reusable job templates for fine-tuning and evaluation dispatch.
  3. Initialize project defaultsbin/initialize-project-defaults-uv.py populates default datasets, prompts, models, and adapters from data/project_defaults.json.
  4. Start gRPC serverbin/start-grpc-server.py launches as a background process (&), binds to port 50051 with a ThreadPoolExecutor(max_workers=10), and sets FINE_TUNING_SERVICE_IP and FINE_TUNING_SERVICE_PORT as CML project environment variables via cmlapi.
  5. Start Streamlituv run -m streamlit run main.py --server.port $CDSW_APP_PORT --server.address 127.0.0.1.

Both processes (gRPC server and Streamlit) run in the same pod. The gRPC server is the subprocess; Streamlit is the foreground process that keeps the CML Application alive.

Environment Variables

VariablePurposeDefault
FINE_TUNING_SERVICE_IPgRPC server IP addressSet at startup from CDSW_IP_ADDRESS
FINE_TUNING_SERVICE_PORTgRPC server port50051
FINE_TUNING_STUDIO_SQLITE_DBSQLite database file path.app/state.db
CDSW_PROJECT_IDCML project identifierSet by CML runtime
CDSW_APP_PORTStreamlit server portSet by CML runtime
HUGGINGFACE_ACCESS_TOKENHuggingFace Hub token for gated modelsOptional (empty string)
IS_COMPOSABLEEnable horizontal navbar modeOptional (unset = sidebar)
CUSTOM_LORA_ADAPTERS_DIRDirectory for custom LoRA adaptersdata/adapters/
FINE_TUNING_STUDIO_PROJECT_DEFAULTSPath to project defaults JSONdata/project_defaults.json

Key Takeaway for Harness Builders

The gRPC API is the sole interface to application logic. The Streamlit UI is one client of this API, not the source of truth. Any external harness, CLI tool, or automation script should instantiate a FineTuningStudioClient (or use the generated gRPC stub directly) and interact through the protobuf contract. The database is an implementation detail behind the DAO – never access .app/state.db directly from external code.

To build a custom training harness:

  1. Import FineTuningStudioClient from ft.client.
  2. Register resources (datasets, models, prompts) via Add* RPCs.
  3. Dispatch training via StartFineTuningJob with the desired resource IDs and compute configuration.
  4. Poll job status via GetFineTuningJob or ListFineTuningJobs.
  5. Evaluate results via StartEvaluationJob.

All resource IDs are UUIDs assigned by the service. Pass them by value between RPCs.

gRPC Service Design

The Fine Tuning Studio API is defined as a single gRPC service in ft/proto/fine_tuning_studio.proto. The service exposes 29 RPCs organized by resource domain. A generated Python stub provides the transport layer; FineTuningStudioClient wraps it with error handling and convenience methods.

Service Architecture

RPC Catalog

Every domain follows the same pattern: List, Get, Add (or Start for jobs), and Remove. Request and response types use the naming convention {Action}{Domain}Request / {Action}{Domain}Response.

Dataset RPCs

RPCRequest TypeResponse TypeDescription
ListDatasetsListDatasetsRequestListDatasetsResponseReturn all registered datasets
GetDatasetGetDatasetRequestGetDatasetResponseReturn a single dataset by ID
AddDatasetAddDatasetRequestAddDatasetResponseRegister a HuggingFace or local dataset
RemoveDatasetRemoveDatasetRequestRemoveDatasetResponseDelete a dataset registration
GetDatasetSplitByAdapterGetDatasetSplitByAdapterRequestGetDatasetSplitByAdapterResponseGet dataset split info for a specific adapter

Model RPCs

RPCRequest TypeResponse TypeDescription
ListModelsListModelsRequestListModelsResponseReturn all registered models
GetModelGetModelRequestGetModelResponseReturn a single model by ID
AddModelAddModelRequestAddModelResponseRegister a HuggingFace or CML model
ExportModelExportModelRequestExportModelResponseExport a model to CML Model Registry
RemoveModelRemoveModelRequestRemoveModelResponseDelete a model registration

Adapter RPCs

RPCRequest TypeResponse TypeDescription
ListAdaptersListAdaptersRequestListAdaptersResponseReturn all registered adapters
GetAdapterGetAdapterRequestGetAdapterResponseReturn a single adapter by ID
AddAdapterAddAdapterRequestAddAdapterResponseRegister a local or HuggingFace adapter
RemoveAdapterRemoveAdapterRequestRemoveAdapterResponseDelete an adapter registration

Prompt RPCs

RPCRequest TypeResponse TypeDescription
ListPromptsListPromptsRequestListPromptsResponseReturn all prompt templates
GetPromptGetPromptRequestGetPromptResponseReturn a single prompt by ID
AddPromptAddPromptRequestAddPromptResponseCreate a new prompt template
RemovePromptRemovePromptRequestRemovePromptResponseDelete a prompt template

Fine-Tuning RPCs

RPCRequest TypeResponse TypeDescription
ListFineTuningJobsListFineTuningJobsRequestListFineTuningJobsResponseReturn all fine-tuning jobs
GetFineTuningJobGetFineTuningJobRequestGetFineTuningJobResponseReturn a single job by ID
StartFineTuningJobStartFineTuningJobRequestStartFineTuningJobResponseDispatch a new fine-tuning CML Job
RemoveFineTuningJobRemoveFineTuningJobRequestRemoveFineTuningJobResponseDelete a fine-tuning job record

Evaluation RPCs

RPCRequest TypeResponse TypeDescription
ListEvaluationJobsListEvaluationJobsRequestListEvaluationJobsResponseReturn all evaluation jobs
GetEvaluationJobGetEvaluationJobRequestGetEvaluationJobResponseReturn a single evaluation job by ID
StartEvaluationJobStartEvaluationJobRequestStartEvaluationJobResponseDispatch a new evaluation CML Job
RemoveEvaluationJobRemoveEvaluationJobRequestRemoveEvaluationJobResponseDelete an evaluation job record

Config RPCs

RPCRequest TypeResponse TypeDescription
ListConfigsListConfigsRequestListConfigsResponseReturn all configuration blobs
GetConfigGetConfigRequestGetConfigResponseReturn a single config by ID
AddConfigAddConfigRequestAddConfigResponseCreate a new configuration
RemoveConfigRemoveConfigRequestRemoveConfigResponseDelete a configuration

Database RPCs

RPCRequest TypeResponse TypeDescription
ExportDatabaseExportDatabaseRequestExportDatabaseResponseExport entire database as JSON
ImportDatabaseImportDatabaseRequestImportDatabaseResponseImport database from JSON file

Servicer Implementation

FineTuningStudioApp in ft/service.py extends the generated FineTuningStudioServicer. It holds two shared resources initialized in __init__:

class FineTuningStudioApp(FineTuningStudioServicer):
    def __init__(self):
        self.cml = cmlapi.default_client()
        self.dao = FineTuningStudioDao(engine_args={
            "pool_size": 5,
            "max_overflow": 10,
            "pool_timeout": 30,
            "pool_recycle": 1800,
        })
        self.project_id = os.getenv("CDSW_PROJECT_ID")

Every RPC method is a one-line delegation to the corresponding domain function, passing (request, self.cml, self.dao):

def ListDatasets(self, request, context):
    return list_datasets(request, self.cml, self.dao)

def StartFineTuningJob(self, request, context):
    return start_fine_tuning_job(request, self.cml, dao=self.dao)

Config and database RPCs omit the cml parameter since they operate on local data only.

Client Wrapper

FineTuningStudioClient in ft/client.py wraps the generated stub with automatic error handling. On construction, it introspects all callable methods on the stub and wraps each one to convert grpc.RpcError into ValueError with cleaned messages.

class FineTuningStudioClient:
    def __init__(self, server_ip=None, server_port=None):
        if not server_ip:
            server_ip = os.getenv("FINE_TUNING_SERVICE_IP")
        if not server_port:
            server_port = os.getenv("FINE_TUNING_SERVICE_PORT")
        self.channel = grpc.insecure_channel(f"{server_ip}:{server_port}")
        self.stub = FineTuningStudioStub(self.channel)

        # Auto-wrap all stub methods with error handling
        for attr in dir(self.stub):
            if not attr.startswith('_') and callable(getattr(self.stub, attr)):
                setattr(self, attr, self._grpc_error_handler(getattr(self.stub, attr)))

Convenience Methods

The client provides shorthand accessors that construct the request internally:

MethodReturnsEquivalent RPC
get_datasets()List[DatasetMetadata]ListDatasets(ListDatasetsRequest()).datasets
get_models()List[ModelMetadata]ListModels(ListModelsRequest()).models
get_adapters()List[AdapterMetadata]ListAdapters(ListAdaptersRequest()).adapters
get_prompts()List[PromptMetadata]ListPrompts(ListPromptsRequest()).prompts
get_fine_tuning_jobs()List[FineTuningJobMetadata]ListFineTuningJobs(ListFineTuningJobsRequest()).fine_tuning_jobs
get_evaluation_jobs()List[EvaluationJobMetadata]ListEvaluationJobs(ListEvaluationJobsRequest()).evaluation_jobs

Usage Example

from ft.client import FineTuningStudioClient
from ft.api import *

client = FineTuningStudioClient()

# List all datasets
datasets = client.get_datasets()

# Add a HuggingFace dataset
client.AddDataset(AddDatasetRequest(
    type="huggingface",
    huggingface_name="tatsu-lab/alpaca",
    name="Alpaca"
))

# Start a fine-tuning job
client.StartFineTuningJob(StartFineTuningJobRequest(
    base_model_id="model-uuid",
    dataset_id="dataset-uuid",
    prompt_id="prompt-uuid",
    adapter_name="my-adapter",
    num_cpu=2,
    num_gpu=1,
    num_memory=16,
    framework_type="legacy"
))

All request and response types are importable from ft.api, which re-exports the generated protobuf classes.

Protobuf Regeneration

After modifying ft/proto/fine_tuning_studio.proto, regenerate the Python bindings:

./bin/generate-proto-python.sh

This produces ft/proto/fine_tuning_studio_pb2.py (message classes) and ft/proto/fine_tuning_studio_pb2_grpc.py (stub and servicer base class). Both are checked into the repository. Do not edit them by hand.

Server Startup

The gRPC server is started by bin/start-grpc-server.py:

  1. Creates a grpc.server with ThreadPoolExecutor(max_workers=10).
  2. Registers FineTuningStudioApp() as the servicer.
  3. Binds to [::]:50051 (all interfaces).
  4. Updates CML project environment variables (FINE_TUNING_SERVICE_IP, FINE_TUNING_SERVICE_PORT) via cmlapi so that any workload in the project can locate the server.
  5. Blocks on server.wait_for_termination().

The server process is launched as a background subprocess by bin/start-app-script.sh before Streamlit starts. See System Overview for the full initialization sequence.

Data Tier

All Fine Tuning Studio metadata is persisted in a SQLite database at .app/state.db (configurable via FINE_TUNING_STUDIO_SQLITE_DB). The ORM layer uses SQLAlchemy declarative models defined in ft/db/model.py. Access is managed through FineTuningStudioDao in ft/db/dao.py.

Schema Topology

Table Schemas

All primary keys are String type (UUIDs assigned by domain logic). All columns are nullable except id. ORM classes are defined in ft/db/model.py.

models

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type (e.g., huggingface, cml)
frameworkStringModel framework identifier
nameStringDisplay name
descriptionStringHuman-readable description
huggingface_model_nameStringHuggingFace Hub model ID
locationStringLocal filesystem path
cml_registered_model_idStringCML Model Registry ID
mlflow_experiment_idStringAssociated MLflow experiment
mlflow_run_idStringAssociated MLflow run

datasets

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type (e.g., huggingface, local)
nameStringDisplay name
descriptionTextLong-form description
huggingface_nameStringHuggingFace Hub dataset ID
locationTextLocal filesystem path
featuresTextJSON string of dataset feature names

adapters

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringSource type
nameStringDisplay name
descriptionStringHuman-readable description
huggingface_nameStringHuggingFace Hub adapter ID
model_idStringFK -> models.idBase model this adapter targets
locationTextLocal filesystem path to adapter weights
fine_tuning_job_idStringFK -> fine_tuning_jobs.idJob that produced this adapter
prompt_idStringFK -> prompts.idPrompt template used during training
cml_registered_model_idStringCML Model Registry ID
mlflow_experiment_idStringAssociated MLflow experiment
mlflow_run_idStringAssociated MLflow run

prompts

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringPrompt type
nameStringDisplay name
descriptionStringHuman-readable description
dataset_idStringFK -> datasets.idDataset this prompt is designed for
prompt_templateStringFull prompt format string
input_templateStringInput portion template
completion_templateStringCompletion portion template

fine_tuning_jobs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
base_model_idStringFK -> models.idBase model to fine-tune
dataset_idStringFK -> datasets.idTraining dataset
prompt_idStringFK -> prompts.idPrompt template
num_workersIntegerNumber of worker processes
cml_job_idStringCML Job ID for tracking
adapter_idStringFK -> adapters.idResulting adapter
num_cpuIntegerCPU allocation
num_gpuIntegerGPU allocation
num_memoryIntegerMemory allocation (GB)
num_epochsIntegerTraining epochs
learning_rateDoubleLearning rate
out_dirStringOutput directory for adapter weights
training_arguments_config_idStringFK -> configs.idTraining arguments config
model_bnb_config_idStringFK -> configs.idModel BitsAndBytes quantization config
adapter_bnb_config_idStringFK -> configs.idAdapter BitsAndBytes quantization config
lora_config_idStringFK -> configs.idLoRA hyperparameters config
training_arguments_configStringSerialized training arguments (snapshot)
model_bnb_configStringSerialized model BnB config (snapshot)
adapter_bnb_configStringSerialized adapter BnB config (snapshot)
lora_configStringSerialized LoRA config (snapshot)
dataset_fractionDoubleFraction of dataset to use
train_test_splitDoubleTrain/test split ratio
user_scriptStringCustom user training script path
user_config_idStringFK -> configs.idCustom user config
framework_typeStringTraining framework (legacy, axolotl, etc.)
axolotl_config_idStringFK -> configs.idAxolotl YAML config
gpu_label_idIntegerGPU label selector
adapter_nameStringName assigned to the output adapter

The fine_tuning_jobs table stores both config ID references (foreign keys to configs) and serialized config snapshots (plain string columns). This allows job records to remain self-describing even if the referenced config is later deleted.

evaluation_jobs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringEvaluation type
cml_job_idStringCML Job ID for tracking
parent_job_idStringParent fine-tuning job (if derived)
base_model_idStringFK -> models.idModel under evaluation
dataset_idStringFK -> datasets.idEvaluation dataset
prompt_idStringFK -> prompts.idPrompt template
num_workersIntegerNumber of worker processes
adapter_idStringFK -> adapters.idAdapter under evaluation
num_cpuIntegerCPU allocation
num_gpuIntegerGPU allocation
num_memoryIntegerMemory allocation (GB)
evaluation_dirStringOutput directory for evaluation artifacts
model_bnb_config_idStringFK -> configs.idModel BnB quantization config
adapter_bnb_config_idStringFK -> configs.idAdapter BnB quantization config
generation_config_idStringFK -> configs.idGeneration config for inference
model_bnb_configStringSerialized model BnB config (snapshot)
adapter_bnb_configStringSerialized adapter BnB config (snapshot)
generation_configStringSerialized generation config (snapshot)

configs

ColumnTypeConstraintsDescription
idStringPK, NOT NULLUUID
typeStringConfig type (training_arguments, bnb, lora, generation, axolotl)
descriptionStringHuman-readable description
configTextJSON or YAML content stored as string
model_familyStringModel family this config targets
is_defaultInteger1 = shipped default, 0 = user-created

ORM Mix-ins

All ORM model classes inherit from three bases: Base (SQLAlchemy declarative base), MappedProtobuf, and MappedDict. These mix-ins provide bidirectional serialization.

MappedProtobuf

Converts between protobuf messages and ORM instances.

# Protobuf message -> ORM instance
adapter_orm = Adapter.from_message(adapter_proto_msg)

# ORM instance -> Protobuf message
adapter_proto = adapter_orm.to_protobuf(AdapterMetadata)

from_message() uses ListFields() (protobuf >= 3.15) to extract only fields that were explicitly set in the message, avoiding default-value contamination. to_protobuf() iterates the ORM instance’s non-null columns and sets matching fields on a new protobuf message.

MappedDict

Converts between Python dictionaries and ORM instances.

# Dict -> ORM instance
model_orm = Model.from_dict({"id": "abc", "name": "llama-2"})

# ORM instance -> Dict (non-null fields only)
model_dict = model_orm.to_dict()

Table-Model Registry

ft/db/model.py exports two lookup dictionaries for programmatic table access:

TABLE_TO_MODEL_REGISTRY = {
    'datasets': Dataset,
    'models': Model,
    'prompts': Prompt,
    'adapters': Adapter,
    'fine_tuning_jobs': FineTuningJob,
    'evaluation_jobs': EvaluationJob,
    'configs': Config
}

MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}

These are used by the database import/export logic to iterate all application tables.

DAO

FineTuningStudioDao in ft/db/dao.py manages SQLAlchemy engine and session lifecycle.

Constructor

class FineTuningStudioDao:
    def __init__(self, engine_url=None, echo=False, engine_args={}):
        if engine_url is None:
            engine_url = f"sqlite+pysqlite:///{get_sqlite_db_location()}"
        self.engine = create_engine(engine_url, echo=echo, **engine_args)
        self.Session = sessionmaker(bind=self.engine, autoflush=True, autocommit=False)
        Base.metadata.create_all(self.engine)

The servicer instantiates the DAO with connection pool parameters:

ParameterValueDescription
pool_size5Persistent connections in the pool
max_overflow10Additional connections beyond pool_size
pool_timeout30Seconds to wait for a connection
pool_recycle1800Seconds before a connection is recycled

Tables are auto-created on first initialization via Base.metadata.create_all(engine).

Session Context Manager

All domain functions access the database through dao.get_session():

@contextmanager
def get_session(self):
    session = self.Session()
    try:
        yield session
        session.commit()
    except Exception as e:
        session.rollback()
        raise e
    finally:
        session.close()

Usage in domain code:

def list_datasets(request, cml, dao):
    with dao.get_session() as session:
        datasets = session.query(Dataset).all()
        # ... convert and return

The context manager guarantees: commit on success, rollback on exception, close in all cases.

Database Export and Import

ft/db/db_import_export.py provides DatabaseJsonConverter for full database serialization.

Export

export_to_json(output_path=None) iterates all non-system tables (excluding sqlite_* internal tables), captures the CREATE TABLE schema and all row data, and returns a JSON string:

{
  "models": {
    "schema": "CREATE TABLE IF NOT EXISTS models (...)",
    "data": [
      {"id": "abc-123", "name": "llama-2", "type": "huggingface", ...}
    ]
  },
  "datasets": { ... },
  ...
}

If output_path is provided, the JSON is also written to that file.

Import

import_from_json(json_path) reads a JSON file in the export format, executes each table’s CREATE TABLE IF NOT EXISTS statement, and inserts all rows. Rows that fail to insert (e.g., due to duplicate primary keys) are logged but do not abort the import.

Alembic Migrations

Schema migrations are managed by Alembic. Configuration is at alembic.ini with migration scripts in db_migrations/. When adding or modifying columns, generate a new migration with:

alembic revision --autogenerate -m "description of change"
alembic upgrade head

The DAO’s create_all() call handles initial table creation, but column additions and type changes on existing databases require Alembic migrations.

Cross-References

Streamlit Presentation Layer

The UI is a multi-page Streamlit application defined in main.py. It renders resource management forms, job dispatch controls, and evaluation dashboards. All data operations go through the gRPC client – the Streamlit layer has no direct database access.

Entry Point

main.py sets the page configuration and selects a navigation mode based on the IS_COMPOSABLE environment variable:

st.set_page_config(
    page_title="Fine Tuning Studio",
    page_icon=IconPaths.FineTuningStudio.FINE_TUNING_STUDIO,
    layout="wide"
)

The layout is always "wide". The page icon is loaded from the resources/images/ directory via ft.consts.IconPaths.

Composable Mode

Activated when IS_COMPOSABLE is set to any non-empty value. Uses streamlit_navigation_bar (st_navbar) combined with custom HTML/CSS for dropdown menus. Navigation groups:

GroupPages
HomeHome
Database Import ExportDatabase Import and Export
ResourcesImport Datasets, View Datasets, Import Base Models, View Base Models, Create Prompts, View Prompts
ExperimentTrain a New Adapter, Monitor Training Jobs, Local Adapter Comparison, Run MLFlow Evaluation, View MLflow Runs
AI WorkbenchExport And Deploy Model
ExamplesTicketing Agent App
FeedbackProvide Feedback

The navbar is rendered as a fixed-position HTML <nav> element with CSS dropdown menus. Links use target="_self" to navigate within the Streamlit app. All pages are registered with st.navigation(position="hidden") so that Streamlit handles routing internally while the custom navbar provides the visible UI.

Standard Mode (Default)

When IS_COMPOSABLE is not set, the sidebar renders section headers and page links with Material Design icons:

with st.sidebar:
    st.image("./resources/images/ft-logo.png")
    st.markdown("Navigation")
    st.page_link("pgs/home.py", label="Home", icon=":material/home:")
    st.page_link("pgs/database.py", label="Database Import and Export", icon=":material/database:")

    st.markdown("Resources")
    st.page_link("pgs/datasets.py", label="Import Datasets", icon=":material/publish:")
    st.page_link("pgs/view_datasets.py", label="View Datasets", icon=":material/data_object:")
    # ... remaining pages

Sidebar sections: Navigation, Resources, Experiments, AI Workbench, Examples, Feedback. The sidebar footer displays the current project owner and a link to the CML domain.

Page Inventory

All page modules live in the pgs/ directory:

FileTitleSection
pgs/home.pyHomeNavigation
pgs/database.pyDatabase Import and ExportDatabase
pgs/datasets.pyImport DatasetsResources
pgs/view_datasets.pyView DatasetsResources
pgs/models.pyImport Base ModelsResources
pgs/view_models.pyView Base ModelsResources
pgs/prompts.pyCreate PromptsResources
pgs/view_prompts.pyView PromptsResources
pgs/train_adapter.pyTrain a New AdapterExperiments
pgs/jobs.pyTraining Job TrackingExperiments
pgs/evaluate.pyLocal Adapter ComparisonExperiments
pgs/mlflow.pyRun MLFlow EvaluationExperiments
pgs/mlflow_jobs.pyView MLflow RunsExperiments
pgs/export.pyExport And Deploy ModelAI Workbench
pgs/sample_ticketing_agent_app_embed.pySample Ticketing Agent AppExamples
pgs/feedback.pyFeedbackFeedback

Client Caching

Shared client instances are cached at the Streamlit server level using @st.cache_resource. This avoids creating a new gRPC channel or CML API client on every page render. Both helpers are defined in pgs/streamlit_utils.py:

@st.cache_resource
def get_fine_tuning_studio_client() -> FineTuningStudioClient:
    client = FineTuningStudioClient()
    return client

@st.cache_resource
def get_cml_client() -> CMLServiceApi:
    client = default_client()
    return client

@st.cache_resource ensures a single instance per Streamlit server process. The gRPC client connects to the address specified by FINE_TUNING_SERVICE_IP and FINE_TUNING_SERVICE_PORT environment variables. The CML client uses cmlapi.default_client(), which reads CML connection parameters from the pod environment.

Data Flow

Every user interaction follows this path: Streamlit widget event triggers a page callback, the page calls the cached client, the client sends a gRPC request, the server delegates to domain logic, and the domain function uses the DAO to read or write SQLite.

How to Add a New Page

  1. Create the page module at pgs/my_page.py:
import streamlit as st
from pgs.streamlit_utils import get_fine_tuning_studio_client

st.header("My New Page")

client = get_fine_tuning_studio_client()

# Use the client to interact with the gRPC service
models = client.get_models()
for model in models:
    st.write(model.name)
  1. Register the page in both navigation modes in main.py:

In the composable mode setup_navigation() function, add:

st.Page("pgs/my_page.py", title="My New Page"),

In the composable mode HTML navbar, add a link in the appropriate dropdown:

<a href="/my_page" target="_self"><span class="material-icons">icon_name</span> My New Page</a>

In the standard mode setup_navigation_sidebar() function, add:

st.Page("pgs/my_page.py", title="My New Page"),

In the standard mode setup_sidebar() function, add under the appropriate section:

st.page_link("pgs/my_page.py", label="My New Page", icon=":material/icon_name:")
  1. If the page requires a new RPC, add it to the protobuf definition, regenerate, implement the servicer method, and add the domain function. See gRPC Service Design.

Custom CSS

Both navigation modes inject custom CSS to control typography and layout:

  • Heading sizes (h3 reduced to 1.1rem)
  • Tab label font sizes (0.9rem)
  • Sidebar theming (dark background #16262c, white text) in standard mode
  • Navbar positioning and dropdown behavior in composable mode

CSS is injected via st.markdown(css, unsafe_allow_html=True).

Cross-References

Resource Concepts

Fine Tuning Studio manages seven resource types. All use UUID string primary keys generated via uuid4(). Resources are metadata entries stored in SQLite – the actual artifacts (model weights, dataset files, adapter checkpoints) live on the filesystem, HuggingFace Hub, or the CML Model Registry.

Resource Types

ResourceTablePurpose
DatasetdatasetsReference to a HuggingFace Hub dataset or local file (CSV, JSON, JSONL)
ModelmodelsBase foundation model from HuggingFace Hub or CML Model Registry
AdapteradaptersPEFT LoRA adapter – produced by training, imported from disk, or fetched from Hub
PromptpromptsFormat-string template mapping dataset features into training input
ConfigconfigsNamed configuration blob (training args, BnB, LoRA, generation, Axolotl YAML)
FineTuningJobfine_tuning_jobsCML Job that trains a PEFT adapter
EvaluationJobevaluation_jobsCML Job that runs MLflow evaluation against model+adapter combinations

Entity Relationships

Type Enums

All type enums are defined in ft/api/types.py as str, Enum subclasses.

EnumValues
DatasetTypehuggingface, project, project_csv, project_json, project_jsonl
ModelTypehuggingface, project, model_registry
AdapterTypeproject, huggingface, model_registry
PromptTypein_place
ConfigTypetraining_arguments, bitsandbytes_config, generation_config, lora_config, custom, axolotl, axolotl_dataset_formats
FineTuningFrameworkTypelegacy, axolotl
ModelExportTypemodel_registry, cml_model
EvaluationJobTypemlflow
ModelFrameworkTypepytorch, tensorflow, onnx

ORM Layer

All ORM models inherit from sqlalchemy.orm.declarative_base() plus two mixins defined in ft/db/model.py:

MappedProtobuf – bidirectional protobuf conversion:

  • from_message(message) – class method. Extracts set fields from a protobuf message via ListFields() and passes them as kwargs to the ORM constructor.
  • to_protobuf(protobuf_cls) – instance method. Converts non-null ORM columns into a protobuf message by matching field names.

MappedDict – bidirectional dict conversion:

  • from_dict(d) – class method. Constructs an ORM instance from a plain dictionary.
  • to_dict() – instance method. Returns a dictionary of all non-null column values via SQLAlchemy inspect().

The serialization chain for any resource:

Protobuf message  <-->  ORM model  <-->  Python dict
     from_message() / to_protobuf()   from_dict() / to_dict()

Table Registry

ft/db/model.py maintains two registries used by the database import/export subsystem:

TABLE_TO_MODEL_REGISTRY = {
    'datasets': Dataset,
    'models': Model,
    'prompts': Prompt,
    'adapters': Adapter,
    'fine_tuning_jobs': FineTuningJob,
    'evaluation_jobs': EvaluationJob,
    'configs': Config,
}

MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}

Any new resource type must be added to TABLE_TO_MODEL_REGISTRY for database import/export to function correctly.

Dataset Specification

A Dataset resource is a metadata reference to a data source. The actual data lives on HuggingFace Hub or the local filesystem. On import, Studio extracts feature column names and stores them as a JSON string, enabling downstream prompt template construction without reloading the data.

Source: ft/datasets.py, ft/db/model.py

Supported Types

TypeSourceIdentifier FieldFeature Extraction Method
huggingfaceHuggingFace Hubhuggingface_nameload_dataset_builder() -> info.features.keys()
projectLocal HF-compatible directorylocationNot extracted
project_csvLocal CSV filelocationRead header row via csv.reader
project_jsonLocal JSON filelocationRead first object keys via json.load
project_jsonlLocal JSONL filelocationRead first line keys via json.loads

ORM Schema

class Dataset(Base, MappedProtobuf, MappedDict):
    __tablename__ = "datasets"
    id = Column(String, primary_key=True)    # UUID
    type = Column(String)                     # DatasetType enum value
    name = Column(String)                     # Display name
    description = Column(Text)                # Auto-populated for HF datasets
    huggingface_name = Column(String)         # HF Hub identifier (HF type only)
    location = Column(Text)                   # Filesystem path (project types only)
    features = Column(Text)                   # JSON-serialized list of column names

Import Validation

add_dataset() dispatches to type-specific validators before creating a record:

All types:

  • type field is required.
  • Duplicate detection by name (local types) or huggingface_name (HF type).

HuggingFace (_validate_huggingface_dataset_request):

  • huggingface_name field required and non-blank.
  • Validates dataset exists on Hub via load_dataset_builder().
  • Extracts dataset_info.features.keys() for feature list.
  • Stores dataset_info.description as the description.

CSV (_validate_local_csv_dataset_request):

  • location field required, must end with .csv.
  • name field required and non-blank.
  • Reads header row with csv.reader(file) / next(reader) for features.

JSON (_validate_local_json_dataset_request):

  • location field required, must end with .json.
  • Reads first object in the JSON array for feature keys.

JSONL (_validate_local_jsonl_dataset_request):

  • location field required, must end with .jsonl.
  • Reads first line, parses as JSON, extracts keys for features.

Feature Extraction Functions

extract_features_from_csv(location)   # csv.reader -> next(reader)
extract_features_from_json(location)  # json.load -> next(iter(data)).keys()
extract_features_from_jsonl(location) # json.loads(first_line).keys()

Features are stored as json.dumps(features) in the features column. Downstream consumers (prompt templates, training scripts) parse this back with json.loads().

Loading into Memory

load_dataset_into_memory(dataset: DatasetMetadata) normalizes all dataset types into a HuggingFace DatasetDict with at minimum a train key:

TypeLoad MethodWrapping
huggingfacedatasets.load_dataset(huggingface_name)Already a DatasetDict
project_csvdatasets.load_dataset('csv', data_files=location)Already a DatasetDict
project_jsondatasets.Dataset.from_json(location)Wrapped in DatasetDict({'train': ds})
project_jsonldatasets.Dataset.from_json(location)Wrapped in DatasetDict({'train': ds})

If the loaded object is a Dataset (not DatasetDict), it is wrapped: DatasetDict({'train': ds}).

Removal

remove_dataset() deletes the Dataset record. If request.remove_prompts is set, also deletes all Prompt records with matching dataset_id via cascading delete.

Protobuf Message

DatasetMetadata fields: id, type, name, description, huggingface_name, location, features.

Model Specification

A Model resource represents a base foundation model registered in the Studio’s metadata store. Models serve as the starting point for fine-tuning and evaluation. The actual model weights are never stored by Studio – they are downloaded at training time from HuggingFace Hub or resolved from the CML Model Registry.

Source: ft/models.py, ft/db/model.py, ft/config/model_configs/config_loader.py

Supported Types

TypeSourceRequired FieldsValidation
huggingfaceHuggingFace Hubhuggingface_model_nameHfApi().model_info() must succeed
model_registryCML Model Registrymodel_registry_id (request)Fetches RegisteredModel via cmlapi
projectLocal directorylocationNot yet fully implemented

ORM Schema

class Model(Base, MappedProtobuf, MappedDict):
    __tablename__ = "models"
    id = Column(String, primary_key=True)            # UUID
    type = Column(String)                             # ModelType enum value
    framework = Column(String)                        # ModelFrameworkType (pytorch, tensorflow, onnx)
    name = Column(String)                             # Display name
    description = Column(String)
    huggingface_model_name = Column(String)           # HF Hub model identifier
    location = Column(String)                         # Local path (project type)
    cml_registered_model_id = Column(String)          # CML Registry model ID
    mlflow_experiment_id = Column(String)             # MLflow experiment (registry type)
    mlflow_run_id = Column(String)                    # MLflow run (registry type)

Import Flow

add_model() validates and creates a Model record based on type:

HuggingFace:

  1. Validate huggingface_name is non-empty and not already registered (duplicate check by huggingface_model_name).
  2. Call HfApi().model_info(name) to confirm model exists on Hub.
  3. Create Model with type=HUGGINGFACE, name and huggingface_model_name set to the stripped input.

Model Registry:

  1. model_registry_id must be provided on the request.
  2. Fetch RegisteredModel via cml.get_registered_model(id).
  3. Extract the first version’s metadata: registered_model.model_versions[0].model_version_metadata.mlflow_metadata.
  4. Create Model with type=MODEL_REGISTRY, name from registered_model.name, and populate cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Model Family Detection

ft/config/model_configs/config_loader.py provides ModelMetadataFinder:

class ModelMetadataFinder:
    def __init__(self, model_name_or_path):
        self.model_name_or_path = model_name_or_path

    def fetch_model_family_from_config(self):
        config = AutoConfig.from_pretrained(self.model_name_or_path)
        return config.architectures[0]  # e.g., "LlamaForCausalLM"

This is used in two places:

  • Config filtering: list_configs() filters default configs to those matching the model’s architecture family.
  • Config creation: add_config() with a description field uses transform_name_to_family() to resolve the model family for deduplication scoping.

Additional static methods:

  • fetch_bos_token_id_from_config(model_name_or_path) – returns config.bos_token_id (default: 1).
  • fetch_eos_token_id_from_config(model_name_or_path) – returns config.eos_token_id (default: 2).

Export Routes

export_model() dispatches based on ModelExportType:

Export TypeHandlerTarget
model_registryexport_model_registry_model()MLflow model registry
cml_modeldeploy_cml_model()CML Model endpoint

Both handlers are defined in ft/export.py.

Protobuf Message

ModelMetadata fields: id, type, framework, name, huggingface_model_name, location, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Adapter Specification

An Adapter resource represents a PEFT LoRA adapter. Adapters are produced by fine-tuning jobs, imported from a local directory, or fetched from HuggingFace Hub. Each adapter is linked to a base model and optionally to the fine-tuning job and prompt template that produced it.

Source: ft/adapters.py, ft/db/model.py

Supported Types

TypeSourceRequired Fields
projectLocal directorylocation (must exist as a directory)
huggingfaceHuggingFace Hubhuggingface_name
model_registryCML Model Registrycml_registered_model_id

ORM Schema

class Adapter(Base, MappedProtobuf, MappedDict):
    __tablename__ = "adapters"
    id = Column(String, primary_key=True)                                 # UUID
    type = Column(String)                                                  # AdapterType enum value
    name = Column(String)                                                  # Display name (unique)
    description = Column(String)
    huggingface_name = Column(String)                                      # HF Hub adapter identifier
    model_id = Column(String, ForeignKey('models.id'))                     # Base model FK
    location = Column(Text)                                                # Local path to adapter dir
    fine_tuning_job_id = Column(String, ForeignKey('fine_tuning_jobs.id')) # Producing job FK
    prompt_id = Column(String, ForeignKey('prompts.id'))                   # Training prompt FK
    cml_registered_model_id = Column(String)                               # CML Registry model ID
    mlflow_experiment_id = Column(String)                                  # MLflow experiment
    mlflow_run_id = Column(String)                                         # MLflow run

Key Relationships

FK ColumnTargetRequired
model_idmodels.idYes – the base model this adapter applies to
fine_tuning_job_idfine_tuning_jobs.idNo – only set for Studio-trained adapters
prompt_idprompts.idNo – only set for Studio-trained adapters

Import Validation

_validate_add_adapter_request() enforces:

  1. Required fields: name, model_id, and location must all be present and non-blank.
  2. Directory existence: os.path.isdir(request.location) must return True.
  3. Model FK: model_id must reference an existing Model record.
  4. Unique name: No existing adapter may share the same name.
  5. Optional FK checks: If fine_tuning_job_id is provided, it must exist in fine_tuning_jobs. If prompt_id is provided, it must exist in prompts.

Adapter Creation

add_adapter() validates the request, then creates an Adapter record with all provided fields mapped directly from the request.

Dataset Split Tracking

get_dataset_split_by_adapter() retrieves the dataset fraction and train/test split used during training for a given adapter:

  1. Joins FineTuningJob to Adapter on adapter_name.
  2. If a matching job is found, returns its dataset_fraction and train_test_split.
  3. If no matching job exists (imported adapter), returns defaults:
ParameterDefaultSource
dataset_fraction1.0TRAINING_DEFAULT_DATASET_FRACTION
train_test_split0.9TRAINING_DEFAULT_TRAIN_TEST_SPLIT

These defaults are defined in ft/consts.py.

Protobuf Message

AdapterMetadata fields: id, type, name, description, huggingface_name, model_id, location, fine_tuning_job_id, prompt_id, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Prompt Template Specification

A Prompt resource defines a format-string template that maps dataset feature columns into structured training input. Prompts bind a dataset’s column names to positional slots in the training text, controlling how raw data is presented to the model during fine-tuning and evaluation.

Source: ft/prompts.py, ft/utils.py, ft/jobs.py, ft/db/model.py

Template Fields

FieldPurposeExample
prompt_templateFull prompt format string used during training"Instruction: {instruction}\nInput: {input}\nOutput: {output}"
input_templateInput portion (informational, used in evaluation)"Instruction: {instruction}\nInput: {input}"
completion_templateExpected output portion (informational, used in evaluation)"Output: {output}"

Placeholders use Python format-string syntax: {feature_name}. Each placeholder must correspond to a column name in the linked dataset’s features JSON array.

ORM Schema

class Prompt(Base, MappedProtobuf, MappedDict):
    __tablename__ = "prompts"
    id = Column(String, primary_key=True)               # UUID
    type = Column(String)                                # PromptType enum value
    name = Column(String)                                # Display name (unique)
    description = Column(String)
    dataset_id = Column(String, ForeignKey('datasets.id'))  # Linked dataset FK
    prompt_template = Column(String)                     # Full template
    input_template = Column(String)                      # Input portion
    completion_template = Column(String)                  # Output portion

Import Validation

_validate_add_prompt_request() enforces:

  1. Required fields: id, name, dataset_id, prompt_template, input_template, completion_template must all be present on the PromptMetadata message.
  2. Non-blank name: name.strip() must be non-empty.
  3. Unique name: No existing prompt may share the same name.

The prompt is created via Prompt.from_message(request.prompt), which uses the MappedProtobuf.from_message() method to map protobuf fields directly to ORM columns.

Auto-Generation from Dataset Columns

ft/utils.py::generate_templates(columns) produces default templates from a list of dataset column names:

  1. Output column detection: Compares column names against a ranked list of 500 common output column names (e.g., answer, response, output, label, target). The column matching the highest-ranked name becomes the output column. If no match, the last column is used.

  2. Input columns: All columns except the identified output column.

  3. Prompt template: Generated as:

    You are an LLM responsible for generating a response. Please provide a response given the user input below.
    
    <Column1>: {column1}
    <Column2>: {column2}
    <Output>:
    
  4. Completion template: {output_column}\n

Returns (prompt_template, completion_template).

Axolotl Auto-Prompt

ft/jobs.py::_add_prompt_for_dataset() generates a prompt automatically when using the Axolotl framework and no prompt is provided:

  1. Load the Axolotl config from the database by axolotl_config_id.
  2. Parse the YAML config and extract the dataset type from config['datasets'][0]['type'].
  3. Query the Config table for a matching axolotl_dataset_formats config by description == dataset_type.
  4. Parse the format config JSON to extract feature column names.
  5. Build a template: <Feature>: {feature}\n for each feature.
  6. Check for an existing prompt with the same dataset_id and prompt_template. If found, return its ID.
  7. Otherwise, create a new Prompt named "AXOLOTL_AUTOGENERATED : {dataset_type}_{dataset_name}".

Removal

remove_prompt() deletes the Prompt record by ID. Note that prompts are also cascade-deleted when their parent dataset is removed with remove_prompts=True.

Protobuf Message

PromptMetadata fields: id, type, name, description, dataset_id, prompt_template, input_template, completion_template.

Configuration Specification

A Config resource stores a named configuration blob – JSON or YAML – that parameterizes training, quantization, inference, or the Axolotl framework. Configs are content-deduplicated: adding a config with identical content and type to an existing one returns the existing config’s ID rather than creating a duplicate.

Source: ft/configs.py, ft/consts.py, ft/db/model.py

Config Types

TypeFormatPurposeDefault Provided
training_argumentsJSONTraining hyperparameters (epochs, optimizer, batch size, learning rate)Yes
bitsandbytes_configJSON4-bit quantization settingsYes
lora_configJSONLoRA hyperparametersYes
generation_configJSONInference generation settingsYes
customJSONUser-defined configuration blobNo
axolotlYAMLAxolotl training configuration fileTemplate provided
axolotl_dataset_formatsJSONAxolotl dataset format schemasYes (multiple)

ORM Schema

class Config(Base, MappedProtobuf, MappedDict):
    __tablename__ = "configs"
    id = Column(String, primary_key=True)       # UUID
    type = Column(String)                        # ConfigType enum value
    description = Column(String)                 # Model name (for family resolution) or format name
    config = Column(Text)                        # Serialized JSON or YAML string
    model_family = Column(String)                # Architecture family (e.g., "LlamaForCausalLM")
    is_default = Column(Integer, default=1)      # 1 = system/default, 0 = user-created

is_default Semantics

ValueConstantMeaning
1DEFAULT_CONFIGSSystem-provided default configuration
0USER_CONFIGSUser-created configuration

User-created configs always have is_default=0. The add_config() function sets this automatically.

Default Config Values

Defined in ft/consts.py:

DEFAULT_TRAINING_ARGUMENTS

{
    "num_train_epochs": 1,
    "optim": "paged_adamw_32bit",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "warmup_ratio": 0.03,
    "max_grad_norm": 0.3,
    "learning_rate": 0.0002,
    "fp16": true,
    "logging_steps": 1,
    "lr_scheduler_type": "constant",
    "disable_tqdm": true,
    "report_to": "mlflow",
    "ddp_find_unused_parameters": false
}

DEFAULT_BNB_CONFIG

{
    "load_in_4bit": true,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_use_double_quant": true,
    "quant_method": "bitsandbytes"
}

DEFAULT_LORA_CONFIG

{
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

DEFAULT_GENERATIONAL_CONFIG

{
    "do_sample": true,
    "temperature": 0.8,
    "max_new_tokens": 60,
    "top_p": 1,
    "top_k": 50,
    "num_beams": 1,
    "repetition_penalty": 1.1,
    "max_length": null
}

Config Deduplication

add_config() implements content-addressed caching:

  1. Parse the incoming config string: yaml.safe_load() for axolotl type, json.loads() for all others.
  2. Re-serialize to a canonical form (yaml.dump() or json.dumps()).
  3. Query existing configs of the same type (and same model_family if description is provided).
  4. Compare parsed content of each existing config against the parsed request content.
  5. If an identical config exists, return it. At most one duplicate is expected (asserted).
  6. If no match, create a new Config with is_default=USER_CONFIGS (0).

When description is provided, it is interpreted as a model name: transform_name_to_family(description) resolves the HuggingFace architecture (e.g., "LlamaForCausalLM") and scopes the deduplication query to that family.

Model-Family-Specific Filtering

list_configs() applies model-aware filtering when model_id is present in the request:

  1. Optionally filter by type if specified.
  2. If model_id is provided, call get_configs_for_model_id():
    • Fetch the Model record and resolve huggingface_model_name.
    • Instantiate ModelMetadataFinder(model_hf_name) and call fetch_model_family_from_config().
    • Filter configs where model_family matches and is_default == 1.
    • If no model-specific defaults exist, fall back to returning all configs.
  3. User configs (is_default=0) are not filtered by model family in get_configs_for_model_id() – they are returned when no model-specific defaults are found (fallback behavior).

Axolotl Config Template

The Axolotl config template is loaded from ft/config/axolotl/training_config/template.yaml via get_axolotl_training_config_template_yaml_str(). Axolotl dataset format configs are stored in ft/config/axolotl/dataset_formats/.

Protobuf Message

ConfigMetadata fields: id, type, description, config (serialized JSON/YAML string), model_family, is_default.

Fine-Tuning Job Lifecycle

A fine-tuning job trains a PEFT LoRA adapter on a base model using a configured dataset and prompt template. Jobs are dispatched as CML Jobs via the cmlapi SDK. The entry point is ft/jobs.py::start_fine_tuning_job(), which validates the request, prepares the execution environment, and creates the CML workload.

Job Dispatch Flow

  1. Validate Request_validate_fine_tuning_request() checks all fields against the rules below. Any violation raises a ValueError that propagates as a gRPC error.
  2. Create Job Directory – A UUID job_id is generated. The directory .app/job_runs/{job_id} is created to hold training artifacts.
  3. Find Template CML Job – The dispatcher locates the Accel_Finetuning_Base_Job template in the CML project. This template defines the runtime environment and script path.
  4. Build Argument List – All training parameters are serialized into a --key value string passed as JOB_ARGUMENTS.
  5. Create CML Job + JobRun – A CML Job and its first JobRun are created via cmlapi, with the specified CPU, GPU, and memory resources.
  6. Store Job Record – A FineTuningJob record is inserted into the fine_tuning_jobs table with all metadata for tracking.

Validation Rules

Validation is performed by ft/jobs.py::_validate_fine_tuning_request() before any side effects occur.

FieldRuleError
framework_typeMust be legacy or axolotl“framework_type must be either legacy or axolotl”
adapter_nameAlphanumeric + hyphens only (^[a-zA-Z0-9-]+$)“adapter_name must be alphanumeric”
out_dirMust exist as directory“output_dir does not exist”
num_cpu> 0“cpu must be greater than 0”
num_gpu>= 0“gpu must be at least 0”
num_memory> 0“memory must be greater than 0”
num_workers> 0“num_workers must be greater than 0”
num_epochs> 0“Number of epochs must be greater than 0”
learning_rate> 0“Learning rate must be greater than 0”
dataset_fraction(0, 1]“dataset_fraction must be between 0 and 1”
train_test_split(0, 1]“train_test_split must be between 0 and 1”
axolotl_config_idRequired when framework=axolotl“axolotl framework requires axolotl_config_id”
base_model_idMust exist in DB“Model not found”
dataset_idMust exist in DB“Dataset not found”
prompt_idMust exist in DB (legacy only)“Prompt not found”

Framework Types

Legacy

Uses HuggingFace Accelerate with the TRL SFTTrainer. The user provides each configuration component separately:

  • prompt_id – A prompt template that maps dataset features to the training text format.
  • LoRA config – PEFT LoRA hyperparameters (rank, alpha, dropout, target modules).
  • BnB config – BitsAndBytes quantization settings (4-bit NF4 quantization).
  • Training arguments – Standard HuggingFace TrainingArguments fields (epochs, learning rate, batch size, etc.).

For distributed training, worker resources are specified independently via dist_cpu, dist_gpu, and dist_mem fields.

Axolotl

Uses the Axolotl training framework. The user provides a single YAML configuration file (referenced by axolotl_config_id) that bundles all training parameters, LoRA settings, and dataset handling into one document. If no prompt_id is provided, the system auto-generates a prompt from the dataset format definition. See Axolotl Integration for details.

Resource Specification

Each job requires explicit compute resource allocation:

FieldDescription
num_cpuCPU cores for the primary training worker
num_gpuGPU count for the primary training worker
num_memoryMemory in GB for the primary training worker
num_workersNumber of training workers (Accelerate distributed training)

For legacy distributed training, additional fields specify per-worker resources:

FieldDescription
dist_cpuCPU cores per distributed worker
dist_gpuGPU count per distributed worker
dist_memMemory in GB per distributed worker

Argument List Schema

Arguments are passed as the JOB_ARGUMENTS environment variable to the CML Job. The value is a space-delimited string of --key value pairs.

Core arguments (always present):

KeySource
base_model_idRequest field
dataset_idRequest field
experimentidGenerated UUID (same as job_id)
out_dirRequest field
train_out_dirConstructed path for training output
adapter_nameRequest field
framework_typeRequest field (legacy or axolotl)

Optional arguments (included when non-empty):

KeyDescription
prompt_idPrompt template ID (required for legacy, optional for axolotl)
bnb_configBitsAndBytes config ID
lora_configLoRA config ID
training_arguments_configTraining arguments config ID
hf_tokenHuggingFace access token
axolotl_config_idAxolotl YAML config ID
gpu_label_idGPU label config ID

Legacy distributed training arguments:

KeyDescription
dist_numNumber of distributed workers
dist_cpuCPU per worker
dist_memMemory per worker
dist_gpuGPU per worker

Protobuf Messages

The job lifecycle uses two primary protobuf messages:

  • StartFineTuningJobRequest – Contains all fields listed above. Sent by the client to initiate training.
  • StartFineTuningJobResponse – Returns the created job metadata including the generated job_id and CML job identifiers.
  • FineTuningJobMetadata – The full job record stored in the database and returned by GetFineTuningJob and ListFineTuningJobs RPCs.

See gRPC Service Design for the complete RPC catalog.

Training Script Architecture

The training script is the code that runs inside a CML Job after dispatch. It receives configuration via environment variables, loads and preprocesses data, trains a PEFT LoRA adapter, and saves the result.

Entry Point

ft/scripts/accel_fine_tune_base_script.py

The script is executed as a CML Job. Arguments are received via the JOB_ARGUMENTS environment variable as a space-delimited string with --key value pairs, parsed into an argparse namespace at startup.

Execution Flow

  1. Parse JOB_ARGUMENTS – The JOB_ARGUMENTS environment variable is split and parsed via argparse into a namespace containing all training parameters.
  2. Load base model – The HuggingFace model is loaded with optional BitsAndBytesConfig for 4-bit NF4 quantization. The model ID is resolved from the Studio database using base_model_id.
  3. Configure tokenizer padding – The tokenizer is inspected for a suitable pad token. The function find_padding_token_candidate() searches the vocabulary for tokens containing “pad” or “reserved”.
  4. Apply PEFT LoRA adapter – A LoraConfig is constructed from the config blob stored in the database, and the model is wrapped with get_peft_model().
  5. Load and preprocess dataset:
    • load_dataset_into_memory() reads the dataset into a HuggingFace DatasetDict.
    • map_dataset_with_prompt_template() formats each row using the prompt template, appending the EOS token.
    • sample_and_split_dataset() downsamples by the configured fraction and splits into train/test sets (seed=42).
  6. Initialize SFTTrainer – A TRL SFTTrainer is created with the processed dataset, model, tokenizer, and training arguments.
  7. Traintrainer.train() executes the training loop.
  8. Save adapter weights – The trained LoRA adapter is saved to the output directory.
  9. Auto-register adapter – If auto_add_adapter=true, the adapter is registered in the Studio database automatically after training completes.

Dataset Preprocessing Chain

StepFunctionInputOutput
Loadload_dataset_into_memory()Dataset metadata (type, path, HF name)HF DatasetDict
Formatmap_dataset_with_prompt_template()DatasetDict + prompt templateDatasetDict with prediction column
Sample/Splitsample_and_split_dataset()DatasetDict + fraction + split ratioTrain/test DatasetDict

The prediction column contains the fully formatted training text for each row – the prompt template applied to dataset features with the EOS token appended. This column name is defined by TRAINING_DATA_TEXT_FIELD.

Key Training Utilities

All utilities are defined in ft/training/utils.py.

get_model_parameters(model)

Returns a tuple of (total_params, trainable_params) for the model. Used for logging the parameter count before and after applying the LoRA adapter.

map_dataset_with_prompt_template(dataset, template)

Applies the prompt template to each row in the dataset. The template contains prompt_template, input_template, and completion_template fields that are formatted with the dataset’s feature columns. The EOS token is appended to the prediction field to signal sequence boundaries during training.

sample_and_split_dataset(ds, fraction, split)

Downsamples the dataset to the specified fraction (e.g., 0.5 = 50% of rows), then splits into train and test sets at the given ratio. Uses TRAINING_DATASET_SEED = 42 for reproducible splits across runs.

find_padding_token_candidate(tokenizer)

Searches the tokenizer vocabulary for tokens containing “pad” or “reserved” as substrings. Returns the first match found, or None if no candidate exists.

configure_tokenizer_padding(tokenizer, pad_token)

Sets the tokenizer’s padding token using a fallback chain:

  1. Use the tokenizer’s existing pad_token if already set.
  2. Use the provided pad_token argument if given.
  3. Use the tokenizer’s unk_token if available.
  4. Search for reserved token candidates via find_padding_token_candidate().

This ensures every tokenizer has a valid pad token regardless of the base model’s configuration.

Training Constants

Defined in ft/consts.py:

ConstantValuePurpose
TRAINING_DATA_TEXT_FIELD"prediction"Column name for the formatted training text in the preprocessed dataset
TRAINING_DEFAULT_TRAIN_TEST_SPLIT0.9Default train/test split ratio (90% train, 10% test)
TRAINING_DEFAULT_DATASET_FRACTION1.0Default dataset fraction (use full dataset)
TRAINING_DATASET_SEED42Random seed for reproducible dataset splitting and sampling

Relationship to Job Lifecycle

The training script is the execution payload created by the Fine-Tuning Job Lifecycle. The job dispatch process builds the JOB_ARGUMENTS string, creates the CML Job pointing to this script, and starts a JobRun. The script runs independently inside the CML workload – it reads its configuration from the environment, accesses the Studio database directly for resource metadata (model paths, dataset locations, config blobs), and writes adapter weights to the output directory.

Axolotl Integration

Axolotl is an alternative training framework supported as a first-class framework_type alongside the legacy HuggingFace Accelerate + TRL path. It replaces the separate LoRA, BitsAndBytes, and training argument configs with a single YAML configuration file that defines the entire training run.

Config Structure

Axolotl configurations are stored in the configs table with ConfigType.axolotl. A template YAML is provided at:

ft/config/axolotl/training_config/template.yaml

This template defines the baseline Axolotl training configuration. Users can create custom configs by modifying the template values. The YAML file specifies model loading, LoRA parameters, quantization, dataset handling, training hyperparameters, and output settings in a single document.

Dataset Format Configs

Dataset format definitions are stored as ConfigType.axolotl_dataset_formats in the configs table. The source files live in:

ft/config/axolotl/dataset_formats/

Each JSON file defines the expected column structure for a specific Axolotl dataset type (e.g., alpaca, completion, sharegpt). These files are loaded into the database during initialization by:

ft/initialize_db.py::InitializeDB.initialize_axolotl_dataset_type_configs()

Pydantic Models

The dataset format structure is defined by two Pydantic models in ft/api/types.py:

DatasetFormatInfo:

FieldTypeDescription
namestrHuman-readable name of the dataset format
descriptionstrThe Axolotl dataset type identifier (e.g., alpaca, completion)
formatDict[str, Any]Map of feature column names to their expected types or descriptions

DatasetFormatsCollection:

FieldTypeDescription
dataset_formatsDict[str, DatasetFormatInfo]Map of format names to their definitions

Auto-Prompt Generation

When a fine-tuning job uses the axolotl framework and no prompt_id is provided, the system automatically generates a prompt template from the dataset format definition. This is handled by ft/jobs.py::_add_prompt_for_dataset().

Generation steps:

  1. Load the Axolotl YAML config from the database using axolotl_config_id.
  2. Extract the type field from the dataset section of the YAML config. This identifies the expected dataset format (e.g., alpaca, completion).
  3. Query the database for a config of type axolotl_dataset_formats whose description field matches the extracted type.
  4. Parse the dataset format config to extract the feature column names from the format dictionary.
  5. Generate a default prompt template by concatenating "Feature: {feature}\n" for each feature column.
  6. Check whether an identical prompt already exists for this dataset to avoid duplicates.
  7. Create and return a new prompt record if no duplicate is found.

This mechanism ensures that Axolotl jobs always have a valid prompt template, even when the user does not explicitly create one.

Legacy vs. Axolotl Comparison

AspectLegacyAxolotl
Config formatSeparate JSON blobs (LoRA, BnB, training args)Single YAML file
Prompt handlingUser must create and select a prompt templateAuto-generated from dataset format if not provided
Required configsprompt_id + lora_config + bnb_config + training_arguments_configaxolotl_config_id only
Training engineHuggingFace Accelerate + TRL SFTTrainerAxolotl framework
Distributed trainingSupported via dist_* fieldsManaged by Axolotl config
Validationprompt_id requiredaxolotl_config_id required; prompt_id optional

Workflow

To use Axolotl for fine-tuning:

  1. Register a base model and dataset via the standard AddModel and AddDataset RPCs.
  2. Create or use an existing Axolotl YAML config (stored as ConfigType.axolotl).
  3. Call StartFineTuningJob with framework_type = "axolotl" and axolotl_config_id set to the config ID.
  4. Omit prompt_id to use auto-generation, or provide one to override.
  5. The job dispatcher passes axolotl_config_id in the JOB_ARGUMENTS to the training script, which loads and executes the Axolotl training pipeline.

See Fine-Tuning Job Lifecycle for the full dispatch flow and Training Script Architecture for execution details.

Evaluation Job Lifecycle

Evaluation jobs run MLflow evaluation against model+adapter combinations. A single evaluation request can compare multiple adapters against a baseline, with each combination dispatched as a separate CML Job linked by a shared parent_job_id.

Dispatch Architecture

A single StartEvaluationJob request specifies N model+adapter combinations. The dispatcher fans out into N independent CML Jobs, each running its own MLflow evaluation. All jobs share a parent_job_id that groups them for result comparison in the UI.

Validation

Validation is performed by ft/evaluation.py::_validate_start_evaluation_job_request() before any jobs are created.

Required fields:

FieldRule
model_adapter_combinationsNon-empty list of model+adapter pairs
dataset_idMust exist in DB
prompt_idMust exist in DB
cpuValid resource specification
gpuValid resource specification
memoryValid resource specification

Per-combination validation:

  • Each base_model_id in the combinations list must exist in the database.
  • Each adapter_id must exist in the database, or be an empty string to evaluate the base model without an adapter.
  • The referenced dataset and prompt must exist.

Multi-Adapter Dispatch

For each model+adapter combination in the request, the dispatcher executes the following sequence:

  1. Generate IDs – A UUID job_id is generated for each individual evaluation run. A shared parent_job_id is generated once for the entire batch.
  2. Create directories – A result directory and job directory are created for each run.
  3. Find template CML Job – The dispatcher locates the Mlflow_Evaluation_Base_Job template in the CML project.
  4. Build argument list – Each run receives its own argument string containing:
ArgumentDescription
base_model_idThe model to evaluate
adapter_idThe adapter to apply (empty string for base model only)
dataset_idThe evaluation dataset
prompt_idThe prompt template for formatting
result_dirDirectory for evaluation output
configsEvaluation-specific configuration
selected_featuresDataset features to include
eval_dataset_fractionFraction of dataset to evaluate on
comparison_adapter_idThe first adapter in the batch, used as the baseline
job_idThis run’s unique identifier
run_numberOrdinal position in the batch (1-indexed)
  1. Create CML Job and JobRun – A CML Job is created via cmlapi with the specified compute resources.
  2. Store EvaluationJob record – An EvaluationJob record is inserted into the evaluation_jobs table with the parent_job_id for grouping.

Parent Job Grouping

All evaluation runs within a batch share the same parent_job_id. This enables:

  • UI grouping – The Streamlit UI displays evaluation runs grouped by parent, showing all adapter comparisons in a single view.
  • Baseline comparison – The first adapter in the model_adapter_combinations list is designated as the baseline (comparison_adapter_id). All other runs compare their metrics against this baseline.
  • Batch status tracking – The overall status of an evaluation batch can be determined by aggregating the statuses of all child jobs sharing the same parent_job_id.

Evaluation Script

The evaluation logic runs inside ft/scripts/mlflow_evaluation_base_script.py:

  1. Load model and adapter – The base HuggingFace model is loaded, and the optional PEFT adapter is applied via load_adapted_hf_generation_pipeline(). This produces a text-generation pipeline.
  2. Load and preprocess dataset – The evaluation dataset is loaded, the prompt template is applied to format inputs, and the dataset is sampled to the configured eval_dataset_fraction.
  3. Run MLflow evaluation – MLflow’s evaluation framework is invoked with the configured metrics. Results (metric values and artifacts) are logged to an MLflow experiment.
  4. Log results – Evaluation metrics, predictions, and comparison data are persisted in the MLflow tracking store for retrieval by the UI.

Protobuf Messages

StartEvaluationJobRequest:

FieldDescription
model_adapter_combinationsList of model+adapter pairs to evaluate
dataset_idEvaluation dataset reference
prompt_idPrompt template for input formatting
cpu, gpu, memoryCompute resources per evaluation job
configsEvaluation configuration (metrics, generation settings)

EvaluationJobMetadata:

FieldDescription
idUnique evaluation job identifier
typeJob type identifier
cml_job_idCML Job identifier
parent_job_idShared batch identifier
base_model_idEvaluated model
dataset_idEvaluation dataset
adapter_idApplied adapter (empty for base model)
cpu, gpu, memoryAllocated resources
configsEvaluation configuration
evaluation_dirPath to evaluation results

See gRPC Service Design for the complete evaluation RPC catalog.

Model Export & Registry

Trained adapters can be exported through two routes, determined by ModelExportType. Both routes merge a base model with a PEFT adapter into a deployable artifact, but target different deployment backends.

Export Routes

Both routes require non-empty base_model_id, adapter_id, and model_name fields. The choice between them depends on the target deployment environment and the adapter source type.

MLflow Model Registry

Function: export_model_registry_model()

This route logs the merged model to the MLflow Model Registry as a registered model. It supports any adapter type (PROJECT, HuggingFace).

Steps:

  1. Load pipelinefetch_pipeline() creates a HuggingFace text-generation pipeline by loading the base model and applying the PEFT adapter.
  2. Quantized loading – If a BitsAndBytesConfig is specified, the base model is loaded with 4-bit quantization before adapter application.
  3. Infer signature – An MLflow model signature is inferred from example input/output pairs. This defines the expected request and response schema for the registered model.
  4. Log modelmlflow.transformers.log_model() logs the pipeline to MLflow as a registered model with the specified model_name.

Requirements:

RequirementDetail
Base modelHuggingFace model registered in Studio
AdapterAny adapter type (PROJECT or HuggingFace)
MLflow trackingMust be configured in the CML environment

CML Model Endpoint

Function: deploy_cml_model()

This route creates a CML Model endpoint that serves the model+adapter combination as a REST API. It is restricted to PROJECT adapters (file-based, local weights).

Steps:

  1. Validate adapter type – Only PROJECT adapters (local file-based weights) are supported. HuggingFace adapters must be downloaded locally first.
  2. Create CML Model – A CML Model object is created via cmlapi.
  3. Create ModelBuild – A build is created pointing to the predict script at ft/scripts/cml_model_predict_script.py. Environment variables are injected:
VariableDescription
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAMEHuggingFace identifier for the base model
ADAPTER_LOCATIONFile path to the adapter weights directory
GEN_CONFIG_STRINGSerialized generation config (JSON string)
  1. Deploy – A ModelDeployment is created with default resources:
ResourceDefault
CPU2 cores
Memory8 GB
GPU1
  1. Resolve runtime – The runtime identifier is inherited from the template Finetuning_Base_Job, ensuring the model endpoint uses the same environment as training workloads.

Requirements:

RequirementDetail
Base modelHuggingFace model registered in Studio
AdapterPROJECT type only (local file weights)
Adapter weightsMust be accessible on the local filesystem

Validation

Both export routes perform the following validation before proceeding:

  • base_model_id must be non-empty and reference an existing model in the database.
  • adapter_id must be non-empty and reference an existing adapter in the database.
  • model_name must be non-empty.

Additional route-specific validation:

  • CML Model: The adapter must be of type PROJECT. Model Registry adapters require the MLflow Registry export path instead.
  • MLflow Registry: The MLflow tracking server must be accessible.

Choosing an Export Route

CriterionMLflow RegistryCML Model Endpoint
Adapter sourceAny (PROJECT, HuggingFace)PROJECT only
Output formatMLflow registered modelREST API endpoint
Serving infrastructureMLflow serving or downstream consumptionCML Model serving
Resource customizationManaged by MLflowDefault 2 CPU / 8 GB / 1 GPU (adjustable post-deploy)
Use caseModel versioning, experiment tracking, CI/CD pipelinesReal-time inference endpoint

See CML Model Serving for details on the predict script and endpoint behavior.

CML Model Serving

A CML Model endpoint serves a fine-tuned model+adapter combination as a REST API. The endpoint is created by deploy_cml_model() (see Model Export & Registry) and runs a predict script that loads the model, applies the adapter, and handles inference requests.

Predict Script

Path: ft/scripts/cml_model_predict_script.py

The predict script runs inside a CML Model endpoint container. It is specified as the build script during deploy_cml_model() and executes in the runtime environment inherited from the template fine-tuning job.

Initialization

On startup, the script:

  1. Reads environment variables:
VariablePurpose
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAMEHuggingFace model identifier to load
ADAPTER_LOCATIONPath to the PEFT adapter weights directory
GEN_CONFIG_STRINGSerialized generation configuration (JSON)
  1. Loads the base model – The HuggingFace model is loaded from the Hub or cache using the identifier in FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME.
  2. Applies the PEFT adapter – The LoRA adapter weights at ADAPTER_LOCATION are loaded and applied to the base model.

Request Handling

The predict script exposes a predict() function that CML invokes for each incoming request.

Request format:

{
  "request": {
    "prompt": "Your input text here"
  }
}

The prompt field contains the raw input text. The predict function:

  1. Extracts the prompt from the request payload.
  2. Tokenizes the input using the model’s tokenizer.
  3. Generates output using the model with the applied generation config.
  4. Decodes and returns the generated text.

Endpoint Creation Flow

The full endpoint creation sequence, initiated by deploy_cml_model():

  1. Create CML Model – A new Model object is created in the CML project via cmlapi. This registers the model name and description.
  2. Create ModelBuild – A build is created with:
    • The predict script path (ft/scripts/cml_model_predict_script.py).
    • Environment variables (FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME, ADAPTER_LOCATION, GEN_CONFIG_STRING).
    • The runtime identifier from the template fine-tuning job.
  3. Create ModelDeployment – A deployment is created with default resource allocation:
ResourceDefault Value
CPU2 cores
Memory8 GB
GPU1
  1. Runtime resolution – The runtime is inherited from the template Finetuning_Base_Job. This ensures the model endpoint has the same Python packages, CUDA version, and system libraries as the training environment.

Limitations

  • PROJECT adapters only – Only adapters stored as local files (PROJECT type) are supported for CML Model deployment. HuggingFace Hub adapters must be downloaded to the project filesystem before they can be used with a CML Model endpoint.
  • Model Registry adapters – Adapters registered through the MLflow Model Registry cannot be deployed as CML Models directly. Use the MLflow Registry export path instead (see Model Export & Registry).
  • Fixed default resources – The deployment is created with 1 GPU, 2 CPU cores, and 8 GB memory. To adjust resource allocation after deployment, modify the CML Model settings through the CML UI or cmlapi.
  • Single adapter – Each CML Model endpoint serves exactly one base model + adapter combination. To serve multiple adapters, create multiple endpoints.

Post-Deployment

After deployment completes:

  • The endpoint URL is available in the CML Model UI and via cmlapi.
  • Requests are sent as HTTP POST with the JSON format shown above.
  • The endpoint auto-scales based on CML’s Model serving configuration.
  • Logs and metrics are available through CML’s standard monitoring interface.
  • Resource allocation can be modified via the CML Model settings without rebuilding.

Validation Rules Reference

The Studio validates resources at multiple points: on import (datasets, models, adapters, prompts, configs), on job submission (fine-tuning, evaluation), and on export (model deployment). This chapter catalogs all validation rules extracted from the source code.

Source: ft/jobs.py, ft/evaluation.py, ft/datasets.py, ft/models.py, ft/adapters.py, ft/prompts.py, ft/configs.py, ft/service.py

Rule ID Convention

Rule IDs follow the format {Domain}-{Number} where Domain is one of:

DomainScope
FTFine-tuning job parameters
EVEvaluation job parameters
DSDataset import
MDModel import
ADAdapter import
PRPrompt template
CFConfiguration blob
EXModel export / deployment

All rules with severity ERROR abort the operation and return a gRPC error. Rules with severity INFO are advisory and do not block the operation.

Fine-Tuning Job Validation

Validated in ft/jobs.py when StartFineTuningJob is called.

Rule IDFieldConstraintSeverity
FT-001framework_typeMust be legacy or axolotlERROR
FT-002adapter_nameMust match ^[a-zA-Z0-9-]+$ (alphanumeric + hyphens, no spaces)ERROR
FT-003out_dirMust exist as a directoryERROR
FT-004num_cpuMust be > 0ERROR
FT-005num_gpuMust be >= 0ERROR
FT-006num_memoryMust be > 0ERROR
FT-007num_workersMust be > 0ERROR
FT-008num_epochsMust be > 0ERROR
FT-009learning_rateMust be > 0ERROR
FT-010dataset_fractionMust be in (0, 1]ERROR
FT-011train_test_splitMust be in (0, 1]ERROR
FT-012axolotl_config_idRequired when framework_type=axolotlERROR
FT-013base_model_idMust exist in models tableERROR
FT-014dataset_idMust exist in datasets tableERROR
FT-015prompt_idMust exist in prompts table (legacy framework only)ERROR
FT-016axolotl_config_idMust exist in configs table (when provided)ERROR

FT-001 through FT-011 are local validations that require no database access. FT-012 is a cross-field consistency check. FT-013 through FT-016 are foreign-key validations resolved against the DAO.

Evaluation Job Validation

Validated in ft/evaluation.py when StartEvaluationJob is called.

Rule IDFieldConstraintSeverity
EV-001model_adapter_combinationsMust be non-emptyERROR
EV-002dataset_idMust be non-emptyERROR
EV-003prompt_idMust be non-emptyERROR
EV-004num_cpu, num_gpu, num_memoryMust be providedERROR
EV-005model IDs in combinationsEach must exist in models tableERROR
EV-006adapter IDs in combinationsEach must exist in adapters table (or empty for base model)ERROR
EV-007dataset_idMust exist in datasets tableERROR
EV-008prompt_idMust exist in prompts tableERROR

Evaluation jobs accept multiple model-adapter pairs in a single request. EV-005 and EV-006 are validated per combination entry.

Dataset Import Validation

Validated in ft/datasets.py when AddDataset is called.

Rule IDFieldConstraintSeverity
DS-001typeMust be one of: huggingface, project, project_csv, project_json, project_jsonlERROR
DS-002huggingface_nameMust resolve via HfApi.dataset_info() (huggingface type)ERROR
DS-003locationFile must exist (project_csv, project_json, project_jsonl)ERROR

DS-002 makes a network call to the HuggingFace Hub. If HUGGINGFACE_ACCESS_TOKEN is set, it is used for gated dataset access. DS-003 validates the local filesystem path and checks the file extension matches the declared type.

Model Import Validation

Validated in ft/models.py when AddModel is called.

Rule IDFieldConstraintSeverity
MD-001typeMust be one of: huggingface, project, model_registryERROR
MD-002huggingface_model_nameMust resolve via HfApi.model_info() (huggingface type)ERROR
MD-003cml_registered_model_idMust resolve via cmlapi (model_registry type)ERROR

MD-002 contacts the HuggingFace Hub. MD-003 queries the CML Model Registry through the cmlapi SDK.

Adapter Import Validation

Validated in ft/adapters.py when AddAdapter is called.

Rule IDFieldConstraintSeverity
AD-001nameRequired, non-emptyERROR
AD-002model_idRequired, must exist in models tableERROR
AD-003locationMust exist as directory (project type)ERROR
AD-004fine_tuning_job_idMust exist in fine_tuning_jobs table (if provided)ERROR
AD-005prompt_idMust exist in prompts table (if provided)ERROR

AD-004 and AD-005 are optional foreign-key references. When provided, they link the adapter back to the job and prompt that produced it.

Prompt Validation

Validated in ft/prompts.py when AddPrompt is called.

Rule IDFieldConstraintSeverity
PR-001nameRequired, uniqueERROR
PR-002dataset_idRequiredERROR
PR-003prompt_templateRequired, non-emptyERROR
PR-004input_templateRequiredERROR
PR-005completion_templateRequiredERROR

PR-001 enforces uniqueness at the application level before insert. The prompt_template, input_template, and completion_template fields use Python format-string syntax referencing dataset feature column names.

Config Validation

Validated in ft/configs.py when AddConfig is called.

Rule IDFieldConstraintSeverity
CF-001typeMust be one of ConfigType enum valuesERROR
CF-002configMust be valid JSON (non-axolotl types)ERROR
CF-003configMust be valid YAML (axolotl type)ERROR
CF-004configDeduplicated – returns existing ID if identical content existsINFO

CF-004 is a deduplication check, not an error. When a config with identical content already exists, the existing record’s ID is returned instead of creating a duplicate. The caller receives a successful response in either case.

Export Validation

Validated in ft/models.py when ExportModel or RegisterModel is called.

Rule IDFieldConstraintSeverity
EX-001base_model_idRequired, non-emptyERROR
EX-002adapter_idRequired, non-emptyERROR
EX-003model_nameRequired, non-emptyERROR
EX-004adapter typeMust be PROJECT for CML Model deploymentERROR
EX-005model typeMust be huggingface for CML Model deploymentERROR

EX-004 and EX-005 enforce deployment constraints. Only project-local adapters (those with files on disk) can be packaged for CML Model Registry, and only HuggingFace-sourced base models are supported for the export merge workflow.

Building a Validation SDK

This chapter provides guidance for building a validation SDK that validates resources and job parameters before submitting them to the Fine Tuning Studio gRPC API. Pre-submission validation catches errors locally, avoiding round-trips to the server and failed CML Job launches.

Source: ft/client.py, ft/proto/fine_tuning_studio_pb2.py, ft/api/types.py

Install the ft Package

The Fine Tuning Studio ships as a pip-installable package. Install it to get access to protobuf definitions, API types, and the gRPC client:

pip install -e /path/to/CML_AMP_LLM_Fine_Tuning_Studio

This provides the ft package in development mode. All protobuf-generated classes and enum types are available for import.

Import Protobuf Types

from ft.api import *

# Or import specific types:
from ft.proto.fine_tuning_studio_pb2 import (
    StartFineTuningJobRequest,
    AddDatasetRequest,
    AddModelRequest,
    AddConfigRequest,
)
from ft.api.types import (
    DatasetType,
    ConfigType,
    FineTuningFrameworkType,
)

These types define the exact field names, types, and enum values accepted by the gRPC API.

Validation Architecture

Validation rules fall into two categories:

CategoryRequiresExamples
Local validationNo external dependenciesRegex checks, numeric bounds, enum membership, cross-field consistency
DB-dependent validationFineTuningStudioClient connectionForeign-key existence (model, dataset, prompt, adapter)

A validation SDK should implement local validation in pure functions and DB-dependent validation through the client.

Local Validation

Replicate the rules from the Validation Rules Reference that require no database access:

import re

def validate_adapter_name(name: str) -> list[str]:
    """FT-002: adapter_name must be alphanumeric + hyphens."""
    errors = []
    if not re.match(r'^[a-zA-Z0-9-]+$', name):
        errors.append("FT-002: adapter_name must match ^[a-zA-Z0-9-]+$")
    return errors

def validate_resource_allocation(num_cpu: int, num_gpu: int, num_memory: int) -> list[str]:
    """FT-004, FT-005, FT-006: resource bounds."""
    errors = []
    if num_cpu <= 0:
        errors.append("FT-004: num_cpu must be > 0")
    if num_gpu < 0:
        errors.append("FT-005: num_gpu must be >= 0")
    if num_memory <= 0:
        errors.append("FT-006: num_memory must be > 0")
    return errors

def validate_training_params(num_epochs: int, learning_rate: float,
                             dataset_fraction: float, train_test_split: float) -> list[str]:
    """FT-008 through FT-011: training parameter ranges."""
    errors = []
    if num_epochs <= 0:
        errors.append("FT-008: num_epochs must be > 0")
    if learning_rate <= 0:
        errors.append("FT-009: learning_rate must be > 0")
    if not (0 < dataset_fraction <= 1):
        errors.append("FT-010: dataset_fraction must be in (0, 1]")
    if not (0 < train_test_split <= 1):
        errors.append("FT-011: train_test_split must be in (0, 1]")
    return errors

def validate_framework_config(framework_type: str, axolotl_config_id: str) -> list[str]:
    """FT-001, FT-012: framework type and config consistency."""
    errors = []
    if framework_type not in ("legacy", "axolotl"):
        errors.append("FT-001: framework_type must be legacy or axolotl")
    if framework_type == "axolotl" and not axolotl_config_id:
        errors.append("FT-012: axolotl_config_id required when framework_type=axolotl")
    return errors

DB-Dependent Validation

Some rules require database lookups. Use FineTuningStudioClient for these:

from ft.client import FineTuningStudioClient

client = FineTuningStudioClient()

# Check if model exists (FT-013)
models = client.get_models()
model_ids = {m.id for m in models}
assert model_id in model_ids, f"FT-013: Model {model_id} not found"

# Check if dataset exists (FT-014)
datasets = client.get_datasets()
dataset_ids = {d.id for d in datasets}
assert dataset_id in dataset_ids, f"FT-014: Dataset {dataset_id} not found"

# Check if prompt exists (FT-015)
prompts = client.get_prompts()
prompt_ids = {p.id for p in prompts}
assert prompt_id in prompt_ids, f"FT-015: Prompt {prompt_id} not found"

The client connects to the gRPC server over FINE_TUNING_SERVICE_IP:FINE_TUNING_SERVICE_PORT. These environment variables are set during Studio initialization.

Config Content Validation

Validate config content before submitting via AddConfig:

import json
import yaml

def validate_config(config_type: str, config_content: str) -> list[str]:
    """CF-002, CF-003: config content must parse correctly."""
    errors = []
    try:
        if config_type == "axolotl":
            yaml.safe_load(config_content)  # CF-003
        else:
            json.loads(config_content)  # CF-002
    except (yaml.YAMLError, json.JSONDecodeError) as e:
        errors.append(f"Config parse error: {e}")
    return errors

Composing a Validation Pipeline

Combine local and DB-dependent validators into a single function that returns all errors at once:

def validate_fine_tuning_request(
    request: StartFineTuningJobRequest,
    client: FineTuningStudioClient,
) -> list[str]:
    """Validate a fine-tuning request against all applicable rules.

    Returns a list of error strings. An empty list means the request is valid.
    """
    errors = []

    # Local validation
    errors.extend(validate_framework_config(
        request.framework_type, request.axolotl_config_id))
    errors.extend(validate_adapter_name(request.adapter_name))
    errors.extend(validate_resource_allocation(
        request.num_cpu, request.num_gpu, request.num_memory))
    errors.extend(validate_training_params(
        request.num_epochs, request.learning_rate,
        request.dataset_fraction, request.train_test_split))

    # DB-dependent validation
    models = {m.id for m in client.get_models()}
    if request.base_model_id not in models:
        errors.append(f"FT-013: Model {request.base_model_id} not found")

    datasets = {d.id for d in client.get_datasets()}
    if request.dataset_id not in datasets:
        errors.append(f"FT-014: Dataset {request.dataset_id} not found")

    if request.framework_type == "legacy":
        prompts = {p.id for p in client.get_prompts()}
        if request.prompt_id not in prompts:
            errors.append(f"FT-015: Prompt {request.prompt_id} not found")

    if request.axolotl_config_id:
        configs = {c.id for c in client.get_configs()}
        if request.axolotl_config_id not in configs:
            errors.append(f"FT-016: Config {request.axolotl_config_id} not found")

    return errors

Usage Pattern

Call the validation pipeline before submitting any request:

from ft.client import FineTuningStudioClient
from ft.proto.fine_tuning_studio_pb2 import StartFineTuningJobRequest

client = FineTuningStudioClient()

request = StartFineTuningJobRequest(
    framework_type="legacy",
    adapter_name="my-adapter",
    base_model_id="abc-123",
    dataset_id="def-456",
    prompt_id="ghi-789",
    num_cpu=2,
    num_gpu=1,
    num_memory=8,
    num_epochs=3,
    learning_rate=2e-5,
    dataset_fraction=1.0,
    train_test_split=0.8,
)

errors = validate_fine_tuning_request(request, client)
if errors:
    for e in errors:
        print(f"  {e}")
    raise ValueError(f"Validation failed with {len(errors)} error(s)")

# Safe to submit
client.start_fine_tuning_job(request)

This pattern ensures that invalid requests never reach the gRPC server, providing immediate feedback and avoiding wasted CML Job compute.

GitHub Actions Integration

This chapter documents the existing CI/CD configuration and provides patterns for extending it with config validation and formatting checks.

Source: .github/workflows/run-tests.yaml, .github/workflows/docs.yml, bin/run-tests.sh

Existing CI Workflow

The primary CI workflow is .github/workflows/run-tests.yaml:

SettingValue
TriggerPushes and PRs to main and dev branches
Runnerubuntu-latest
Python version3.11
Dependenciesrequirements.txt
Test commandpytest -v --cov=ft --cov-report=html --cov-report=xml -s tests/
Coverage threshold>10%

The workflow installs all dependencies, runs the full test suite with coverage collection, and generates both HTML and XML coverage reports.

Running Tests Locally

# Full test suite with coverage
./bin/run-tests.sh

# Single test file
pytest -v -s tests/test_datasets.py

# Single test method
pytest -v -s tests/test_datasets.py::TestDatasets::test_add_dataset

The bin/run-tests.sh script mirrors the CI configuration. Run it before pushing to catch failures early.

Adding Config Validation to CI

Axolotl YAML configs and dataset format JSON files live under ft/config/. A dedicated workflow validates these files on any PR that modifies them:

name: Validate Configs

on:
  pull_request:
    paths:
      - 'ft/config/**'
      - 'data/project_defaults.json'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install pyyaml pydantic

      - name: Validate Axolotl configs
        run: |
          python -c "
          import yaml, json, glob

          # Validate YAML configs
          for f in glob.glob('ft/config/axolotl/training_config/*.yaml'):
              with open(f) as fh:
                  yaml.safe_load(fh)
              print(f'OK: {f}')

          # Validate dataset format JSON configs
          for f in glob.glob('ft/config/axolotl/dataset_formats/*.json'):
              with open(f) as fh:
                  json.load(fh)
              print(f'OK: {f}')

          # Validate project defaults
          with open('data/project_defaults.json') as fh:
              json.load(fh)
          print('OK: data/project_defaults.json')
          "

The paths filter ensures this workflow only runs when config files change. Any parse error causes the step to fail with a traceback identifying the malformed file.

Pre-Commit Formatting Check

The project uses autoflake and autopep8 for code formatting. Add a CI step to verify formatting compliance:

- name: Check formatting
  run: |
    pip install autoflake autopep8
    autoflake --check --remove-all-unused-imports \
      --ignore-init-module-imports --recursive ft/ tests/ pgs/
    autopep8 --diff --max-line-length 120 \
      --aggressive --aggressive --aggressive \
      --recursive ft/ tests/ pgs/ | head -20

autoflake --check exits non-zero if any unused imports are found. autopep8 --diff prints the diff that would be applied; pipe through head -20 to keep output concise. If either tool reports issues, the step fails.

Documentation Deployment

The documentation workflow is .github/workflows/docs.yml. It builds the mdbook with D2 diagram support and deploys to GitHub Pages. This workflow is independent of the test CI and triggers on documentation changes. See the System Overview for the full project architecture.

Extending the CI Pipeline

When adding new validation workflows, follow these conventions:

ConventionGuideline
Path filteringUse paths: to scope workflows to relevant directories
Python versionPin to 3.11 to match production
Dependency isolationInstall only what the validation step needs, not the full requirements.txt
Exit codesRely on tool exit codes for pass/fail – avoid custom success checks
Artifact uploadsUse actions/upload-artifact@v4 for coverage reports or validation logs
Branch protectionConfigure required status checks on main to enforce green CI before merge