Introduction

Fine Tuning Studio is a Cloudera AMP (Applied ML Prototype) for managing, fine-tuning, and evaluating large language models within Cloudera Machine Learning (CML). It provides a Streamlit UI backed by a gRPC API, a SQLite metadata store, and job dispatch to CML workloads for training and evaluation. Models, datasets, PEFT adapters, and prompt templates are managed as first-class resources that flow through the import-train-evaluate-deploy lifecycle.

This guide serves two audiences:

If you are…	Start here
Building a training harness or extending the platform (custom gRPC clients, new dataset types, training scripts, Axolotl integrations)	Architecture Reference
Building a validation SDK or CI/CD pipeline for fine-tuning artifacts (config validation, adapter packaging, model export)	Resource Specifications and Validation Rules

Terminology

Term	Definition
Dataset	A reference to a HuggingFace Hub dataset or a local file (CSV, JSON, JSONL) registered in the Studio’s metadata store. Features are auto-extracted on import.
Model	A base foundation model registered from HuggingFace Hub or the CML Model Registry. Serves as the starting point for fine-tuning.
Adapter	A PEFT LoRA adapter — either produced by a fine-tuning job, imported from a local directory, or fetched from HuggingFace Hub. Applied on top of a base model.
Prompt Template	A format-string template that maps dataset feature columns into training input. Contains `prompt_template`, `input_template`, and `completion_template` fields.
Config	A named configuration blob — training arguments, BitsAndBytes quantization, LoRA hyperparameters, generation config, or Axolotl YAML. Configs are deduplicated by content.
Fine-Tuning Job	A CML Job that trains a PEFT adapter. Dispatched via the gRPC API, tracked in the metadata store, executed as a CML workload with configurable CPU/GPU/memory.
Evaluation Job	A CML Job that runs MLflow evaluation against one or more model+adapter combinations. Results are tracked in MLflow experiments.
gRPC Service	The Fine Tuning Service (FTS) — a stateless gRPC server on port 50051 that hosts all application logic. Accessed via `FineTuningStudioClient`.
DAO	Data Access Object — `FineTuningStudioDao` manages SQLAlchemy sessions and connection pooling against the SQLite database.
CML Workload	A Cloudera ML Job, session, or model endpoint. Fine-tuning and evaluation are dispatched as CML Jobs via the `cmlapi` SDK.

Resource Lifecycle

The lifecycle begins with importing resources (datasets from HuggingFace or local files, base models, prompt templates) and ends with deploying trained adapters to the CML Model Registry or as CML Model endpoints. The gRPC API drives every step — the Streamlit UI is a client of this API, not the source of truth.

System Overview

Fine Tuning Studio is a three-layer application running inside a single CML Application pod. A Streamlit frontend communicates with a gRPC backend over localhost; the backend persists metadata to SQLite and dispatches CML Jobs for training and evaluation workloads.

Component Topology

Layer Summary

Presentation Layer

Entry point: main.py. Page modules live in pgs/. Two navigation modes are controlled by the IS_COMPOSABLE environment variable:

Composable mode (IS_COMPOSABLE set): Horizontal navbar with dropdown menus for Home, Database, Resources, Experiments, AI Workbench, Examples, and Feedback.
Standard mode (default): Sidebar navigation with section headers and Material Design icons.

Pages obtain shared gRPC and CML client instances through @st.cache_resource decorators defined in pgs/streamlit_utils.py. See Streamlit Presentation Layer for full details.

Application Layer

A gRPC server runs on port 50051, started by bin/start-grpc-server.py as a background subprocess. The service class FineTuningStudioApp in ft/service.py implements FineTuningStudioServicer (generated from protobuf). It is a pure router – each RPC method delegates to a domain function in the corresponding module:

Module	Domain
`ft/datasets.py`	Dataset import, listing, removal
`ft/models.py`	Model registration, export
`ft/adapters.py`	Adapter management, dataset split lookup
`ft/prompts.py`	Prompt template CRUD
`ft/jobs.py`	Fine-tuning job dispatch and tracking
`ft/evaluation.py`	Evaluation job dispatch and tracking
`ft/configs.py`	Configuration blob management
`ft/databse_ops.py`	Database export/import operations

The servicer holds a cmlapi.default_client() and a FineTuningStudioDao instance, passing both to every domain function call. See gRPC Service Design for the full API surface.

Data Layer

SQLite at .app/state.db via SQLAlchemy ORM. Seven tables: models, datasets, adapters, prompts, fine_tuning_jobs, evaluation_jobs, configs. The DAO manages sessions with connection pooling (pool_size=5, max_overflow=10, pool_timeout=30, pool_recycle=1800). See Data Tier for schemas and the DAO API.

Initialization Sequence

The startup sequence is defined in .project-metadata.yaml and executed by bin/start-app-script.sh:

Install dependencies – bin/install-dependencies-uv.py installs from requirements.txt and performs pip install -e . to install the ft package in dev mode.
Create template CML Jobs – Accel_Finetuning_Base_Job and Mlflow_Evaluation_Base_Job are created as reusable job templates for fine-tuning and evaluation dispatch.
Initialize project defaults – bin/initialize-project-defaults-uv.py populates default datasets, prompts, models, and adapters from data/project_defaults.json.
Start gRPC server – bin/start-grpc-server.py launches as a background process (&), binds to port 50051 with a ThreadPoolExecutor(max_workers=10), and sets FINE_TUNING_SERVICE_IP and FINE_TUNING_SERVICE_PORT as CML project environment variables via cmlapi.
Start Streamlit – uv run -m streamlit run main.py --server.port $CDSW_APP_PORT --server.address 127.0.0.1.

Both processes (gRPC server and Streamlit) run in the same pod. The gRPC server is the subprocess; Streamlit is the foreground process that keeps the CML Application alive.

Environment Variables

Variable	Purpose	Default
`FINE_TUNING_SERVICE_IP`	gRPC server IP address	Set at startup from `CDSW_IP_ADDRESS`
`FINE_TUNING_SERVICE_PORT`	gRPC server port	`50051`
`FINE_TUNING_STUDIO_SQLITE_DB`	SQLite database file path	`.app/state.db`
`CDSW_PROJECT_ID`	CML project identifier	Set by CML runtime
`CDSW_APP_PORT`	Streamlit server port	Set by CML runtime
`HUGGINGFACE_ACCESS_TOKEN`	HuggingFace Hub token for gated models	Optional (empty string)
`IS_COMPOSABLE`	Enable horizontal navbar mode	Optional (unset = sidebar)
`CUSTOM_LORA_ADAPTERS_DIR`	Directory for custom LoRA adapters	`data/adapters/`
`FINE_TUNING_STUDIO_PROJECT_DEFAULTS`	Path to project defaults JSON	`data/project_defaults.json`

Key Takeaway for Harness Builders

The gRPC API is the sole interface to application logic. The Streamlit UI is one client of this API, not the source of truth. Any external harness, CLI tool, or automation script should instantiate a FineTuningStudioClient (or use the generated gRPC stub directly) and interact through the protobuf contract. The database is an implementation detail behind the DAO – never access .app/state.db directly from external code.

To build a custom training harness:

Import FineTuningStudioClient from ft.client.
Register resources (datasets, models, prompts) via Add* RPCs.
Dispatch training via StartFineTuningJob with the desired resource IDs and compute configuration.
Poll job status via GetFineTuningJob or ListFineTuningJobs.
Evaluate results via StartEvaluationJob.

All resource IDs are UUIDs assigned by the service. Pass them by value between RPCs.

gRPC Service Design

The Fine Tuning Studio API is defined as a single gRPC service in ft/proto/fine_tuning_studio.proto. The service exposes 29 RPCs organized by resource domain. A generated Python stub provides the transport layer; FineTuningStudioClient wraps it with error handling and convenience methods.

Service Architecture

RPC Catalog

Every domain follows the same pattern: List, Get, Add (or Start for jobs), and Remove. Request and response types use the naming convention {Action}{Domain}Request / {Action}{Domain}Response.

Dataset RPCs

RPC	Request Type	Response Type	Description
`ListDatasets`	`ListDatasetsRequest`	`ListDatasetsResponse`	Return all registered datasets
`GetDataset`	`GetDatasetRequest`	`GetDatasetResponse`	Return a single dataset by ID
`AddDataset`	`AddDatasetRequest`	`AddDatasetResponse`	Register a HuggingFace or local dataset
`RemoveDataset`	`RemoveDatasetRequest`	`RemoveDatasetResponse`	Delete a dataset registration
`GetDatasetSplitByAdapter`	`GetDatasetSplitByAdapterRequest`	`GetDatasetSplitByAdapterResponse`	Get dataset split info for a specific adapter

Model RPCs

RPC	Request Type	Response Type	Description
`ListModels`	`ListModelsRequest`	`ListModelsResponse`	Return all registered models
`GetModel`	`GetModelRequest`	`GetModelResponse`	Return a single model by ID
`AddModel`	`AddModelRequest`	`AddModelResponse`	Register a HuggingFace or CML model
`ExportModel`	`ExportModelRequest`	`ExportModelResponse`	Export a model to CML Model Registry
`RemoveModel`	`RemoveModelRequest`	`RemoveModelResponse`	Delete a model registration

Adapter RPCs

RPC	Request Type	Response Type	Description
`ListAdapters`	`ListAdaptersRequest`	`ListAdaptersResponse`	Return all registered adapters
`GetAdapter`	`GetAdapterRequest`	`GetAdapterResponse`	Return a single adapter by ID
`AddAdapter`	`AddAdapterRequest`	`AddAdapterResponse`	Register a local or HuggingFace adapter
`RemoveAdapter`	`RemoveAdapterRequest`	`RemoveAdapterResponse`	Delete an adapter registration

Prompt RPCs

RPC	Request Type	Response Type	Description
`ListPrompts`	`ListPromptsRequest`	`ListPromptsResponse`	Return all prompt templates
`GetPrompt`	`GetPromptRequest`	`GetPromptResponse`	Return a single prompt by ID
`AddPrompt`	`AddPromptRequest`	`AddPromptResponse`	Create a new prompt template
`RemovePrompt`	`RemovePromptRequest`	`RemovePromptResponse`	Delete a prompt template

Fine-Tuning RPCs

RPC	Request Type	Response Type	Description
`ListFineTuningJobs`	`ListFineTuningJobsRequest`	`ListFineTuningJobsResponse`	Return all fine-tuning jobs
`GetFineTuningJob`	`GetFineTuningJobRequest`	`GetFineTuningJobResponse`	Return a single job by ID
`StartFineTuningJob`	`StartFineTuningJobRequest`	`StartFineTuningJobResponse`	Dispatch a new fine-tuning CML Job
`RemoveFineTuningJob`	`RemoveFineTuningJobRequest`	`RemoveFineTuningJobResponse`	Delete a fine-tuning job record

Evaluation RPCs

RPC	Request Type	Response Type	Description
`ListEvaluationJobs`	`ListEvaluationJobsRequest`	`ListEvaluationJobsResponse`	Return all evaluation jobs
`GetEvaluationJob`	`GetEvaluationJobRequest`	`GetEvaluationJobResponse`	Return a single evaluation job by ID
`StartEvaluationJob`	`StartEvaluationJobRequest`	`StartEvaluationJobResponse`	Dispatch a new evaluation CML Job
`RemoveEvaluationJob`	`RemoveEvaluationJobRequest`	`RemoveEvaluationJobResponse`	Delete an evaluation job record

Config RPCs

RPC	Request Type	Response Type	Description
`ListConfigs`	`ListConfigsRequest`	`ListConfigsResponse`	Return all configuration blobs
`GetConfig`	`GetConfigRequest`	`GetConfigResponse`	Return a single config by ID
`AddConfig`	`AddConfigRequest`	`AddConfigResponse`	Create a new configuration
`RemoveConfig`	`RemoveConfigRequest`	`RemoveConfigResponse`	Delete a configuration

Database RPCs

RPC	Request Type	Response Type	Description
`ExportDatabase`	`ExportDatabaseRequest`	`ExportDatabaseResponse`	Export entire database as JSON
`ImportDatabase`	`ImportDatabaseRequest`	`ImportDatabaseResponse`	Import database from JSON file

Servicer Implementation

FineTuningStudioApp in ft/service.py extends the generated FineTuningStudioServicer. It holds two shared resources initialized in __init__:

class FineTuningStudioApp(FineTuningStudioServicer):
    def __init__(self):
        self.cml = cmlapi.default_client()
        self.dao = FineTuningStudioDao(engine_args={
            "pool_size": 5,
            "max_overflow": 10,
            "pool_timeout": 30,
            "pool_recycle": 1800,
        })
        self.project_id = os.getenv("CDSW_PROJECT_ID")

Every RPC method is a one-line delegation to the corresponding domain function, passing (request, self.cml, self.dao):

def ListDatasets(self, request, context):
    return list_datasets(request, self.cml, self.dao)

def StartFineTuningJob(self, request, context):
    return start_fine_tuning_job(request, self.cml, dao=self.dao)

Config and database RPCs omit the cml parameter since they operate on local data only.

Client Wrapper

FineTuningStudioClient in ft/client.py wraps the generated stub with automatic error handling. On construction, it introspects all callable methods on the stub and wraps each one to convert grpc.RpcError into ValueError with cleaned messages.

class FineTuningStudioClient:
    def __init__(self, server_ip=None, server_port=None):
        if not server_ip:
            server_ip = os.getenv("FINE_TUNING_SERVICE_IP")
        if not server_port:
            server_port = os.getenv("FINE_TUNING_SERVICE_PORT")
        self.channel = grpc.insecure_channel(f"{server_ip}:{server_port}")
        self.stub = FineTuningStudioStub(self.channel)

        # Auto-wrap all stub methods with error handling
        for attr in dir(self.stub):
            if not attr.startswith('_') and callable(getattr(self.stub, attr)):
                setattr(self, attr, self._grpc_error_handler(getattr(self.stub, attr)))

Convenience Methods

The client provides shorthand accessors that construct the request internally:

Method	Returns	Equivalent RPC
`get_datasets()`	`List[DatasetMetadata]`	`ListDatasets(ListDatasetsRequest()).datasets`
`get_models()`	`List[ModelMetadata]`	`ListModels(ListModelsRequest()).models`
`get_adapters()`	`List[AdapterMetadata]`	`ListAdapters(ListAdaptersRequest()).adapters`
`get_prompts()`	`List[PromptMetadata]`	`ListPrompts(ListPromptsRequest()).prompts`
`get_fine_tuning_jobs()`	`List[FineTuningJobMetadata]`	`ListFineTuningJobs(ListFineTuningJobsRequest()).fine_tuning_jobs`
`get_evaluation_jobs()`	`List[EvaluationJobMetadata]`	`ListEvaluationJobs(ListEvaluationJobsRequest()).evaluation_jobs`

Usage Example

from ft.client import FineTuningStudioClient
from ft.api import *

client = FineTuningStudioClient()

# List all datasets
datasets = client.get_datasets()

# Add a HuggingFace dataset
client.AddDataset(AddDatasetRequest(
    type="huggingface",
    huggingface_name="tatsu-lab/alpaca",
    name="Alpaca"
))

# Start a fine-tuning job
client.StartFineTuningJob(StartFineTuningJobRequest(
    base_model_id="model-uuid",
    dataset_id="dataset-uuid",
    prompt_id="prompt-uuid",
    adapter_name="my-adapter",
    num_cpu=2,
    num_gpu=1,
    num_memory=16,
    framework_type="legacy"
))

All request and response types are importable from ft.api, which re-exports the generated protobuf classes.

Protobuf Regeneration

After modifying ft/proto/fine_tuning_studio.proto, regenerate the Python bindings:

./bin/generate-proto-python.sh

This produces ft/proto/fine_tuning_studio_pb2.py (message classes) and ft/proto/fine_tuning_studio_pb2_grpc.py (stub and servicer base class). Both are checked into the repository. Do not edit them by hand.

Server Startup

The gRPC server is started by bin/start-grpc-server.py:

Creates a grpc.server with ThreadPoolExecutor(max_workers=10).
Registers FineTuningStudioApp() as the servicer.
Binds to [::]:50051 (all interfaces).
Updates CML project environment variables (FINE_TUNING_SERVICE_IP, FINE_TUNING_SERVICE_PORT) via cmlapi so that any workload in the project can locate the server.
Blocks on server.wait_for_termination().

The server process is launched as a background subprocess by bin/start-app-script.sh before Streamlit starts. See System Overview for the full initialization sequence.

Data Tier

All Fine Tuning Studio metadata is persisted in a SQLite database at .app/state.db (configurable via FINE_TUNING_STUDIO_SQLITE_DB). The ORM layer uses SQLAlchemy declarative models defined in ft/db/model.py. Access is managed through FineTuningStudioDao in ft/db/dao.py.

Schema Topology

Table Schemas

All primary keys are String type (UUIDs assigned by domain logic). All columns are nullable except id. ORM classes are defined in ft/db/model.py.

models

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Source type (e.g., `huggingface`, `cml`)
`framework`	String		Model framework identifier
`name`	String		Display name
`description`	String		Human-readable description
`huggingface_model_name`	String		HuggingFace Hub model ID
`location`	String		Local filesystem path
`cml_registered_model_id`	String		CML Model Registry ID
`mlflow_experiment_id`	String		Associated MLflow experiment
`mlflow_run_id`	String		Associated MLflow run

datasets

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Source type (e.g., `huggingface`, `local`)
`name`	String		Display name
`description`	Text		Long-form description
`huggingface_name`	String		HuggingFace Hub dataset ID
`location`	Text		Local filesystem path
`features`	Text		JSON string of dataset feature names

adapters

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Source type
`name`	String		Display name
`description`	String		Human-readable description
`huggingface_name`	String		HuggingFace Hub adapter ID
`model_id`	String	FK -> `models.id`	Base model this adapter targets
`location`	Text		Local filesystem path to adapter weights
`fine_tuning_job_id`	String	FK -> `fine_tuning_jobs.id`	Job that produced this adapter
`prompt_id`	String	FK -> `prompts.id`	Prompt template used during training
`cml_registered_model_id`	String		CML Model Registry ID
`mlflow_experiment_id`	String		Associated MLflow experiment
`mlflow_run_id`	String		Associated MLflow run

prompts

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Prompt type
`name`	String		Display name
`description`	String		Human-readable description
`dataset_id`	String	FK -> `datasets.id`	Dataset this prompt is designed for
`prompt_template`	String		Full prompt format string
`input_template`	String		Input portion template
`completion_template`	String		Completion portion template

fine_tuning_jobs

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`base_model_id`	String	FK -> `models.id`	Base model to fine-tune
`dataset_id`	String	FK -> `datasets.id`	Training dataset
`prompt_id`	String	FK -> `prompts.id`	Prompt template
`num_workers`	Integer		Number of worker processes
`cml_job_id`	String		CML Job ID for tracking
`adapter_id`	String	FK -> `adapters.id`	Resulting adapter
`num_cpu`	Integer		CPU allocation
`num_gpu`	Integer		GPU allocation
`num_memory`	Integer		Memory allocation (GB)
`num_epochs`	Integer		Training epochs
`learning_rate`	Double		Learning rate
`out_dir`	String		Output directory for adapter weights
`training_arguments_config_id`	String	FK -> `configs.id`	Training arguments config
`model_bnb_config_id`	String	FK -> `configs.id`	Model BitsAndBytes quantization config
`adapter_bnb_config_id`	String	FK -> `configs.id`	Adapter BitsAndBytes quantization config
`lora_config_id`	String	FK -> `configs.id`	LoRA hyperparameters config
`training_arguments_config`	String		Serialized training arguments (snapshot)
`model_bnb_config`	String		Serialized model BnB config (snapshot)
`adapter_bnb_config`	String		Serialized adapter BnB config (snapshot)
`lora_config`	String		Serialized LoRA config (snapshot)
`dataset_fraction`	Double		Fraction of dataset to use
`train_test_split`	Double		Train/test split ratio
`user_script`	String		Custom user training script path
`user_config_id`	String	FK -> `configs.id`	Custom user config
`framework_type`	String		Training framework (`legacy`, `axolotl`, etc.)
`axolotl_config_id`	String	FK -> `configs.id`	Axolotl YAML config
`gpu_label_id`	Integer		GPU label selector
`adapter_name`	String		Name assigned to the output adapter

The fine_tuning_jobs table stores both config ID references (foreign keys to configs) and serialized config snapshots (plain string columns). This allows job records to remain self-describing even if the referenced config is later deleted.

evaluation_jobs

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Evaluation type
`cml_job_id`	String		CML Job ID for tracking
`parent_job_id`	String		Parent fine-tuning job (if derived)
`base_model_id`	String	FK -> `models.id`	Model under evaluation
`dataset_id`	String	FK -> `datasets.id`	Evaluation dataset
`prompt_id`	String	FK -> `prompts.id`	Prompt template
`num_workers`	Integer		Number of worker processes
`adapter_id`	String	FK -> `adapters.id`	Adapter under evaluation
`num_cpu`	Integer		CPU allocation
`num_gpu`	Integer		GPU allocation
`num_memory`	Integer		Memory allocation (GB)
`evaluation_dir`	String		Output directory for evaluation artifacts
`model_bnb_config_id`	String	FK -> `configs.id`	Model BnB quantization config
`adapter_bnb_config_id`	String	FK -> `configs.id`	Adapter BnB quantization config
`generation_config_id`	String	FK -> `configs.id`	Generation config for inference
`model_bnb_config`	String		Serialized model BnB config (snapshot)
`adapter_bnb_config`	String		Serialized adapter BnB config (snapshot)
`generation_config`	String		Serialized generation config (snapshot)

configs

Column	Type	Constraints	Description
`id`	String	PK, NOT NULL	UUID
`type`	String		Config type (`training_arguments`, `bnb`, `lora`, `generation`, `axolotl`)
`description`	String		Human-readable description
`config`	Text		JSON or YAML content stored as string
`model_family`	String		Model family this config targets
`is_default`	Integer		`1` = shipped default, `0` = user-created

ORM Mix-ins

All ORM model classes inherit from three bases: Base (SQLAlchemy declarative base), MappedProtobuf, and MappedDict. These mix-ins provide bidirectional serialization.

MappedProtobuf

Converts between protobuf messages and ORM instances.

# Protobuf message -> ORM instance
adapter_orm = Adapter.from_message(adapter_proto_msg)

# ORM instance -> Protobuf message
adapter_proto = adapter_orm.to_protobuf(AdapterMetadata)

from_message() uses ListFields() (protobuf >= 3.15) to extract only fields that were explicitly set in the message, avoiding default-value contamination. to_protobuf() iterates the ORM instance’s non-null columns and sets matching fields on a new protobuf message.

MappedDict

Converts between Python dictionaries and ORM instances.

# Dict -> ORM instance
model_orm = Model.from_dict({"id": "abc", "name": "llama-2"})

# ORM instance -> Dict (non-null fields only)
model_dict = model_orm.to_dict()

Table-Model Registry

ft/db/model.py exports two lookup dictionaries for programmatic table access:

TABLE_TO_MODEL_REGISTRY = {
    'datasets': Dataset,
    'models': Model,
    'prompts': Prompt,
    'adapters': Adapter,
    'fine_tuning_jobs': FineTuningJob,
    'evaluation_jobs': EvaluationJob,
    'configs': Config
}

MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}

These are used by the database import/export logic to iterate all application tables.

DAO

FineTuningStudioDao in ft/db/dao.py manages SQLAlchemy engine and session lifecycle.

Constructor

class FineTuningStudioDao:
    def __init__(self, engine_url=None, echo=False, engine_args={}):
        if engine_url is None:
            engine_url = f"sqlite+pysqlite:///{get_sqlite_db_location()}"
        self.engine = create_engine(engine_url, echo=echo, **engine_args)
        self.Session = sessionmaker(bind=self.engine, autoflush=True, autocommit=False)
        Base.metadata.create_all(self.engine)

The servicer instantiates the DAO with connection pool parameters:

Parameter	Value	Description
`pool_size`	5	Persistent connections in the pool
`max_overflow`	10	Additional connections beyond pool_size
`pool_timeout`	30	Seconds to wait for a connection
`pool_recycle`	1800	Seconds before a connection is recycled

Tables are auto-created on first initialization via Base.metadata.create_all(engine).

Session Context Manager

All domain functions access the database through dao.get_session():

@contextmanager
def get_session(self):
    session = self.Session()
    try:
        yield session
        session.commit()
    except Exception as e:
        session.rollback()
        raise e
    finally:
        session.close()

Usage in domain code:

def list_datasets(request, cml, dao):
    with dao.get_session() as session:
        datasets = session.query(Dataset).all()
        # ... convert and return

The context manager guarantees: commit on success, rollback on exception, close in all cases.

Database Export and Import

ft/db/db_import_export.py provides DatabaseJsonConverter for full database serialization.

Export

export_to_json(output_path=None) iterates all non-system tables (excluding sqlite_* internal tables), captures the CREATE TABLE schema and all row data, and returns a JSON string:

{
  "models": {
    "schema": "CREATE TABLE IF NOT EXISTS models (...)",
    "data": [
      {"id": "abc-123", "name": "llama-2", "type": "huggingface", ...}
    ]
  },
  "datasets": { ... },
  ...
}

If output_path is provided, the JSON is also written to that file.

Import

import_from_json(json_path) reads a JSON file in the export format, executes each table’s CREATE TABLE IF NOT EXISTS statement, and inserts all rows. Rows that fail to insert (e.g., due to duplicate primary keys) are logged but do not abort the import.

Alembic Migrations

Schema migrations are managed by Alembic. Configuration is at alembic.ini with migration scripts in db_migrations/. When adding or modifying columns, generate a new migration with:

alembic revision --autogenerate -m "description of change"
alembic upgrade head

The DAO’s create_all() call handles initial table creation, but column additions and type changes on existing databases require Alembic migrations.

Cross-References

System Overview – initialization sequence and environment variables
gRPC Service Design – how domain functions receive the DAO
Configuration Specification – config type taxonomy and validation

Streamlit Presentation Layer

The UI is a multi-page Streamlit application defined in main.py. It renders resource management forms, job dispatch controls, and evaluation dashboards. All data operations go through the gRPC client – the Streamlit layer has no direct database access.

Entry Point

main.py sets the page configuration and selects a navigation mode based on the IS_COMPOSABLE environment variable:

st.set_page_config(
    page_title="Fine Tuning Studio",
    page_icon=IconPaths.FineTuningStudio.FINE_TUNING_STUDIO,
    layout="wide"
)

The layout is always "wide". The page icon is loaded from the resources/images/ directory via ft.consts.IconPaths.

Composable Mode

Activated when IS_COMPOSABLE is set to any non-empty value. Uses streamlit_navigation_bar (st_navbar) combined with custom HTML/CSS for dropdown menus. Navigation groups:

Group	Pages
Home	Home
Database Import Export	Database Import and Export
Resources	Import Datasets, View Datasets, Import Base Models, View Base Models, Create Prompts, View Prompts
Experiment	Train a New Adapter, Monitor Training Jobs, Local Adapter Comparison, Run MLFlow Evaluation, View MLflow Runs
AI Workbench	Export And Deploy Model
Examples	Ticketing Agent App
Feedback	Provide Feedback

The navbar is rendered as a fixed-position HTML <nav> element with CSS dropdown menus. Links use target="_self" to navigate within the Streamlit app. All pages are registered with st.navigation(position="hidden") so that Streamlit handles routing internally while the custom navbar provides the visible UI.

Standard Mode (Default)

When IS_COMPOSABLE is not set, the sidebar renders section headers and page links with Material Design icons:

with st.sidebar:
    st.image("./resources/images/ft-logo.png")
    st.markdown("Navigation")
    st.page_link("pgs/home.py", label="Home", icon=":material/home:")
    st.page_link("pgs/database.py", label="Database Import and Export", icon=":material/database:")

    st.markdown("Resources")
    st.page_link("pgs/datasets.py", label="Import Datasets", icon=":material/publish:")
    st.page_link("pgs/view_datasets.py", label="View Datasets", icon=":material/data_object:")
    # ... remaining pages

Sidebar sections: Navigation, Resources, Experiments, AI Workbench, Examples, Feedback. The sidebar footer displays the current project owner and a link to the CML domain.

Page Inventory

All page modules live in the pgs/ directory:

File	Title	Section
`pgs/home.py`	Home	Navigation
`pgs/database.py`	Database Import and Export	Database
`pgs/datasets.py`	Import Datasets	Resources
`pgs/view_datasets.py`	View Datasets	Resources
`pgs/models.py`	Import Base Models	Resources
`pgs/view_models.py`	View Base Models	Resources
`pgs/prompts.py`	Create Prompts	Resources
`pgs/view_prompts.py`	View Prompts	Resources
`pgs/train_adapter.py`	Train a New Adapter	Experiments
`pgs/jobs.py`	Training Job Tracking	Experiments
`pgs/evaluate.py`	Local Adapter Comparison	Experiments
`pgs/mlflow.py`	Run MLFlow Evaluation	Experiments
`pgs/mlflow_jobs.py`	View MLflow Runs	Experiments
`pgs/export.py`	Export And Deploy Model	AI Workbench
`pgs/sample_ticketing_agent_app_embed.py`	Sample Ticketing Agent App	Examples
`pgs/feedback.py`	Feedback	Feedback

Client Caching

Shared client instances are cached at the Streamlit server level using @st.cache_resource. This avoids creating a new gRPC channel or CML API client on every page render. Both helpers are defined in pgs/streamlit_utils.py:

@st.cache_resource
def get_fine_tuning_studio_client() -> FineTuningStudioClient:
    client = FineTuningStudioClient()
    return client

@st.cache_resource
def get_cml_client() -> CMLServiceApi:
    client = default_client()
    return client

@st.cache_resource ensures a single instance per Streamlit server process. The gRPC client connects to the address specified by FINE_TUNING_SERVICE_IP and FINE_TUNING_SERVICE_PORT environment variables. The CML client uses cmlapi.default_client(), which reads CML connection parameters from the pod environment.

Data Flow

Every user interaction follows this path: Streamlit widget event triggers a page callback, the page calls the cached client, the client sends a gRPC request, the server delegates to domain logic, and the domain function uses the DAO to read or write SQLite.

How to Add a New Page

Create the page module at pgs/my_page.py:

import streamlit as st
from pgs.streamlit_utils import get_fine_tuning_studio_client

st.header("My New Page")

client = get_fine_tuning_studio_client()

# Use the client to interact with the gRPC service
models = client.get_models()
for model in models:
    st.write(model.name)

Register the page in both navigation modes in main.py:

In the composable mode setup_navigation() function, add:

st.Page("pgs/my_page.py", title="My New Page"),

In the composable mode HTML navbar, add a link in the appropriate dropdown:

<a href="/my_page" target="_self"><span class="material-icons">icon_name</span> My New Page</a>

In the standard mode setup_navigation_sidebar() function, add:

st.Page("pgs/my_page.py", title="My New Page"),

In the standard mode setup_sidebar() function, add under the appropriate section:

st.page_link("pgs/my_page.py", label="My New Page", icon=":material/icon_name:")

If the page requires a new RPC, add it to the protobuf definition, regenerate, implement the servicer method, and add the domain function. See gRPC Service Design.

Custom CSS

Both navigation modes inject custom CSS to control typography and layout:

Heading sizes (h3 reduced to 1.1rem)
Tab label font sizes (0.9rem)
Sidebar theming (dark background #16262c, white text) in standard mode
Navbar positioning and dropdown behavior in composable mode

CSS is injected via st.markdown(css, unsafe_allow_html=True).

Cross-References

System Overview – startup sequence and environment variables
gRPC Service Design – client wrapper and API surface
Data Tier – database schema backing the resources displayed in the UI

Resource Concepts

Fine Tuning Studio manages seven resource types. All use UUID string primary keys generated via uuid4(). Resources are metadata entries stored in SQLite – the actual artifacts (model weights, dataset files, adapter checkpoints) live on the filesystem, HuggingFace Hub, or the CML Model Registry.

Resource Types

Resource	Table	Purpose
Dataset	`datasets`	Reference to a HuggingFace Hub dataset or local file (CSV, JSON, JSONL)
Model	`models`	Base foundation model from HuggingFace Hub or CML Model Registry
Adapter	`adapters`	PEFT LoRA adapter – produced by training, imported from disk, or fetched from Hub
Prompt	`prompts`	Format-string template mapping dataset features into training input
Config	`configs`	Named configuration blob (training args, BnB, LoRA, generation, Axolotl YAML)
FineTuningJob	`fine_tuning_jobs`	CML Job that trains a PEFT adapter
EvaluationJob	`evaluation_jobs`	CML Job that runs MLflow evaluation against model+adapter combinations

Entity Relationships

Type Enums

All type enums are defined in ft/api/types.py as str, Enum subclasses.

Enum	Values
`DatasetType`	`huggingface`, `project`, `project_csv`, `project_json`, `project_jsonl`
`ModelType`	`huggingface`, `project`, `model_registry`
`AdapterType`	`project`, `huggingface`, `model_registry`
`PromptType`	`in_place`
`ConfigType`	`training_arguments`, `bitsandbytes_config`, `generation_config`, `lora_config`, `custom`, `axolotl`, `axolotl_dataset_formats`
`FineTuningFrameworkType`	`legacy`, `axolotl`
`ModelExportType`	`model_registry`, `cml_model`
`EvaluationJobType`	`mlflow`
`ModelFrameworkType`	`pytorch`, `tensorflow`, `onnx`

ORM Layer

All ORM models inherit from sqlalchemy.orm.declarative_base() plus two mixins defined in ft/db/model.py:

MappedProtobuf – bidirectional protobuf conversion:

from_message(message) – class method. Extracts set fields from a protobuf message via ListFields() and passes them as kwargs to the ORM constructor.
to_protobuf(protobuf_cls) – instance method. Converts non-null ORM columns into a protobuf message by matching field names.

MappedDict – bidirectional dict conversion:

from_dict(d) – class method. Constructs an ORM instance from a plain dictionary.
to_dict() – instance method. Returns a dictionary of all non-null column values via SQLAlchemy inspect().

The serialization chain for any resource:

Protobuf message  <-->  ORM model  <-->  Python dict
     from_message() / to_protobuf()   from_dict() / to_dict()

Table Registry

ft/db/model.py maintains two registries used by the database import/export subsystem:

TABLE_TO_MODEL_REGISTRY = {
    'datasets': Dataset,
    'models': Model,
    'prompts': Prompt,
    'adapters': Adapter,
    'fine_tuning_jobs': FineTuningJob,
    'evaluation_jobs': EvaluationJob,
    'configs': Config,
}

MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}

Any new resource type must be added to TABLE_TO_MODEL_REGISTRY for database import/export to function correctly.

Dataset Specification

A Dataset resource is a metadata reference to a data source. The actual data lives on HuggingFace Hub or the local filesystem. On import, Studio extracts feature column names and stores them as a JSON string, enabling downstream prompt template construction without reloading the data.

Source: ft/datasets.py, ft/db/model.py

Supported Types

Type	Source	Identifier Field	Feature Extraction Method
`huggingface`	HuggingFace Hub	`huggingface_name`	`load_dataset_builder()` -> `info.features.keys()`
`project`	Local HF-compatible directory	`location`	Not extracted
`project_csv`	Local CSV file	`location`	Read header row via `csv.reader`
`project_json`	Local JSON file	`location`	Read first object keys via `json.load`
`project_jsonl`	Local JSONL file	`location`	Read first line keys via `json.loads`

ORM Schema

class Dataset(Base, MappedProtobuf, MappedDict):
    __tablename__ = "datasets"
    id = Column(String, primary_key=True)    # UUID
    type = Column(String)                     # DatasetType enum value
    name = Column(String)                     # Display name
    description = Column(Text)                # Auto-populated for HF datasets
    huggingface_name = Column(String)         # HF Hub identifier (HF type only)
    location = Column(Text)                   # Filesystem path (project types only)
    features = Column(Text)                   # JSON-serialized list of column names

Import Validation

add_dataset() dispatches to type-specific validators before creating a record:

All types:

type field is required.
Duplicate detection by name (local types) or huggingface_name (HF type).

HuggingFace (_validate_huggingface_dataset_request):

huggingface_name field required and non-blank.
Validates dataset exists on Hub via load_dataset_builder().
Extracts dataset_info.features.keys() for feature list.
Stores dataset_info.description as the description.

CSV (_validate_local_csv_dataset_request):

location field required, must end with .csv.
name field required and non-blank.
Reads header row with csv.reader(file) / next(reader) for features.

JSON (_validate_local_json_dataset_request):

location field required, must end with .json.
Reads first object in the JSON array for feature keys.

JSONL (_validate_local_jsonl_dataset_request):

location field required, must end with .jsonl.
Reads first line, parses as JSON, extracts keys for features.

Feature Extraction Functions

extract_features_from_csv(location)   # csv.reader -> next(reader)
extract_features_from_json(location)  # json.load -> next(iter(data)).keys()
extract_features_from_jsonl(location) # json.loads(first_line).keys()

Features are stored as json.dumps(features) in the features column. Downstream consumers (prompt templates, training scripts) parse this back with json.loads().

Loading into Memory

load_dataset_into_memory(dataset: DatasetMetadata) normalizes all dataset types into a HuggingFace DatasetDict with at minimum a train key:

Type	Load Method	Wrapping
`huggingface`	`datasets.load_dataset(huggingface_name)`	Already a `DatasetDict`
`project_csv`	`datasets.load_dataset('csv', data_files=location)`	Already a `DatasetDict`
`project_json`	`datasets.Dataset.from_json(location)`	Wrapped in `DatasetDict({'train': ds})`
`project_jsonl`	`datasets.Dataset.from_json(location)`	Wrapped in `DatasetDict({'train': ds})`

If the loaded object is a Dataset (not DatasetDict), it is wrapped: DatasetDict({'train': ds}).

Removal

remove_dataset() deletes the Dataset record. If request.remove_prompts is set, also deletes all Prompt records with matching dataset_id via cascading delete.

Protobuf Message

DatasetMetadata fields: id, type, name, description, huggingface_name, location, features.

Model Specification

A Model resource represents a base foundation model registered in the Studio’s metadata store. Models serve as the starting point for fine-tuning and evaluation. The actual model weights are never stored by Studio – they are downloaded at training time from HuggingFace Hub or resolved from the CML Model Registry.

Source: ft/models.py, ft/db/model.py, ft/config/model_configs/config_loader.py

Supported Types

Type	Source	Required Fields	Validation
`huggingface`	HuggingFace Hub	`huggingface_model_name`	`HfApi().model_info()` must succeed
`model_registry`	CML Model Registry	`model_registry_id` (request)	Fetches `RegisteredModel` via `cmlapi`
`project`	Local directory	`location`	Not yet fully implemented

ORM Schema

class Model(Base, MappedProtobuf, MappedDict):
    __tablename__ = "models"
    id = Column(String, primary_key=True)            # UUID
    type = Column(String)                             # ModelType enum value
    framework = Column(String)                        # ModelFrameworkType (pytorch, tensorflow, onnx)
    name = Column(String)                             # Display name
    description = Column(String)
    huggingface_model_name = Column(String)           # HF Hub model identifier
    location = Column(String)                         # Local path (project type)
    cml_registered_model_id = Column(String)          # CML Registry model ID
    mlflow_experiment_id = Column(String)             # MLflow experiment (registry type)
    mlflow_run_id = Column(String)                    # MLflow run (registry type)

Import Flow

add_model() validates and creates a Model record based on type:

HuggingFace:

Validate huggingface_name is non-empty and not already registered (duplicate check by huggingface_model_name).
Call HfApi().model_info(name) to confirm model exists on Hub.
Create Model with type=HUGGINGFACE, name and huggingface_model_name set to the stripped input.

Model Registry:

model_registry_id must be provided on the request.
Fetch RegisteredModel via cml.get_registered_model(id).
Extract the first version’s metadata: registered_model.model_versions[0].model_version_metadata.mlflow_metadata.
Create Model with type=MODEL_REGISTRY, name from registered_model.name, and populate cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Model Family Detection

ft/config/model_configs/config_loader.py provides ModelMetadataFinder:

class ModelMetadataFinder:
    def __init__(self, model_name_or_path):
        self.model_name_or_path = model_name_or_path

    def fetch_model_family_from_config(self):
        config = AutoConfig.from_pretrained(self.model_name_or_path)
        return config.architectures[0]  # e.g., "LlamaForCausalLM"

This is used in two places:

Config filtering: list_configs() filters default configs to those matching the model’s architecture family.
Config creation: add_config() with a description field uses transform_name_to_family() to resolve the model family for deduplication scoping.

Additional static methods:

fetch_bos_token_id_from_config(model_name_or_path) – returns config.bos_token_id (default: 1).
fetch_eos_token_id_from_config(model_name_or_path) – returns config.eos_token_id (default: 2).

Export Routes

export_model() dispatches based on ModelExportType:

Export Type	Handler	Target
`model_registry`	`export_model_registry_model()`	MLflow model registry
`cml_model`	`deploy_cml_model()`	CML Model endpoint

Both handlers are defined in ft/export.py.

Protobuf Message

ModelMetadata fields: id, type, framework, name, huggingface_model_name, location, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Adapter Specification

An Adapter resource represents a PEFT LoRA adapter. Adapters are produced by fine-tuning jobs, imported from a local directory, or fetched from HuggingFace Hub. Each adapter is linked to a base model and optionally to the fine-tuning job and prompt template that produced it.

Source: ft/adapters.py, ft/db/model.py

Supported Types

Type	Source	Required Fields
`project`	Local directory	`location` (must exist as a directory)
`huggingface`	HuggingFace Hub	`huggingface_name`
`model_registry`	CML Model Registry	`cml_registered_model_id`

ORM Schema

class Adapter(Base, MappedProtobuf, MappedDict):
    __tablename__ = "adapters"
    id = Column(String, primary_key=True)                                 # UUID
    type = Column(String)                                                  # AdapterType enum value
    name = Column(String)                                                  # Display name (unique)
    description = Column(String)
    huggingface_name = Column(String)                                      # HF Hub adapter identifier
    model_id = Column(String, ForeignKey('models.id'))                     # Base model FK
    location = Column(Text)                                                # Local path to adapter dir
    fine_tuning_job_id = Column(String, ForeignKey('fine_tuning_jobs.id')) # Producing job FK
    prompt_id = Column(String, ForeignKey('prompts.id'))                   # Training prompt FK
    cml_registered_model_id = Column(String)                               # CML Registry model ID
    mlflow_experiment_id = Column(String)                                  # MLflow experiment
    mlflow_run_id = Column(String)                                         # MLflow run

Key Relationships

FK Column	Target	Required
`model_id`	`models.id`	Yes – the base model this adapter applies to
`fine_tuning_job_id`	`fine_tuning_jobs.id`	No – only set for Studio-trained adapters
`prompt_id`	`prompts.id`	No – only set for Studio-trained adapters

Import Validation

_validate_add_adapter_request() enforces:

Required fields: name, model_id, and location must all be present and non-blank.
Directory existence: os.path.isdir(request.location) must return True.
Model FK: model_id must reference an existing Model record.
Unique name: No existing adapter may share the same name.
Optional FK checks: If fine_tuning_job_id is provided, it must exist in fine_tuning_jobs. If prompt_id is provided, it must exist in prompts.

Adapter Creation

add_adapter() validates the request, then creates an Adapter record with all provided fields mapped directly from the request.

Dataset Split Tracking

get_dataset_split_by_adapter() retrieves the dataset fraction and train/test split used during training for a given adapter:

Joins FineTuningJob to Adapter on adapter_name.
If a matching job is found, returns its dataset_fraction and train_test_split.
If no matching job exists (imported adapter), returns defaults:

Parameter	Default	Source
`dataset_fraction`	`1.0`	`TRAINING_DEFAULT_DATASET_FRACTION`
`train_test_split`	`0.9`	`TRAINING_DEFAULT_TRAIN_TEST_SPLIT`

These defaults are defined in ft/consts.py.

Protobuf Message

AdapterMetadata fields: id, type, name, description, huggingface_name, model_id, location, fine_tuning_job_id, prompt_id, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.

Prompt Template Specification

A Prompt resource defines a format-string template that maps dataset feature columns into structured training input. Prompts bind a dataset’s column names to positional slots in the training text, controlling how raw data is presented to the model during fine-tuning and evaluation.

Source: ft/prompts.py, ft/utils.py, ft/jobs.py, ft/db/model.py

Template Fields

Field	Purpose	Example
`prompt_template`	Full prompt format string used during training	`"Instruction: {instruction}\nInput: {input}\nOutput: {output}"`
`input_template`	Input portion (informational, used in evaluation)	`"Instruction: {instruction}\nInput: {input}"`
`completion_template`	Expected output portion (informational, used in evaluation)	`"Output: {output}"`

Placeholders use Python format-string syntax: {feature_name}. Each placeholder must correspond to a column name in the linked dataset’s features JSON array.

ORM Schema

class Prompt(Base, MappedProtobuf, MappedDict):
    __tablename__ = "prompts"
    id = Column(String, primary_key=True)               # UUID
    type = Column(String)                                # PromptType enum value
    name = Column(String)                                # Display name (unique)
    description = Column(String)
    dataset_id = Column(String, ForeignKey('datasets.id'))  # Linked dataset FK
    prompt_template = Column(String)                     # Full template
    input_template = Column(String)                      # Input portion
    completion_template = Column(String)                  # Output portion

Import Validation

_validate_add_prompt_request() enforces:

Required fields: id, name, dataset_id, prompt_template, input_template, completion_template must all be present on the PromptMetadata message.
Non-blank name: name.strip() must be non-empty.
Unique name: No existing prompt may share the same name.

The prompt is created via Prompt.from_message(request.prompt), which uses the MappedProtobuf.from_message() method to map protobuf fields directly to ORM columns.

Auto-Generation from Dataset Columns

ft/utils.py::generate_templates(columns) produces default templates from a list of dataset column names:

Output column detection: Compares column names against a ranked list of 500 common output column names (e.g., answer, response, output, label, target). The column matching the highest-ranked name becomes the output column. If no match, the last column is used.
Input columns: All columns except the identified output column.

Prompt template: Generated as:

You are an LLM responsible for generating a response. Please provide a response given the user input below.

<Column1>: {column1}
<Column2>: {column2}
<Output>:

Completion template: {output_column}\n

Returns (prompt_template, completion_template).

Axolotl Auto-Prompt

ft/jobs.py::_add_prompt_for_dataset() generates a prompt automatically when using the Axolotl framework and no prompt is provided:

Load the Axolotl config from the database by axolotl_config_id.
Parse the YAML config and extract the dataset type from config['datasets'][0]['type'].
Query the Config table for a matching axolotl_dataset_formats config by description == dataset_type.
Parse the format config JSON to extract feature column names.
Build a template: <Feature>: {feature}\n for each feature.
Check for an existing prompt with the same dataset_id and prompt_template. If found, return its ID.
Otherwise, create a new Prompt named "AXOLOTL_AUTOGENERATED : {dataset_type}_{dataset_name}".

Removal

remove_prompt() deletes the Prompt record by ID. Note that prompts are also cascade-deleted when their parent dataset is removed with remove_prompts=True.

Protobuf Message

PromptMetadata fields: id, type, name, description, dataset_id, prompt_template, input_template, completion_template.

Configuration Specification

A Config resource stores a named configuration blob – JSON or YAML – that parameterizes training, quantization, inference, or the Axolotl framework. Configs are content-deduplicated: adding a config with identical content and type to an existing one returns the existing config’s ID rather than creating a duplicate.

Source: ft/configs.py, ft/consts.py, ft/db/model.py

Config Types

Type	Format	Purpose	Default Provided
`training_arguments`	JSON	Training hyperparameters (epochs, optimizer, batch size, learning rate)	Yes
`bitsandbytes_config`	JSON	4-bit quantization settings	Yes
`lora_config`	JSON	LoRA hyperparameters	Yes
`generation_config`	JSON	Inference generation settings	Yes
`custom`	JSON	User-defined configuration blob	No
`axolotl`	YAML	Axolotl training configuration file	Template provided
`axolotl_dataset_formats`	JSON	Axolotl dataset format schemas	Yes (multiple)

ORM Schema

class Config(Base, MappedProtobuf, MappedDict):
    __tablename__ = "configs"
    id = Column(String, primary_key=True)       # UUID
    type = Column(String)                        # ConfigType enum value
    description = Column(String)                 # Model name (for family resolution) or format name
    config = Column(Text)                        # Serialized JSON or YAML string
    model_family = Column(String)                # Architecture family (e.g., "LlamaForCausalLM")
    is_default = Column(Integer, default=1)      # 1 = system/default, 0 = user-created

is_default Semantics

Value	Constant	Meaning
`1`	`DEFAULT_CONFIGS`	System-provided default configuration
`0`	`USER_CONFIGS`	User-created configuration

User-created configs always have is_default=0. The add_config() function sets this automatically.

Default Config Values

Defined in ft/consts.py:

DEFAULT_TRAINING_ARGUMENTS

{
    "num_train_epochs": 1,
    "optim": "paged_adamw_32bit",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "warmup_ratio": 0.03,
    "max_grad_norm": 0.3,
    "learning_rate": 0.0002,
    "fp16": true,
    "logging_steps": 1,
    "lr_scheduler_type": "constant",
    "disable_tqdm": true,
    "report_to": "mlflow",
    "ddp_find_unused_parameters": false
}

DEFAULT_BNB_CONFIG

{
    "load_in_4bit": true,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_use_double_quant": true,
    "quant_method": "bitsandbytes"
}

DEFAULT_LORA_CONFIG

{
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

DEFAULT_GENERATIONAL_CONFIG

{
    "do_sample": true,
    "temperature": 0.8,
    "max_new_tokens": 60,
    "top_p": 1,
    "top_k": 50,
    "num_beams": 1,
    "repetition_penalty": 1.1,
    "max_length": null
}

Config Deduplication

add_config() implements content-addressed caching:

Parse the incoming config string: yaml.safe_load() for axolotl type, json.loads() for all others.
Re-serialize to a canonical form (yaml.dump() or json.dumps()).
Query existing configs of the same type (and same model_family if description is provided).
Compare parsed content of each existing config against the parsed request content.
If an identical config exists, return it. At most one duplicate is expected (asserted).
If no match, create a new Config with is_default=USER_CONFIGS (0).

When description is provided, it is interpreted as a model name: transform_name_to_family(description) resolves the HuggingFace architecture (e.g., "LlamaForCausalLM") and scopes the deduplication query to that family.

Model-Family-Specific Filtering

list_configs() applies model-aware filtering when model_id is present in the request:

Optionally filter by type if specified.
If model_id is provided, call get_configs_for_model_id():
- Fetch the Model record and resolve huggingface_model_name.
- Instantiate ModelMetadataFinder(model_hf_name) and call fetch_model_family_from_config().
- Filter configs where model_family matches and is_default == 1.
- If no model-specific defaults exist, fall back to returning all configs.
User configs (is_default=0) are not filtered by model family in get_configs_for_model_id() – they are returned when no model-specific defaults are found (fallback behavior).

Axolotl Config Template

The Axolotl config template is loaded from ft/config/axolotl/training_config/template.yaml via get_axolotl_training_config_template_yaml_str(). Axolotl dataset format configs are stored in ft/config/axolotl/dataset_formats/.

Protobuf Message

ConfigMetadata fields: id, type, description, config (serialized JSON/YAML string), model_family, is_default.

Fine-Tuning Job Lifecycle

A fine-tuning job trains a PEFT LoRA adapter on a base model using a configured dataset and prompt template. Jobs are dispatched as CML Jobs via the cmlapi SDK. The entry point is ft/jobs.py::start_fine_tuning_job(), which validates the request, prepares the execution environment, and creates the CML workload.

Job Dispatch Flow

Validate Request – _validate_fine_tuning_request() checks all fields against the rules below. Any violation raises a ValueError that propagates as a gRPC error.
Create Job Directory – A UUID job_id is generated. The directory .app/job_runs/{job_id} is created to hold training artifacts.
Find Template CML Job – The dispatcher locates the Accel_Finetuning_Base_Job template in the CML project. This template defines the runtime environment and script path.
Build Argument List – All training parameters are serialized into a --key value string passed as JOB_ARGUMENTS.
Create CML Job + JobRun – A CML Job and its first JobRun are created via cmlapi, with the specified CPU, GPU, and memory resources.
Store Job Record – A FineTuningJob record is inserted into the fine_tuning_jobs table with all metadata for tracking.

Validation Rules

Validation is performed by ft/jobs.py::_validate_fine_tuning_request() before any side effects occur.

Field	Rule	Error
`framework_type`	Must be `legacy` or `axolotl`	“framework_type must be either legacy or axolotl”
`adapter_name`	Alphanumeric + hyphens only (`^[a-zA-Z0-9-]+$`)	“adapter_name must be alphanumeric”
`out_dir`	Must exist as directory	“output_dir does not exist”
`num_cpu`	> 0	“cpu must be greater than 0”
`num_gpu`	>= 0	“gpu must be at least 0”
`num_memory`	> 0	“memory must be greater than 0”
`num_workers`	> 0	“num_workers must be greater than 0”
`num_epochs`	> 0	“Number of epochs must be greater than 0”
`learning_rate`	> 0	“Learning rate must be greater than 0”
`dataset_fraction`	(0, 1]	“dataset_fraction must be between 0 and 1”
`train_test_split`	(0, 1]	“train_test_split must be between 0 and 1”
`axolotl_config_id`	Required when framework=axolotl	“axolotl framework requires axolotl_config_id”
`base_model_id`	Must exist in DB	“Model not found”
`dataset_id`	Must exist in DB	“Dataset not found”
`prompt_id`	Must exist in DB (legacy only)	“Prompt not found”

Framework Types

Legacy

Uses HuggingFace Accelerate with the TRL SFTTrainer. The user provides each configuration component separately:

prompt_id – A prompt template that maps dataset features to the training text format.
LoRA config – PEFT LoRA hyperparameters (rank, alpha, dropout, target modules).
BnB config – BitsAndBytes quantization settings (4-bit NF4 quantization).
Training arguments – Standard HuggingFace TrainingArguments fields (epochs, learning rate, batch size, etc.).

For distributed training, worker resources are specified independently via dist_cpu, dist_gpu, and dist_mem fields.

Axolotl

Uses the Axolotl training framework. The user provides a single YAML configuration file (referenced by axolotl_config_id) that bundles all training parameters, LoRA settings, and dataset handling into one document. If no prompt_id is provided, the system auto-generates a prompt from the dataset format definition. See Axolotl Integration for details.

Resource Specification

Each job requires explicit compute resource allocation:

Field	Description
`num_cpu`	CPU cores for the primary training worker
`num_gpu`	GPU count for the primary training worker
`num_memory`	Memory in GB for the primary training worker
`num_workers`	Number of training workers (Accelerate distributed training)

For legacy distributed training, additional fields specify per-worker resources:

Field	Description
`dist_cpu`	CPU cores per distributed worker
`dist_gpu`	GPU count per distributed worker
`dist_mem`	Memory in GB per distributed worker

Argument List Schema

Arguments are passed as the JOB_ARGUMENTS environment variable to the CML Job. The value is a space-delimited string of --key value pairs.

Core arguments (always present):

Key	Source
`base_model_id`	Request field
`dataset_id`	Request field
`experimentid`	Generated UUID (same as `job_id`)
`out_dir`	Request field
`train_out_dir`	Constructed path for training output
`adapter_name`	Request field
`framework_type`	Request field (`legacy` or `axolotl`)

Optional arguments (included when non-empty):

Key	Description
`prompt_id`	Prompt template ID (required for legacy, optional for axolotl)
`bnb_config`	BitsAndBytes config ID
`lora_config`	LoRA config ID
`training_arguments_config`	Training arguments config ID
`hf_token`	HuggingFace access token
`axolotl_config_id`	Axolotl YAML config ID
`gpu_label_id`	GPU label config ID

Legacy distributed training arguments:

Key	Description
`dist_num`	Number of distributed workers
`dist_cpu`	CPU per worker
`dist_mem`	Memory per worker
`dist_gpu`	GPU per worker

Protobuf Messages

The job lifecycle uses two primary protobuf messages:

StartFineTuningJobRequest – Contains all fields listed above. Sent by the client to initiate training.
StartFineTuningJobResponse – Returns the created job metadata including the generated job_id and CML job identifiers.
FineTuningJobMetadata – The full job record stored in the database and returned by GetFineTuningJob and ListFineTuningJobs RPCs.

See gRPC Service Design for the complete RPC catalog.

Training Script Architecture

The training script is the code that runs inside a CML Job after dispatch. It receives configuration via environment variables, loads and preprocesses data, trains a PEFT LoRA adapter, and saves the result.

Entry Point

ft/scripts/accel_fine_tune_base_script.py

The script is executed as a CML Job. Arguments are received via the JOB_ARGUMENTS environment variable as a space-delimited string with --key value pairs, parsed into an argparse namespace at startup.

Execution Flow

Parse JOB_ARGUMENTS – The JOB_ARGUMENTS environment variable is split and parsed via argparse into a namespace containing all training parameters.
Load base model – The HuggingFace model is loaded with optional BitsAndBytesConfig for 4-bit NF4 quantization. The model ID is resolved from the Studio database using base_model_id.
Configure tokenizer padding – The tokenizer is inspected for a suitable pad token. The function find_padding_token_candidate() searches the vocabulary for tokens containing “pad” or “reserved”.
Apply PEFT LoRA adapter – A LoraConfig is constructed from the config blob stored in the database, and the model is wrapped with get_peft_model().
Load and preprocess dataset:
- load_dataset_into_memory() reads the dataset into a HuggingFace DatasetDict.
- map_dataset_with_prompt_template() formats each row using the prompt template, appending the EOS token.
- sample_and_split_dataset() downsamples by the configured fraction and splits into train/test sets (seed=42).
Initialize SFTTrainer – A TRL SFTTrainer is created with the processed dataset, model, tokenizer, and training arguments.
Train – trainer.train() executes the training loop.
Save adapter weights – The trained LoRA adapter is saved to the output directory.
Auto-register adapter – If auto_add_adapter=true, the adapter is registered in the Studio database automatically after training completes.

Dataset Preprocessing Chain

Step	Function	Input	Output
Load	`load_dataset_into_memory()`	Dataset metadata (type, path, HF name)	HF `DatasetDict`
Format	`map_dataset_with_prompt_template()`	`DatasetDict` + prompt template	`DatasetDict` with `prediction` column
Sample/Split	`sample_and_split_dataset()`	`DatasetDict` + fraction + split ratio	Train/test `DatasetDict`

The prediction column contains the fully formatted training text for each row – the prompt template applied to dataset features with the EOS token appended. This column name is defined by TRAINING_DATA_TEXT_FIELD.

Key Training Utilities

All utilities are defined in ft/training/utils.py.

`get_model_parameters(model)`

Returns a tuple of (total_params, trainable_params) for the model. Used for logging the parameter count before and after applying the LoRA adapter.

`map_dataset_with_prompt_template(dataset, template)`

Applies the prompt template to each row in the dataset. The template contains prompt_template, input_template, and completion_template fields that are formatted with the dataset’s feature columns. The EOS token is appended to the prediction field to signal sequence boundaries during training.

`sample_and_split_dataset(ds, fraction, split)`

Downsamples the dataset to the specified fraction (e.g., 0.5 = 50% of rows), then splits into train and test sets at the given ratio. Uses TRAINING_DATASET_SEED = 42 for reproducible splits across runs.

`find_padding_token_candidate(tokenizer)`

Searches the tokenizer vocabulary for tokens containing “pad” or “reserved” as substrings. Returns the first match found, or None if no candidate exists.

`configure_tokenizer_padding(tokenizer, pad_token)`

Sets the tokenizer’s padding token using a fallback chain:

Use the tokenizer’s existing pad_token if already set.
Use the provided pad_token argument if given.
Use the tokenizer’s unk_token if available.
Search for reserved token candidates via find_padding_token_candidate().

This ensures every tokenizer has a valid pad token regardless of the base model’s configuration.

Training Constants

Defined in ft/consts.py:

Constant	Value	Purpose
`TRAINING_DATA_TEXT_FIELD`	`"prediction"`	Column name for the formatted training text in the preprocessed dataset
`TRAINING_DEFAULT_TRAIN_TEST_SPLIT`	`0.9`	Default train/test split ratio (90% train, 10% test)
`TRAINING_DEFAULT_DATASET_FRACTION`	`1.0`	Default dataset fraction (use full dataset)
`TRAINING_DATASET_SEED`	`42`	Random seed for reproducible dataset splitting and sampling

Relationship to Job Lifecycle

The training script is the execution payload created by the Fine-Tuning Job Lifecycle. The job dispatch process builds the JOB_ARGUMENTS string, creates the CML Job pointing to this script, and starts a JobRun. The script runs independently inside the CML workload – it reads its configuration from the environment, accesses the Studio database directly for resource metadata (model paths, dataset locations, config blobs), and writes adapter weights to the output directory.

Axolotl Integration

Axolotl is an alternative training framework supported as a first-class framework_type alongside the legacy HuggingFace Accelerate + TRL path. It replaces the separate LoRA, BitsAndBytes, and training argument configs with a single YAML configuration file that defines the entire training run.

Config Structure

Axolotl configurations are stored in the configs table with ConfigType.axolotl. A template YAML is provided at:

ft/config/axolotl/training_config/template.yaml

This template defines the baseline Axolotl training configuration. Users can create custom configs by modifying the template values. The YAML file specifies model loading, LoRA parameters, quantization, dataset handling, training hyperparameters, and output settings in a single document.

Dataset Format Configs

Dataset format definitions are stored as ConfigType.axolotl_dataset_formats in the configs table. The source files live in:

ft/config/axolotl/dataset_formats/

Each JSON file defines the expected column structure for a specific Axolotl dataset type (e.g., alpaca, completion, sharegpt). These files are loaded into the database during initialization by:

ft/initialize_db.py::InitializeDB.initialize_axolotl_dataset_type_configs()

Pydantic Models

The dataset format structure is defined by two Pydantic models in ft/api/types.py:

DatasetFormatInfo:

Field	Type	Description
`name`	`str`	Human-readable name of the dataset format
`description`	`str`	The Axolotl dataset type identifier (e.g., `alpaca`, `completion`)
`format`	`Dict[str, Any]`	Map of feature column names to their expected types or descriptions

DatasetFormatsCollection:

Field	Type	Description
`dataset_formats`	`Dict[str, DatasetFormatInfo]`	Map of format names to their definitions

Auto-Prompt Generation

When a fine-tuning job uses the axolotl framework and no prompt_id is provided, the system automatically generates a prompt template from the dataset format definition. This is handled by ft/jobs.py::_add_prompt_for_dataset().

Generation steps:

Load the Axolotl YAML config from the database using axolotl_config_id.
Extract the type field from the dataset section of the YAML config. This identifies the expected dataset format (e.g., alpaca, completion).
Query the database for a config of type axolotl_dataset_formats whose description field matches the extracted type.
Parse the dataset format config to extract the feature column names from the format dictionary.
Generate a default prompt template by concatenating "Feature: {feature}\n" for each feature column.
Check whether an identical prompt already exists for this dataset to avoid duplicates.
Create and return a new prompt record if no duplicate is found.

This mechanism ensures that Axolotl jobs always have a valid prompt template, even when the user does not explicitly create one.

Legacy vs. Axolotl Comparison

Aspect	Legacy	Axolotl
Config format	Separate JSON blobs (LoRA, BnB, training args)	Single YAML file
Prompt handling	User must create and select a prompt template	Auto-generated from dataset format if not provided
Required configs	`prompt_id` + `lora_config` + `bnb_config` + `training_arguments_config`	`axolotl_config_id` only
Training engine	HuggingFace Accelerate + TRL SFTTrainer	Axolotl framework
Distributed training	Supported via `dist_*` fields	Managed by Axolotl config
Validation	`prompt_id` required	`axolotl_config_id` required; `prompt_id` optional

Workflow

To use Axolotl for fine-tuning:

Register a base model and dataset via the standard AddModel and AddDataset RPCs.
Create or use an existing Axolotl YAML config (stored as ConfigType.axolotl).
Call StartFineTuningJob with framework_type = "axolotl" and axolotl_config_id set to the config ID.
Omit prompt_id to use auto-generation, or provide one to override.
The job dispatcher passes axolotl_config_id in the JOB_ARGUMENTS to the training script, which loads and executes the Axolotl training pipeline.

See Fine-Tuning Job Lifecycle for the full dispatch flow and Training Script Architecture for execution details.

Evaluation Job Lifecycle

Evaluation jobs run MLflow evaluation against model+adapter combinations. A single evaluation request can compare multiple adapters against a baseline, with each combination dispatched as a separate CML Job linked by a shared parent_job_id.

Dispatch Architecture

A single StartEvaluationJob request specifies N model+adapter combinations. The dispatcher fans out into N independent CML Jobs, each running its own MLflow evaluation. All jobs share a parent_job_id that groups them for result comparison in the UI.

Validation

Validation is performed by ft/evaluation.py::_validate_start_evaluation_job_request() before any jobs are created.

Required fields:

Field	Rule
`model_adapter_combinations`	Non-empty list of model+adapter pairs
`dataset_id`	Must exist in DB
`prompt_id`	Must exist in DB
`cpu`	Valid resource specification
`gpu`	Valid resource specification
`memory`	Valid resource specification

Per-combination validation:

Each base_model_id in the combinations list must exist in the database.
Each adapter_id must exist in the database, or be an empty string to evaluate the base model without an adapter.
The referenced dataset and prompt must exist.

Multi-Adapter Dispatch

For each model+adapter combination in the request, the dispatcher executes the following sequence:

Generate IDs – A UUID job_id is generated for each individual evaluation run. A shared parent_job_id is generated once for the entire batch.
Create directories – A result directory and job directory are created for each run.
Find template CML Job – The dispatcher locates the Mlflow_Evaluation_Base_Job template in the CML project.
Build argument list – Each run receives its own argument string containing:

Argument	Description
`base_model_id`	The model to evaluate
`adapter_id`	The adapter to apply (empty string for base model only)
`dataset_id`	The evaluation dataset
`prompt_id`	The prompt template for formatting
`result_dir`	Directory for evaluation output
`configs`	Evaluation-specific configuration
`selected_features`	Dataset features to include
`eval_dataset_fraction`	Fraction of dataset to evaluate on
`comparison_adapter_id`	The first adapter in the batch, used as the baseline
`job_id`	This run’s unique identifier
`run_number`	Ordinal position in the batch (1-indexed)

Create CML Job and JobRun – A CML Job is created via cmlapi with the specified compute resources.
Store EvaluationJob record – An EvaluationJob record is inserted into the evaluation_jobs table with the parent_job_id for grouping.

Parent Job Grouping

All evaluation runs within a batch share the same parent_job_id. This enables:

UI grouping – The Streamlit UI displays evaluation runs grouped by parent, showing all adapter comparisons in a single view.
Baseline comparison – The first adapter in the model_adapter_combinations list is designated as the baseline (comparison_adapter_id). All other runs compare their metrics against this baseline.
Batch status tracking – The overall status of an evaluation batch can be determined by aggregating the statuses of all child jobs sharing the same parent_job_id.

Evaluation Script

The evaluation logic runs inside ft/scripts/mlflow_evaluation_base_script.py:

Load model and adapter – The base HuggingFace model is loaded, and the optional PEFT adapter is applied via load_adapted_hf_generation_pipeline(). This produces a text-generation pipeline.
Load and preprocess dataset – The evaluation dataset is loaded, the prompt template is applied to format inputs, and the dataset is sampled to the configured eval_dataset_fraction.
Run MLflow evaluation – MLflow’s evaluation framework is invoked with the configured metrics. Results (metric values and artifacts) are logged to an MLflow experiment.
Log results – Evaluation metrics, predictions, and comparison data are persisted in the MLflow tracking store for retrieval by the UI.

Protobuf Messages

StartEvaluationJobRequest:

Field	Description
`model_adapter_combinations`	List of model+adapter pairs to evaluate
`dataset_id`	Evaluation dataset reference
`prompt_id`	Prompt template for input formatting
`cpu`, `gpu`, `memory`	Compute resources per evaluation job
`configs`	Evaluation configuration (metrics, generation settings)

EvaluationJobMetadata:

Field	Description
`id`	Unique evaluation job identifier
`type`	Job type identifier
`cml_job_id`	CML Job identifier
`parent_job_id`	Shared batch identifier
`base_model_id`	Evaluated model
`dataset_id`	Evaluation dataset
`adapter_id`	Applied adapter (empty for base model)
`cpu`, `gpu`, `memory`	Allocated resources
`configs`	Evaluation configuration
`evaluation_dir`	Path to evaluation results

See gRPC Service Design for the complete evaluation RPC catalog.

Model Export & Registry

Trained adapters can be exported through two routes, determined by ModelExportType. Both routes merge a base model with a PEFT adapter into a deployable artifact, but target different deployment backends.

Export Routes

Both routes require non-empty base_model_id, adapter_id, and model_name fields. The choice between them depends on the target deployment environment and the adapter source type.

MLflow Model Registry

Function: export_model_registry_model()

This route logs the merged model to the MLflow Model Registry as a registered model. It supports any adapter type (PROJECT, HuggingFace).

Steps:

Load pipeline – fetch_pipeline() creates a HuggingFace text-generation pipeline by loading the base model and applying the PEFT adapter.
Quantized loading – If a BitsAndBytesConfig is specified, the base model is loaded with 4-bit quantization before adapter application.
Infer signature – An MLflow model signature is inferred from example input/output pairs. This defines the expected request and response schema for the registered model.
Log model – mlflow.transformers.log_model() logs the pipeline to MLflow as a registered model with the specified model_name.

Requirements:

Requirement	Detail
Base model	HuggingFace model registered in Studio
Adapter	Any adapter type (PROJECT or HuggingFace)
MLflow tracking	Must be configured in the CML environment

CML Model Endpoint

Function: deploy_cml_model()

This route creates a CML Model endpoint that serves the model+adapter combination as a REST API. It is restricted to PROJECT adapters (file-based, local weights).

Steps:

Validate adapter type – Only PROJECT adapters (local file-based weights) are supported. HuggingFace adapters must be downloaded locally first.
Create CML Model – A CML Model object is created via cmlapi.
Create ModelBuild – A build is created pointing to the predict script at ft/scripts/cml_model_predict_script.py. Environment variables are injected:

Variable	Description
`FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME`	HuggingFace identifier for the base model
`ADAPTER_LOCATION`	File path to the adapter weights directory
`GEN_CONFIG_STRING`	Serialized generation config (JSON string)

Deploy – A ModelDeployment is created with default resources:

Resource	Default
CPU	2 cores
Memory	8 GB
GPU	1

Resolve runtime – The runtime identifier is inherited from the template Finetuning_Base_Job, ensuring the model endpoint uses the same environment as training workloads.

Requirements:

Requirement	Detail
Base model	HuggingFace model registered in Studio
Adapter	PROJECT type only (local file weights)
Adapter weights	Must be accessible on the local filesystem

Validation

Both export routes perform the following validation before proceeding:

base_model_id must be non-empty and reference an existing model in the database.
adapter_id must be non-empty and reference an existing adapter in the database.
model_name must be non-empty.

Additional route-specific validation:

CML Model: The adapter must be of type PROJECT. Model Registry adapters require the MLflow Registry export path instead.
MLflow Registry: The MLflow tracking server must be accessible.

Choosing an Export Route

Criterion	MLflow Registry	CML Model Endpoint
Adapter source	Any (PROJECT, HuggingFace)	PROJECT only
Output format	MLflow registered model	REST API endpoint
Serving infrastructure	MLflow serving or downstream consumption	CML Model serving
Resource customization	Managed by MLflow	Default 2 CPU / 8 GB / 1 GPU (adjustable post-deploy)
Use case	Model versioning, experiment tracking, CI/CD pipelines	Real-time inference endpoint

See CML Model Serving for details on the predict script and endpoint behavior.

CML Model Serving

A CML Model endpoint serves a fine-tuned model+adapter combination as a REST API. The endpoint is created by deploy_cml_model() (see Model Export & Registry) and runs a predict script that loads the model, applies the adapter, and handles inference requests.

Predict Script

Path: ft/scripts/cml_model_predict_script.py

The predict script runs inside a CML Model endpoint container. It is specified as the build script during deploy_cml_model() and executes in the runtime environment inherited from the template fine-tuning job.

Initialization

On startup, the script:

Reads environment variables:

Variable	Purpose
`FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME`	HuggingFace model identifier to load
`ADAPTER_LOCATION`	Path to the PEFT adapter weights directory
`GEN_CONFIG_STRING`	Serialized generation configuration (JSON)

Loads the base model – The HuggingFace model is loaded from the Hub or cache using the identifier in FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME.
Applies the PEFT adapter – The LoRA adapter weights at ADAPTER_LOCATION are loaded and applied to the base model.

Request Handling

The predict script exposes a predict() function that CML invokes for each incoming request.

Request format:

{
  "request": {
    "prompt": "Your input text here"
  }
}

The prompt field contains the raw input text. The predict function:

Extracts the prompt from the request payload.
Tokenizes the input using the model’s tokenizer.
Generates output using the model with the applied generation config.
Decodes and returns the generated text.

Endpoint Creation Flow

The full endpoint creation sequence, initiated by deploy_cml_model():

Create CML Model – A new Model object is created in the CML project via cmlapi. This registers the model name and description.
Create ModelBuild – A build is created with:
- The predict script path (ft/scripts/cml_model_predict_script.py).
- Environment variables (FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME, ADAPTER_LOCATION, GEN_CONFIG_STRING).
- The runtime identifier from the template fine-tuning job.
Create ModelDeployment – A deployment is created with default resource allocation:

Resource	Default Value
CPU	2 cores
Memory	8 GB
GPU	1

Runtime resolution – The runtime is inherited from the template Finetuning_Base_Job. This ensures the model endpoint has the same Python packages, CUDA version, and system libraries as the training environment.

Limitations

PROJECT adapters only – Only adapters stored as local files (PROJECT type) are supported for CML Model deployment. HuggingFace Hub adapters must be downloaded to the project filesystem before they can be used with a CML Model endpoint.
Model Registry adapters – Adapters registered through the MLflow Model Registry cannot be deployed as CML Models directly. Use the MLflow Registry export path instead (see Model Export & Registry).
Fixed default resources – The deployment is created with 1 GPU, 2 CPU cores, and 8 GB memory. To adjust resource allocation after deployment, modify the CML Model settings through the CML UI or cmlapi.
Single adapter – Each CML Model endpoint serves exactly one base model + adapter combination. To serve multiple adapters, create multiple endpoints.

Post-Deployment

After deployment completes:

The endpoint URL is available in the CML Model UI and via cmlapi.
Requests are sent as HTTP POST with the JSON format shown above.
The endpoint auto-scales based on CML’s Model serving configuration.
Logs and metrics are available through CML’s standard monitoring interface.
Resource allocation can be modified via the CML Model settings without rebuilding.

Validation Rules Reference

The Studio validates resources at multiple points: on import (datasets, models, adapters, prompts, configs), on job submission (fine-tuning, evaluation), and on export (model deployment). This chapter catalogs all validation rules extracted from the source code.

Source: ft/jobs.py, ft/evaluation.py, ft/datasets.py, ft/models.py, ft/adapters.py, ft/prompts.py, ft/configs.py, ft/service.py

Rule ID Convention

Rule IDs follow the format {Domain}-{Number} where Domain is one of:

Domain	Scope
FT	Fine-tuning job parameters
EV	Evaluation job parameters
DS	Dataset import
MD	Model import
AD	Adapter import
PR	Prompt template
CF	Configuration blob
EX	Model export / deployment

All rules with severity ERROR abort the operation and return a gRPC error. Rules with severity INFO are advisory and do not block the operation.

Fine-Tuning Job Validation

Validated in ft/jobs.py when StartFineTuningJob is called.

Rule ID	Field	Constraint	Severity
FT-001	`framework_type`	Must be `legacy` or `axolotl`	ERROR
FT-002	`adapter_name`	Must match `^[a-zA-Z0-9-]+$` (alphanumeric + hyphens, no spaces)	ERROR
FT-003	`out_dir`	Must exist as a directory	ERROR
FT-004	`num_cpu`	Must be > 0	ERROR
FT-005	`num_gpu`	Must be >= 0	ERROR
FT-006	`num_memory`	Must be > 0	ERROR
FT-007	`num_workers`	Must be > 0	ERROR
FT-008	`num_epochs`	Must be > 0	ERROR
FT-009	`learning_rate`	Must be > 0	ERROR
FT-010	`dataset_fraction`	Must be in (0, 1]	ERROR
FT-011	`train_test_split`	Must be in (0, 1]	ERROR
FT-012	`axolotl_config_id`	Required when `framework_type=axolotl`	ERROR
FT-013	`base_model_id`	Must exist in models table	ERROR
FT-014	`dataset_id`	Must exist in datasets table	ERROR
FT-015	`prompt_id`	Must exist in prompts table (legacy framework only)	ERROR
FT-016	`axolotl_config_id`	Must exist in configs table (when provided)	ERROR

FT-001 through FT-011 are local validations that require no database access. FT-012 is a cross-field consistency check. FT-013 through FT-016 are foreign-key validations resolved against the DAO.

Evaluation Job Validation

Validated in ft/evaluation.py when StartEvaluationJob is called.

Rule ID	Field	Constraint	Severity
EV-001	`model_adapter_combinations`	Must be non-empty	ERROR
EV-002	`dataset_id`	Must be non-empty	ERROR
EV-003	`prompt_id`	Must be non-empty	ERROR
EV-004	`num_cpu`, `num_gpu`, `num_memory`	Must be provided	ERROR
EV-005	model IDs in combinations	Each must exist in models table	ERROR
EV-006	adapter IDs in combinations	Each must exist in adapters table (or empty for base model)	ERROR
EV-007	`dataset_id`	Must exist in datasets table	ERROR
EV-008	`prompt_id`	Must exist in prompts table	ERROR

Evaluation jobs accept multiple model-adapter pairs in a single request. EV-005 and EV-006 are validated per combination entry.

Dataset Import Validation

Validated in ft/datasets.py when AddDataset is called.

Rule ID	Field	Constraint	Severity
DS-001	`type`	Must be one of: `huggingface`, `project`, `project_csv`, `project_json`, `project_jsonl`	ERROR
DS-002	`huggingface_name`	Must resolve via `HfApi.dataset_info()` (huggingface type)	ERROR
DS-003	`location`	File must exist (project_csv, project_json, project_jsonl)	ERROR

DS-002 makes a network call to the HuggingFace Hub. If HUGGINGFACE_ACCESS_TOKEN is set, it is used for gated dataset access. DS-003 validates the local filesystem path and checks the file extension matches the declared type.

Model Import Validation

Validated in ft/models.py when AddModel is called.

Rule ID	Field	Constraint	Severity
MD-001	`type`	Must be one of: `huggingface`, `project`, `model_registry`	ERROR
MD-002	`huggingface_model_name`	Must resolve via `HfApi.model_info()` (huggingface type)	ERROR
MD-003	`cml_registered_model_id`	Must resolve via `cmlapi` (model_registry type)	ERROR

MD-002 contacts the HuggingFace Hub. MD-003 queries the CML Model Registry through the cmlapi SDK.

Adapter Import Validation

Validated in ft/adapters.py when AddAdapter is called.

Rule ID	Field	Constraint	Severity
AD-001	`name`	Required, non-empty	ERROR
AD-002	`model_id`	Required, must exist in models table	ERROR
AD-003	`location`	Must exist as directory (project type)	ERROR
AD-004	`fine_tuning_job_id`	Must exist in fine_tuning_jobs table (if provided)	ERROR
AD-005	`prompt_id`	Must exist in prompts table (if provided)	ERROR

AD-004 and AD-005 are optional foreign-key references. When provided, they link the adapter back to the job and prompt that produced it.

Prompt Validation

Validated in ft/prompts.py when AddPrompt is called.

Rule ID	Field	Constraint	Severity
PR-001	`name`	Required, unique	ERROR
PR-002	`dataset_id`	Required	ERROR
PR-003	`prompt_template`	Required, non-empty	ERROR
PR-004	`input_template`	Required	ERROR
PR-005	`completion_template`	Required	ERROR

PR-001 enforces uniqueness at the application level before insert. The prompt_template, input_template, and completion_template fields use Python format-string syntax referencing dataset feature column names.

Config Validation

Validated in ft/configs.py when AddConfig is called.

Rule ID	Field	Constraint	Severity
CF-001	`type`	Must be one of `ConfigType` enum values	ERROR
CF-002	`config`	Must be valid JSON (non-axolotl types)	ERROR
CF-003	`config`	Must be valid YAML (axolotl type)	ERROR
CF-004	`config`	Deduplicated – returns existing ID if identical content exists	INFO

CF-004 is a deduplication check, not an error. When a config with identical content already exists, the existing record’s ID is returned instead of creating a duplicate. The caller receives a successful response in either case.

Export Validation

Validated in ft/models.py when ExportModel or RegisterModel is called.

Rule ID	Field	Constraint	Severity
EX-001	`base_model_id`	Required, non-empty	ERROR
EX-002	`adapter_id`	Required, non-empty	ERROR
EX-003	`model_name`	Required, non-empty	ERROR
EX-004	adapter type	Must be `PROJECT` for CML Model deployment	ERROR
EX-005	model type	Must be `huggingface` for CML Model deployment	ERROR

EX-004 and EX-005 enforce deployment constraints. Only project-local adapters (those with files on disk) can be packaged for CML Model Registry, and only HuggingFace-sourced base models are supported for the export merge workflow.

Building a Validation SDK

This chapter provides guidance for building a validation SDK that validates resources and job parameters before submitting them to the Fine Tuning Studio gRPC API. Pre-submission validation catches errors locally, avoiding round-trips to the server and failed CML Job launches.

Source: ft/client.py, ft/proto/fine_tuning_studio_pb2.py, ft/api/types.py

Install the `ft` Package

The Fine Tuning Studio ships as a pip-installable package. Install it to get access to protobuf definitions, API types, and the gRPC client:

pip install -e /path/to/CML_AMP_LLM_Fine_Tuning_Studio

This provides the ft package in development mode. All protobuf-generated classes and enum types are available for import.

Import Protobuf Types

from ft.api import *

# Or import specific types:
from ft.proto.fine_tuning_studio_pb2 import (
    StartFineTuningJobRequest,
    AddDatasetRequest,
    AddModelRequest,
    AddConfigRequest,
)
from ft.api.types import (
    DatasetType,
    ConfigType,
    FineTuningFrameworkType,
)

These types define the exact field names, types, and enum values accepted by the gRPC API.

Validation Architecture

Validation rules fall into two categories:

Category	Requires	Examples
Local validation	No external dependencies	Regex checks, numeric bounds, enum membership, cross-field consistency
DB-dependent validation	`FineTuningStudioClient` connection	Foreign-key existence (model, dataset, prompt, adapter)

A validation SDK should implement local validation in pure functions and DB-dependent validation through the client.

Local Validation

Replicate the rules from the Validation Rules Reference that require no database access:

import re

def validate_adapter_name(name: str) -> list[str]:
    """FT-002: adapter_name must be alphanumeric + hyphens."""
    errors = []
    if not re.match(r'^[a-zA-Z0-9-]+$', name):
        errors.append("FT-002: adapter_name must match ^[a-zA-Z0-9-]+$")
    return errors

def validate_resource_allocation(num_cpu: int, num_gpu: int, num_memory: int) -> list[str]:
    """FT-004, FT-005, FT-006: resource bounds."""
    errors = []
    if num_cpu <= 0:
        errors.append("FT-004: num_cpu must be > 0")
    if num_gpu < 0:
        errors.append("FT-005: num_gpu must be >= 0")
    if num_memory <= 0:
        errors.append("FT-006: num_memory must be > 0")
    return errors

def validate_training_params(num_epochs: int, learning_rate: float,
                             dataset_fraction: float, train_test_split: float) -> list[str]:
    """FT-008 through FT-011: training parameter ranges."""
    errors = []
    if num_epochs <= 0:
        errors.append("FT-008: num_epochs must be > 0")
    if learning_rate <= 0:
        errors.append("FT-009: learning_rate must be > 0")
    if not (0 < dataset_fraction <= 1):
        errors.append("FT-010: dataset_fraction must be in (0, 1]")
    if not (0 < train_test_split <= 1):
        errors.append("FT-011: train_test_split must be in (0, 1]")
    return errors

def validate_framework_config(framework_type: str, axolotl_config_id: str) -> list[str]:
    """FT-001, FT-012: framework type and config consistency."""
    errors = []
    if framework_type not in ("legacy", "axolotl"):
        errors.append("FT-001: framework_type must be legacy or axolotl")
    if framework_type == "axolotl" and not axolotl_config_id:
        errors.append("FT-012: axolotl_config_id required when framework_type=axolotl")
    return errors

DB-Dependent Validation

Some rules require database lookups. Use FineTuningStudioClient for these:

from ft.client import FineTuningStudioClient

client = FineTuningStudioClient()

# Check if model exists (FT-013)
models = client.get_models()
model_ids = {m.id for m in models}
assert model_id in model_ids, f"FT-013: Model {model_id} not found"

# Check if dataset exists (FT-014)
datasets = client.get_datasets()
dataset_ids = {d.id for d in datasets}
assert dataset_id in dataset_ids, f"FT-014: Dataset {dataset_id} not found"

# Check if prompt exists (FT-015)
prompts = client.get_prompts()
prompt_ids = {p.id for p in prompts}
assert prompt_id in prompt_ids, f"FT-015: Prompt {prompt_id} not found"

The client connects to the gRPC server over FINE_TUNING_SERVICE_IP:FINE_TUNING_SERVICE_PORT. These environment variables are set during Studio initialization.

Config Content Validation

Validate config content before submitting via AddConfig:

import json
import yaml

def validate_config(config_type: str, config_content: str) -> list[str]:
    """CF-002, CF-003: config content must parse correctly."""
    errors = []
    try:
        if config_type == "axolotl":
            yaml.safe_load(config_content)  # CF-003
        else:
            json.loads(config_content)  # CF-002
    except (yaml.YAMLError, json.JSONDecodeError) as e:
        errors.append(f"Config parse error: {e}")
    return errors

Composing a Validation Pipeline

Combine local and DB-dependent validators into a single function that returns all errors at once:

def validate_fine_tuning_request(
    request: StartFineTuningJobRequest,
    client: FineTuningStudioClient,
) -> list[str]:
    """Validate a fine-tuning request against all applicable rules.

    Returns a list of error strings. An empty list means the request is valid.
    """
    errors = []

    # Local validation
    errors.extend(validate_framework_config(
        request.framework_type, request.axolotl_config_id))
    errors.extend(validate_adapter_name(request.adapter_name))
    errors.extend(validate_resource_allocation(
        request.num_cpu, request.num_gpu, request.num_memory))
    errors.extend(validate_training_params(
        request.num_epochs, request.learning_rate,
        request.dataset_fraction, request.train_test_split))

    # DB-dependent validation
    models = {m.id for m in client.get_models()}
    if request.base_model_id not in models:
        errors.append(f"FT-013: Model {request.base_model_id} not found")

    datasets = {d.id for d in client.get_datasets()}
    if request.dataset_id not in datasets:
        errors.append(f"FT-014: Dataset {request.dataset_id} not found")

    if request.framework_type == "legacy":
        prompts = {p.id for p in client.get_prompts()}
        if request.prompt_id not in prompts:
            errors.append(f"FT-015: Prompt {request.prompt_id} not found")

    if request.axolotl_config_id:
        configs = {c.id for c in client.get_configs()}
        if request.axolotl_config_id not in configs:
            errors.append(f"FT-016: Config {request.axolotl_config_id} not found")

    return errors

Usage Pattern

Call the validation pipeline before submitting any request:

from ft.client import FineTuningStudioClient
from ft.proto.fine_tuning_studio_pb2 import StartFineTuningJobRequest

client = FineTuningStudioClient()

request = StartFineTuningJobRequest(
    framework_type="legacy",
    adapter_name="my-adapter",
    base_model_id="abc-123",
    dataset_id="def-456",
    prompt_id="ghi-789",
    num_cpu=2,
    num_gpu=1,
    num_memory=8,
    num_epochs=3,
    learning_rate=2e-5,
    dataset_fraction=1.0,
    train_test_split=0.8,
)

errors = validate_fine_tuning_request(request, client)
if errors:
    for e in errors:
        print(f"  {e}")
    raise ValueError(f"Validation failed with {len(errors)} error(s)")

# Safe to submit
client.start_fine_tuning_job(request)

This pattern ensures that invalid requests never reach the gRPC server, providing immediate feedback and avoiding wasted CML Job compute.

GitHub Actions Integration

This chapter documents the existing CI/CD configuration and provides patterns for extending it with config validation and formatting checks.

Source: .github/workflows/run-tests.yaml, .github/workflows/docs.yml, bin/run-tests.sh

Existing CI Workflow

The primary CI workflow is .github/workflows/run-tests.yaml:

Setting	Value
Trigger	Pushes and PRs to `main` and `dev` branches
Runner	`ubuntu-latest`
Python version	3.11
Dependencies	`requirements.txt`
Test command	`pytest -v --cov=ft --cov-report=html --cov-report=xml -s tests/`
Coverage threshold	>10%

The workflow installs all dependencies, runs the full test suite with coverage collection, and generates both HTML and XML coverage reports.

Running Tests Locally

# Full test suite with coverage
./bin/run-tests.sh

# Single test file
pytest -v -s tests/test_datasets.py

# Single test method
pytest -v -s tests/test_datasets.py::TestDatasets::test_add_dataset

The bin/run-tests.sh script mirrors the CI configuration. Run it before pushing to catch failures early.

Adding Config Validation to CI

Axolotl YAML configs and dataset format JSON files live under ft/config/. A dedicated workflow validates these files on any PR that modifies them:

name: Validate Configs

on:
  pull_request:
    paths:
      - 'ft/config/**'
      - 'data/project_defaults.json'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install pyyaml pydantic

      - name: Validate Axolotl configs
        run: |
          python -c "
          import yaml, json, glob

          # Validate YAML configs
          for f in glob.glob('ft/config/axolotl/training_config/*.yaml'):
              with open(f) as fh:
                  yaml.safe_load(fh)
              print(f'OK: {f}')

          # Validate dataset format JSON configs
          for f in glob.glob('ft/config/axolotl/dataset_formats/*.json'):
              with open(f) as fh:
                  json.load(fh)
              print(f'OK: {f}')

          # Validate project defaults
          with open('data/project_defaults.json') as fh:
              json.load(fh)
          print('OK: data/project_defaults.json')
          "

The paths filter ensures this workflow only runs when config files change. Any parse error causes the step to fail with a traceback identifying the malformed file.

Pre-Commit Formatting Check

The project uses autoflake and autopep8 for code formatting. Add a CI step to verify formatting compliance:

- name: Check formatting
  run: |
    pip install autoflake autopep8
    autoflake --check --remove-all-unused-imports \
      --ignore-init-module-imports --recursive ft/ tests/ pgs/
    autopep8 --diff --max-line-length 120 \
      --aggressive --aggressive --aggressive \
      --recursive ft/ tests/ pgs/ | head -20

autoflake --check exits non-zero if any unused imports are found. autopep8 --diff prints the diff that would be applied; pipe through head -20 to keep output concise. If either tool reports issues, the step fails.

Documentation Deployment

The documentation workflow is .github/workflows/docs.yml. It builds the mdbook with D2 diagram support and deploys to GitHub Pages. This workflow is independent of the test CI and triggers on documentation changes. See the System Overview for the full project architecture.

Extending the CI Pipeline

When adding new validation workflows, follow these conventions:

Convention	Guideline
Path filtering	Use `paths:` to scope workflows to relevant directories
Python version	Pin to `3.11` to match production
Dependency isolation	Install only what the validation step needs, not the full `requirements.txt`
Exit codes	Rely on tool exit codes for pass/fail – avoid custom success checks
Artifact uploads	Use `actions/upload-artifact@v4` for coverage reports or validation logs
Branch protection	Configure required status checks on `main` to enforce green CI before merge

Keyboard shortcuts

Fine Tuning Studio Developer's Guide