Introduction
Fine Tuning Studio is a Cloudera AMP (Applied ML Prototype) for managing, fine-tuning, and evaluating large language models within Cloudera Machine Learning (CML). It provides a Streamlit UI backed by a gRPC API, a SQLite metadata store, and job dispatch to CML workloads for training and evaluation. Models, datasets, PEFT adapters, and prompt templates are managed as first-class resources that flow through the import-train-evaluate-deploy lifecycle.
This guide serves two audiences:
| If you are… | Start here |
|---|---|
| Building a training harness or extending the platform (custom gRPC clients, new dataset types, training scripts, Axolotl integrations) | Architecture Reference |
| Building a validation SDK or CI/CD pipeline for fine-tuning artifacts (config validation, adapter packaging, model export) | Resource Specifications and Validation Rules |
Terminology
| Term | Definition |
|---|---|
| Dataset | A reference to a HuggingFace Hub dataset or a local file (CSV, JSON, JSONL) registered in the Studio’s metadata store. Features are auto-extracted on import. |
| Model | A base foundation model registered from HuggingFace Hub or the CML Model Registry. Serves as the starting point for fine-tuning. |
| Adapter | A PEFT LoRA adapter — either produced by a fine-tuning job, imported from a local directory, or fetched from HuggingFace Hub. Applied on top of a base model. |
| Prompt Template | A format-string template that maps dataset feature columns into training input. Contains prompt_template, input_template, and completion_template fields. |
| Config | A named configuration blob — training arguments, BitsAndBytes quantization, LoRA hyperparameters, generation config, or Axolotl YAML. Configs are deduplicated by content. |
| Fine-Tuning Job | A CML Job that trains a PEFT adapter. Dispatched via the gRPC API, tracked in the metadata store, executed as a CML workload with configurable CPU/GPU/memory. |
| Evaluation Job | A CML Job that runs MLflow evaluation against one or more model+adapter combinations. Results are tracked in MLflow experiments. |
| gRPC Service | The Fine Tuning Service (FTS) — a stateless gRPC server on port 50051 that hosts all application logic. Accessed via FineTuningStudioClient. |
| DAO | Data Access Object — FineTuningStudioDao manages SQLAlchemy sessions and connection pooling against the SQLite database. |
| CML Workload | A Cloudera ML Job, session, or model endpoint. Fine-tuning and evaluation are dispatched as CML Jobs via the cmlapi SDK. |
Resource Lifecycle
The lifecycle begins with importing resources (datasets from HuggingFace or local files, base models, prompt templates) and ends with deploying trained adapters to the CML Model Registry or as CML Model endpoints. The gRPC API drives every step — the Streamlit UI is a client of this API, not the source of truth.
System Overview
Fine Tuning Studio is a three-layer application running inside a single CML Application pod. A Streamlit frontend communicates with a gRPC backend over localhost; the backend persists metadata to SQLite and dispatches CML Jobs for training and evaluation workloads.
Component Topology
Layer Summary
Presentation Layer
Entry point: main.py. Page modules live in pgs/. Two navigation modes are controlled by the IS_COMPOSABLE environment variable:
- Composable mode (
IS_COMPOSABLEset): Horizontal navbar with dropdown menus for Home, Database, Resources, Experiments, AI Workbench, Examples, and Feedback. - Standard mode (default): Sidebar navigation with section headers and Material Design icons.
Pages obtain shared gRPC and CML client instances through @st.cache_resource decorators defined in pgs/streamlit_utils.py. See Streamlit Presentation Layer for full details.
Application Layer
A gRPC server runs on port 50051, started by bin/start-grpc-server.py as a background subprocess. The service class FineTuningStudioApp in ft/service.py implements FineTuningStudioServicer (generated from protobuf). It is a pure router – each RPC method delegates to a domain function in the corresponding module:
| Module | Domain |
|---|---|
ft/datasets.py | Dataset import, listing, removal |
ft/models.py | Model registration, export |
ft/adapters.py | Adapter management, dataset split lookup |
ft/prompts.py | Prompt template CRUD |
ft/jobs.py | Fine-tuning job dispatch and tracking |
ft/evaluation.py | Evaluation job dispatch and tracking |
ft/configs.py | Configuration blob management |
ft/databse_ops.py | Database export/import operations |
The servicer holds a cmlapi.default_client() and a FineTuningStudioDao instance, passing both to every domain function call. See gRPC Service Design for the full API surface.
Data Layer
SQLite at .app/state.db via SQLAlchemy ORM. Seven tables: models, datasets, adapters, prompts, fine_tuning_jobs, evaluation_jobs, configs. The DAO manages sessions with connection pooling (pool_size=5, max_overflow=10, pool_timeout=30, pool_recycle=1800). See Data Tier for schemas and the DAO API.
Initialization Sequence
The startup sequence is defined in .project-metadata.yaml and executed by bin/start-app-script.sh:
- Install dependencies –
bin/install-dependencies-uv.pyinstalls fromrequirements.txtand performspip install -e .to install theftpackage in dev mode. - Create template CML Jobs –
Accel_Finetuning_Base_JobandMlflow_Evaluation_Base_Jobare created as reusable job templates for fine-tuning and evaluation dispatch. - Initialize project defaults –
bin/initialize-project-defaults-uv.pypopulates default datasets, prompts, models, and adapters fromdata/project_defaults.json. - Start gRPC server –
bin/start-grpc-server.pylaunches as a background process (&), binds to port 50051 with aThreadPoolExecutor(max_workers=10), and setsFINE_TUNING_SERVICE_IPandFINE_TUNING_SERVICE_PORTas CML project environment variables viacmlapi. - Start Streamlit –
uv run -m streamlit run main.py --server.port $CDSW_APP_PORT --server.address 127.0.0.1.
Both processes (gRPC server and Streamlit) run in the same pod. The gRPC server is the subprocess; Streamlit is the foreground process that keeps the CML Application alive.
Environment Variables
| Variable | Purpose | Default |
|---|---|---|
FINE_TUNING_SERVICE_IP | gRPC server IP address | Set at startup from CDSW_IP_ADDRESS |
FINE_TUNING_SERVICE_PORT | gRPC server port | 50051 |
FINE_TUNING_STUDIO_SQLITE_DB | SQLite database file path | .app/state.db |
CDSW_PROJECT_ID | CML project identifier | Set by CML runtime |
CDSW_APP_PORT | Streamlit server port | Set by CML runtime |
HUGGINGFACE_ACCESS_TOKEN | HuggingFace Hub token for gated models | Optional (empty string) |
IS_COMPOSABLE | Enable horizontal navbar mode | Optional (unset = sidebar) |
CUSTOM_LORA_ADAPTERS_DIR | Directory for custom LoRA adapters | data/adapters/ |
FINE_TUNING_STUDIO_PROJECT_DEFAULTS | Path to project defaults JSON | data/project_defaults.json |
Key Takeaway for Harness Builders
The gRPC API is the sole interface to application logic. The Streamlit UI is one client of this API, not the source of truth. Any external harness, CLI tool, or automation script should instantiate a FineTuningStudioClient (or use the generated gRPC stub directly) and interact through the protobuf contract. The database is an implementation detail behind the DAO – never access .app/state.db directly from external code.
To build a custom training harness:
- Import
FineTuningStudioClientfromft.client. - Register resources (datasets, models, prompts) via
Add*RPCs. - Dispatch training via
StartFineTuningJobwith the desired resource IDs and compute configuration. - Poll job status via
GetFineTuningJoborListFineTuningJobs. - Evaluate results via
StartEvaluationJob.
All resource IDs are UUIDs assigned by the service. Pass them by value between RPCs.
gRPC Service Design
The Fine Tuning Studio API is defined as a single gRPC service in ft/proto/fine_tuning_studio.proto. The service exposes 29 RPCs organized by resource domain. A generated Python stub provides the transport layer; FineTuningStudioClient wraps it with error handling and convenience methods.
Service Architecture
RPC Catalog
Every domain follows the same pattern: List, Get, Add (or Start for jobs), and Remove. Request and response types use the naming convention {Action}{Domain}Request / {Action}{Domain}Response.
Dataset RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListDatasets | ListDatasetsRequest | ListDatasetsResponse | Return all registered datasets |
GetDataset | GetDatasetRequest | GetDatasetResponse | Return a single dataset by ID |
AddDataset | AddDatasetRequest | AddDatasetResponse | Register a HuggingFace or local dataset |
RemoveDataset | RemoveDatasetRequest | RemoveDatasetResponse | Delete a dataset registration |
GetDatasetSplitByAdapter | GetDatasetSplitByAdapterRequest | GetDatasetSplitByAdapterResponse | Get dataset split info for a specific adapter |
Model RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListModels | ListModelsRequest | ListModelsResponse | Return all registered models |
GetModel | GetModelRequest | GetModelResponse | Return a single model by ID |
AddModel | AddModelRequest | AddModelResponse | Register a HuggingFace or CML model |
ExportModel | ExportModelRequest | ExportModelResponse | Export a model to CML Model Registry |
RemoveModel | RemoveModelRequest | RemoveModelResponse | Delete a model registration |
Adapter RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListAdapters | ListAdaptersRequest | ListAdaptersResponse | Return all registered adapters |
GetAdapter | GetAdapterRequest | GetAdapterResponse | Return a single adapter by ID |
AddAdapter | AddAdapterRequest | AddAdapterResponse | Register a local or HuggingFace adapter |
RemoveAdapter | RemoveAdapterRequest | RemoveAdapterResponse | Delete an adapter registration |
Prompt RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListPrompts | ListPromptsRequest | ListPromptsResponse | Return all prompt templates |
GetPrompt | GetPromptRequest | GetPromptResponse | Return a single prompt by ID |
AddPrompt | AddPromptRequest | AddPromptResponse | Create a new prompt template |
RemovePrompt | RemovePromptRequest | RemovePromptResponse | Delete a prompt template |
Fine-Tuning RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListFineTuningJobs | ListFineTuningJobsRequest | ListFineTuningJobsResponse | Return all fine-tuning jobs |
GetFineTuningJob | GetFineTuningJobRequest | GetFineTuningJobResponse | Return a single job by ID |
StartFineTuningJob | StartFineTuningJobRequest | StartFineTuningJobResponse | Dispatch a new fine-tuning CML Job |
RemoveFineTuningJob | RemoveFineTuningJobRequest | RemoveFineTuningJobResponse | Delete a fine-tuning job record |
Evaluation RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListEvaluationJobs | ListEvaluationJobsRequest | ListEvaluationJobsResponse | Return all evaluation jobs |
GetEvaluationJob | GetEvaluationJobRequest | GetEvaluationJobResponse | Return a single evaluation job by ID |
StartEvaluationJob | StartEvaluationJobRequest | StartEvaluationJobResponse | Dispatch a new evaluation CML Job |
RemoveEvaluationJob | RemoveEvaluationJobRequest | RemoveEvaluationJobResponse | Delete an evaluation job record |
Config RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ListConfigs | ListConfigsRequest | ListConfigsResponse | Return all configuration blobs |
GetConfig | GetConfigRequest | GetConfigResponse | Return a single config by ID |
AddConfig | AddConfigRequest | AddConfigResponse | Create a new configuration |
RemoveConfig | RemoveConfigRequest | RemoveConfigResponse | Delete a configuration |
Database RPCs
| RPC | Request Type | Response Type | Description |
|---|---|---|---|
ExportDatabase | ExportDatabaseRequest | ExportDatabaseResponse | Export entire database as JSON |
ImportDatabase | ImportDatabaseRequest | ImportDatabaseResponse | Import database from JSON file |
Servicer Implementation
FineTuningStudioApp in ft/service.py extends the generated FineTuningStudioServicer. It holds two shared resources initialized in __init__:
class FineTuningStudioApp(FineTuningStudioServicer):
def __init__(self):
self.cml = cmlapi.default_client()
self.dao = FineTuningStudioDao(engine_args={
"pool_size": 5,
"max_overflow": 10,
"pool_timeout": 30,
"pool_recycle": 1800,
})
self.project_id = os.getenv("CDSW_PROJECT_ID")
Every RPC method is a one-line delegation to the corresponding domain function, passing (request, self.cml, self.dao):
def ListDatasets(self, request, context):
return list_datasets(request, self.cml, self.dao)
def StartFineTuningJob(self, request, context):
return start_fine_tuning_job(request, self.cml, dao=self.dao)
Config and database RPCs omit the cml parameter since they operate on local data only.
Client Wrapper
FineTuningStudioClient in ft/client.py wraps the generated stub with automatic error handling. On construction, it introspects all callable methods on the stub and wraps each one to convert grpc.RpcError into ValueError with cleaned messages.
class FineTuningStudioClient:
def __init__(self, server_ip=None, server_port=None):
if not server_ip:
server_ip = os.getenv("FINE_TUNING_SERVICE_IP")
if not server_port:
server_port = os.getenv("FINE_TUNING_SERVICE_PORT")
self.channel = grpc.insecure_channel(f"{server_ip}:{server_port}")
self.stub = FineTuningStudioStub(self.channel)
# Auto-wrap all stub methods with error handling
for attr in dir(self.stub):
if not attr.startswith('_') and callable(getattr(self.stub, attr)):
setattr(self, attr, self._grpc_error_handler(getattr(self.stub, attr)))
Convenience Methods
The client provides shorthand accessors that construct the request internally:
| Method | Returns | Equivalent RPC |
|---|---|---|
get_datasets() | List[DatasetMetadata] | ListDatasets(ListDatasetsRequest()).datasets |
get_models() | List[ModelMetadata] | ListModels(ListModelsRequest()).models |
get_adapters() | List[AdapterMetadata] | ListAdapters(ListAdaptersRequest()).adapters |
get_prompts() | List[PromptMetadata] | ListPrompts(ListPromptsRequest()).prompts |
get_fine_tuning_jobs() | List[FineTuningJobMetadata] | ListFineTuningJobs(ListFineTuningJobsRequest()).fine_tuning_jobs |
get_evaluation_jobs() | List[EvaluationJobMetadata] | ListEvaluationJobs(ListEvaluationJobsRequest()).evaluation_jobs |
Usage Example
from ft.client import FineTuningStudioClient
from ft.api import *
client = FineTuningStudioClient()
# List all datasets
datasets = client.get_datasets()
# Add a HuggingFace dataset
client.AddDataset(AddDatasetRequest(
type="huggingface",
huggingface_name="tatsu-lab/alpaca",
name="Alpaca"
))
# Start a fine-tuning job
client.StartFineTuningJob(StartFineTuningJobRequest(
base_model_id="model-uuid",
dataset_id="dataset-uuid",
prompt_id="prompt-uuid",
adapter_name="my-adapter",
num_cpu=2,
num_gpu=1,
num_memory=16,
framework_type="legacy"
))
All request and response types are importable from ft.api, which re-exports the generated protobuf classes.
Protobuf Regeneration
After modifying ft/proto/fine_tuning_studio.proto, regenerate the Python bindings:
./bin/generate-proto-python.sh
This produces ft/proto/fine_tuning_studio_pb2.py (message classes) and ft/proto/fine_tuning_studio_pb2_grpc.py (stub and servicer base class). Both are checked into the repository. Do not edit them by hand.
Server Startup
The gRPC server is started by bin/start-grpc-server.py:
- Creates a
grpc.serverwithThreadPoolExecutor(max_workers=10). - Registers
FineTuningStudioApp()as the servicer. - Binds to
[::]:50051(all interfaces). - Updates CML project environment variables (
FINE_TUNING_SERVICE_IP,FINE_TUNING_SERVICE_PORT) viacmlapiso that any workload in the project can locate the server. - Blocks on
server.wait_for_termination().
The server process is launched as a background subprocess by bin/start-app-script.sh before Streamlit starts. See System Overview for the full initialization sequence.
Data Tier
All Fine Tuning Studio metadata is persisted in a SQLite database at .app/state.db (configurable via FINE_TUNING_STUDIO_SQLITE_DB). The ORM layer uses SQLAlchemy declarative models defined in ft/db/model.py. Access is managed through FineTuningStudioDao in ft/db/dao.py.
Schema Topology
Table Schemas
All primary keys are String type (UUIDs assigned by domain logic). All columns are nullable except id. ORM classes are defined in ft/db/model.py.
models
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Source type (e.g., huggingface, cml) | |
framework | String | Model framework identifier | |
name | String | Display name | |
description | String | Human-readable description | |
huggingface_model_name | String | HuggingFace Hub model ID | |
location | String | Local filesystem path | |
cml_registered_model_id | String | CML Model Registry ID | |
mlflow_experiment_id | String | Associated MLflow experiment | |
mlflow_run_id | String | Associated MLflow run |
datasets
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Source type (e.g., huggingface, local) | |
name | String | Display name | |
description | Text | Long-form description | |
huggingface_name | String | HuggingFace Hub dataset ID | |
location | Text | Local filesystem path | |
features | Text | JSON string of dataset feature names |
adapters
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Source type | |
name | String | Display name | |
description | String | Human-readable description | |
huggingface_name | String | HuggingFace Hub adapter ID | |
model_id | String | FK -> models.id | Base model this adapter targets |
location | Text | Local filesystem path to adapter weights | |
fine_tuning_job_id | String | FK -> fine_tuning_jobs.id | Job that produced this adapter |
prompt_id | String | FK -> prompts.id | Prompt template used during training |
cml_registered_model_id | String | CML Model Registry ID | |
mlflow_experiment_id | String | Associated MLflow experiment | |
mlflow_run_id | String | Associated MLflow run |
prompts
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Prompt type | |
name | String | Display name | |
description | String | Human-readable description | |
dataset_id | String | FK -> datasets.id | Dataset this prompt is designed for |
prompt_template | String | Full prompt format string | |
input_template | String | Input portion template | |
completion_template | String | Completion portion template |
fine_tuning_jobs
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
base_model_id | String | FK -> models.id | Base model to fine-tune |
dataset_id | String | FK -> datasets.id | Training dataset |
prompt_id | String | FK -> prompts.id | Prompt template |
num_workers | Integer | Number of worker processes | |
cml_job_id | String | CML Job ID for tracking | |
adapter_id | String | FK -> adapters.id | Resulting adapter |
num_cpu | Integer | CPU allocation | |
num_gpu | Integer | GPU allocation | |
num_memory | Integer | Memory allocation (GB) | |
num_epochs | Integer | Training epochs | |
learning_rate | Double | Learning rate | |
out_dir | String | Output directory for adapter weights | |
training_arguments_config_id | String | FK -> configs.id | Training arguments config |
model_bnb_config_id | String | FK -> configs.id | Model BitsAndBytes quantization config |
adapter_bnb_config_id | String | FK -> configs.id | Adapter BitsAndBytes quantization config |
lora_config_id | String | FK -> configs.id | LoRA hyperparameters config |
training_arguments_config | String | Serialized training arguments (snapshot) | |
model_bnb_config | String | Serialized model BnB config (snapshot) | |
adapter_bnb_config | String | Serialized adapter BnB config (snapshot) | |
lora_config | String | Serialized LoRA config (snapshot) | |
dataset_fraction | Double | Fraction of dataset to use | |
train_test_split | Double | Train/test split ratio | |
user_script | String | Custom user training script path | |
user_config_id | String | FK -> configs.id | Custom user config |
framework_type | String | Training framework (legacy, axolotl, etc.) | |
axolotl_config_id | String | FK -> configs.id | Axolotl YAML config |
gpu_label_id | Integer | GPU label selector | |
adapter_name | String | Name assigned to the output adapter |
The fine_tuning_jobs table stores both config ID references (foreign keys to configs) and serialized config snapshots (plain string columns). This allows job records to remain self-describing even if the referenced config is later deleted.
evaluation_jobs
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Evaluation type | |
cml_job_id | String | CML Job ID for tracking | |
parent_job_id | String | Parent fine-tuning job (if derived) | |
base_model_id | String | FK -> models.id | Model under evaluation |
dataset_id | String | FK -> datasets.id | Evaluation dataset |
prompt_id | String | FK -> prompts.id | Prompt template |
num_workers | Integer | Number of worker processes | |
adapter_id | String | FK -> adapters.id | Adapter under evaluation |
num_cpu | Integer | CPU allocation | |
num_gpu | Integer | GPU allocation | |
num_memory | Integer | Memory allocation (GB) | |
evaluation_dir | String | Output directory for evaluation artifacts | |
model_bnb_config_id | String | FK -> configs.id | Model BnB quantization config |
adapter_bnb_config_id | String | FK -> configs.id | Adapter BnB quantization config |
generation_config_id | String | FK -> configs.id | Generation config for inference |
model_bnb_config | String | Serialized model BnB config (snapshot) | |
adapter_bnb_config | String | Serialized adapter BnB config (snapshot) | |
generation_config | String | Serialized generation config (snapshot) |
configs
| Column | Type | Constraints | Description |
|---|---|---|---|
id | String | PK, NOT NULL | UUID |
type | String | Config type (training_arguments, bnb, lora, generation, axolotl) | |
description | String | Human-readable description | |
config | Text | JSON or YAML content stored as string | |
model_family | String | Model family this config targets | |
is_default | Integer | 1 = shipped default, 0 = user-created |
ORM Mix-ins
All ORM model classes inherit from three bases: Base (SQLAlchemy declarative base), MappedProtobuf, and MappedDict. These mix-ins provide bidirectional serialization.
MappedProtobuf
Converts between protobuf messages and ORM instances.
# Protobuf message -> ORM instance
adapter_orm = Adapter.from_message(adapter_proto_msg)
# ORM instance -> Protobuf message
adapter_proto = adapter_orm.to_protobuf(AdapterMetadata)
from_message() uses ListFields() (protobuf >= 3.15) to extract only fields that were explicitly set in the message, avoiding default-value contamination. to_protobuf() iterates the ORM instance’s non-null columns and sets matching fields on a new protobuf message.
MappedDict
Converts between Python dictionaries and ORM instances.
# Dict -> ORM instance
model_orm = Model.from_dict({"id": "abc", "name": "llama-2"})
# ORM instance -> Dict (non-null fields only)
model_dict = model_orm.to_dict()
Table-Model Registry
ft/db/model.py exports two lookup dictionaries for programmatic table access:
TABLE_TO_MODEL_REGISTRY = {
'datasets': Dataset,
'models': Model,
'prompts': Prompt,
'adapters': Adapter,
'fine_tuning_jobs': FineTuningJob,
'evaluation_jobs': EvaluationJob,
'configs': Config
}
MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}
These are used by the database import/export logic to iterate all application tables.
DAO
FineTuningStudioDao in ft/db/dao.py manages SQLAlchemy engine and session lifecycle.
Constructor
class FineTuningStudioDao:
def __init__(self, engine_url=None, echo=False, engine_args={}):
if engine_url is None:
engine_url = f"sqlite+pysqlite:///{get_sqlite_db_location()}"
self.engine = create_engine(engine_url, echo=echo, **engine_args)
self.Session = sessionmaker(bind=self.engine, autoflush=True, autocommit=False)
Base.metadata.create_all(self.engine)
The servicer instantiates the DAO with connection pool parameters:
| Parameter | Value | Description |
|---|---|---|
pool_size | 5 | Persistent connections in the pool |
max_overflow | 10 | Additional connections beyond pool_size |
pool_timeout | 30 | Seconds to wait for a connection |
pool_recycle | 1800 | Seconds before a connection is recycled |
Tables are auto-created on first initialization via Base.metadata.create_all(engine).
Session Context Manager
All domain functions access the database through dao.get_session():
@contextmanager
def get_session(self):
session = self.Session()
try:
yield session
session.commit()
except Exception as e:
session.rollback()
raise e
finally:
session.close()
Usage in domain code:
def list_datasets(request, cml, dao):
with dao.get_session() as session:
datasets = session.query(Dataset).all()
# ... convert and return
The context manager guarantees: commit on success, rollback on exception, close in all cases.
Database Export and Import
ft/db/db_import_export.py provides DatabaseJsonConverter for full database serialization.
Export
export_to_json(output_path=None) iterates all non-system tables (excluding sqlite_* internal tables), captures the CREATE TABLE schema and all row data, and returns a JSON string:
{
"models": {
"schema": "CREATE TABLE IF NOT EXISTS models (...)",
"data": [
{"id": "abc-123", "name": "llama-2", "type": "huggingface", ...}
]
},
"datasets": { ... },
...
}
If output_path is provided, the JSON is also written to that file.
Import
import_from_json(json_path) reads a JSON file in the export format, executes each table’s CREATE TABLE IF NOT EXISTS statement, and inserts all rows. Rows that fail to insert (e.g., due to duplicate primary keys) are logged but do not abort the import.
Alembic Migrations
Schema migrations are managed by Alembic. Configuration is at alembic.ini with migration scripts in db_migrations/. When adding or modifying columns, generate a new migration with:
alembic revision --autogenerate -m "description of change"
alembic upgrade head
The DAO’s create_all() call handles initial table creation, but column additions and type changes on existing databases require Alembic migrations.
Cross-References
- System Overview – initialization sequence and environment variables
- gRPC Service Design – how domain functions receive the DAO
- Configuration Specification – config type taxonomy and validation
Streamlit Presentation Layer
The UI is a multi-page Streamlit application defined in main.py. It renders resource management forms, job dispatch controls, and evaluation dashboards. All data operations go through the gRPC client – the Streamlit layer has no direct database access.
Entry Point
main.py sets the page configuration and selects a navigation mode based on the IS_COMPOSABLE environment variable:
st.set_page_config(
page_title="Fine Tuning Studio",
page_icon=IconPaths.FineTuningStudio.FINE_TUNING_STUDIO,
layout="wide"
)
The layout is always "wide". The page icon is loaded from the resources/images/ directory via ft.consts.IconPaths.
Navigation Modes
Composable Mode
Activated when IS_COMPOSABLE is set to any non-empty value. Uses streamlit_navigation_bar (st_navbar) combined with custom HTML/CSS for dropdown menus. Navigation groups:
| Group | Pages |
|---|---|
| Home | Home |
| Database Import Export | Database Import and Export |
| Resources | Import Datasets, View Datasets, Import Base Models, View Base Models, Create Prompts, View Prompts |
| Experiment | Train a New Adapter, Monitor Training Jobs, Local Adapter Comparison, Run MLFlow Evaluation, View MLflow Runs |
| AI Workbench | Export And Deploy Model |
| Examples | Ticketing Agent App |
| Feedback | Provide Feedback |
The navbar is rendered as a fixed-position HTML <nav> element with CSS dropdown menus. Links use target="_self" to navigate within the Streamlit app. All pages are registered with st.navigation(position="hidden") so that Streamlit handles routing internally while the custom navbar provides the visible UI.
Standard Mode (Default)
When IS_COMPOSABLE is not set, the sidebar renders section headers and page links with Material Design icons:
with st.sidebar:
st.image("./resources/images/ft-logo.png")
st.markdown("Navigation")
st.page_link("pgs/home.py", label="Home", icon=":material/home:")
st.page_link("pgs/database.py", label="Database Import and Export", icon=":material/database:")
st.markdown("Resources")
st.page_link("pgs/datasets.py", label="Import Datasets", icon=":material/publish:")
st.page_link("pgs/view_datasets.py", label="View Datasets", icon=":material/data_object:")
# ... remaining pages
Sidebar sections: Navigation, Resources, Experiments, AI Workbench, Examples, Feedback. The sidebar footer displays the current project owner and a link to the CML domain.
Page Inventory
All page modules live in the pgs/ directory:
| File | Title | Section |
|---|---|---|
pgs/home.py | Home | Navigation |
pgs/database.py | Database Import and Export | Database |
pgs/datasets.py | Import Datasets | Resources |
pgs/view_datasets.py | View Datasets | Resources |
pgs/models.py | Import Base Models | Resources |
pgs/view_models.py | View Base Models | Resources |
pgs/prompts.py | Create Prompts | Resources |
pgs/view_prompts.py | View Prompts | Resources |
pgs/train_adapter.py | Train a New Adapter | Experiments |
pgs/jobs.py | Training Job Tracking | Experiments |
pgs/evaluate.py | Local Adapter Comparison | Experiments |
pgs/mlflow.py | Run MLFlow Evaluation | Experiments |
pgs/mlflow_jobs.py | View MLflow Runs | Experiments |
pgs/export.py | Export And Deploy Model | AI Workbench |
pgs/sample_ticketing_agent_app_embed.py | Sample Ticketing Agent App | Examples |
pgs/feedback.py | Feedback | Feedback |
Client Caching
Shared client instances are cached at the Streamlit server level using @st.cache_resource. This avoids creating a new gRPC channel or CML API client on every page render. Both helpers are defined in pgs/streamlit_utils.py:
@st.cache_resource
def get_fine_tuning_studio_client() -> FineTuningStudioClient:
client = FineTuningStudioClient()
return client
@st.cache_resource
def get_cml_client() -> CMLServiceApi:
client = default_client()
return client
@st.cache_resource ensures a single instance per Streamlit server process. The gRPC client connects to the address specified by FINE_TUNING_SERVICE_IP and FINE_TUNING_SERVICE_PORT environment variables. The CML client uses cmlapi.default_client(), which reads CML connection parameters from the pod environment.
Data Flow
Every user interaction follows this path: Streamlit widget event triggers a page callback, the page calls the cached client, the client sends a gRPC request, the server delegates to domain logic, and the domain function uses the DAO to read or write SQLite.
How to Add a New Page
- Create the page module at
pgs/my_page.py:
import streamlit as st
from pgs.streamlit_utils import get_fine_tuning_studio_client
st.header("My New Page")
client = get_fine_tuning_studio_client()
# Use the client to interact with the gRPC service
models = client.get_models()
for model in models:
st.write(model.name)
- Register the page in both navigation modes in
main.py:
In the composable mode setup_navigation() function, add:
st.Page("pgs/my_page.py", title="My New Page"),
In the composable mode HTML navbar, add a link in the appropriate dropdown:
<a href="/my_page" target="_self"><span class="material-icons">icon_name</span> My New Page</a>
In the standard mode setup_navigation_sidebar() function, add:
st.Page("pgs/my_page.py", title="My New Page"),
In the standard mode setup_sidebar() function, add under the appropriate section:
st.page_link("pgs/my_page.py", label="My New Page", icon=":material/icon_name:")
- If the page requires a new RPC, add it to the protobuf definition, regenerate, implement the servicer method, and add the domain function. See gRPC Service Design.
Custom CSS
Both navigation modes inject custom CSS to control typography and layout:
- Heading sizes (
h3reduced to1.1rem) - Tab label font sizes (
0.9rem) - Sidebar theming (dark background
#16262c, white text) in standard mode - Navbar positioning and dropdown behavior in composable mode
CSS is injected via st.markdown(css, unsafe_allow_html=True).
Cross-References
- System Overview – startup sequence and environment variables
- gRPC Service Design – client wrapper and API surface
- Data Tier – database schema backing the resources displayed in the UI
Resource Concepts
Fine Tuning Studio manages seven resource types. All use UUID string primary keys generated via uuid4(). Resources are metadata entries stored in SQLite – the actual artifacts (model weights, dataset files, adapter checkpoints) live on the filesystem, HuggingFace Hub, or the CML Model Registry.
Resource Types
| Resource | Table | Purpose |
|---|---|---|
| Dataset | datasets | Reference to a HuggingFace Hub dataset or local file (CSV, JSON, JSONL) |
| Model | models | Base foundation model from HuggingFace Hub or CML Model Registry |
| Adapter | adapters | PEFT LoRA adapter – produced by training, imported from disk, or fetched from Hub |
| Prompt | prompts | Format-string template mapping dataset features into training input |
| Config | configs | Named configuration blob (training args, BnB, LoRA, generation, Axolotl YAML) |
| FineTuningJob | fine_tuning_jobs | CML Job that trains a PEFT adapter |
| EvaluationJob | evaluation_jobs | CML Job that runs MLflow evaluation against model+adapter combinations |
Entity Relationships
Type Enums
All type enums are defined in ft/api/types.py as str, Enum subclasses.
| Enum | Values |
|---|---|
DatasetType | huggingface, project, project_csv, project_json, project_jsonl |
ModelType | huggingface, project, model_registry |
AdapterType | project, huggingface, model_registry |
PromptType | in_place |
ConfigType | training_arguments, bitsandbytes_config, generation_config, lora_config, custom, axolotl, axolotl_dataset_formats |
FineTuningFrameworkType | legacy, axolotl |
ModelExportType | model_registry, cml_model |
EvaluationJobType | mlflow |
ModelFrameworkType | pytorch, tensorflow, onnx |
ORM Layer
All ORM models inherit from sqlalchemy.orm.declarative_base() plus two mixins defined in ft/db/model.py:
MappedProtobuf – bidirectional protobuf conversion:
from_message(message)– class method. Extracts set fields from a protobuf message viaListFields()and passes them as kwargs to the ORM constructor.to_protobuf(protobuf_cls)– instance method. Converts non-null ORM columns into a protobuf message by matching field names.
MappedDict – bidirectional dict conversion:
from_dict(d)– class method. Constructs an ORM instance from a plain dictionary.to_dict()– instance method. Returns a dictionary of all non-null column values via SQLAlchemyinspect().
The serialization chain for any resource:
Protobuf message <--> ORM model <--> Python dict
from_message() / to_protobuf() from_dict() / to_dict()
Table Registry
ft/db/model.py maintains two registries used by the database import/export subsystem:
TABLE_TO_MODEL_REGISTRY = {
'datasets': Dataset,
'models': Model,
'prompts': Prompt,
'adapters': Adapter,
'fine_tuning_jobs': FineTuningJob,
'evaluation_jobs': EvaluationJob,
'configs': Config,
}
MODEL_TO_TABLE_REGISTRY = {v: k for k, v in TABLE_TO_MODEL_REGISTRY.items()}
Any new resource type must be added to TABLE_TO_MODEL_REGISTRY for database import/export to function correctly.
Dataset Specification
A Dataset resource is a metadata reference to a data source. The actual data lives on HuggingFace Hub or the local filesystem. On import, Studio extracts feature column names and stores them as a JSON string, enabling downstream prompt template construction without reloading the data.
Source: ft/datasets.py, ft/db/model.py
Supported Types
| Type | Source | Identifier Field | Feature Extraction Method |
|---|---|---|---|
huggingface | HuggingFace Hub | huggingface_name | load_dataset_builder() -> info.features.keys() |
project | Local HF-compatible directory | location | Not extracted |
project_csv | Local CSV file | location | Read header row via csv.reader |
project_json | Local JSON file | location | Read first object keys via json.load |
project_jsonl | Local JSONL file | location | Read first line keys via json.loads |
ORM Schema
class Dataset(Base, MappedProtobuf, MappedDict):
__tablename__ = "datasets"
id = Column(String, primary_key=True) # UUID
type = Column(String) # DatasetType enum value
name = Column(String) # Display name
description = Column(Text) # Auto-populated for HF datasets
huggingface_name = Column(String) # HF Hub identifier (HF type only)
location = Column(Text) # Filesystem path (project types only)
features = Column(Text) # JSON-serialized list of column names
Import Validation
add_dataset() dispatches to type-specific validators before creating a record:
All types:
typefield is required.- Duplicate detection by name (local types) or
huggingface_name(HF type).
HuggingFace (_validate_huggingface_dataset_request):
huggingface_namefield required and non-blank.- Validates dataset exists on Hub via
load_dataset_builder(). - Extracts
dataset_info.features.keys()for feature list. - Stores
dataset_info.descriptionas the description.
CSV (_validate_local_csv_dataset_request):
locationfield required, must end with.csv.namefield required and non-blank.- Reads header row with
csv.reader(file)/next(reader)for features.
JSON (_validate_local_json_dataset_request):
locationfield required, must end with.json.- Reads first object in the JSON array for feature keys.
JSONL (_validate_local_jsonl_dataset_request):
locationfield required, must end with.jsonl.- Reads first line, parses as JSON, extracts keys for features.
Feature Extraction Functions
extract_features_from_csv(location) # csv.reader -> next(reader)
extract_features_from_json(location) # json.load -> next(iter(data)).keys()
extract_features_from_jsonl(location) # json.loads(first_line).keys()
Features are stored as json.dumps(features) in the features column. Downstream consumers (prompt templates, training scripts) parse this back with json.loads().
Loading into Memory
load_dataset_into_memory(dataset: DatasetMetadata) normalizes all dataset types into a HuggingFace DatasetDict with at minimum a train key:
| Type | Load Method | Wrapping |
|---|---|---|
huggingface | datasets.load_dataset(huggingface_name) | Already a DatasetDict |
project_csv | datasets.load_dataset('csv', data_files=location) | Already a DatasetDict |
project_json | datasets.Dataset.from_json(location) | Wrapped in DatasetDict({'train': ds}) |
project_jsonl | datasets.Dataset.from_json(location) | Wrapped in DatasetDict({'train': ds}) |
If the loaded object is a Dataset (not DatasetDict), it is wrapped: DatasetDict({'train': ds}).
Removal
remove_dataset() deletes the Dataset record. If request.remove_prompts is set, also deletes all Prompt records with matching dataset_id via cascading delete.
Protobuf Message
DatasetMetadata fields: id, type, name, description, huggingface_name, location, features.
Model Specification
A Model resource represents a base foundation model registered in the Studio’s metadata store. Models serve as the starting point for fine-tuning and evaluation. The actual model weights are never stored by Studio – they are downloaded at training time from HuggingFace Hub or resolved from the CML Model Registry.
Source: ft/models.py, ft/db/model.py, ft/config/model_configs/config_loader.py
Supported Types
| Type | Source | Required Fields | Validation |
|---|---|---|---|
huggingface | HuggingFace Hub | huggingface_model_name | HfApi().model_info() must succeed |
model_registry | CML Model Registry | model_registry_id (request) | Fetches RegisteredModel via cmlapi |
project | Local directory | location | Not yet fully implemented |
ORM Schema
class Model(Base, MappedProtobuf, MappedDict):
__tablename__ = "models"
id = Column(String, primary_key=True) # UUID
type = Column(String) # ModelType enum value
framework = Column(String) # ModelFrameworkType (pytorch, tensorflow, onnx)
name = Column(String) # Display name
description = Column(String)
huggingface_model_name = Column(String) # HF Hub model identifier
location = Column(String) # Local path (project type)
cml_registered_model_id = Column(String) # CML Registry model ID
mlflow_experiment_id = Column(String) # MLflow experiment (registry type)
mlflow_run_id = Column(String) # MLflow run (registry type)
Import Flow
add_model() validates and creates a Model record based on type:
HuggingFace:
- Validate
huggingface_nameis non-empty and not already registered (duplicate check byhuggingface_model_name). - Call
HfApi().model_info(name)to confirm model exists on Hub. - Create Model with
type=HUGGINGFACE,nameandhuggingface_model_nameset to the stripped input.
Model Registry:
model_registry_idmust be provided on the request.- Fetch
RegisteredModelviacml.get_registered_model(id). - Extract the first version’s metadata:
registered_model.model_versions[0].model_version_metadata.mlflow_metadata. - Create Model with
type=MODEL_REGISTRY,namefromregistered_model.name, and populatecml_registered_model_id,mlflow_experiment_id,mlflow_run_id.
Model Family Detection
ft/config/model_configs/config_loader.py provides ModelMetadataFinder:
class ModelMetadataFinder:
def __init__(self, model_name_or_path):
self.model_name_or_path = model_name_or_path
def fetch_model_family_from_config(self):
config = AutoConfig.from_pretrained(self.model_name_or_path)
return config.architectures[0] # e.g., "LlamaForCausalLM"
This is used in two places:
- Config filtering:
list_configs()filters default configs to those matching the model’s architecture family. - Config creation:
add_config()with adescriptionfield usestransform_name_to_family()to resolve the model family for deduplication scoping.
Additional static methods:
fetch_bos_token_id_from_config(model_name_or_path)– returnsconfig.bos_token_id(default: 1).fetch_eos_token_id_from_config(model_name_or_path)– returnsconfig.eos_token_id(default: 2).
Export Routes
export_model() dispatches based on ModelExportType:
| Export Type | Handler | Target |
|---|---|---|
model_registry | export_model_registry_model() | MLflow model registry |
cml_model | deploy_cml_model() | CML Model endpoint |
Both handlers are defined in ft/export.py.
Protobuf Message
ModelMetadata fields: id, type, framework, name, huggingface_model_name, location, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.
Adapter Specification
An Adapter resource represents a PEFT LoRA adapter. Adapters are produced by fine-tuning jobs, imported from a local directory, or fetched from HuggingFace Hub. Each adapter is linked to a base model and optionally to the fine-tuning job and prompt template that produced it.
Source: ft/adapters.py, ft/db/model.py
Supported Types
| Type | Source | Required Fields |
|---|---|---|
project | Local directory | location (must exist as a directory) |
huggingface | HuggingFace Hub | huggingface_name |
model_registry | CML Model Registry | cml_registered_model_id |
ORM Schema
class Adapter(Base, MappedProtobuf, MappedDict):
__tablename__ = "adapters"
id = Column(String, primary_key=True) # UUID
type = Column(String) # AdapterType enum value
name = Column(String) # Display name (unique)
description = Column(String)
huggingface_name = Column(String) # HF Hub adapter identifier
model_id = Column(String, ForeignKey('models.id')) # Base model FK
location = Column(Text) # Local path to adapter dir
fine_tuning_job_id = Column(String, ForeignKey('fine_tuning_jobs.id')) # Producing job FK
prompt_id = Column(String, ForeignKey('prompts.id')) # Training prompt FK
cml_registered_model_id = Column(String) # CML Registry model ID
mlflow_experiment_id = Column(String) # MLflow experiment
mlflow_run_id = Column(String) # MLflow run
Key Relationships
| FK Column | Target | Required |
|---|---|---|
model_id | models.id | Yes – the base model this adapter applies to |
fine_tuning_job_id | fine_tuning_jobs.id | No – only set for Studio-trained adapters |
prompt_id | prompts.id | No – only set for Studio-trained adapters |
Import Validation
_validate_add_adapter_request() enforces:
- Required fields:
name,model_id, andlocationmust all be present and non-blank. - Directory existence:
os.path.isdir(request.location)must returnTrue. - Model FK:
model_idmust reference an existing Model record. - Unique name: No existing adapter may share the same
name. - Optional FK checks: If
fine_tuning_job_idis provided, it must exist infine_tuning_jobs. Ifprompt_idis provided, it must exist inprompts.
Adapter Creation
add_adapter() validates the request, then creates an Adapter record with all provided fields mapped directly from the request.
Dataset Split Tracking
get_dataset_split_by_adapter() retrieves the dataset fraction and train/test split used during training for a given adapter:
- Joins
FineTuningJobtoAdapteronadapter_name. - If a matching job is found, returns its
dataset_fractionandtrain_test_split. - If no matching job exists (imported adapter), returns defaults:
| Parameter | Default | Source |
|---|---|---|
dataset_fraction | 1.0 | TRAINING_DEFAULT_DATASET_FRACTION |
train_test_split | 0.9 | TRAINING_DEFAULT_TRAIN_TEST_SPLIT |
These defaults are defined in ft/consts.py.
Protobuf Message
AdapterMetadata fields: id, type, name, description, huggingface_name, model_id, location, fine_tuning_job_id, prompt_id, cml_registered_model_id, mlflow_experiment_id, mlflow_run_id.
Prompt Template Specification
A Prompt resource defines a format-string template that maps dataset feature columns into structured training input. Prompts bind a dataset’s column names to positional slots in the training text, controlling how raw data is presented to the model during fine-tuning and evaluation.
Source: ft/prompts.py, ft/utils.py, ft/jobs.py, ft/db/model.py
Template Fields
| Field | Purpose | Example |
|---|---|---|
prompt_template | Full prompt format string used during training | "Instruction: {instruction}\nInput: {input}\nOutput: {output}" |
input_template | Input portion (informational, used in evaluation) | "Instruction: {instruction}\nInput: {input}" |
completion_template | Expected output portion (informational, used in evaluation) | "Output: {output}" |
Placeholders use Python format-string syntax: {feature_name}. Each placeholder must correspond to a column name in the linked dataset’s features JSON array.
ORM Schema
class Prompt(Base, MappedProtobuf, MappedDict):
__tablename__ = "prompts"
id = Column(String, primary_key=True) # UUID
type = Column(String) # PromptType enum value
name = Column(String) # Display name (unique)
description = Column(String)
dataset_id = Column(String, ForeignKey('datasets.id')) # Linked dataset FK
prompt_template = Column(String) # Full template
input_template = Column(String) # Input portion
completion_template = Column(String) # Output portion
Import Validation
_validate_add_prompt_request() enforces:
- Required fields:
id,name,dataset_id,prompt_template,input_template,completion_templatemust all be present on thePromptMetadatamessage. - Non-blank name:
name.strip()must be non-empty. - Unique name: No existing prompt may share the same
name.
The prompt is created via Prompt.from_message(request.prompt), which uses the MappedProtobuf.from_message() method to map protobuf fields directly to ORM columns.
Auto-Generation from Dataset Columns
ft/utils.py::generate_templates(columns) produces default templates from a list of dataset column names:
-
Output column detection: Compares column names against a ranked list of 500 common output column names (e.g.,
answer,response,output,label,target). The column matching the highest-ranked name becomes the output column. If no match, the last column is used. -
Input columns: All columns except the identified output column.
-
Prompt template: Generated as:
You are an LLM responsible for generating a response. Please provide a response given the user input below. <Column1>: {column1} <Column2>: {column2} <Output>: -
Completion template:
{output_column}\n
Returns (prompt_template, completion_template).
Axolotl Auto-Prompt
ft/jobs.py::_add_prompt_for_dataset() generates a prompt automatically when using the Axolotl framework and no prompt is provided:
- Load the Axolotl config from the database by
axolotl_config_id. - Parse the YAML config and extract the dataset type from
config['datasets'][0]['type']. - Query the Config table for a matching
axolotl_dataset_formatsconfig bydescription == dataset_type. - Parse the format config JSON to extract feature column names.
- Build a template:
<Feature>: {feature}\nfor each feature. - Check for an existing prompt with the same
dataset_idandprompt_template. If found, return its ID. - Otherwise, create a new Prompt named
"AXOLOTL_AUTOGENERATED : {dataset_type}_{dataset_name}".
Removal
remove_prompt() deletes the Prompt record by ID. Note that prompts are also cascade-deleted when their parent dataset is removed with remove_prompts=True.
Protobuf Message
PromptMetadata fields: id, type, name, description, dataset_id, prompt_template, input_template, completion_template.
Configuration Specification
A Config resource stores a named configuration blob – JSON or YAML – that parameterizes training, quantization, inference, or the Axolotl framework. Configs are content-deduplicated: adding a config with identical content and type to an existing one returns the existing config’s ID rather than creating a duplicate.
Source: ft/configs.py, ft/consts.py, ft/db/model.py
Config Types
| Type | Format | Purpose | Default Provided |
|---|---|---|---|
training_arguments | JSON | Training hyperparameters (epochs, optimizer, batch size, learning rate) | Yes |
bitsandbytes_config | JSON | 4-bit quantization settings | Yes |
lora_config | JSON | LoRA hyperparameters | Yes |
generation_config | JSON | Inference generation settings | Yes |
custom | JSON | User-defined configuration blob | No |
axolotl | YAML | Axolotl training configuration file | Template provided |
axolotl_dataset_formats | JSON | Axolotl dataset format schemas | Yes (multiple) |
ORM Schema
class Config(Base, MappedProtobuf, MappedDict):
__tablename__ = "configs"
id = Column(String, primary_key=True) # UUID
type = Column(String) # ConfigType enum value
description = Column(String) # Model name (for family resolution) or format name
config = Column(Text) # Serialized JSON or YAML string
model_family = Column(String) # Architecture family (e.g., "LlamaForCausalLM")
is_default = Column(Integer, default=1) # 1 = system/default, 0 = user-created
is_default Semantics
| Value | Constant | Meaning |
|---|---|---|
1 | DEFAULT_CONFIGS | System-provided default configuration |
0 | USER_CONFIGS | User-created configuration |
User-created configs always have is_default=0. The add_config() function sets this automatically.
Default Config Values
Defined in ft/consts.py:
DEFAULT_TRAINING_ARGUMENTS
{
"num_train_epochs": 1,
"optim": "paged_adamw_32bit",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 4,
"warmup_ratio": 0.03,
"max_grad_norm": 0.3,
"learning_rate": 0.0002,
"fp16": true,
"logging_steps": 1,
"lr_scheduler_type": "constant",
"disable_tqdm": true,
"report_to": "mlflow",
"ddp_find_unused_parameters": false
}
DEFAULT_BNB_CONFIG
{
"load_in_4bit": true,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "float16",
"bnb_4bit_use_double_quant": true,
"quant_method": "bitsandbytes"
}
DEFAULT_LORA_CONFIG
{
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM"
}
DEFAULT_GENERATIONAL_CONFIG
{
"do_sample": true,
"temperature": 0.8,
"max_new_tokens": 60,
"top_p": 1,
"top_k": 50,
"num_beams": 1,
"repetition_penalty": 1.1,
"max_length": null
}
Config Deduplication
add_config() implements content-addressed caching:
- Parse the incoming config string:
yaml.safe_load()foraxolotltype,json.loads()for all others. - Re-serialize to a canonical form (
yaml.dump()orjson.dumps()). - Query existing configs of the same
type(and samemodel_familyifdescriptionis provided). - Compare parsed content of each existing config against the parsed request content.
- If an identical config exists, return it. At most one duplicate is expected (asserted).
- If no match, create a new Config with
is_default=USER_CONFIGS(0).
When description is provided, it is interpreted as a model name: transform_name_to_family(description) resolves the HuggingFace architecture (e.g., "LlamaForCausalLM") and scopes the deduplication query to that family.
Model-Family-Specific Filtering
list_configs() applies model-aware filtering when model_id is present in the request:
- Optionally filter by
typeif specified. - If
model_idis provided, callget_configs_for_model_id():- Fetch the Model record and resolve
huggingface_model_name. - Instantiate
ModelMetadataFinder(model_hf_name)and callfetch_model_family_from_config(). - Filter configs where
model_familymatches andis_default == 1. - If no model-specific defaults exist, fall back to returning all configs.
- Fetch the Model record and resolve
- User configs (
is_default=0) are not filtered by model family inget_configs_for_model_id()– they are returned when no model-specific defaults are found (fallback behavior).
Axolotl Config Template
The Axolotl config template is loaded from ft/config/axolotl/training_config/template.yaml via get_axolotl_training_config_template_yaml_str(). Axolotl dataset format configs are stored in ft/config/axolotl/dataset_formats/.
Protobuf Message
ConfigMetadata fields: id, type, description, config (serialized JSON/YAML string), model_family, is_default.
Fine-Tuning Job Lifecycle
A fine-tuning job trains a PEFT LoRA adapter on a base model using a configured dataset and prompt template. Jobs are dispatched as CML Jobs via the cmlapi SDK. The entry point is ft/jobs.py::start_fine_tuning_job(), which validates the request, prepares the execution environment, and creates the CML workload.
Job Dispatch Flow
- Validate Request –
_validate_fine_tuning_request()checks all fields against the rules below. Any violation raises aValueErrorthat propagates as a gRPC error. - Create Job Directory – A UUID
job_idis generated. The directory.app/job_runs/{job_id}is created to hold training artifacts. - Find Template CML Job – The dispatcher locates the
Accel_Finetuning_Base_Jobtemplate in the CML project. This template defines the runtime environment and script path. - Build Argument List – All training parameters are serialized into a
--key valuestring passed asJOB_ARGUMENTS. - Create CML Job + JobRun – A CML Job and its first JobRun are created via
cmlapi, with the specified CPU, GPU, and memory resources. - Store Job Record – A
FineTuningJobrecord is inserted into thefine_tuning_jobstable with all metadata for tracking.
Validation Rules
Validation is performed by ft/jobs.py::_validate_fine_tuning_request() before any side effects occur.
| Field | Rule | Error |
|---|---|---|
framework_type | Must be legacy or axolotl | “framework_type must be either legacy or axolotl” |
adapter_name | Alphanumeric + hyphens only (^[a-zA-Z0-9-]+$) | “adapter_name must be alphanumeric” |
out_dir | Must exist as directory | “output_dir does not exist” |
num_cpu | > 0 | “cpu must be greater than 0” |
num_gpu | >= 0 | “gpu must be at least 0” |
num_memory | > 0 | “memory must be greater than 0” |
num_workers | > 0 | “num_workers must be greater than 0” |
num_epochs | > 0 | “Number of epochs must be greater than 0” |
learning_rate | > 0 | “Learning rate must be greater than 0” |
dataset_fraction | (0, 1] | “dataset_fraction must be between 0 and 1” |
train_test_split | (0, 1] | “train_test_split must be between 0 and 1” |
axolotl_config_id | Required when framework=axolotl | “axolotl framework requires axolotl_config_id” |
base_model_id | Must exist in DB | “Model not found” |
dataset_id | Must exist in DB | “Dataset not found” |
prompt_id | Must exist in DB (legacy only) | “Prompt not found” |
Framework Types
Legacy
Uses HuggingFace Accelerate with the TRL SFTTrainer. The user provides each configuration component separately:
- prompt_id – A prompt template that maps dataset features to the training text format.
- LoRA config – PEFT LoRA hyperparameters (rank, alpha, dropout, target modules).
- BnB config – BitsAndBytes quantization settings (4-bit NF4 quantization).
- Training arguments – Standard HuggingFace
TrainingArgumentsfields (epochs, learning rate, batch size, etc.).
For distributed training, worker resources are specified independently via dist_cpu, dist_gpu, and dist_mem fields.
Axolotl
Uses the Axolotl training framework. The user provides a single YAML configuration file (referenced by axolotl_config_id) that bundles all training parameters, LoRA settings, and dataset handling into one document. If no prompt_id is provided, the system auto-generates a prompt from the dataset format definition. See Axolotl Integration for details.
Resource Specification
Each job requires explicit compute resource allocation:
| Field | Description |
|---|---|
num_cpu | CPU cores for the primary training worker |
num_gpu | GPU count for the primary training worker |
num_memory | Memory in GB for the primary training worker |
num_workers | Number of training workers (Accelerate distributed training) |
For legacy distributed training, additional fields specify per-worker resources:
| Field | Description |
|---|---|
dist_cpu | CPU cores per distributed worker |
dist_gpu | GPU count per distributed worker |
dist_mem | Memory in GB per distributed worker |
Argument List Schema
Arguments are passed as the JOB_ARGUMENTS environment variable to the CML Job. The value is a space-delimited string of --key value pairs.
Core arguments (always present):
| Key | Source |
|---|---|
base_model_id | Request field |
dataset_id | Request field |
experimentid | Generated UUID (same as job_id) |
out_dir | Request field |
train_out_dir | Constructed path for training output |
adapter_name | Request field |
framework_type | Request field (legacy or axolotl) |
Optional arguments (included when non-empty):
| Key | Description |
|---|---|
prompt_id | Prompt template ID (required for legacy, optional for axolotl) |
bnb_config | BitsAndBytes config ID |
lora_config | LoRA config ID |
training_arguments_config | Training arguments config ID |
hf_token | HuggingFace access token |
axolotl_config_id | Axolotl YAML config ID |
gpu_label_id | GPU label config ID |
Legacy distributed training arguments:
| Key | Description |
|---|---|
dist_num | Number of distributed workers |
dist_cpu | CPU per worker |
dist_mem | Memory per worker |
dist_gpu | GPU per worker |
Protobuf Messages
The job lifecycle uses two primary protobuf messages:
StartFineTuningJobRequest– Contains all fields listed above. Sent by the client to initiate training.StartFineTuningJobResponse– Returns the created job metadata including the generatedjob_idand CML job identifiers.FineTuningJobMetadata– The full job record stored in the database and returned byGetFineTuningJobandListFineTuningJobsRPCs.
See gRPC Service Design for the complete RPC catalog.
Training Script Architecture
The training script is the code that runs inside a CML Job after dispatch. It receives configuration via environment variables, loads and preprocesses data, trains a PEFT LoRA adapter, and saves the result.
Entry Point
ft/scripts/accel_fine_tune_base_script.py
The script is executed as a CML Job. Arguments are received via the JOB_ARGUMENTS environment variable as a space-delimited string with --key value pairs, parsed into an argparse namespace at startup.
Execution Flow
- Parse JOB_ARGUMENTS – The
JOB_ARGUMENTSenvironment variable is split and parsed viaargparseinto a namespace containing all training parameters. - Load base model – The HuggingFace model is loaded with optional
BitsAndBytesConfigfor 4-bit NF4 quantization. The model ID is resolved from the Studio database usingbase_model_id. - Configure tokenizer padding – The tokenizer is inspected for a suitable pad token. The function
find_padding_token_candidate()searches the vocabulary for tokens containing “pad” or “reserved”. - Apply PEFT LoRA adapter – A
LoraConfigis constructed from the config blob stored in the database, and the model is wrapped withget_peft_model(). - Load and preprocess dataset:
load_dataset_into_memory()reads the dataset into a HuggingFaceDatasetDict.map_dataset_with_prompt_template()formats each row using the prompt template, appending the EOS token.sample_and_split_dataset()downsamples by the configured fraction and splits into train/test sets (seed=42).
- Initialize SFTTrainer – A TRL
SFTTraineris created with the processed dataset, model, tokenizer, and training arguments. - Train –
trainer.train()executes the training loop. - Save adapter weights – The trained LoRA adapter is saved to the output directory.
- Auto-register adapter – If
auto_add_adapter=true, the adapter is registered in the Studio database automatically after training completes.
Dataset Preprocessing Chain
| Step | Function | Input | Output |
|---|---|---|---|
| Load | load_dataset_into_memory() | Dataset metadata (type, path, HF name) | HF DatasetDict |
| Format | map_dataset_with_prompt_template() | DatasetDict + prompt template | DatasetDict with prediction column |
| Sample/Split | sample_and_split_dataset() | DatasetDict + fraction + split ratio | Train/test DatasetDict |
The prediction column contains the fully formatted training text for each row – the prompt template applied to dataset features with the EOS token appended. This column name is defined by TRAINING_DATA_TEXT_FIELD.
Key Training Utilities
All utilities are defined in ft/training/utils.py.
get_model_parameters(model)
Returns a tuple of (total_params, trainable_params) for the model. Used for logging the parameter count before and after applying the LoRA adapter.
map_dataset_with_prompt_template(dataset, template)
Applies the prompt template to each row in the dataset. The template contains prompt_template, input_template, and completion_template fields that are formatted with the dataset’s feature columns. The EOS token is appended to the prediction field to signal sequence boundaries during training.
sample_and_split_dataset(ds, fraction, split)
Downsamples the dataset to the specified fraction (e.g., 0.5 = 50% of rows), then splits into train and test sets at the given ratio. Uses TRAINING_DATASET_SEED = 42 for reproducible splits across runs.
find_padding_token_candidate(tokenizer)
Searches the tokenizer vocabulary for tokens containing “pad” or “reserved” as substrings. Returns the first match found, or None if no candidate exists.
configure_tokenizer_padding(tokenizer, pad_token)
Sets the tokenizer’s padding token using a fallback chain:
- Use the tokenizer’s existing
pad_tokenif already set. - Use the provided
pad_tokenargument if given. - Use the tokenizer’s
unk_tokenif available. - Search for reserved token candidates via
find_padding_token_candidate().
This ensures every tokenizer has a valid pad token regardless of the base model’s configuration.
Training Constants
Defined in ft/consts.py:
| Constant | Value | Purpose |
|---|---|---|
TRAINING_DATA_TEXT_FIELD | "prediction" | Column name for the formatted training text in the preprocessed dataset |
TRAINING_DEFAULT_TRAIN_TEST_SPLIT | 0.9 | Default train/test split ratio (90% train, 10% test) |
TRAINING_DEFAULT_DATASET_FRACTION | 1.0 | Default dataset fraction (use full dataset) |
TRAINING_DATASET_SEED | 42 | Random seed for reproducible dataset splitting and sampling |
Relationship to Job Lifecycle
The training script is the execution payload created by the Fine-Tuning Job Lifecycle. The job dispatch process builds the JOB_ARGUMENTS string, creates the CML Job pointing to this script, and starts a JobRun. The script runs independently inside the CML workload – it reads its configuration from the environment, accesses the Studio database directly for resource metadata (model paths, dataset locations, config blobs), and writes adapter weights to the output directory.
Axolotl Integration
Axolotl is an alternative training framework supported as a first-class framework_type alongside the legacy HuggingFace Accelerate + TRL path. It replaces the separate LoRA, BitsAndBytes, and training argument configs with a single YAML configuration file that defines the entire training run.
Config Structure
Axolotl configurations are stored in the configs table with ConfigType.axolotl. A template YAML is provided at:
ft/config/axolotl/training_config/template.yaml
This template defines the baseline Axolotl training configuration. Users can create custom configs by modifying the template values. The YAML file specifies model loading, LoRA parameters, quantization, dataset handling, training hyperparameters, and output settings in a single document.
Dataset Format Configs
Dataset format definitions are stored as ConfigType.axolotl_dataset_formats in the configs table. The source files live in:
ft/config/axolotl/dataset_formats/
Each JSON file defines the expected column structure for a specific Axolotl dataset type (e.g., alpaca, completion, sharegpt). These files are loaded into the database during initialization by:
ft/initialize_db.py::InitializeDB.initialize_axolotl_dataset_type_configs()
Pydantic Models
The dataset format structure is defined by two Pydantic models in ft/api/types.py:
DatasetFormatInfo:
| Field | Type | Description |
|---|---|---|
name | str | Human-readable name of the dataset format |
description | str | The Axolotl dataset type identifier (e.g., alpaca, completion) |
format | Dict[str, Any] | Map of feature column names to their expected types or descriptions |
DatasetFormatsCollection:
| Field | Type | Description |
|---|---|---|
dataset_formats | Dict[str, DatasetFormatInfo] | Map of format names to their definitions |
Auto-Prompt Generation
When a fine-tuning job uses the axolotl framework and no prompt_id is provided, the system automatically generates a prompt template from the dataset format definition. This is handled by ft/jobs.py::_add_prompt_for_dataset().
Generation steps:
- Load the Axolotl YAML config from the database using
axolotl_config_id. - Extract the
typefield from the dataset section of the YAML config. This identifies the expected dataset format (e.g.,alpaca,completion). - Query the database for a config of type
axolotl_dataset_formatswhosedescriptionfield matches the extracted type. - Parse the dataset format config to extract the feature column names from the
formatdictionary. - Generate a default prompt template by concatenating
"Feature: {feature}\n"for each feature column. - Check whether an identical prompt already exists for this dataset to avoid duplicates.
- Create and return a new prompt record if no duplicate is found.
This mechanism ensures that Axolotl jobs always have a valid prompt template, even when the user does not explicitly create one.
Legacy vs. Axolotl Comparison
| Aspect | Legacy | Axolotl |
|---|---|---|
| Config format | Separate JSON blobs (LoRA, BnB, training args) | Single YAML file |
| Prompt handling | User must create and select a prompt template | Auto-generated from dataset format if not provided |
| Required configs | prompt_id + lora_config + bnb_config + training_arguments_config | axolotl_config_id only |
| Training engine | HuggingFace Accelerate + TRL SFTTrainer | Axolotl framework |
| Distributed training | Supported via dist_* fields | Managed by Axolotl config |
| Validation | prompt_id required | axolotl_config_id required; prompt_id optional |
Workflow
To use Axolotl for fine-tuning:
- Register a base model and dataset via the standard
AddModelandAddDatasetRPCs. - Create or use an existing Axolotl YAML config (stored as
ConfigType.axolotl). - Call
StartFineTuningJobwithframework_type = "axolotl"andaxolotl_config_idset to the config ID. - Omit
prompt_idto use auto-generation, or provide one to override. - The job dispatcher passes
axolotl_config_idin theJOB_ARGUMENTSto the training script, which loads and executes the Axolotl training pipeline.
See Fine-Tuning Job Lifecycle for the full dispatch flow and Training Script Architecture for execution details.
Evaluation Job Lifecycle
Evaluation jobs run MLflow evaluation against model+adapter combinations. A single evaluation request can compare multiple adapters against a baseline, with each combination dispatched as a separate CML Job linked by a shared parent_job_id.
Dispatch Architecture
A single StartEvaluationJob request specifies N model+adapter combinations. The dispatcher fans out into N independent CML Jobs, each running its own MLflow evaluation. All jobs share a parent_job_id that groups them for result comparison in the UI.
Validation
Validation is performed by ft/evaluation.py::_validate_start_evaluation_job_request() before any jobs are created.
Required fields:
| Field | Rule |
|---|---|
model_adapter_combinations | Non-empty list of model+adapter pairs |
dataset_id | Must exist in DB |
prompt_id | Must exist in DB |
cpu | Valid resource specification |
gpu | Valid resource specification |
memory | Valid resource specification |
Per-combination validation:
- Each
base_model_idin the combinations list must exist in the database. - Each
adapter_idmust exist in the database, or be an empty string to evaluate the base model without an adapter. - The referenced dataset and prompt must exist.
Multi-Adapter Dispatch
For each model+adapter combination in the request, the dispatcher executes the following sequence:
- Generate IDs – A UUID
job_idis generated for each individual evaluation run. A sharedparent_job_idis generated once for the entire batch. - Create directories – A result directory and job directory are created for each run.
- Find template CML Job – The dispatcher locates the
Mlflow_Evaluation_Base_Jobtemplate in the CML project. - Build argument list – Each run receives its own argument string containing:
| Argument | Description |
|---|---|
base_model_id | The model to evaluate |
adapter_id | The adapter to apply (empty string for base model only) |
dataset_id | The evaluation dataset |
prompt_id | The prompt template for formatting |
result_dir | Directory for evaluation output |
configs | Evaluation-specific configuration |
selected_features | Dataset features to include |
eval_dataset_fraction | Fraction of dataset to evaluate on |
comparison_adapter_id | The first adapter in the batch, used as the baseline |
job_id | This run’s unique identifier |
run_number | Ordinal position in the batch (1-indexed) |
- Create CML Job and JobRun – A CML Job is created via
cmlapiwith the specified compute resources. - Store EvaluationJob record – An
EvaluationJobrecord is inserted into theevaluation_jobstable with theparent_job_idfor grouping.
Parent Job Grouping
All evaluation runs within a batch share the same parent_job_id. This enables:
- UI grouping – The Streamlit UI displays evaluation runs grouped by parent, showing all adapter comparisons in a single view.
- Baseline comparison – The first adapter in the
model_adapter_combinationslist is designated as the baseline (comparison_adapter_id). All other runs compare their metrics against this baseline. - Batch status tracking – The overall status of an evaluation batch can be determined by aggregating the statuses of all child jobs sharing the same
parent_job_id.
Evaluation Script
The evaluation logic runs inside ft/scripts/mlflow_evaluation_base_script.py:
- Load model and adapter – The base HuggingFace model is loaded, and the optional PEFT adapter is applied via
load_adapted_hf_generation_pipeline(). This produces a text-generation pipeline. - Load and preprocess dataset – The evaluation dataset is loaded, the prompt template is applied to format inputs, and the dataset is sampled to the configured
eval_dataset_fraction. - Run MLflow evaluation – MLflow’s evaluation framework is invoked with the configured metrics. Results (metric values and artifacts) are logged to an MLflow experiment.
- Log results – Evaluation metrics, predictions, and comparison data are persisted in the MLflow tracking store for retrieval by the UI.
Protobuf Messages
StartEvaluationJobRequest:
| Field | Description |
|---|---|
model_adapter_combinations | List of model+adapter pairs to evaluate |
dataset_id | Evaluation dataset reference |
prompt_id | Prompt template for input formatting |
cpu, gpu, memory | Compute resources per evaluation job |
configs | Evaluation configuration (metrics, generation settings) |
EvaluationJobMetadata:
| Field | Description |
|---|---|
id | Unique evaluation job identifier |
type | Job type identifier |
cml_job_id | CML Job identifier |
parent_job_id | Shared batch identifier |
base_model_id | Evaluated model |
dataset_id | Evaluation dataset |
adapter_id | Applied adapter (empty for base model) |
cpu, gpu, memory | Allocated resources |
configs | Evaluation configuration |
evaluation_dir | Path to evaluation results |
See gRPC Service Design for the complete evaluation RPC catalog.
Model Export & Registry
Trained adapters can be exported through two routes, determined by ModelExportType. Both routes merge a base model with a PEFT adapter into a deployable artifact, but target different deployment backends.
Export Routes
Both routes require non-empty base_model_id, adapter_id, and model_name fields. The choice between them depends on the target deployment environment and the adapter source type.
MLflow Model Registry
Function: export_model_registry_model()
This route logs the merged model to the MLflow Model Registry as a registered model. It supports any adapter type (PROJECT, HuggingFace).
Steps:
- Load pipeline –
fetch_pipeline()creates a HuggingFace text-generation pipeline by loading the base model and applying the PEFT adapter. - Quantized loading – If a
BitsAndBytesConfigis specified, the base model is loaded with 4-bit quantization before adapter application. - Infer signature – An MLflow model signature is inferred from example input/output pairs. This defines the expected request and response schema for the registered model.
- Log model –
mlflow.transformers.log_model()logs the pipeline to MLflow as a registered model with the specifiedmodel_name.
Requirements:
| Requirement | Detail |
|---|---|
| Base model | HuggingFace model registered in Studio |
| Adapter | Any adapter type (PROJECT or HuggingFace) |
| MLflow tracking | Must be configured in the CML environment |
CML Model Endpoint
Function: deploy_cml_model()
This route creates a CML Model endpoint that serves the model+adapter combination as a REST API. It is restricted to PROJECT adapters (file-based, local weights).
Steps:
- Validate adapter type – Only PROJECT adapters (local file-based weights) are supported. HuggingFace adapters must be downloaded locally first.
- Create CML Model – A CML Model object is created via
cmlapi. - Create ModelBuild – A build is created pointing to the predict script at
ft/scripts/cml_model_predict_script.py. Environment variables are injected:
| Variable | Description |
|---|---|
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME | HuggingFace identifier for the base model |
ADAPTER_LOCATION | File path to the adapter weights directory |
GEN_CONFIG_STRING | Serialized generation config (JSON string) |
- Deploy – A ModelDeployment is created with default resources:
| Resource | Default |
|---|---|
| CPU | 2 cores |
| Memory | 8 GB |
| GPU | 1 |
- Resolve runtime – The runtime identifier is inherited from the template
Finetuning_Base_Job, ensuring the model endpoint uses the same environment as training workloads.
Requirements:
| Requirement | Detail |
|---|---|
| Base model | HuggingFace model registered in Studio |
| Adapter | PROJECT type only (local file weights) |
| Adapter weights | Must be accessible on the local filesystem |
Validation
Both export routes perform the following validation before proceeding:
base_model_idmust be non-empty and reference an existing model in the database.adapter_idmust be non-empty and reference an existing adapter in the database.model_namemust be non-empty.
Additional route-specific validation:
- CML Model: The adapter must be of type PROJECT. Model Registry adapters require the MLflow Registry export path instead.
- MLflow Registry: The MLflow tracking server must be accessible.
Choosing an Export Route
| Criterion | MLflow Registry | CML Model Endpoint |
|---|---|---|
| Adapter source | Any (PROJECT, HuggingFace) | PROJECT only |
| Output format | MLflow registered model | REST API endpoint |
| Serving infrastructure | MLflow serving or downstream consumption | CML Model serving |
| Resource customization | Managed by MLflow | Default 2 CPU / 8 GB / 1 GPU (adjustable post-deploy) |
| Use case | Model versioning, experiment tracking, CI/CD pipelines | Real-time inference endpoint |
See CML Model Serving for details on the predict script and endpoint behavior.
CML Model Serving
A CML Model endpoint serves a fine-tuned model+adapter combination as a REST API. The endpoint is created by deploy_cml_model() (see Model Export & Registry) and runs a predict script that loads the model, applies the adapter, and handles inference requests.
Predict Script
Path: ft/scripts/cml_model_predict_script.py
The predict script runs inside a CML Model endpoint container. It is specified as the build script during deploy_cml_model() and executes in the runtime environment inherited from the template fine-tuning job.
Initialization
On startup, the script:
- Reads environment variables:
| Variable | Purpose |
|---|---|
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME | HuggingFace model identifier to load |
ADAPTER_LOCATION | Path to the PEFT adapter weights directory |
GEN_CONFIG_STRING | Serialized generation configuration (JSON) |
- Loads the base model – The HuggingFace model is loaded from the Hub or cache using the identifier in
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME. - Applies the PEFT adapter – The LoRA adapter weights at
ADAPTER_LOCATIONare loaded and applied to the base model.
Request Handling
The predict script exposes a predict() function that CML invokes for each incoming request.
Request format:
{
"request": {
"prompt": "Your input text here"
}
}
The prompt field contains the raw input text. The predict function:
- Extracts the prompt from the request payload.
- Tokenizes the input using the model’s tokenizer.
- Generates output using the model with the applied generation config.
- Decodes and returns the generated text.
Endpoint Creation Flow
The full endpoint creation sequence, initiated by deploy_cml_model():
- Create CML Model – A new Model object is created in the CML project via
cmlapi. This registers the model name and description. - Create ModelBuild – A build is created with:
- The predict script path (
ft/scripts/cml_model_predict_script.py). - Environment variables (
FINE_TUNING_STUDIO_BASE_MODEL_HF_NAME,ADAPTER_LOCATION,GEN_CONFIG_STRING). - The runtime identifier from the template fine-tuning job.
- The predict script path (
- Create ModelDeployment – A deployment is created with default resource allocation:
| Resource | Default Value |
|---|---|
| CPU | 2 cores |
| Memory | 8 GB |
| GPU | 1 |
- Runtime resolution – The runtime is inherited from the template
Finetuning_Base_Job. This ensures the model endpoint has the same Python packages, CUDA version, and system libraries as the training environment.
Limitations
- PROJECT adapters only – Only adapters stored as local files (PROJECT type) are supported for CML Model deployment. HuggingFace Hub adapters must be downloaded to the project filesystem before they can be used with a CML Model endpoint.
- Model Registry adapters – Adapters registered through the MLflow Model Registry cannot be deployed as CML Models directly. Use the MLflow Registry export path instead (see Model Export & Registry).
- Fixed default resources – The deployment is created with 1 GPU, 2 CPU cores, and 8 GB memory. To adjust resource allocation after deployment, modify the CML Model settings through the CML UI or
cmlapi. - Single adapter – Each CML Model endpoint serves exactly one base model + adapter combination. To serve multiple adapters, create multiple endpoints.
Post-Deployment
After deployment completes:
- The endpoint URL is available in the CML Model UI and via
cmlapi. - Requests are sent as HTTP POST with the JSON format shown above.
- The endpoint auto-scales based on CML’s Model serving configuration.
- Logs and metrics are available through CML’s standard monitoring interface.
- Resource allocation can be modified via the CML Model settings without rebuilding.
Validation Rules Reference
The Studio validates resources at multiple points: on import (datasets, models, adapters, prompts, configs), on job submission (fine-tuning, evaluation), and on export (model deployment). This chapter catalogs all validation rules extracted from the source code.
Source: ft/jobs.py, ft/evaluation.py, ft/datasets.py, ft/models.py, ft/adapters.py, ft/prompts.py, ft/configs.py, ft/service.py
Rule ID Convention
Rule IDs follow the format {Domain}-{Number} where Domain is one of:
| Domain | Scope |
|---|---|
| FT | Fine-tuning job parameters |
| EV | Evaluation job parameters |
| DS | Dataset import |
| MD | Model import |
| AD | Adapter import |
| PR | Prompt template |
| CF | Configuration blob |
| EX | Model export / deployment |
All rules with severity ERROR abort the operation and return a gRPC error. Rules with severity INFO are advisory and do not block the operation.
Fine-Tuning Job Validation
Validated in ft/jobs.py when StartFineTuningJob is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| FT-001 | framework_type | Must be legacy or axolotl | ERROR |
| FT-002 | adapter_name | Must match ^[a-zA-Z0-9-]+$ (alphanumeric + hyphens, no spaces) | ERROR |
| FT-003 | out_dir | Must exist as a directory | ERROR |
| FT-004 | num_cpu | Must be > 0 | ERROR |
| FT-005 | num_gpu | Must be >= 0 | ERROR |
| FT-006 | num_memory | Must be > 0 | ERROR |
| FT-007 | num_workers | Must be > 0 | ERROR |
| FT-008 | num_epochs | Must be > 0 | ERROR |
| FT-009 | learning_rate | Must be > 0 | ERROR |
| FT-010 | dataset_fraction | Must be in (0, 1] | ERROR |
| FT-011 | train_test_split | Must be in (0, 1] | ERROR |
| FT-012 | axolotl_config_id | Required when framework_type=axolotl | ERROR |
| FT-013 | base_model_id | Must exist in models table | ERROR |
| FT-014 | dataset_id | Must exist in datasets table | ERROR |
| FT-015 | prompt_id | Must exist in prompts table (legacy framework only) | ERROR |
| FT-016 | axolotl_config_id | Must exist in configs table (when provided) | ERROR |
FT-001 through FT-011 are local validations that require no database access. FT-012 is a cross-field consistency check. FT-013 through FT-016 are foreign-key validations resolved against the DAO.
Evaluation Job Validation
Validated in ft/evaluation.py when StartEvaluationJob is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| EV-001 | model_adapter_combinations | Must be non-empty | ERROR |
| EV-002 | dataset_id | Must be non-empty | ERROR |
| EV-003 | prompt_id | Must be non-empty | ERROR |
| EV-004 | num_cpu, num_gpu, num_memory | Must be provided | ERROR |
| EV-005 | model IDs in combinations | Each must exist in models table | ERROR |
| EV-006 | adapter IDs in combinations | Each must exist in adapters table (or empty for base model) | ERROR |
| EV-007 | dataset_id | Must exist in datasets table | ERROR |
| EV-008 | prompt_id | Must exist in prompts table | ERROR |
Evaluation jobs accept multiple model-adapter pairs in a single request. EV-005 and EV-006 are validated per combination entry.
Dataset Import Validation
Validated in ft/datasets.py when AddDataset is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| DS-001 | type | Must be one of: huggingface, project, project_csv, project_json, project_jsonl | ERROR |
| DS-002 | huggingface_name | Must resolve via HfApi.dataset_info() (huggingface type) | ERROR |
| DS-003 | location | File must exist (project_csv, project_json, project_jsonl) | ERROR |
DS-002 makes a network call to the HuggingFace Hub. If HUGGINGFACE_ACCESS_TOKEN is set, it is used for gated dataset access. DS-003 validates the local filesystem path and checks the file extension matches the declared type.
Model Import Validation
Validated in ft/models.py when AddModel is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| MD-001 | type | Must be one of: huggingface, project, model_registry | ERROR |
| MD-002 | huggingface_model_name | Must resolve via HfApi.model_info() (huggingface type) | ERROR |
| MD-003 | cml_registered_model_id | Must resolve via cmlapi (model_registry type) | ERROR |
MD-002 contacts the HuggingFace Hub. MD-003 queries the CML Model Registry through the cmlapi SDK.
Adapter Import Validation
Validated in ft/adapters.py when AddAdapter is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| AD-001 | name | Required, non-empty | ERROR |
| AD-002 | model_id | Required, must exist in models table | ERROR |
| AD-003 | location | Must exist as directory (project type) | ERROR |
| AD-004 | fine_tuning_job_id | Must exist in fine_tuning_jobs table (if provided) | ERROR |
| AD-005 | prompt_id | Must exist in prompts table (if provided) | ERROR |
AD-004 and AD-005 are optional foreign-key references. When provided, they link the adapter back to the job and prompt that produced it.
Prompt Validation
Validated in ft/prompts.py when AddPrompt is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| PR-001 | name | Required, unique | ERROR |
| PR-002 | dataset_id | Required | ERROR |
| PR-003 | prompt_template | Required, non-empty | ERROR |
| PR-004 | input_template | Required | ERROR |
| PR-005 | completion_template | Required | ERROR |
PR-001 enforces uniqueness at the application level before insert. The prompt_template, input_template, and completion_template fields use Python format-string syntax referencing dataset feature column names.
Config Validation
Validated in ft/configs.py when AddConfig is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| CF-001 | type | Must be one of ConfigType enum values | ERROR |
| CF-002 | config | Must be valid JSON (non-axolotl types) | ERROR |
| CF-003 | config | Must be valid YAML (axolotl type) | ERROR |
| CF-004 | config | Deduplicated – returns existing ID if identical content exists | INFO |
CF-004 is a deduplication check, not an error. When a config with identical content already exists, the existing record’s ID is returned instead of creating a duplicate. The caller receives a successful response in either case.
Export Validation
Validated in ft/models.py when ExportModel or RegisterModel is called.
| Rule ID | Field | Constraint | Severity |
|---|---|---|---|
| EX-001 | base_model_id | Required, non-empty | ERROR |
| EX-002 | adapter_id | Required, non-empty | ERROR |
| EX-003 | model_name | Required, non-empty | ERROR |
| EX-004 | adapter type | Must be PROJECT for CML Model deployment | ERROR |
| EX-005 | model type | Must be huggingface for CML Model deployment | ERROR |
EX-004 and EX-005 enforce deployment constraints. Only project-local adapters (those with files on disk) can be packaged for CML Model Registry, and only HuggingFace-sourced base models are supported for the export merge workflow.
Building a Validation SDK
This chapter provides guidance for building a validation SDK that validates resources and job parameters before submitting them to the Fine Tuning Studio gRPC API. Pre-submission validation catches errors locally, avoiding round-trips to the server and failed CML Job launches.
Source: ft/client.py, ft/proto/fine_tuning_studio_pb2.py, ft/api/types.py
Install the ft Package
The Fine Tuning Studio ships as a pip-installable package. Install it to get access to protobuf definitions, API types, and the gRPC client:
pip install -e /path/to/CML_AMP_LLM_Fine_Tuning_Studio
This provides the ft package in development mode. All protobuf-generated classes and enum types are available for import.
Import Protobuf Types
from ft.api import *
# Or import specific types:
from ft.proto.fine_tuning_studio_pb2 import (
StartFineTuningJobRequest,
AddDatasetRequest,
AddModelRequest,
AddConfigRequest,
)
from ft.api.types import (
DatasetType,
ConfigType,
FineTuningFrameworkType,
)
These types define the exact field names, types, and enum values accepted by the gRPC API.
Validation Architecture
Validation rules fall into two categories:
| Category | Requires | Examples |
|---|---|---|
| Local validation | No external dependencies | Regex checks, numeric bounds, enum membership, cross-field consistency |
| DB-dependent validation | FineTuningStudioClient connection | Foreign-key existence (model, dataset, prompt, adapter) |
A validation SDK should implement local validation in pure functions and DB-dependent validation through the client.
Local Validation
Replicate the rules from the Validation Rules Reference that require no database access:
import re
def validate_adapter_name(name: str) -> list[str]:
"""FT-002: adapter_name must be alphanumeric + hyphens."""
errors = []
if not re.match(r'^[a-zA-Z0-9-]+$', name):
errors.append("FT-002: adapter_name must match ^[a-zA-Z0-9-]+$")
return errors
def validate_resource_allocation(num_cpu: int, num_gpu: int, num_memory: int) -> list[str]:
"""FT-004, FT-005, FT-006: resource bounds."""
errors = []
if num_cpu <= 0:
errors.append("FT-004: num_cpu must be > 0")
if num_gpu < 0:
errors.append("FT-005: num_gpu must be >= 0")
if num_memory <= 0:
errors.append("FT-006: num_memory must be > 0")
return errors
def validate_training_params(num_epochs: int, learning_rate: float,
dataset_fraction: float, train_test_split: float) -> list[str]:
"""FT-008 through FT-011: training parameter ranges."""
errors = []
if num_epochs <= 0:
errors.append("FT-008: num_epochs must be > 0")
if learning_rate <= 0:
errors.append("FT-009: learning_rate must be > 0")
if not (0 < dataset_fraction <= 1):
errors.append("FT-010: dataset_fraction must be in (0, 1]")
if not (0 < train_test_split <= 1):
errors.append("FT-011: train_test_split must be in (0, 1]")
return errors
def validate_framework_config(framework_type: str, axolotl_config_id: str) -> list[str]:
"""FT-001, FT-012: framework type and config consistency."""
errors = []
if framework_type not in ("legacy", "axolotl"):
errors.append("FT-001: framework_type must be legacy or axolotl")
if framework_type == "axolotl" and not axolotl_config_id:
errors.append("FT-012: axolotl_config_id required when framework_type=axolotl")
return errors
DB-Dependent Validation
Some rules require database lookups. Use FineTuningStudioClient for these:
from ft.client import FineTuningStudioClient
client = FineTuningStudioClient()
# Check if model exists (FT-013)
models = client.get_models()
model_ids = {m.id for m in models}
assert model_id in model_ids, f"FT-013: Model {model_id} not found"
# Check if dataset exists (FT-014)
datasets = client.get_datasets()
dataset_ids = {d.id for d in datasets}
assert dataset_id in dataset_ids, f"FT-014: Dataset {dataset_id} not found"
# Check if prompt exists (FT-015)
prompts = client.get_prompts()
prompt_ids = {p.id for p in prompts}
assert prompt_id in prompt_ids, f"FT-015: Prompt {prompt_id} not found"
The client connects to the gRPC server over FINE_TUNING_SERVICE_IP:FINE_TUNING_SERVICE_PORT. These environment variables are set during Studio initialization.
Config Content Validation
Validate config content before submitting via AddConfig:
import json
import yaml
def validate_config(config_type: str, config_content: str) -> list[str]:
"""CF-002, CF-003: config content must parse correctly."""
errors = []
try:
if config_type == "axolotl":
yaml.safe_load(config_content) # CF-003
else:
json.loads(config_content) # CF-002
except (yaml.YAMLError, json.JSONDecodeError) as e:
errors.append(f"Config parse error: {e}")
return errors
Composing a Validation Pipeline
Combine local and DB-dependent validators into a single function that returns all errors at once:
def validate_fine_tuning_request(
request: StartFineTuningJobRequest,
client: FineTuningStudioClient,
) -> list[str]:
"""Validate a fine-tuning request against all applicable rules.
Returns a list of error strings. An empty list means the request is valid.
"""
errors = []
# Local validation
errors.extend(validate_framework_config(
request.framework_type, request.axolotl_config_id))
errors.extend(validate_adapter_name(request.adapter_name))
errors.extend(validate_resource_allocation(
request.num_cpu, request.num_gpu, request.num_memory))
errors.extend(validate_training_params(
request.num_epochs, request.learning_rate,
request.dataset_fraction, request.train_test_split))
# DB-dependent validation
models = {m.id for m in client.get_models()}
if request.base_model_id not in models:
errors.append(f"FT-013: Model {request.base_model_id} not found")
datasets = {d.id for d in client.get_datasets()}
if request.dataset_id not in datasets:
errors.append(f"FT-014: Dataset {request.dataset_id} not found")
if request.framework_type == "legacy":
prompts = {p.id for p in client.get_prompts()}
if request.prompt_id not in prompts:
errors.append(f"FT-015: Prompt {request.prompt_id} not found")
if request.axolotl_config_id:
configs = {c.id for c in client.get_configs()}
if request.axolotl_config_id not in configs:
errors.append(f"FT-016: Config {request.axolotl_config_id} not found")
return errors
Usage Pattern
Call the validation pipeline before submitting any request:
from ft.client import FineTuningStudioClient
from ft.proto.fine_tuning_studio_pb2 import StartFineTuningJobRequest
client = FineTuningStudioClient()
request = StartFineTuningJobRequest(
framework_type="legacy",
adapter_name="my-adapter",
base_model_id="abc-123",
dataset_id="def-456",
prompt_id="ghi-789",
num_cpu=2,
num_gpu=1,
num_memory=8,
num_epochs=3,
learning_rate=2e-5,
dataset_fraction=1.0,
train_test_split=0.8,
)
errors = validate_fine_tuning_request(request, client)
if errors:
for e in errors:
print(f" {e}")
raise ValueError(f"Validation failed with {len(errors)} error(s)")
# Safe to submit
client.start_fine_tuning_job(request)
This pattern ensures that invalid requests never reach the gRPC server, providing immediate feedback and avoiding wasted CML Job compute.
GitHub Actions Integration
This chapter documents the existing CI/CD configuration and provides patterns for extending it with config validation and formatting checks.
Source: .github/workflows/run-tests.yaml, .github/workflows/docs.yml, bin/run-tests.sh
Existing CI Workflow
The primary CI workflow is .github/workflows/run-tests.yaml:
| Setting | Value |
|---|---|
| Trigger | Pushes and PRs to main and dev branches |
| Runner | ubuntu-latest |
| Python version | 3.11 |
| Dependencies | requirements.txt |
| Test command | pytest -v --cov=ft --cov-report=html --cov-report=xml -s tests/ |
| Coverage threshold | >10% |
The workflow installs all dependencies, runs the full test suite with coverage collection, and generates both HTML and XML coverage reports.
Running Tests Locally
# Full test suite with coverage
./bin/run-tests.sh
# Single test file
pytest -v -s tests/test_datasets.py
# Single test method
pytest -v -s tests/test_datasets.py::TestDatasets::test_add_dataset
The bin/run-tests.sh script mirrors the CI configuration. Run it before pushing to catch failures early.
Adding Config Validation to CI
Axolotl YAML configs and dataset format JSON files live under ft/config/. A dedicated workflow validates these files on any PR that modifies them:
name: Validate Configs
on:
pull_request:
paths:
- 'ft/config/**'
- 'data/project_defaults.json'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install pyyaml pydantic
- name: Validate Axolotl configs
run: |
python -c "
import yaml, json, glob
# Validate YAML configs
for f in glob.glob('ft/config/axolotl/training_config/*.yaml'):
with open(f) as fh:
yaml.safe_load(fh)
print(f'OK: {f}')
# Validate dataset format JSON configs
for f in glob.glob('ft/config/axolotl/dataset_formats/*.json'):
with open(f) as fh:
json.load(fh)
print(f'OK: {f}')
# Validate project defaults
with open('data/project_defaults.json') as fh:
json.load(fh)
print('OK: data/project_defaults.json')
"
The paths filter ensures this workflow only runs when config files change. Any parse error causes the step to fail with a traceback identifying the malformed file.
Pre-Commit Formatting Check
The project uses autoflake and autopep8 for code formatting. Add a CI step to verify formatting compliance:
- name: Check formatting
run: |
pip install autoflake autopep8
autoflake --check --remove-all-unused-imports \
--ignore-init-module-imports --recursive ft/ tests/ pgs/
autopep8 --diff --max-line-length 120 \
--aggressive --aggressive --aggressive \
--recursive ft/ tests/ pgs/ | head -20
autoflake --check exits non-zero if any unused imports are found. autopep8 --diff prints the diff that would be applied; pipe through head -20 to keep output concise. If either tool reports issues, the step fails.
Documentation Deployment
The documentation workflow is .github/workflows/docs.yml. It builds the mdbook with D2 diagram support and deploys to GitHub Pages. This workflow is independent of the test CI and triggers on documentation changes. See the System Overview for the full project architecture.
Extending the CI Pipeline
When adding new validation workflows, follow these conventions:
| Convention | Guideline |
|---|---|
| Path filtering | Use paths: to scope workflows to relevant directories |
| Python version | Pin to 3.11 to match production |
| Dependency isolation | Install only what the validation step needs, not the full requirements.txt |
| Exit codes | Rely on tool exit codes for pass/fail – avoid custom success checks |
| Artifact uploads | Use actions/upload-artifact@v4 for coverage reports or validation logs |
| Branch protection | Configure required status checks on main to enforce green CI before merge |