Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Fine-Tuning Job Lifecycle

A fine-tuning job trains a PEFT LoRA adapter on a base model using a configured dataset and prompt template. Jobs are dispatched as CML Jobs via the cmlapi SDK. The entry point is ft/jobs.py::start_fine_tuning_job(), which validates the request, prepares the execution environment, and creates the CML workload.

Job Dispatch Flow

  1. Validate Request_validate_fine_tuning_request() checks all fields against the rules below. Any violation raises a ValueError that propagates as a gRPC error.
  2. Create Job Directory – A UUID job_id is generated. The directory .app/job_runs/{job_id} is created to hold training artifacts.
  3. Find Template CML Job – The dispatcher locates the Accel_Finetuning_Base_Job template in the CML project. This template defines the runtime environment and script path.
  4. Build Argument List – All training parameters are serialized into a --key value string passed as JOB_ARGUMENTS.
  5. Create CML Job + JobRun – A CML Job and its first JobRun are created via cmlapi, with the specified CPU, GPU, and memory resources.
  6. Store Job Record – A FineTuningJob record is inserted into the fine_tuning_jobs table with all metadata for tracking.

Validation Rules

Validation is performed by ft/jobs.py::_validate_fine_tuning_request() before any side effects occur.

FieldRuleError
framework_typeMust be legacy or axolotl“framework_type must be either legacy or axolotl”
adapter_nameAlphanumeric + hyphens only (^[a-zA-Z0-9-]+$)“adapter_name must be alphanumeric”
out_dirMust exist as directory“output_dir does not exist”
num_cpu> 0“cpu must be greater than 0”
num_gpu>= 0“gpu must be at least 0”
num_memory> 0“memory must be greater than 0”
num_workers> 0“num_workers must be greater than 0”
num_epochs> 0“Number of epochs must be greater than 0”
learning_rate> 0“Learning rate must be greater than 0”
dataset_fraction(0, 1]“dataset_fraction must be between 0 and 1”
train_test_split(0, 1]“train_test_split must be between 0 and 1”
axolotl_config_idRequired when framework=axolotl“axolotl framework requires axolotl_config_id”
base_model_idMust exist in DB“Model not found”
dataset_idMust exist in DB“Dataset not found”
prompt_idMust exist in DB (legacy only)“Prompt not found”

Framework Types

Legacy

Uses HuggingFace Accelerate with the TRL SFTTrainer. The user provides each configuration component separately:

  • prompt_id – A prompt template that maps dataset features to the training text format.
  • LoRA config – PEFT LoRA hyperparameters (rank, alpha, dropout, target modules).
  • BnB config – BitsAndBytes quantization settings (4-bit NF4 quantization).
  • Training arguments – Standard HuggingFace TrainingArguments fields (epochs, learning rate, batch size, etc.).

For distributed training, worker resources are specified independently via dist_cpu, dist_gpu, and dist_mem fields.

Axolotl

Uses the Axolotl training framework. The user provides a single YAML configuration file (referenced by axolotl_config_id) that bundles all training parameters, LoRA settings, and dataset handling into one document. If no prompt_id is provided, the system auto-generates a prompt from the dataset format definition. See Axolotl Integration for details.

Resource Specification

Each job requires explicit compute resource allocation:

FieldDescription
num_cpuCPU cores for the primary training worker
num_gpuGPU count for the primary training worker
num_memoryMemory in GB for the primary training worker
num_workersNumber of training workers (Accelerate distributed training)

For legacy distributed training, additional fields specify per-worker resources:

FieldDescription
dist_cpuCPU cores per distributed worker
dist_gpuGPU count per distributed worker
dist_memMemory in GB per distributed worker

Argument List Schema

Arguments are passed as the JOB_ARGUMENTS environment variable to the CML Job. The value is a space-delimited string of --key value pairs.

Core arguments (always present):

KeySource
base_model_idRequest field
dataset_idRequest field
experimentidGenerated UUID (same as job_id)
out_dirRequest field
train_out_dirConstructed path for training output
adapter_nameRequest field
framework_typeRequest field (legacy or axolotl)

Optional arguments (included when non-empty):

KeyDescription
prompt_idPrompt template ID (required for legacy, optional for axolotl)
bnb_configBitsAndBytes config ID
lora_configLoRA config ID
training_arguments_configTraining arguments config ID
hf_tokenHuggingFace access token
axolotl_config_idAxolotl YAML config ID
gpu_label_idGPU label config ID

Legacy distributed training arguments:

KeyDescription
dist_numNumber of distributed workers
dist_cpuCPU per worker
dist_memMemory per worker
dist_gpuGPU per worker

Protobuf Messages

The job lifecycle uses two primary protobuf messages:

  • StartFineTuningJobRequest – Contains all fields listed above. Sent by the client to initiate training.
  • StartFineTuningJobResponse – Returns the created job metadata including the generated job_id and CML job identifiers.
  • FineTuningJobMetadata – The full job record stored in the database and returned by GetFineTuningJob and ListFineTuningJobs RPCs.

See gRPC Service Design for the complete RPC catalog.