Training Script Architecture
The training script is the code that runs inside a CML Job after dispatch. It receives configuration via environment variables, loads and preprocesses data, trains a PEFT LoRA adapter, and saves the result.
Entry Point
ft/scripts/accel_fine_tune_base_script.py
The script is executed as a CML Job. Arguments are received via the JOB_ARGUMENTS environment variable as a space-delimited string with --key value pairs, parsed into an argparse namespace at startup.
Execution Flow
- Parse JOB_ARGUMENTS – The
JOB_ARGUMENTSenvironment variable is split and parsed viaargparseinto a namespace containing all training parameters. - Load base model – The HuggingFace model is loaded with optional
BitsAndBytesConfigfor 4-bit NF4 quantization. The model ID is resolved from the Studio database usingbase_model_id. - Configure tokenizer padding – The tokenizer is inspected for a suitable pad token. The function
find_padding_token_candidate()searches the vocabulary for tokens containing “pad” or “reserved”. - Apply PEFT LoRA adapter – A
LoraConfigis constructed from the config blob stored in the database, and the model is wrapped withget_peft_model(). - Load and preprocess dataset:
load_dataset_into_memory()reads the dataset into a HuggingFaceDatasetDict.map_dataset_with_prompt_template()formats each row using the prompt template, appending the EOS token.sample_and_split_dataset()downsamples by the configured fraction and splits into train/test sets (seed=42).
- Initialize SFTTrainer – A TRL
SFTTraineris created with the processed dataset, model, tokenizer, and training arguments. - Train –
trainer.train()executes the training loop. - Save adapter weights – The trained LoRA adapter is saved to the output directory.
- Auto-register adapter – If
auto_add_adapter=true, the adapter is registered in the Studio database automatically after training completes.
Dataset Preprocessing Chain
| Step | Function | Input | Output |
|---|---|---|---|
| Load | load_dataset_into_memory() | Dataset metadata (type, path, HF name) | HF DatasetDict |
| Format | map_dataset_with_prompt_template() | DatasetDict + prompt template | DatasetDict with prediction column |
| Sample/Split | sample_and_split_dataset() | DatasetDict + fraction + split ratio | Train/test DatasetDict |
The prediction column contains the fully formatted training text for each row – the prompt template applied to dataset features with the EOS token appended. This column name is defined by TRAINING_DATA_TEXT_FIELD.
Key Training Utilities
All utilities are defined in ft/training/utils.py.
get_model_parameters(model)
Returns a tuple of (total_params, trainable_params) for the model. Used for logging the parameter count before and after applying the LoRA adapter.
map_dataset_with_prompt_template(dataset, template)
Applies the prompt template to each row in the dataset. The template contains prompt_template, input_template, and completion_template fields that are formatted with the dataset’s feature columns. The EOS token is appended to the prediction field to signal sequence boundaries during training.
sample_and_split_dataset(ds, fraction, split)
Downsamples the dataset to the specified fraction (e.g., 0.5 = 50% of rows), then splits into train and test sets at the given ratio. Uses TRAINING_DATASET_SEED = 42 for reproducible splits across runs.
find_padding_token_candidate(tokenizer)
Searches the tokenizer vocabulary for tokens containing “pad” or “reserved” as substrings. Returns the first match found, or None if no candidate exists.
configure_tokenizer_padding(tokenizer, pad_token)
Sets the tokenizer’s padding token using a fallback chain:
- Use the tokenizer’s existing
pad_tokenif already set. - Use the provided
pad_tokenargument if given. - Use the tokenizer’s
unk_tokenif available. - Search for reserved token candidates via
find_padding_token_candidate().
This ensures every tokenizer has a valid pad token regardless of the base model’s configuration.
Training Constants
Defined in ft/consts.py:
| Constant | Value | Purpose |
|---|---|---|
TRAINING_DATA_TEXT_FIELD | "prediction" | Column name for the formatted training text in the preprocessed dataset |
TRAINING_DEFAULT_TRAIN_TEST_SPLIT | 0.9 | Default train/test split ratio (90% train, 10% test) |
TRAINING_DEFAULT_DATASET_FRACTION | 1.0 | Default dataset fraction (use full dataset) |
TRAINING_DATASET_SEED | 42 | Random seed for reproducible dataset splitting and sampling |
Relationship to Job Lifecycle
The training script is the execution payload created by the Fine-Tuning Job Lifecycle. The job dispatch process builds the JOB_ARGUMENTS string, creates the CML Job pointing to this script, and starts a JobRun. The script runs independently inside the CML workload – it reads its configuration from the environment, accesses the Studio database directly for resource metadata (model paths, dataset locations, config blobs), and writes adapter weights to the output directory.