Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ML Pipeline Stages

The ML pipeline is implemented in notebooks/distributed-xgboost-with-dask.ipynb. It proceeds through five stages: data loading, feature engineering, train/dev/val split, model training (with baselines and distributed XGBoost), and model validation with threshold selection.

Stage 1: Data Loading

dask_df = dd.read_csv("../data/creditcardsample.csv", assume_missing=True)

Data is loaded as a lazy Dask DataFrame. No computation occurs until .compute() is called. The assume_missing=True flag tells Dask to treat ambiguous columns as nullable, avoiding type inference errors on partitioned data.

Input: data/creditcardsample.csv — 94,926 rows, 31 columns (Time, V1–V28, Amount, Class).

Stage 2: Feature Engineering

  1. Drop the Time column (not useful for point-like fraud predictions).
  2. Separate Class as the target variable y.
  3. Remaining 29 columns (V1–V28, Amount) become the feature matrix X.
  4. Scale all features via StandardScaler (zero mean, unit variance).

See Feature Engineering Contract for the exact transformation rules.

Stage 3: Train/Dev/Val Split

Two successive calls to dask_ml.model_selection.train_test_split(shuffle=True):

  1. Split into 70% training and 30% holdout.
  2. Split the 30% holdout into 20% development and 10% validation (relative to original).

All splits maintain the class balance (~0.16% fraud). Data remains as distributed Dask arrays throughout training — only the final validation set is converted to NumPy for threshold analysis.

Stage 4: Model Training

Baselines (single-node)

Two baseline models establish performance floors:

ModelMethodDev AUCPR
DummyClassifier (majority)Always predicts non-fraud~0.50
LogisticRegressionNo regularization penalty, sklearn Pipeline with StandardScaler~0.75

Distributed XGBoost

  1. Convert Dask arrays to DaskDMatrix objects (memory-optimized for distributed training).
  2. Train with xgb.dask.train() using fixed parameters (tree_method=hist, objective=reg:logistic, eval_metric=aucpr).
  3. Tune hyperparameters via sequential random search (20 samples from 8-dimensional search space).
  4. Select the best model by dev AUCPR.

See XGBoost Training Configuration and Hyperparameter Search Space for full parameter specifications.

Stage 5: Validation and Threshold Selection

The best model is evaluated on the held-out validation set (10% of data, never seen during training or hyperparameter selection).

  1. Compute continuous predictions via xgb.dask.predict() on the validation Dask DataFrame, then convert to NumPy.
  2. Calculate AUCPR on the validation set.
  3. Generate precision-recall curve to select a classification threshold.
  4. The threshold of 0.35 is chosen to balance precision and recall for the fraud detection use case.
  5. Serialize the best model:
results["best_model"].save_model("../model/best-xgboost-model")

See Model Serialization Format for details on the saved artifact.