ML Pipeline Stages
The ML pipeline is implemented in notebooks/distributed-xgboost-with-dask.ipynb. It proceeds through five stages: data loading, feature engineering, train/dev/val split, model training (with baselines and distributed XGBoost), and model validation with threshold selection.
Stage 1: Data Loading
dask_df = dd.read_csv("../data/creditcardsample.csv", assume_missing=True)
Data is loaded as a lazy Dask DataFrame. No computation occurs until .compute() is called. The assume_missing=True flag tells Dask to treat ambiguous columns as nullable, avoiding type inference errors on partitioned data.
Input: data/creditcardsample.csv — 94,926 rows, 31 columns (Time, V1–V28, Amount, Class).
Stage 2: Feature Engineering
- Drop the
Timecolumn (not useful for point-like fraud predictions). - Separate
Classas the target variabley. - Remaining 29 columns (V1–V28, Amount) become the feature matrix
X. - Scale all features via
StandardScaler(zero mean, unit variance).
See Feature Engineering Contract for the exact transformation rules.
Stage 3: Train/Dev/Val Split
Two successive calls to dask_ml.model_selection.train_test_split(shuffle=True):
- Split into 70% training and 30% holdout.
- Split the 30% holdout into 20% development and 10% validation (relative to original).
All splits maintain the class balance (~0.16% fraud). Data remains as distributed Dask arrays throughout training — only the final validation set is converted to NumPy for threshold analysis.
Stage 4: Model Training
Baselines (single-node)
Two baseline models establish performance floors:
| Model | Method | Dev AUCPR |
|---|---|---|
| DummyClassifier (majority) | Always predicts non-fraud | ~0.50 |
| LogisticRegression | No regularization penalty, sklearn Pipeline with StandardScaler | ~0.75 |
Distributed XGBoost
- Convert Dask arrays to
DaskDMatrixobjects (memory-optimized for distributed training). - Train with
xgb.dask.train()using fixed parameters (tree_method=hist,objective=reg:logistic,eval_metric=aucpr). - Tune hyperparameters via sequential random search (20 samples from 8-dimensional search space).
- Select the best model by dev AUCPR.
See XGBoost Training Configuration and Hyperparameter Search Space for full parameter specifications.
Stage 5: Validation and Threshold Selection
The best model is evaluated on the held-out validation set (10% of data, never seen during training or hyperparameter selection).
- Compute continuous predictions via
xgb.dask.predict()on the validation Dask DataFrame, then convert to NumPy. - Calculate AUCPR on the validation set.
- Generate precision-recall curve to select a classification threshold.
- The threshold of 0.35 is chosen to balance precision and recall for the fraud detection use case.
- Serialize the best model:
results["best_model"].save_model("../model/best-xgboost-model")
See Model Serialization Format for details on the saved artifact.