CatBoost Migration Roadmap
This roadmap details how to migrate the distributed fraud detection pipeline from XGBoost to CatBoost, drawing on the production CatBoost patterns established in the Signals-360 metadata classification project. Signals uses CatBoost as one of five evidence sources in a Dempster-Shafer fusion pipeline for classifying database columns into a 174-category SIGDG taxonomy — a harder problem (extreme class imbalance, 2 samples per category) that validates CatBoost’s suitability for the comparatively straightforward binary fraud detection task here.
Why CatBoost
| Concern | XGBoost (current) | CatBoost (target) |
|---|---|---|
| Ordered boosting | Not available | posterior_sampling=True — prevents prediction shift in low-data regimes |
| Categorical features | Requires one-hot or label encoding | Native categorical handling via cat_features parameter |
| Regularization | L1/L2 on weights | L2 on leaves + Bayesian priors via ordered boosting |
| GPU training | tree_method="gpu_hist" | task_type="GPU" with automatic multi-GPU |
| Dask integration | First-class (xgb.dask.train, DaskDMatrix) | None — this is the primary migration challenge |
| Model format | Binary (.ubj) | CatBoost binary (.cbm) + sidecar class mapping (.classes.json) |
| Feature importance | Built-in SHAP | Built-in TreeSHAP via get_feature_importance(type="ShapValues") |
Migration Phases
Phase 1: Drop-in Replacement
Goal: Replace XGBoost with CatBoost while keeping the existing Dask infrastructure for data loading and preprocessing. CatBoost trains on collected (non-distributed) data.
Architecture Change
The current pipeline uses Dask end-to-end: DaskDMatrix → xgb.dask.train() → distributed predictions. CatBoost has no Dask integration, so the architecture splits into two stages: Dask for ETL, CatBoost for training.
Dependency Changes
# requirements.txt
- xgboost==1.6.1
+ catboost>=1.2.7
dask[complete]==2022.5.1
scikit-learn==1.0.2
numpy==1.22.4
Training Code
Replace xgb.dask.train() with CatBoostClassifier.fit():
from catboost import CatBoostClassifier, Pool
# Collect Dask arrays to NumPy (data must fit in driver memory)
X_train_np = X_train.compute()
y_train_np = y_train.compute()
X_dev_np = X_dev.compute()
y_dev_np = y_dev.compute()
# Create CatBoost Pools
train_pool = Pool(X_train_np, y_train_np)
dev_pool = Pool(X_dev_np, y_dev_np)
cb = CatBoostClassifier(
loss_function="Logloss", # Binary classification
depth=6,
iterations=100,
l2_leaf_reg=0.5,
learning_rate=0.1,
random_seed=42,
verbose=10,
eval_metric="PRAUC", # Equivalent to XGBoost's aucpr
auto_class_weights="Balanced", # Handle 0.16% fraud imbalance
)
cb.fit(
train_pool,
eval_set=dev_pool,
early_stopping_rounds=10,
)
Key differences from XGBoost:
Loglossreplacesreg:logistic(binary classification)PRAUCreplacesaucpr(same metric, different name)auto_class_weights="Balanced"addresses the 636:1 class imbalance natively — XGBoost required manualscale_pos_weightearly_stopping_roundsreplaces fixednum_round=5- No
tree_methodneeded — CatBoost uses symmetric trees by default
Inference Changes
# scripts/predict_fraud.py
import numpy as np
from catboost import CatBoostClassifier
booster = CatBoostClassifier()
booster.load_model('/home/cdsw/model/best-catboost-model.cbm')
threshold = 0.35
def predict_fraud(args):
features = np.array(args['features'])
prediction = booster.predict_proba(features)[:, 1] # P(fraud)
if prediction[0] <= threshold:
return 0
return 1
Key change: CatBoost’s predict_proba() returns a 2-column matrix [P(legit), P(fraud)]. Index [:, 1] extracts the fraud probability, replacing XGBoost’s inplace_predict() which returned a single value.
Model Serialization
# Save
cb.save_model("../model/best-catboost-model.cbm")
# Load
loaded = CatBoostClassifier()
loaded.load_model("../model/best-catboost-model.cbm")
The .cbm extension is CatBoost’s native binary format. Unlike XGBoost, the class mapping is embedded in the model file for binary classification, so no sidecar .classes.json is needed (that pattern from Signals is for multi-class).
What Stays the Same
- Dask cluster orchestration (
utils/dask_utils.py) — unchanged - Data loading (
dd.read_csv()) — unchanged - Feature engineering (drop Time, StandardScaler) — unchanged
- Train/dev/val split ratios — unchanged
- Threshold selection logic — unchanged
- CML AMP structure (
.project-metadata.yaml,cdsw-build.sh) — unchanged
Phase 2: CatBoost-Native Features
Goal: Adopt CatBoost-specific capabilities that XGBoost lacks, following patterns proven in Signals.
Ordered Boosting
Signals uses posterior_sampling=True in all CatBoost configurations to prevent prediction shift — a form of target leakage where training samples influence their own gradient estimates. This is particularly valuable in the fraud dataset’s extreme imbalance regime.
cb = CatBoostClassifier(
loss_function="Logloss",
depth=8, # Deeper (Signals uses 8 for complex tasks)
iterations=500, # More rounds (Signals uses 500)
l2_leaf_reg=0.3, # Lighter regularization
learning_rate=0.08, # Slower learning rate
bootstrap_type="Bernoulli", # Bernoulli subsampling
subsample=0.8, # 80% per iteration
posterior_sampling=True, # Ordered boosting (CatBoost-unique)
min_data_in_leaf=2, # From Signals: critical for rare classes
random_seed=42,
eval_metric="PRAUC",
auto_class_weights="Balanced",
)
GPU caveat (from Signals):
posterior_sampling=Trueandrsm(random subspace method) are not supported on CatBoost GPU for classification. Training must run on CPU when these are enabled. Signals handles this by detecting GPU availability and forcing CPU when ordered boosting is active.
GPU Acceleration
For configurations that don’t use ordered boosting (e.g., initial exploration):
def _catboost_gpu_kwargs(devices=None):
"""GPU detection pattern from Signals."""
import torch
if torch.cuda.is_available():
n = torch.cuda.device_count()
if devices is None:
devices = ":".join(str(i) for i in range(n))
return {"task_type": "GPU", "devices": devices}
return {}
cb = CatBoostClassifier(
**_catboost_gpu_kwargs(),
# ... other params (without posterior_sampling)
)
TreeSHAP Feature Importance
Signals uses CatBoost’s built-in TreeSHAP for per-prediction explanations. Apply the same pattern for fraud detection:
from catboost import Pool
pool = Pool(X_val_np)
shap_values = cb.get_feature_importance(
type="ShapValues",
data=pool,
)
# shap_values shape: (n_samples, n_features + 1)
# Last column is the base value (bias term)
feature_shap = shap_values[:, :-1] # Per-sample, per-feature importance
This replaces the need for a separate SHAP library and is much faster for tree models.
Hyperparameter Search Space
Adapted from XGBoost space with CatBoost equivalents:
| XGBoost Parameter | CatBoost Equivalent | Recommended Range |
|---|---|---|
learning_rate | learning_rate | [0.01, 0.3] |
gamma | min_data_in_leaf | [1, 20] (integer) |
max_depth | depth | [4, 10] |
min_child_weight | min_data_in_leaf | (merged with gamma) |
max_delta_step | (not needed) | — |
subsample | subsample | [0.6, 1.0] |
lambda (L2) | l2_leaf_reg | [0.1, 10] (log-uniform) |
alpha (L1) | (not directly available) | — |
CatBoost-specific parameters to add:
| Parameter | Range | Description |
|---|---|---|
random_strength | [0.1, 10] | Score randomization at each split |
bagging_temperature | [0, 5] | Bayesian bootstrap intensity |
border_count | [32, 255] | Number of splits per feature |
Phase 3: Distributed Strategy
Goal: Restore memory-constrained distributed training capability without Dask integration.
CatBoost does not support Dask. Three strategies can replace xgb.dask.train() for datasets that exceed driver memory:
Strategy A: Dask Preprocessing + Quantized Pools
Use Dask for data loading and preprocessing, then convert to CatBoost’s quantized pool format which uses ~8x less memory than raw floats:
# 1. Dask loads and preprocesses (distributed)
dask_df = dd.read_csv("large_dataset.csv", assume_missing=True)
dask_df = dask_df.drop(columns=["Time"])
# ... feature engineering on Dask ...
# 2. Save preprocessed partitions to disk
dask_df.to_parquet("/tmp/preprocessed/", engine="pyarrow")
# 3. CatBoost loads quantized (memory-efficient)
from catboost import Pool
train_pool = Pool(
"/tmp/preprocessed/train.parquet",
column_description="column_desc.cd",
)
# Quantize to ~8x memory reduction
train_pool.quantize()
Strategy B: CatBoost Multi-Node Training
CatBoost has its own distributed training mode via --node-count:
# On each CML worker:
catboost fit \
--loss-function Logloss \
--learn-set /data/train.tsv \
--node-count 3 \
--node-port 8788 \
--file-with-hosts hosts.txt
This could be orchestrated via dask_utils.py’s worker-launching pattern — replace !dask-worker with !catboost fit --node-count N.
Strategy C: Partition-Level Ensembling
Train separate CatBoost models on each Dask partition, then ensemble predictions:
# On each Dask worker
def train_partition(partition_df):
cb = CatBoostClassifier(...)
cb.fit(partition_df[features], partition_df["Class"])
return cb
# Collect models and average predictions
models = client.map(train_partition, dask_df.to_delayed())
predictions = np.mean([m.predict_proba(X_val) for m in models], axis=0)
This is the simplest approach but sacrifices some accuracy compared to true distributed training where the full dataset informs each tree.
Recommended Strategy
For datasets up to ~10M rows: Strategy A (quantized pools). CatBoost’s quantization reduces a 10M × 29 float64 matrix from ~2.3 GB to ~290 MB, fitting comfortably in a single 4 GiB CML session.
For datasets beyond ~10M rows: Strategy B (multi-node). Requires modifying dask_utils.py to launch CatBoost worker processes instead of Dask workers.
Phase 4: Signals Patterns (Advanced)
Goal: Adopt advanced techniques from Signals that would improve fraud detection beyond what basic CatBoost provides.
Self-Training
Signals implements multi-round self-training where high-confidence predictions are injected back as training labels. This is valuable when labeled fraud data is scarce:
# Round 1: Train on labeled data
cb.fit(X_train, y_train)
proba = cb.predict_proba(X_unlabeled)
# Pseudo-label high-confidence predictions
confident_mask = proba.max(axis=1) >= 0.80 # Signals default threshold
pseudo_X = X_unlabeled[confident_mask]
pseudo_y = proba[confident_mask].argmax(axis=1)
# Round 2: Retrain with pseudo-labels added
X_aug = np.vstack([X_train, pseudo_X])
y_aug = np.concatenate([y_train, pseudo_y])
cb.fit(X_aug, y_aug)
Signals achieves 99.4% accuracy with self-training vs 81.6% without — though the fraud detection domain may see smaller gains since the signal-to-noise ratio in transaction features is higher than in column metadata.
Synthetic Data Augmentation
Signals generates 50 synthetic variants per category to handle classes with only 2 real samples. For fraud detection, a similar approach could generate synthetic fraud transactions:
- Fit a distribution to real fraud transactions (V1–V28 are already PCA-transformed, so multivariate normal is reasonable).
- Generate
Nsynthetic fraud samples. - Include synthetic samples in training with reduced sample weight.
Evidence Fusion
The Signals architecture combines five evidence sources via Dempster-Shafer theory. For fraud detection, the CatBoost model could become one source alongside:
| Source | Mass Function | Discount |
|---|---|---|
| CatBoost probability | catboost_to_mass() | 0.15 (Signals default for well-calibrated CatBoost) |
| Pattern detection | Rule-based (card number regex, etc.) | 0.10 (high-reliability rules) |
| Amount anomaly | Z-score of transaction amount | 0.30 |
| Velocity check | Transactions per time window | 0.30 |
This produces belief intervals [Bel(fraud), Pl(fraud)] instead of a single confidence score — uncertainty-aware decisions where the gap between belief and plausibility flags transactions needing human review. The evidence sources and risk levels map to the SIGDG Financial Transaction Ontology, which provides BFO-grounded categories for transaction risk classification, regulatory scope, and fraud detection processes.
Migration Checklist
Phase 1 (minimum viable)
- Replace
xgboostwithcatboostinrequirements.txt - Add
.compute()calls to collect Dask arrays before training - Replace
xgb.dask.train()withCatBoostClassifier.fit() - Replace
xgb.dask.DaskDMatrixwithcatboost.Pool - Update
predict_fraud.py:load_model()+predict_proba()[:, 1] - Update model path from
best-xgboost-modeltobest-catboost-model.cbm - Update hyperparameter search space for CatBoost parameters
- Update evaluation: CatBoost uses
PRAUCnotaucpr - Update
cdsw-build.shif dependency names changed - Run threshold selection on validation set (threshold may shift)
Phase 2 (CatBoost-native)
- Enable
posterior_sampling=True(ordered boosting) - Add
auto_class_weights="Balanced"for imbalance handling - Add GPU detection with CPU fallback (from Signals)
- Replace SHAP library with CatBoost’s built-in TreeSHAP
- Tune CatBoost-specific parameters (
random_strength,bagging_temperature)
Phase 3 (distributed)
- Implement quantized Pool loading for large datasets
- Or: modify
dask_utils.pyto launch CatBoost multi-node workers - Benchmark memory usage: quantized vs raw on target dataset size
Phase 4 (Signals patterns)
- Implement self-training loop with confidence threshold
- Evaluate synthetic fraud augmentation
- Prototype evidence fusion with rule-based sources
Files Modified
| File | Change |
|---|---|
requirements.txt | xgboost → catboost |
notebooks/distributed-xgboost-with-dask.ipynb | All training/prediction cells |
scripts/predict_fraud.py | Model loading and inference API |
utils/dask_utils.py | No change (Phase 1–2); modify for Phase 3B |
.project-metadata.yaml | No change |
cdsw-build.sh | No change (dependencies installed via requirements.txt) |