Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CatBoost Migration Roadmap

This roadmap details how to migrate the distributed fraud detection pipeline from XGBoost to CatBoost, drawing on the production CatBoost patterns established in the Signals-360 metadata classification project. Signals uses CatBoost as one of five evidence sources in a Dempster-Shafer fusion pipeline for classifying database columns into a 174-category SIGDG taxonomy — a harder problem (extreme class imbalance, 2 samples per category) that validates CatBoost’s suitability for the comparatively straightforward binary fraud detection task here.

Why CatBoost

ConcernXGBoost (current)CatBoost (target)
Ordered boostingNot availableposterior_sampling=True — prevents prediction shift in low-data regimes
Categorical featuresRequires one-hot or label encodingNative categorical handling via cat_features parameter
RegularizationL1/L2 on weightsL2 on leaves + Bayesian priors via ordered boosting
GPU trainingtree_method="gpu_hist"task_type="GPU" with automatic multi-GPU
Dask integrationFirst-class (xgb.dask.train, DaskDMatrix)None — this is the primary migration challenge
Model formatBinary (.ubj)CatBoost binary (.cbm) + sidecar class mapping (.classes.json)
Feature importanceBuilt-in SHAPBuilt-in TreeSHAP via get_feature_importance(type="ShapValues")

Migration Phases


Phase 1: Drop-in Replacement

Goal: Replace XGBoost with CatBoost while keeping the existing Dask infrastructure for data loading and preprocessing. CatBoost trains on collected (non-distributed) data.

Architecture Change

The current pipeline uses Dask end-to-end: DaskDMatrixxgb.dask.train() → distributed predictions. CatBoost has no Dask integration, so the architecture splits into two stages: Dask for ETL, CatBoost for training.

Dependency Changes

# requirements.txt
- xgboost==1.6.1
+ catboost>=1.2.7
  dask[complete]==2022.5.1
  scikit-learn==1.0.2
  numpy==1.22.4

Training Code

Replace xgb.dask.train() with CatBoostClassifier.fit():

from catboost import CatBoostClassifier, Pool

# Collect Dask arrays to NumPy (data must fit in driver memory)
X_train_np = X_train.compute()
y_train_np = y_train.compute()
X_dev_np = X_dev.compute()
y_dev_np = y_dev.compute()

# Create CatBoost Pools
train_pool = Pool(X_train_np, y_train_np)
dev_pool = Pool(X_dev_np, y_dev_np)

cb = CatBoostClassifier(
    loss_function="Logloss",        # Binary classification
    depth=6,
    iterations=100,
    l2_leaf_reg=0.5,
    learning_rate=0.1,
    random_seed=42,
    verbose=10,
    eval_metric="PRAUC",            # Equivalent to XGBoost's aucpr
    auto_class_weights="Balanced",  # Handle 0.16% fraud imbalance
)

cb.fit(
    train_pool,
    eval_set=dev_pool,
    early_stopping_rounds=10,
)

Key differences from XGBoost:

  • Logloss replaces reg:logistic (binary classification)
  • PRAUC replaces aucpr (same metric, different name)
  • auto_class_weights="Balanced" addresses the 636:1 class imbalance natively — XGBoost required manual scale_pos_weight
  • early_stopping_rounds replaces fixed num_round=5
  • No tree_method needed — CatBoost uses symmetric trees by default

Inference Changes

# scripts/predict_fraud.py
import numpy as np
from catboost import CatBoostClassifier

booster = CatBoostClassifier()
booster.load_model('/home/cdsw/model/best-catboost-model.cbm')
threshold = 0.35

def predict_fraud(args):
    features = np.array(args['features'])
    prediction = booster.predict_proba(features)[:, 1]  # P(fraud)
    if prediction[0] <= threshold:
        return 0
    return 1

Key change: CatBoost’s predict_proba() returns a 2-column matrix [P(legit), P(fraud)]. Index [:, 1] extracts the fraud probability, replacing XGBoost’s inplace_predict() which returned a single value.

Model Serialization

# Save
cb.save_model("../model/best-catboost-model.cbm")

# Load
loaded = CatBoostClassifier()
loaded.load_model("../model/best-catboost-model.cbm")

The .cbm extension is CatBoost’s native binary format. Unlike XGBoost, the class mapping is embedded in the model file for binary classification, so no sidecar .classes.json is needed (that pattern from Signals is for multi-class).

What Stays the Same

  • Dask cluster orchestration (utils/dask_utils.py) — unchanged
  • Data loading (dd.read_csv()) — unchanged
  • Feature engineering (drop Time, StandardScaler) — unchanged
  • Train/dev/val split ratios — unchanged
  • Threshold selection logic — unchanged
  • CML AMP structure (.project-metadata.yaml, cdsw-build.sh) — unchanged

Phase 2: CatBoost-Native Features

Goal: Adopt CatBoost-specific capabilities that XGBoost lacks, following patterns proven in Signals.

Ordered Boosting

Signals uses posterior_sampling=True in all CatBoost configurations to prevent prediction shift — a form of target leakage where training samples influence their own gradient estimates. This is particularly valuable in the fraud dataset’s extreme imbalance regime.

cb = CatBoostClassifier(
    loss_function="Logloss",
    depth=8,                        # Deeper (Signals uses 8 for complex tasks)
    iterations=500,                 # More rounds (Signals uses 500)
    l2_leaf_reg=0.3,               # Lighter regularization
    learning_rate=0.08,            # Slower learning rate
    bootstrap_type="Bernoulli",    # Bernoulli subsampling
    subsample=0.8,                 # 80% per iteration
    posterior_sampling=True,       # Ordered boosting (CatBoost-unique)
    min_data_in_leaf=2,            # From Signals: critical for rare classes
    random_seed=42,
    eval_metric="PRAUC",
    auto_class_weights="Balanced",
)

GPU caveat (from Signals): posterior_sampling=True and rsm (random subspace method) are not supported on CatBoost GPU for classification. Training must run on CPU when these are enabled. Signals handles this by detecting GPU availability and forcing CPU when ordered boosting is active.

GPU Acceleration

For configurations that don’t use ordered boosting (e.g., initial exploration):

def _catboost_gpu_kwargs(devices=None):
    """GPU detection pattern from Signals."""
    import torch
    if torch.cuda.is_available():
        n = torch.cuda.device_count()
        if devices is None:
            devices = ":".join(str(i) for i in range(n))
        return {"task_type": "GPU", "devices": devices}
    return {}

cb = CatBoostClassifier(
    **_catboost_gpu_kwargs(),
    # ... other params (without posterior_sampling)
)

TreeSHAP Feature Importance

Signals uses CatBoost’s built-in TreeSHAP for per-prediction explanations. Apply the same pattern for fraud detection:

from catboost import Pool

pool = Pool(X_val_np)
shap_values = cb.get_feature_importance(
    type="ShapValues",
    data=pool,
)
# shap_values shape: (n_samples, n_features + 1)
# Last column is the base value (bias term)
feature_shap = shap_values[:, :-1]  # Per-sample, per-feature importance

This replaces the need for a separate SHAP library and is much faster for tree models.

Hyperparameter Search Space

Adapted from XGBoost space with CatBoost equivalents:

XGBoost ParameterCatBoost EquivalentRecommended Range
learning_ratelearning_rate[0.01, 0.3]
gammamin_data_in_leaf[1, 20] (integer)
max_depthdepth[4, 10]
min_child_weightmin_data_in_leaf(merged with gamma)
max_delta_step(not needed)
subsamplesubsample[0.6, 1.0]
lambda (L2)l2_leaf_reg[0.1, 10] (log-uniform)
alpha (L1)(not directly available)

CatBoost-specific parameters to add:

ParameterRangeDescription
random_strength[0.1, 10]Score randomization at each split
bagging_temperature[0, 5]Bayesian bootstrap intensity
border_count[32, 255]Number of splits per feature

Phase 3: Distributed Strategy

Goal: Restore memory-constrained distributed training capability without Dask integration.

CatBoost does not support Dask. Three strategies can replace xgb.dask.train() for datasets that exceed driver memory:

Strategy A: Dask Preprocessing + Quantized Pools

Use Dask for data loading and preprocessing, then convert to CatBoost’s quantized pool format which uses ~8x less memory than raw floats:

# 1. Dask loads and preprocesses (distributed)
dask_df = dd.read_csv("large_dataset.csv", assume_missing=True)
dask_df = dask_df.drop(columns=["Time"])
# ... feature engineering on Dask ...

# 2. Save preprocessed partitions to disk
dask_df.to_parquet("/tmp/preprocessed/", engine="pyarrow")

# 3. CatBoost loads quantized (memory-efficient)
from catboost import Pool

train_pool = Pool(
    "/tmp/preprocessed/train.parquet",
    column_description="column_desc.cd",
)
# Quantize to ~8x memory reduction
train_pool.quantize()

Strategy B: CatBoost Multi-Node Training

CatBoost has its own distributed training mode via --node-count:

# On each CML worker:
catboost fit \
    --loss-function Logloss \
    --learn-set /data/train.tsv \
    --node-count 3 \
    --node-port 8788 \
    --file-with-hosts hosts.txt

This could be orchestrated via dask_utils.py’s worker-launching pattern — replace !dask-worker with !catboost fit --node-count N.

Strategy C: Partition-Level Ensembling

Train separate CatBoost models on each Dask partition, then ensemble predictions:

# On each Dask worker
def train_partition(partition_df):
    cb = CatBoostClassifier(...)
    cb.fit(partition_df[features], partition_df["Class"])
    return cb

# Collect models and average predictions
models = client.map(train_partition, dask_df.to_delayed())
predictions = np.mean([m.predict_proba(X_val) for m in models], axis=0)

This is the simplest approach but sacrifices some accuracy compared to true distributed training where the full dataset informs each tree.

For datasets up to ~10M rows: Strategy A (quantized pools). CatBoost’s quantization reduces a 10M × 29 float64 matrix from ~2.3 GB to ~290 MB, fitting comfortably in a single 4 GiB CML session.

For datasets beyond ~10M rows: Strategy B (multi-node). Requires modifying dask_utils.py to launch CatBoost worker processes instead of Dask workers.


Phase 4: Signals Patterns (Advanced)

Goal: Adopt advanced techniques from Signals that would improve fraud detection beyond what basic CatBoost provides.

Self-Training

Signals implements multi-round self-training where high-confidence predictions are injected back as training labels. This is valuable when labeled fraud data is scarce:

# Round 1: Train on labeled data
cb.fit(X_train, y_train)
proba = cb.predict_proba(X_unlabeled)

# Pseudo-label high-confidence predictions
confident_mask = proba.max(axis=1) >= 0.80  # Signals default threshold
pseudo_X = X_unlabeled[confident_mask]
pseudo_y = proba[confident_mask].argmax(axis=1)

# Round 2: Retrain with pseudo-labels added
X_aug = np.vstack([X_train, pseudo_X])
y_aug = np.concatenate([y_train, pseudo_y])
cb.fit(X_aug, y_aug)

Signals achieves 99.4% accuracy with self-training vs 81.6% without — though the fraud detection domain may see smaller gains since the signal-to-noise ratio in transaction features is higher than in column metadata.

Synthetic Data Augmentation

Signals generates 50 synthetic variants per category to handle classes with only 2 real samples. For fraud detection, a similar approach could generate synthetic fraud transactions:

  1. Fit a distribution to real fraud transactions (V1–V28 are already PCA-transformed, so multivariate normal is reasonable).
  2. Generate N synthetic fraud samples.
  3. Include synthetic samples in training with reduced sample weight.

Evidence Fusion

The Signals architecture combines five evidence sources via Dempster-Shafer theory. For fraud detection, the CatBoost model could become one source alongside:

SourceMass FunctionDiscount
CatBoost probabilitycatboost_to_mass()0.15 (Signals default for well-calibrated CatBoost)
Pattern detectionRule-based (card number regex, etc.)0.10 (high-reliability rules)
Amount anomalyZ-score of transaction amount0.30
Velocity checkTransactions per time window0.30

This produces belief intervals [Bel(fraud), Pl(fraud)] instead of a single confidence score — uncertainty-aware decisions where the gap between belief and plausibility flags transactions needing human review. The evidence sources and risk levels map to the SIGDG Financial Transaction Ontology, which provides BFO-grounded categories for transaction risk classification, regulatory scope, and fraud detection processes.


Migration Checklist

Phase 1 (minimum viable)

  • Replace xgboost with catboost in requirements.txt
  • Add .compute() calls to collect Dask arrays before training
  • Replace xgb.dask.train() with CatBoostClassifier.fit()
  • Replace xgb.dask.DaskDMatrix with catboost.Pool
  • Update predict_fraud.py: load_model() + predict_proba()[:, 1]
  • Update model path from best-xgboost-model to best-catboost-model.cbm
  • Update hyperparameter search space for CatBoost parameters
  • Update evaluation: CatBoost uses PRAUC not aucpr
  • Update cdsw-build.sh if dependency names changed
  • Run threshold selection on validation set (threshold may shift)

Phase 2 (CatBoost-native)

  • Enable posterior_sampling=True (ordered boosting)
  • Add auto_class_weights="Balanced" for imbalance handling
  • Add GPU detection with CPU fallback (from Signals)
  • Replace SHAP library with CatBoost’s built-in TreeSHAP
  • Tune CatBoost-specific parameters (random_strength, bagging_temperature)

Phase 3 (distributed)

  • Implement quantized Pool loading for large datasets
  • Or: modify dask_utils.py to launch CatBoost multi-node workers
  • Benchmark memory usage: quantized vs raw on target dataset size

Phase 4 (Signals patterns)

  • Implement self-training loop with confidence threshold
  • Evaluate synthetic fraud augmentation
  • Prototype evidence fusion with rule-based sources

Files Modified

FileChange
requirements.txtxgboostcatboost
notebooks/distributed-xgboost-with-dask.ipynbAll training/prediction cells
scripts/predict_fraud.pyModel loading and inference API
utils/dask_utils.pyNo change (Phase 1–2); modify for Phase 3B
.project-metadata.yamlNo change
cdsw-build.shNo change (dependencies installed via requirements.txt)