CatBoost Migration Roadmap

This roadmap details how to migrate the distributed fraud detection pipeline from XGBoost to CatBoost, drawing on the production CatBoost patterns established in the Signals-360 metadata classification project. Signals uses CatBoost as one of five evidence sources in a Dempster-Shafer fusion pipeline for classifying database columns into a 174-category SIGDG taxonomy — a harder problem (extreme class imbalance, 2 samples per category) that validates CatBoost’s suitability for the comparatively straightforward binary fraud detection task here.

Why CatBoost

Concern	XGBoost (current)	CatBoost (target)
Ordered boosting	Not available	`posterior_sampling=True` — prevents prediction shift in low-data regimes
Categorical features	Requires one-hot or label encoding	Native categorical handling via `cat_features` parameter
Regularization	L1/L2 on weights	L2 on leaves + Bayesian priors via ordered boosting
GPU training	`tree_method="gpu_hist"`	`task_type="GPU"` with automatic multi-GPU
Dask integration	First-class (`xgb.dask.train`, `DaskDMatrix`)	None — this is the primary migration challenge
Model format	Binary (`.ubj`)	CatBoost binary (`.cbm`) + sidecar class mapping (`.classes.json`)
Feature importance	Built-in SHAP	Built-in TreeSHAP via `get_feature_importance(type="ShapValues")`

Migration Phases

Phase 1: Drop-in Replacement

Goal: Replace XGBoost with CatBoost while keeping the existing Dask infrastructure for data loading and preprocessing. CatBoost trains on collected (non-distributed) data.

Architecture Change

The current pipeline uses Dask end-to-end: DaskDMatrix → xgb.dask.train() → distributed predictions. CatBoost has no Dask integration, so the architecture splits into two stages: Dask for ETL, CatBoost for training.

Dependency Changes

# requirements.txt
- xgboost==1.6.1
+ catboost>=1.2.7
  dask[complete]==2022.5.1
  scikit-learn==1.0.2
  numpy==1.22.4

Training Code

Replace xgb.dask.train() with CatBoostClassifier.fit():

from catboost import CatBoostClassifier, Pool

# Collect Dask arrays to NumPy (data must fit in driver memory)
X_train_np = X_train.compute()
y_train_np = y_train.compute()
X_dev_np = X_dev.compute()
y_dev_np = y_dev.compute()

# Create CatBoost Pools
train_pool = Pool(X_train_np, y_train_np)
dev_pool = Pool(X_dev_np, y_dev_np)

cb = CatBoostClassifier(
    loss_function="Logloss",        # Binary classification
    depth=6,
    iterations=100,
    l2_leaf_reg=0.5,
    learning_rate=0.1,
    random_seed=42,
    verbose=10,
    eval_metric="PRAUC",            # Equivalent to XGBoost's aucpr
    auto_class_weights="Balanced",  # Handle 0.16% fraud imbalance
)

cb.fit(
    train_pool,
    eval_set=dev_pool,
    early_stopping_rounds=10,
)

Key differences from XGBoost:

Logloss replaces reg:logistic (binary classification)
PRAUC replaces aucpr (same metric, different name)
auto_class_weights="Balanced" addresses the 636:1 class imbalance natively — XGBoost required manual scale_pos_weight
early_stopping_rounds replaces fixed num_round=5
No tree_method needed — CatBoost uses symmetric trees by default

Inference Changes

# scripts/predict_fraud.py
import numpy as np
from catboost import CatBoostClassifier

booster = CatBoostClassifier()
booster.load_model('/home/cdsw/model/best-catboost-model.cbm')
threshold = 0.35

def predict_fraud(args):
    features = np.array(args['features'])
    prediction = booster.predict_proba(features)[:, 1]  # P(fraud)
    if prediction[0] <= threshold:
        return 0
    return 1

Key change: CatBoost’s predict_proba() returns a 2-column matrix [P(legit), P(fraud)]. Index [:, 1] extracts the fraud probability, replacing XGBoost’s inplace_predict() which returned a single value.

Model Serialization

# Save
cb.save_model("../model/best-catboost-model.cbm")

# Load
loaded = CatBoostClassifier()
loaded.load_model("../model/best-catboost-model.cbm")

The .cbm extension is CatBoost’s native binary format. Unlike XGBoost, the class mapping is embedded in the model file for binary classification, so no sidecar .classes.json is needed (that pattern from Signals is for multi-class).

What Stays the Same

Dask cluster orchestration (utils/dask_utils.py) — unchanged
Data loading (dd.read_csv()) — unchanged
Feature engineering (drop Time, StandardScaler) — unchanged
Train/dev/val split ratios — unchanged
Threshold selection logic — unchanged
CML AMP structure (.project-metadata.yaml, cdsw-build.sh) — unchanged

Phase 2: CatBoost-Native Features

Goal: Adopt CatBoost-specific capabilities that XGBoost lacks, following patterns proven in Signals.

Ordered Boosting

Signals uses posterior_sampling=True in all CatBoost configurations to prevent prediction shift — a form of target leakage where training samples influence their own gradient estimates. This is particularly valuable in the fraud dataset’s extreme imbalance regime.

cb = CatBoostClassifier(
    loss_function="Logloss",
    depth=8,                        # Deeper (Signals uses 8 for complex tasks)
    iterations=500,                 # More rounds (Signals uses 500)
    l2_leaf_reg=0.3,               # Lighter regularization
    learning_rate=0.08,            # Slower learning rate
    bootstrap_type="Bernoulli",    # Bernoulli subsampling
    subsample=0.8,                 # 80% per iteration
    posterior_sampling=True,       # Ordered boosting (CatBoost-unique)
    min_data_in_leaf=2,            # From Signals: critical for rare classes
    random_seed=42,
    eval_metric="PRAUC",
    auto_class_weights="Balanced",
)

GPU caveat (from Signals): posterior_sampling=True and rsm (random subspace method) are not supported on CatBoost GPU for classification. Training must run on CPU when these are enabled. Signals handles this by detecting GPU availability and forcing CPU when ordered boosting is active.

GPU Acceleration

For configurations that don’t use ordered boosting (e.g., initial exploration):

def _catboost_gpu_kwargs(devices=None):
    """GPU detection pattern from Signals."""
    import torch
    if torch.cuda.is_available():
        n = torch.cuda.device_count()
        if devices is None:
            devices = ":".join(str(i) for i in range(n))
        return {"task_type": "GPU", "devices": devices}
    return {}

cb = CatBoostClassifier(
    **_catboost_gpu_kwargs(),
    # ... other params (without posterior_sampling)
)

TreeSHAP Feature Importance

Signals uses CatBoost’s built-in TreeSHAP for per-prediction explanations. Apply the same pattern for fraud detection:

from catboost import Pool

pool = Pool(X_val_np)
shap_values = cb.get_feature_importance(
    type="ShapValues",
    data=pool,
)
# shap_values shape: (n_samples, n_features + 1)
# Last column is the base value (bias term)
feature_shap = shap_values[:, :-1]  # Per-sample, per-feature importance

This replaces the need for a separate SHAP library and is much faster for tree models.

Hyperparameter Search Space

Adapted from XGBoost space with CatBoost equivalents:

XGBoost Parameter	CatBoost Equivalent	Recommended Range
`learning_rate`	`learning_rate`	[0.01, 0.3]
`gamma`	`min_data_in_leaf`	[1, 20] (integer)
`max_depth`	`depth`	[4, 10]
`min_child_weight`	`min_data_in_leaf`	(merged with gamma)
`max_delta_step`	(not needed)	—
`subsample`	`subsample`	[0.6, 1.0]
`lambda` (L2)	`l2_leaf_reg`	[0.1, 10] (log-uniform)
`alpha` (L1)	(not directly available)	—

CatBoost-specific parameters to add:

Parameter	Range	Description
`random_strength`	[0.1, 10]	Score randomization at each split
`bagging_temperature`	[0, 5]	Bayesian bootstrap intensity
`border_count`	[32, 255]	Number of splits per feature

Phase 3: Distributed Strategy

Goal: Restore memory-constrained distributed training capability without Dask integration.

CatBoost does not support Dask. Three strategies can replace xgb.dask.train() for datasets that exceed driver memory:

Strategy A: Dask Preprocessing + Quantized Pools

Use Dask for data loading and preprocessing, then convert to CatBoost’s quantized pool format which uses ~8x less memory than raw floats:

# 1. Dask loads and preprocesses (distributed)
dask_df = dd.read_csv("large_dataset.csv", assume_missing=True)
dask_df = dask_df.drop(columns=["Time"])
# ... feature engineering on Dask ...

# 2. Save preprocessed partitions to disk
dask_df.to_parquet("/tmp/preprocessed/", engine="pyarrow")

# 3. CatBoost loads quantized (memory-efficient)
from catboost import Pool

train_pool = Pool(
    "/tmp/preprocessed/train.parquet",
    column_description="column_desc.cd",
)
# Quantize to ~8x memory reduction
train_pool.quantize()

Strategy B: CatBoost Multi-Node Training

CatBoost has its own distributed training mode via --node-count:

# On each CML worker:
catboost fit \
    --loss-function Logloss \
    --learn-set /data/train.tsv \
    --node-count 3 \
    --node-port 8788 \
    --file-with-hosts hosts.txt

This could be orchestrated via dask_utils.py’s worker-launching pattern — replace !dask-worker with !catboost fit --node-count N.

Strategy C: Partition-Level Ensembling

Train separate CatBoost models on each Dask partition, then ensemble predictions:

# On each Dask worker
def train_partition(partition_df):
    cb = CatBoostClassifier(...)
    cb.fit(partition_df[features], partition_df["Class"])
    return cb

# Collect models and average predictions
models = client.map(train_partition, dask_df.to_delayed())
predictions = np.mean([m.predict_proba(X_val) for m in models], axis=0)

This is the simplest approach but sacrifices some accuracy compared to true distributed training where the full dataset informs each tree.

Recommended Strategy

For datasets up to ~10M rows: Strategy A (quantized pools). CatBoost’s quantization reduces a 10M × 29 float64 matrix from ~2.3 GB to ~290 MB, fitting comfortably in a single 4 GiB CML session.

For datasets beyond ~10M rows: Strategy B (multi-node). Requires modifying dask_utils.py to launch CatBoost worker processes instead of Dask workers.

Phase 4: Signals Patterns (Advanced)

Goal: Adopt advanced techniques from Signals that would improve fraud detection beyond what basic CatBoost provides.

Self-Training

Signals implements multi-round self-training where high-confidence predictions are injected back as training labels. This is valuable when labeled fraud data is scarce:

# Round 1: Train on labeled data
cb.fit(X_train, y_train)
proba = cb.predict_proba(X_unlabeled)

# Pseudo-label high-confidence predictions
confident_mask = proba.max(axis=1) >= 0.80  # Signals default threshold
pseudo_X = X_unlabeled[confident_mask]
pseudo_y = proba[confident_mask].argmax(axis=1)

# Round 2: Retrain with pseudo-labels added
X_aug = np.vstack([X_train, pseudo_X])
y_aug = np.concatenate([y_train, pseudo_y])
cb.fit(X_aug, y_aug)

Signals achieves 99.4% accuracy with self-training vs 81.6% without — though the fraud detection domain may see smaller gains since the signal-to-noise ratio in transaction features is higher than in column metadata.

Synthetic Data Augmentation

Signals generates 50 synthetic variants per category to handle classes with only 2 real samples. For fraud detection, a similar approach could generate synthetic fraud transactions:

Fit a distribution to real fraud transactions (V1–V28 are already PCA-transformed, so multivariate normal is reasonable).
Generate N synthetic fraud samples.
Include synthetic samples in training with reduced sample weight.

Evidence Fusion

The Signals architecture combines five evidence sources via Dempster-Shafer theory. For fraud detection, the CatBoost model could become one source alongside:

Source	Mass Function	Discount
CatBoost probability	`catboost_to_mass()`	0.15 (Signals default for well-calibrated CatBoost)
Pattern detection	Rule-based (card number regex, etc.)	0.10 (high-reliability rules)
Amount anomaly	Z-score of transaction amount	0.30
Velocity check	Transactions per time window	0.30

This produces belief intervals [Bel(fraud), Pl(fraud)] instead of a single confidence score — uncertainty-aware decisions where the gap between belief and plausibility flags transactions needing human review. The evidence sources and risk levels map to the SIGDG Financial Transaction Ontology, which provides BFO-grounded categories for transaction risk classification, regulatory scope, and fraud detection processes.

Migration Checklist

Phase 1 (minimum viable)

Replace xgboost with catboost in requirements.txt
Add .compute() calls to collect Dask arrays before training
Replace xgb.dask.train() with CatBoostClassifier.fit()
Replace xgb.dask.DaskDMatrix with catboost.Pool
Update predict_fraud.py: load_model() + predict_proba()[:, 1]
Update model path from best-xgboost-model to best-catboost-model.cbm
Update hyperparameter search space for CatBoost parameters
Update evaluation: CatBoost uses PRAUC not aucpr
Update cdsw-build.sh if dependency names changed
Run threshold selection on validation set (threshold may shift)

Phase 2 (CatBoost-native)

Enable posterior_sampling=True (ordered boosting)
Add auto_class_weights="Balanced" for imbalance handling
Add GPU detection with CPU fallback (from Signals)
Replace SHAP library with CatBoost’s built-in TreeSHAP
Tune CatBoost-specific parameters (random_strength, bagging_temperature)

Phase 3 (distributed)

Implement quantized Pool loading for large datasets
Or: modify dask_utils.py to launch CatBoost multi-node workers
Benchmark memory usage: quantized vs raw on target dataset size

Phase 4 (Signals patterns)

Implement self-training loop with confidence threshold
Evaluate synthetic fraud augmentation
Prototype evidence fusion with rule-based sources

Files Modified

File	Change
`requirements.txt`	`xgboost` → `catboost`
`notebooks/distributed-xgboost-with-dask.ipynb`	All training/prediction cells
`scripts/predict_fraud.py`	Model loading and inference API
`utils/dask_utils.py`	No change (Phase 1–2); modify for Phase 3B
`.project-metadata.yaml`	No change
`cdsw-build.sh`	No change (dependencies installed via `requirements.txt`)

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide