Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model Serialization Format

Serialization

The best model from hyperparameter tuning is saved using XGBoost’s native serialization:

results["best_model"].save_model("../model/best-xgboost-model")
PropertyValue
Pathmodel/best-xgboost-model
CML path/home/cdsw/model/best-xgboost-model
FormatXGBoost binary (default when no file extension is provided)
Typexgb.core.Booster — the raw XGBoost model, not a scikit-learn wrapper

Deserialization

import xgboost as xgb

booster = xgb.Booster(model_file="/home/cdsw/model/best-xgboost-model")

The model is loaded at module scope in scripts/predict_fraud.py, meaning it is loaded once when the CML Model Endpoint starts and reused for all subsequent prediction requests.

Inference Methods

MethodContextInput TypeOutput
booster.inplace_predict(np.array(...))Production (model endpoint)NumPy arrayFloat array, each value in [0.0, 1.0]
xgb.dask.predict(client, booster, dask_df)Training (notebook)Dask DataFrameDistributed predictions

The model endpoint uses inplace_predict because inference is single-node — no Dask cluster is required at serving time.

Threshold Contract

The model outputs a continuous probability. A threshold is applied to produce a binary classification:

ConditionOutputMeaning
prediction[0] <= 0.350Not fraud
prediction[0] > 0.351Fraud

The threshold of 0.35 is hardcoded in scripts/predict_fraud.py. It was selected by analyzing the precision-recall curve on the validation set — it produces a fraud prediction rate approximately matching the base rate (~0.16%). The optimal threshold depends on the business use case (cost of false positives vs. false negatives) and should be re-evaluated when training on new data.

Compatibility Notes

  • The serialized model requires xgboost >= 1.6.1 to load. Older versions may fail to deserialize.
  • The model expects exactly 29 input features in the order defined by the Feature Engineering Contract.
  • The model was trained with tree_method=hist and objective=reg:logistic. These are embedded in the serialized model and do not need to be specified at inference time.