Model Serialization Format
Serialization
The best model from hyperparameter tuning is saved using XGBoost’s native serialization:
results["best_model"].save_model("../model/best-xgboost-model")
| Property | Value |
|---|---|
| Path | model/best-xgboost-model |
| CML path | /home/cdsw/model/best-xgboost-model |
| Format | XGBoost binary (default when no file extension is provided) |
| Type | xgb.core.Booster — the raw XGBoost model, not a scikit-learn wrapper |
Deserialization
import xgboost as xgb
booster = xgb.Booster(model_file="/home/cdsw/model/best-xgboost-model")
The model is loaded at module scope in scripts/predict_fraud.py, meaning it is loaded once when the CML Model Endpoint starts and reused for all subsequent prediction requests.
Inference Methods
| Method | Context | Input Type | Output |
|---|---|---|---|
booster.inplace_predict(np.array(...)) | Production (model endpoint) | NumPy array | Float array, each value in [0.0, 1.0] |
xgb.dask.predict(client, booster, dask_df) | Training (notebook) | Dask DataFrame | Distributed predictions |
The model endpoint uses inplace_predict because inference is single-node — no Dask cluster is required at serving time.
Threshold Contract
The model outputs a continuous probability. A threshold is applied to produce a binary classification:
| Condition | Output | Meaning |
|---|---|---|
prediction[0] <= 0.35 | 0 | Not fraud |
prediction[0] > 0.35 | 1 | Fraud |
The threshold of 0.35 is hardcoded in scripts/predict_fraud.py. It was selected by analyzing the precision-recall curve on the validation set — it produces a fraud prediction rate approximately matching the base rate (~0.16%). The optimal threshold depends on the business use case (cost of false positives vs. false negatives) and should be re-evaluated when training on new data.
Compatibility Notes
- The serialized model requires
xgboost >= 1.6.1to load. Older versions may fail to deserialize. - The model expects exactly 29 input features in the order defined by the Feature Engineering Contract.
- The model was trained with
tree_method=histandobjective=reg:logistic. These are embedded in the serialized model and do not need to be specified at inference time.