Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

XGBoost Training Configuration

This page documents the fixed XGBoost parameters used during both initial training and hyperparameter tuning.

Fixed Parameters

ParameterValueRationale
tree_methodhistHistogram-based tree construction. Required for distributed training with Dask — the exact method does not support distribution.
objectivereg:logisticOutputs a continuous probability in [0.0, 1.0], suitable for threshold-based binary classification.
eval_metricaucprArea Under the Precision-Recall Curve. Appropriate for highly imbalanced datasets where accuracy is misleading (a majority-class classifier achieves 99.84%).
verbosity2Detailed logging during training.
num_round5Number of boosting rounds. Kept low for the sample dataset; increase for larger data.

Evaluation Protocol

Both training and development sets are monitored per boosting round:

eval_list = [(ddev, "dev"), (dtrain, "train")]

The dev AUCPR at the final boosting round is the primary metric for model selection during hyperparameter tuning.

Training API

Training uses the Dask-aware XGBoost API, not the scikit-learn wrapper:

import xgboost as xgb

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
ddev = xgb.dask.DaskDMatrix(client, X_dev, y_dev)

result = xgb.dask.train(
    client,
    params,          # dict of fixed + tunable parameters
    dtrain,
    num_boost_round=5,
    evals=[(ddev, "dev"), (dtrain, "train")],
)

Return Value

xgb.dask.train() returns a dict with two keys:

{
    "booster": xgb.core.Booster,   # trained model
    "history": {
        "dev": OrderedDict([("aucpr", [0.647, 0.822, ...])]),
        "train": OrderedDict([("aucpr", [0.806, 0.846, ...])])
    }
}
  • booster — the trained XGBoost model, an xgb.core.Booster object (not a scikit-learn estimator).
  • history — per-round evaluation metrics for each dataset in evals. The list length equals num_boost_round.

Distributed Prediction

predictions = xgb.dask.predict(client, booster, dask_dataframe)

Returns predictions distributed across the Dask cluster. Convert to NumPy with .compute() for local analysis.

For non-distributed inference (model endpoint), use booster.inplace_predict(numpy_array) instead. See Model Endpoint Contract.