XGBoost Training Configuration
This page documents the fixed XGBoost parameters used during both initial training and hyperparameter tuning.
Fixed Parameters
| Parameter | Value | Rationale |
|---|---|---|
tree_method | hist | Histogram-based tree construction. Required for distributed training with Dask — the exact method does not support distribution. |
objective | reg:logistic | Outputs a continuous probability in [0.0, 1.0], suitable for threshold-based binary classification. |
eval_metric | aucpr | Area Under the Precision-Recall Curve. Appropriate for highly imbalanced datasets where accuracy is misleading (a majority-class classifier achieves 99.84%). |
verbosity | 2 | Detailed logging during training. |
num_round | 5 | Number of boosting rounds. Kept low for the sample dataset; increase for larger data. |
Evaluation Protocol
Both training and development sets are monitored per boosting round:
eval_list = [(ddev, "dev"), (dtrain, "train")]
The dev AUCPR at the final boosting round is the primary metric for model selection during hyperparameter tuning.
Training API
Training uses the Dask-aware XGBoost API, not the scikit-learn wrapper:
import xgboost as xgb
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
ddev = xgb.dask.DaskDMatrix(client, X_dev, y_dev)
result = xgb.dask.train(
client,
params, # dict of fixed + tunable parameters
dtrain,
num_boost_round=5,
evals=[(ddev, "dev"), (dtrain, "train")],
)
Return Value
xgb.dask.train() returns a dict with two keys:
{
"booster": xgb.core.Booster, # trained model
"history": {
"dev": OrderedDict([("aucpr", [0.647, 0.822, ...])]),
"train": OrderedDict([("aucpr", [0.806, 0.846, ...])])
}
}
booster— the trained XGBoost model, anxgb.core.Boosterobject (not a scikit-learn estimator).history— per-round evaluation metrics for each dataset inevals. The list length equalsnum_boost_round.
Distributed Prediction
predictions = xgb.dask.predict(client, booster, dask_dataframe)
Returns predictions distributed across the Dask cluster. Convert to NumPy with .compute() for local analysis.
For non-distributed inference (model endpoint), use booster.inplace_predict(numpy_array) instead. See Model Endpoint Contract.