XGBoost Training Configuration

This page documents the fixed XGBoost parameters used during both initial training and hyperparameter tuning.

Fixed Parameters

Parameter	Value	Rationale
`tree_method`	`hist`	Histogram-based tree construction. Required for distributed training with Dask — the `exact` method does not support distribution.
`objective`	`reg:logistic`	Outputs a continuous probability in [0.0, 1.0], suitable for threshold-based binary classification.
`eval_metric`	`aucpr`	Area Under the Precision-Recall Curve. Appropriate for highly imbalanced datasets where accuracy is misleading (a majority-class classifier achieves 99.84%).
`verbosity`	2	Detailed logging during training.
`num_round`	5	Number of boosting rounds. Kept low for the sample dataset; increase for larger data.

Evaluation Protocol

Both training and development sets are monitored per boosting round:

eval_list = [(ddev, "dev"), (dtrain, "train")]

The dev AUCPR at the final boosting round is the primary metric for model selection during hyperparameter tuning.

Training API

Training uses the Dask-aware XGBoost API, not the scikit-learn wrapper:

import xgboost as xgb

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
ddev = xgb.dask.DaskDMatrix(client, X_dev, y_dev)

result = xgb.dask.train(
    client,
    params,          # dict of fixed + tunable parameters
    dtrain,
    num_boost_round=5,
    evals=[(ddev, "dev"), (dtrain, "train")],
)

Return Value

xgb.dask.train() returns a dict with two keys:

{
    "booster": xgb.core.Booster,   # trained model
    "history": {
        "dev": OrderedDict([("aucpr", [0.647, 0.822, ...])]),
        "train": OrderedDict([("aucpr", [0.806, 0.846, ...])])
    }
}

booster — the trained XGBoost model, an xgb.core.Booster object (not a scikit-learn estimator).
history — per-round evaluation metrics for each dataset in evals. The list length equals num_boost_round.

Distributed Prediction

predictions = xgb.dask.predict(client, booster, dask_dataframe)

Returns predictions distributed across the Dask cluster. Convert to NumPy with .compute() for local analysis.

For non-distributed inference (model endpoint), use booster.inplace_predict(numpy_array) instead. See Model Endpoint Contract.