Hyperparameter Search Space

Hyperparameter tuning uses a sequential random search over eight XGBoost parameters. The search is sequential (not parallel) because the Dask cluster is already fully occupied with distributed data — each trial trains a complete distributed model.

Search Space

Parameter	Distribution	Range	Description
`learning_rate`	Uniform	[0, 1]	Step size shrinkage applied after each boosting round to prevent overfitting
`gamma`	Log-Uniform	[1×10⁻⁶, 10]	Minimum loss reduction required to make a further partition on a leaf node
`max_depth`	Uniform Integer	[1, 20)	Maximum depth of each tree
`min_child_weight`	Uniform	[0, 10]	Minimum sum of instance weight (hessian) in a child node
`max_delta_step`	Uniform	[0, 10]	Maximum delta step allowed for each tree’s weight estimation
`subsample`	Uniform	[0, 1]	Fraction of training instances sampled per tree
`lambda`	Uniform	[0, 1]	L2 regularization term on weights
`alpha`	Uniform	[0, 1]	L1 regularization term on weights

Search Strategy

from sklearn.model_selection import ParameterSampler
from scipy.stats import uniform, loguniform, randint

search_space = {
    "learning_rate": uniform(0, 1),
    "gamma": loguniform(1e-6, 1e+1),
    "max_depth": randint(1, 20),
    "min_child_weight": uniform(0, 10),
    "max_delta_step": uniform(0, 10),
    "subsample": uniform(0, 1),
    "lambda": uniform(0, 1),
    "alpha": uniform(0, 1),
}

sampler = ParameterSampler(search_space, n_iter=20, random_state=42)

Each sample is merged with the fixed parameters and used to train a complete distributed XGBoost model. The model with the highest dev AUCPR is selected.

To switch to grid search, replace ParameterSampler with sklearn.model_selection.ParameterGrid and provide discrete values instead of distributions.

`tune_xgboost()` Contract

The tuning function is defined inline in the notebook. Its effective signature and return value are:

def tune_xgboost(client, dtrain, params, search_space, num_samples,
                 random_state=42) -> dict:
    """
    Returns:
        {
            "best_model": xgb.core.Booster,   # best model by dev AUCPR
            "best_params": str,                # string repr of best parameter dict
            "best_score": float,               # best dev AUCPR
            "hp_history": pd.DataFrame,        # all trials with parameters and scores
        }
    """

Reference Results (Sample Dataset)

These results are from the 94,926-row sample dataset with 20 random search samples:

Metric	Value
Best dev AUCPR (tuning)	0.8040
Validation AUCPR (holdout)	0.9325
Default model dev AUCPR (no tuning)	0.8402

Results will vary with different random seeds and with the full Kaggle dataset.

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide

Hyperparameter Search Space

Search Space

Search Strategy

tune_xgboost() Contract

Reference Results (Sample Dataset)

`tune_xgboost()` Contract