Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hyperparameter Search Space

Hyperparameter tuning uses a sequential random search over eight XGBoost parameters. The search is sequential (not parallel) because the Dask cluster is already fully occupied with distributed data — each trial trains a complete distributed model.

Search Space

ParameterDistributionRangeDescription
learning_rateUniform[0, 1]Step size shrinkage applied after each boosting round to prevent overfitting
gammaLog-Uniform[1×10⁻⁶, 10]Minimum loss reduction required to make a further partition on a leaf node
max_depthUniform Integer[1, 20)Maximum depth of each tree
min_child_weightUniform[0, 10]Minimum sum of instance weight (hessian) in a child node
max_delta_stepUniform[0, 10]Maximum delta step allowed for each tree’s weight estimation
subsampleUniform[0, 1]Fraction of training instances sampled per tree
lambdaUniform[0, 1]L2 regularization term on weights
alphaUniform[0, 1]L1 regularization term on weights

Search Strategy

from sklearn.model_selection import ParameterSampler
from scipy.stats import uniform, loguniform, randint

search_space = {
    "learning_rate": uniform(0, 1),
    "gamma": loguniform(1e-6, 1e+1),
    "max_depth": randint(1, 20),
    "min_child_weight": uniform(0, 10),
    "max_delta_step": uniform(0, 10),
    "subsample": uniform(0, 1),
    "lambda": uniform(0, 1),
    "alpha": uniform(0, 1),
}

sampler = ParameterSampler(search_space, n_iter=20, random_state=42)

Each sample is merged with the fixed parameters and used to train a complete distributed XGBoost model. The model with the highest dev AUCPR is selected.

To switch to grid search, replace ParameterSampler with sklearn.model_selection.ParameterGrid and provide discrete values instead of distributions.

tune_xgboost() Contract

The tuning function is defined inline in the notebook. Its effective signature and return value are:

def tune_xgboost(client, dtrain, params, search_space, num_samples,
                 random_state=42) -> dict:
    """
    Returns:
        {
            "best_model": xgb.core.Booster,   # best model by dev AUCPR
            "best_params": str,                # string repr of best parameter dict
            "best_score": float,               # best dev AUCPR
            "hp_history": pd.DataFrame,        # all trials with parameters and scores
        }
    """

Reference Results (Sample Dataset)

These results are from the 94,926-row sample dataset with 20 random search samples:

MetricValue
Best dev AUCPR (tuning)0.8040
Validation AUCPR (holdout)0.9325
Default model dev AUCPR (no tuning)0.8402

Results will vary with different random seeds and with the full Kaggle dataset.