Hyperparameter Search Space
Hyperparameter tuning uses a sequential random search over eight XGBoost parameters. The search is sequential (not parallel) because the Dask cluster is already fully occupied with distributed data — each trial trains a complete distributed model.
Search Space
| Parameter | Distribution | Range | Description |
|---|---|---|---|
learning_rate | Uniform | [0, 1] | Step size shrinkage applied after each boosting round to prevent overfitting |
gamma | Log-Uniform | [1×10⁻⁶, 10] | Minimum loss reduction required to make a further partition on a leaf node |
max_depth | Uniform Integer | [1, 20) | Maximum depth of each tree |
min_child_weight | Uniform | [0, 10] | Minimum sum of instance weight (hessian) in a child node |
max_delta_step | Uniform | [0, 10] | Maximum delta step allowed for each tree’s weight estimation |
subsample | Uniform | [0, 1] | Fraction of training instances sampled per tree |
lambda | Uniform | [0, 1] | L2 regularization term on weights |
alpha | Uniform | [0, 1] | L1 regularization term on weights |
Search Strategy
from sklearn.model_selection import ParameterSampler
from scipy.stats import uniform, loguniform, randint
search_space = {
"learning_rate": uniform(0, 1),
"gamma": loguniform(1e-6, 1e+1),
"max_depth": randint(1, 20),
"min_child_weight": uniform(0, 10),
"max_delta_step": uniform(0, 10),
"subsample": uniform(0, 1),
"lambda": uniform(0, 1),
"alpha": uniform(0, 1),
}
sampler = ParameterSampler(search_space, n_iter=20, random_state=42)
Each sample is merged with the fixed parameters and used to train a complete distributed XGBoost model. The model with the highest dev AUCPR is selected.
To switch to grid search, replace ParameterSampler with sklearn.model_selection.ParameterGrid and provide discrete values instead of distributions.
tune_xgboost() Contract
The tuning function is defined inline in the notebook. Its effective signature and return value are:
def tune_xgboost(client, dtrain, params, search_space, num_samples,
random_state=42) -> dict:
"""
Returns:
{
"best_model": xgb.core.Booster, # best model by dev AUCPR
"best_params": str, # string repr of best parameter dict
"best_score": float, # best dev AUCPR
"hp_history": pd.DataFrame, # all trials with parameters and scores
}
"""
Reference Results (Sample Dataset)
These results are from the 94,926-row sample dataset with 20 random search samples:
| Metric | Value |
|---|---|
| Best dev AUCPR (tuning) | 0.8040 |
| Validation AUCPR (holdout) | 0.9325 |
| Default model dev AUCPR (no tuning) | 0.8402 |
Results will vary with different random seeds and with the full Kaggle dataset.