Dask Cluster Lifecycle

The Dask cluster is orchestrated by utils/dask_utils.py, which wraps the CML Workers API into a simple startup/shutdown sequence. The cluster is ephemeral — created at the start of a training session and torn down when training completes.

Lifecycle Sequence

Function Specifications

`run_dask_cluster(num_workers, cpu, memory, nvidia_gpu=0, dashboard_port=DASHBOARD_PORT)`

Entry point. Orchestrates the full cluster creation sequence by calling the four functions below in order.

Parameters:

Parameter	Type	Default	Description
`num_workers`	int	—	Number of Dask worker nodes
`cpu`	int	—	vCPUs per worker
`memory`	int	—	GiB RAM per worker
`nvidia_gpu`	int	0	GPUs per worker
`dashboard_port`	str	`CDSW_READONLY_PORT`	Port for the Dask dashboard

Returns:

{
    "scheduler": [{"id": str, "app_url": str}],
    "workers": [{"id": str, "app_url": str}, ...],
    "scheduler_address": "tcp://{ip}:8786",
    "dashboard_address": "https://{domain}/status"
}

`run_scheduler(dashboard_port, num_workers=1, cpu=1, memory=2)`

Launches a single CML worker running dask-scheduler --host 0.0.0.0 --dashboard-address 127.0.0.1:{dashboard_port}. Waits up to 90 seconds for the scheduler to become available. Raises RuntimeError if the scheduler fails to launch.

`get_scheduler_url(dask_scheduler)`

Queries workers_v1.list_workers() to find the scheduler’s IP address by matching worker IDs. Returns tcp://{ip}:8786.

`get_dashboard_url(dask_scheduler)`

Constructs the dashboard URL from the scheduler worker’s app_url field, appending /status.

`run_dask_workers(scheduler_url, num_workers, cpu, memory, nvidia_gpu=0)`

Launches N CML workers, each running dask-worker {scheduler_url}. Raises RuntimeError if any worker fails to launch.

Shutdown

After training completes, all cluster sessions must be explicitly stopped:

from cml import workers_v1

workers_v1.stop_workers(
    *[w["id"] for w in dask_cluster["scheduler"] + dask_cluster["workers"]]
)

This frees CML compute resources. There is no automatic cleanup — if the notebook kernel dies without calling stop_workers(), the sessions remain running until they hit the CML idle timeout.

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide