Dask Cluster Lifecycle
The Dask cluster is orchestrated by utils/dask_utils.py, which wraps the CML Workers API into a simple startup/shutdown sequence. The cluster is ephemeral — created at the start of a training session and torn down when training completes.
Lifecycle Sequence
Function Specifications
run_dask_cluster(num_workers, cpu, memory, nvidia_gpu=0, dashboard_port=DASHBOARD_PORT)
Entry point. Orchestrates the full cluster creation sequence by calling the four functions below in order.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
num_workers | int | — | Number of Dask worker nodes |
cpu | int | — | vCPUs per worker |
memory | int | — | GiB RAM per worker |
nvidia_gpu | int | 0 | GPUs per worker |
dashboard_port | str | CDSW_READONLY_PORT | Port for the Dask dashboard |
Returns:
{
"scheduler": [{"id": str, "app_url": str}],
"workers": [{"id": str, "app_url": str}, ...],
"scheduler_address": "tcp://{ip}:8786",
"dashboard_address": "https://{domain}/status"
}
run_scheduler(dashboard_port, num_workers=1, cpu=1, memory=2)
Launches a single CML worker running dask-scheduler --host 0.0.0.0 --dashboard-address 127.0.0.1:{dashboard_port}. Waits up to 90 seconds for the scheduler to become available. Raises RuntimeError if the scheduler fails to launch.
get_scheduler_url(dask_scheduler)
Queries workers_v1.list_workers() to find the scheduler’s IP address by matching worker IDs. Returns tcp://{ip}:8786.
get_dashboard_url(dask_scheduler)
Constructs the dashboard URL from the scheduler worker’s app_url field, appending /status.
run_dask_workers(scheduler_url, num_workers, cpu, memory, nvidia_gpu=0)
Launches N CML workers, each running dask-worker {scheduler_url}. Raises RuntimeError if any worker fails to launch.
Shutdown
After training completes, all cluster sessions must be explicitly stopped:
from cml import workers_v1
workers_v1.stop_workers(
*[w["id"] for w in dask_cluster["scheduler"] + dask_cluster["workers"]]
)
This frees CML compute resources. There is no automatic cleanup — if the notebook kernel dies without calling stop_workers(), the sessions remain running until they hit the CML idle timeout.