Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dask Cluster Lifecycle

The Dask cluster is orchestrated by utils/dask_utils.py, which wraps the CML Workers API into a simple startup/shutdown sequence. The cluster is ephemeral — created at the start of a training session and torn down when training completes.

Lifecycle Sequence

Function Specifications

run_dask_cluster(num_workers, cpu, memory, nvidia_gpu=0, dashboard_port=DASHBOARD_PORT)

Entry point. Orchestrates the full cluster creation sequence by calling the four functions below in order.

Parameters:

ParameterTypeDefaultDescription
num_workersintNumber of Dask worker nodes
cpuintvCPUs per worker
memoryintGiB RAM per worker
nvidia_gpuint0GPUs per worker
dashboard_portstrCDSW_READONLY_PORTPort for the Dask dashboard

Returns:

{
    "scheduler": [{"id": str, "app_url": str}],
    "workers": [{"id": str, "app_url": str}, ...],
    "scheduler_address": "tcp://{ip}:8786",
    "dashboard_address": "https://{domain}/status"
}

run_scheduler(dashboard_port, num_workers=1, cpu=1, memory=2)

Launches a single CML worker running dask-scheduler --host 0.0.0.0 --dashboard-address 127.0.0.1:{dashboard_port}. Waits up to 90 seconds for the scheduler to become available. Raises RuntimeError if the scheduler fails to launch.

get_scheduler_url(dask_scheduler)

Queries workers_v1.list_workers() to find the scheduler’s IP address by matching worker IDs. Returns tcp://{ip}:8786.

get_dashboard_url(dask_scheduler)

Constructs the dashboard URL from the scheduler worker’s app_url field, appending /status.

run_dask_workers(scheduler_url, num_workers, cpu, memory, nvidia_gpu=0)

Launches N CML workers, each running dask-worker {scheduler_url}. Raises RuntimeError if any worker fails to launch.

Shutdown

After training completes, all cluster sessions must be explicitly stopped:

from cml import workers_v1

workers_v1.stop_workers(
    *[w["id"] for w in dask_cluster["scheduler"] + dask_cluster["workers"]]
)

This frees CML compute resources. There is no automatic cleanup — if the notebook kernel dies without calling stop_workers(), the sessions remain running until they hit the CML idle timeout.