Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

System Overview

The system comprises four layers: a JupyterLab session (user entry point), an ephemeral Dask cluster (distributed compute), a CML Model Endpoint (inference serving), and the CML platform services that orchestrate everything.

Repository Layout

.
├── .project-metadata.yaml          AMP declarative config
├── cdsw-build.sh                   Model endpoint build script
├── requirements.txt                Pinned dependencies (Dask 2022.5.1, XGBoost 1.6.1)
├── setup.py                        Local utils package (cdsw-dask-utils 0.1.0)
├── data/
│   └── creditcardsample.csv        Sample dataset (94,926 rows × 31 columns)
├── model/
│   └── best-xgboost-model          Serialized XGBoost Booster (binary)
├── notebooks/
│   ├── dask-intro.ipynb            Dask concepts introduction
│   └── distributed-xgboost-with-dask.ipynb   Full ML pipeline
├── scripts/
│   ├── install_dependencies.py     Dependency installation (CML job)
│   └── predict_fraud.py            Inference endpoint function
└── utils/
    ├── __init__.py
    └── dask_utils.py               Dask cluster orchestration

Design Principles

  1. CML Workers as compute fabric. All distributed compute runs as CML worker sessions managed via cml.workers_v1. No external cluster manager (YARN, Kubernetes scheduler) is required.

  2. Ephemeral clusters. The Dask cluster exists only for the duration of training. Workers are explicitly stopped after training completes, freeing CML resources.

  3. Notebook-driven development. Notebooks are the primary deliverables. The utility library (dask_utils.py) exists solely to simplify cluster orchestration from notebooks.

  4. Decoupled inference. The model endpoint uses standard (non-Dask) XGBoost calls. Inference does not require a distributed cluster — a single CML session loads the serialized Booster and applies a threshold.

  5. Pinned dependencies. All library versions are locked in requirements.txt to ensure reproducibility across CML deployments.