System Overview
The system comprises four layers: a JupyterLab session (user entry point), an ephemeral Dask cluster (distributed compute), a CML Model Endpoint (inference serving), and the CML platform services that orchestrate everything.
Repository Layout
.
├── .project-metadata.yaml AMP declarative config
├── cdsw-build.sh Model endpoint build script
├── requirements.txt Pinned dependencies (Dask 2022.5.1, XGBoost 1.6.1)
├── setup.py Local utils package (cdsw-dask-utils 0.1.0)
├── data/
│ └── creditcardsample.csv Sample dataset (94,926 rows × 31 columns)
├── model/
│ └── best-xgboost-model Serialized XGBoost Booster (binary)
├── notebooks/
│ ├── dask-intro.ipynb Dask concepts introduction
│ └── distributed-xgboost-with-dask.ipynb Full ML pipeline
├── scripts/
│ ├── install_dependencies.py Dependency installation (CML job)
│ └── predict_fraud.py Inference endpoint function
└── utils/
├── __init__.py
└── dask_utils.py Dask cluster orchestration
Design Principles
-
CML Workers as compute fabric. All distributed compute runs as CML worker sessions managed via
cml.workers_v1. No external cluster manager (YARN, Kubernetes scheduler) is required. -
Ephemeral clusters. The Dask cluster exists only for the duration of training. Workers are explicitly stopped after training completes, freeing CML resources.
-
Notebook-driven development. Notebooks are the primary deliverables. The utility library (
dask_utils.py) exists solely to simplify cluster orchestration from notebooks. -
Decoupled inference. The model endpoint uses standard (non-Dask) XGBoost calls. Inference does not require a distributed cluster — a single CML session loads the serialized Booster and applies a threshold.
-
Pinned dependencies. All library versions are locked in
requirements.txtto ensure reproducibility across CML deployments.