Introduction

This Applied ML Prototype (AMP) demonstrates distributed XGBoost training using Dask on Cloudera Machine Learning (CML). The use case is credit card fraud detection under the memory-constrained distributed computing paradigm — data is partitioned across worker nodes while the model is replicated, enabling training on datasets that exceed single-node memory. A Dask cluster is orchestrated on-demand via the CML Workers API, used for training, then torn down.

This guide serves two audiences:

If you are…	Start here
Building a distributed ML pipeline on CML using Dask or a similar framework	Architecture Reference
Building a validation SDK or CI/CD pipeline for ML artifacts targeting CML	Data & Model Specification and Validation Rules

Terminology

Term	Definition
AMP	Applied ML Prototype — a portable, declarative CML project with a `.project-metadata.yaml` that automates setup, dependency installation, and deployment.
CML Workers API	`cml.workers_v1` — the Python API for launching and managing compute worker sessions within a CML project. Used here to orchestrate Dask scheduler and worker processes.
Dask Cluster	An ephemeral distributed compute cluster consisting of a scheduler, one or more workers, and a client. Orchestrated on CML via the Workers API.
Dask Scheduler	A single CML worker process running `dask-scheduler`, responsible for coordinating task distribution among Dask workers. Listens on TCP port 8786.
Dask Worker	A CML worker process running `dask-worker`, executing distributed computation tasks assigned by the scheduler.
Dask Client	A Python object (`dask.distributed.Client`) instantiated in the notebook session that submits work to the scheduler.
DaskDMatrix	An XGBoost data structure (`xgb.dask.DaskDMatrix`) optimized for distributed training. Partitions data across the Dask cluster.
Model Endpoint	A CML-hosted REST endpoint that loads a serialized XGBoost model and serves predictions via a `predict_fraud(args)` function.
Booster	The trained XGBoost model object (`xgb.core.Booster`). Serialized to disk as a binary file and loaded at inference time.
AUCPR	Area Under the Precision-Recall Curve — the primary evaluation metric, chosen over accuracy due to extreme class imbalance (0.16% fraud rate).

End-to-End Lifecycle

The upstream validation SDK’s responsibility ends at structural and contract validation — ensuring the project conforms to the structure documented in the Data & Model Specification and Deployment chapters. CML handles runtime orchestration and endpoint serving.

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide

Introduction

Terminology

End-to-End Lifecycle