Introduction
This Applied ML Prototype (AMP) demonstrates distributed XGBoost training using Dask on Cloudera Machine Learning (CML). The use case is credit card fraud detection under the memory-constrained distributed computing paradigm — data is partitioned across worker nodes while the model is replicated, enabling training on datasets that exceed single-node memory. A Dask cluster is orchestrated on-demand via the CML Workers API, used for training, then torn down.
This guide serves two audiences:
| If you are… | Start here |
|---|---|
| Building a distributed ML pipeline on CML using Dask or a similar framework | Architecture Reference |
| Building a validation SDK or CI/CD pipeline for ML artifacts targeting CML | Data & Model Specification and Validation Rules |
Terminology
| Term | Definition |
|---|---|
| AMP | Applied ML Prototype — a portable, declarative CML project with a .project-metadata.yaml that automates setup, dependency installation, and deployment. |
| CML Workers API | cml.workers_v1 — the Python API for launching and managing compute worker sessions within a CML project. Used here to orchestrate Dask scheduler and worker processes. |
| Dask Cluster | An ephemeral distributed compute cluster consisting of a scheduler, one or more workers, and a client. Orchestrated on CML via the Workers API. |
| Dask Scheduler | A single CML worker process running dask-scheduler, responsible for coordinating task distribution among Dask workers. Listens on TCP port 8786. |
| Dask Worker | A CML worker process running dask-worker, executing distributed computation tasks assigned by the scheduler. |
| Dask Client | A Python object (dask.distributed.Client) instantiated in the notebook session that submits work to the scheduler. |
| DaskDMatrix | An XGBoost data structure (xgb.dask.DaskDMatrix) optimized for distributed training. Partitions data across the Dask cluster. |
| Model Endpoint | A CML-hosted REST endpoint that loads a serialized XGBoost model and serves predictions via a predict_fraud(args) function. |
| Booster | The trained XGBoost model object (xgb.core.Booster). Serialized to disk as a binary file and loaded at inference time. |
| AUCPR | Area Under the Precision-Recall Curve — the primary evaluation metric, chosen over accuracy due to extreme class imbalance (0.16% fraud rate). |
End-to-End Lifecycle
The upstream validation SDK’s responsibility ends at structural and contract validation — ensuring the project conforms to the structure documented in the Data & Model Specification and Deployment chapters. CML handles runtime orchestration and endpoint serving.