Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This Applied ML Prototype (AMP) demonstrates distributed XGBoost training using Dask on Cloudera Machine Learning (CML). The use case is credit card fraud detection under the memory-constrained distributed computing paradigm — data is partitioned across worker nodes while the model is replicated, enabling training on datasets that exceed single-node memory. A Dask cluster is orchestrated on-demand via the CML Workers API, used for training, then torn down.

This guide serves two audiences:

If you are…Start here
Building a distributed ML pipeline on CML using Dask or a similar frameworkArchitecture Reference
Building a validation SDK or CI/CD pipeline for ML artifacts targeting CMLData & Model Specification and Validation Rules

Terminology

TermDefinition
AMPApplied ML Prototype — a portable, declarative CML project with a .project-metadata.yaml that automates setup, dependency installation, and deployment.
CML Workers APIcml.workers_v1 — the Python API for launching and managing compute worker sessions within a CML project. Used here to orchestrate Dask scheduler and worker processes.
Dask ClusterAn ephemeral distributed compute cluster consisting of a scheduler, one or more workers, and a client. Orchestrated on CML via the Workers API.
Dask SchedulerA single CML worker process running dask-scheduler, responsible for coordinating task distribution among Dask workers. Listens on TCP port 8786.
Dask WorkerA CML worker process running dask-worker, executing distributed computation tasks assigned by the scheduler.
Dask ClientA Python object (dask.distributed.Client) instantiated in the notebook session that submits work to the scheduler.
DaskDMatrixAn XGBoost data structure (xgb.dask.DaskDMatrix) optimized for distributed training. Partitions data across the Dask cluster.
Model EndpointA CML-hosted REST endpoint that loads a serialized XGBoost model and serves predictions via a predict_fraud(args) function.
BoosterThe trained XGBoost model object (xgb.core.Booster). Serialized to disk as a binary file and loaded at inference time.
AUCPRArea Under the Precision-Recall Curve — the primary evaluation metric, chosen over accuracy due to extreme class imbalance (0.16% fraud rate).

End-to-End Lifecycle

The upstream validation SDK’s responsibility ends at structural and contract validation — ensuring the project conforms to the structure documented in the Data & Model Specification and Deployment chapters. CML handles runtime orchestration and endpoint serving.