Dataset Schema

The project uses a sample of the Kaggle credit card fraud dataset, curated by the Machine Learning Group at Université Libre de Bruxelles.

Raw Schema

The CSV file contains 31 columns and 94,926 rows.

Column	Type	Range	Description
`Time`	float64	0 – 172,792	Seconds elapsed from the first transaction in the dataset. Dropped during feature engineering — not useful for point-like fraud predictions.
`V1`	float64	≈ −56 to 2	PCA component 1 (privacy-preserving transformation applied by the dataset curators)
`V2`	float64	≈ −73 to 22	PCA component 2
`V3`	float64	≈ −48 to 10	PCA component 3
`V4` – `V28`	float64	varies	PCA components 4 through 28. Exact ranges vary per component. All are the result of a PCA transformation; original feature names are undisclosed for confidentiality.
`Amount`	float64	0 – 25,691	Transaction amount in original currency units
`Class`	float64	0.0 or 1.0	Target variable. 0 = legitimate transaction, 1 = fraudulent transaction.

Class Distribution

Class	Count	Percentage
0 (legitimate)	94,777	99.84%
1 (fraud)	149	0.16%

The dataset is extremely imbalanced. This motivates the use of AUCPR (Area Under the Precision-Recall Curve) rather than accuracy as the primary evaluation metric.

Raw vs. Model-Input Schema

Schema	Columns	Purpose
Raw (CSV)	31: Time, V1–V28, Amount, Class	As loaded from `data/creditcardsample.csv`
Model input	29: V1–V28, Amount	After dropping Time and separating Class as the target. Features are ordered alphabetically by column name (Amount, V1, V2, …, V28).
Model target	1: Class	Binary fraud label

The distinction between raw and model-input schema is critical for building a compatible inference client — the Model Endpoint Contract expects exactly 29 features in the model-input ordering.

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide

Dataset Schema

Raw Schema

Class Distribution

Raw vs. Model-Input Schema