Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dataset Schema

The project uses a sample of the Kaggle credit card fraud dataset, curated by the Machine Learning Group at Université Libre de Bruxelles.

Raw Schema

The CSV file contains 31 columns and 94,926 rows.

ColumnTypeRangeDescription
Timefloat640 – 172,792Seconds elapsed from the first transaction in the dataset. Dropped during feature engineering — not useful for point-like fraud predictions.
V1float64≈ −56 to 2PCA component 1 (privacy-preserving transformation applied by the dataset curators)
V2float64≈ −73 to 22PCA component 2
V3float64≈ −48 to 10PCA component 3
V4V28float64variesPCA components 4 through 28. Exact ranges vary per component. All are the result of a PCA transformation; original feature names are undisclosed for confidentiality.
Amountfloat640 – 25,691Transaction amount in original currency units
Classfloat640.0 or 1.0Target variable. 0 = legitimate transaction, 1 = fraudulent transaction.

Class Distribution

ClassCountPercentage
0 (legitimate)94,77799.84%
1 (fraud)1490.16%

The dataset is extremely imbalanced. This motivates the use of AUCPR (Area Under the Precision-Recall Curve) rather than accuracy as the primary evaluation metric.

Raw vs. Model-Input Schema

SchemaColumnsPurpose
Raw (CSV)31: Time, V1–V28, Amount, ClassAs loaded from data/creditcardsample.csv
Model input29: V1–V28, AmountAfter dropping Time and separating Class as the target. Features are ordered alphabetically by column name (Amount, V1, V2, …, V28).
Model target1: ClassBinary fraud label

The distinction between raw and model-input schema is critical for building a compatible inference client — the Model Endpoint Contract expects exactly 29 features in the model-input ordering.