Dataset Schema
The project uses a sample of the Kaggle credit card fraud dataset, curated by the Machine Learning Group at Université Libre de Bruxelles.
Raw Schema
The CSV file contains 31 columns and 94,926 rows.
| Column | Type | Range | Description |
|---|---|---|---|
Time | float64 | 0 – 172,792 | Seconds elapsed from the first transaction in the dataset. Dropped during feature engineering — not useful for point-like fraud predictions. |
V1 | float64 | ≈ −56 to 2 | PCA component 1 (privacy-preserving transformation applied by the dataset curators) |
V2 | float64 | ≈ −73 to 22 | PCA component 2 |
V3 | float64 | ≈ −48 to 10 | PCA component 3 |
V4 – V28 | float64 | varies | PCA components 4 through 28. Exact ranges vary per component. All are the result of a PCA transformation; original feature names are undisclosed for confidentiality. |
Amount | float64 | 0 – 25,691 | Transaction amount in original currency units |
Class | float64 | 0.0 or 1.0 | Target variable. 0 = legitimate transaction, 1 = fraudulent transaction. |
Class Distribution
| Class | Count | Percentage |
|---|---|---|
| 0 (legitimate) | 94,777 | 99.84% |
| 1 (fraud) | 149 | 0.16% |
The dataset is extremely imbalanced. This motivates the use of AUCPR (Area Under the Precision-Recall Curve) rather than accuracy as the primary evaluation metric.
Raw vs. Model-Input Schema
| Schema | Columns | Purpose |
|---|---|---|
| Raw (CSV) | 31: Time, V1–V28, Amount, Class | As loaded from data/creditcardsample.csv |
| Model input | 29: V1–V28, Amount | After dropping Time and separating Class as the target. Features are ordered alphabetically by column name (Amount, V1, V2, …, V28). |
| Model target | 1: Class | Binary fraud label |
The distinction between raw and model-input schema is critical for building a compatible inference client — the Model Endpoint Contract expects exactly 29 features in the model-input ordering.