Feature Engineering Contract
This page documents the exact transformations applied between raw CSV data and model input. A compatible implementation must reproduce these steps to generate features that the trained model can score correctly.
Transformation Steps
Step 1: Drop Time
dask_df = dask_df.drop(columns=["Time"])
The Time column represents seconds since the first transaction in the dataset. It is not useful for point-like fraud detection (individual transaction scoring) and is dropped before any further processing.
Step 2: Separate Target Variable
y = dask_df["Class"]
X = dask_df.drop(columns=["Class"])
After this step, X contains 29 feature columns and y contains the binary fraud label.
Step 3: Feature Ordering
The 29 columns in X are ordered alphabetically by column name:
Amount, V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13,
V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26,
V27, V28
This ordering is inherited from the Dask DataFrame column order and is significant — the model endpoint receives features as a positional array, so the ordering must match exactly.
Step 4: StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
All 29 features are scaled to zero mean and unit variance. The PCA features (V1–V28) are already approximately centered by the PCA transformation, so the scaler primarily affects the Amount column, which has a much larger raw range (0 – 25,691).
Note: The scaler is fitted on the training split only, then applied to dev and validation splits. This prevents data leakage from the evaluation sets into the scaling parameters.
Summary
| Step | Input | Output | Notes |
|---|---|---|---|
| Drop Time | 31 columns | 30 columns | Removes non-predictive temporal feature |
| Separate Class | 30 columns | 29 features + 1 target | Class becomes the label y |
| StandardScaler | 29 features (raw scale) | 29 features (zero mean, unit variance) | Fitted on training split only |