Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Feature Engineering Contract

This page documents the exact transformations applied between raw CSV data and model input. A compatible implementation must reproduce these steps to generate features that the trained model can score correctly.

Transformation Steps

Step 1: Drop Time

dask_df = dask_df.drop(columns=["Time"])

The Time column represents seconds since the first transaction in the dataset. It is not useful for point-like fraud detection (individual transaction scoring) and is dropped before any further processing.

Step 2: Separate Target Variable

y = dask_df["Class"]
X = dask_df.drop(columns=["Class"])

After this step, X contains 29 feature columns and y contains the binary fraud label.

Step 3: Feature Ordering

The 29 columns in X are ordered alphabetically by column name:

Amount, V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13,
V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26,
V27, V28

This ordering is inherited from the Dask DataFrame column order and is significant — the model endpoint receives features as a positional array, so the ordering must match exactly.

Step 4: StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

All 29 features are scaled to zero mean and unit variance. The PCA features (V1–V28) are already approximately centered by the PCA transformation, so the scaler primarily affects the Amount column, which has a much larger raw range (0 – 25,691).

Note: The scaler is fitted on the training split only, then applied to dev and validation splits. This prevents data leakage from the evaluation sets into the scaling parameters.

Summary

StepInputOutputNotes
Drop Time31 columns30 columnsRemoves non-predictive temporal feature
Separate Class30 columns29 features + 1 targetClass becomes the label y
StandardScaler29 features (raw scale)29 features (zero mean, unit variance)Fitted on training split only