Feature Engineering Contract

This page documents the exact transformations applied between raw CSV data and model input. A compatible implementation must reproduce these steps to generate features that the trained model can score correctly.

Transformation Steps

Step 1: Drop `Time`

dask_df = dask_df.drop(columns=["Time"])

The Time column represents seconds since the first transaction in the dataset. It is not useful for point-like fraud detection (individual transaction scoring) and is dropped before any further processing.

Step 2: Separate Target Variable

y = dask_df["Class"]
X = dask_df.drop(columns=["Class"])

After this step, X contains 29 feature columns and y contains the binary fraud label.

Step 3: Feature Ordering

The 29 columns in X are ordered alphabetically by column name:

Amount, V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13,
V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26,
V27, V28

This ordering is inherited from the Dask DataFrame column order and is significant — the model endpoint receives features as a positional array, so the ordering must match exactly.

Step 4: StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

All 29 features are scaled to zero mean and unit variance. The PCA features (V1–V28) are already approximately centered by the PCA transformation, so the scaler primarily affects the Amount column, which has a much larger raw range (0 – 25,691).

Note: The scaler is fitted on the training split only, then applied to dev and validation splits. This prevents data leakage from the evaluation sets into the scaling parameters.

Summary

Step	Input	Output	Notes
Drop Time	31 columns	30 columns	Removes non-predictive temporal feature
Separate Class	30 columns	29 features + 1 target	Class becomes the label `y`
StandardScaler	29 features (raw scale)	29 features (zero mean, unit variance)	Fitted on training split only

Keyboard shortcuts

Distributed XGBoost with Dask on CML — Developer's Guide