Why Pipelines Matter¶
What you'll learn:
- What data leakage is and why it invalidates your model
- How sklearn pipelines prevent leakage automatically
- Why sklab enforces pipeline-first design
Prerequisites: Basic familiarity with sklearn estimators and train/test splits.
The problem: data leakage¶
Imagine you're building a model to predict house prices. Your workflow looks reasonable:
- Load the data
- Scale features with StandardScaler
- Split into train and test sets
- Train a model on the training set
- Evaluate on the test set
You get 95% accuracy on the test set. Ship it!
But in production, predictions are wildly wrong. What happened?
The scaler was fit on all the data—including the test set. When you scaled the training data, you used statistics (mean, variance) computed from data your model should never have seen. The model "cheated" by learning from the future.
This is data leakage: information from outside the training set influencing the training process. It's one of the most common and dangerous mistakes in machine learning.
Demonstration: leakage vs. no leakage¶
Let's see this concretely. We'll create a scenario where leakage dramatically inflates apparent performance.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Create data where test set has different distribution
rng = np.random.default_rng(42)
# Training distribution: centered at 0
X_train_raw = rng.normal(0, 1, size=(100, 5))
y_train = (X_train_raw[:, 0] > 0).astype(int)
# Test distribution: centered at 3 (different!)
X_test_raw = rng.normal(3, 1, size=(50, 5))
y_test = (X_test_raw[:, 0] > 3).astype(int)
X_all = np.vstack([X_train_raw, X_test_raw])
y_all = np.hstack([y_train, y_test])
The wrong way: scale before splitting¶
# WRONG: Fit scaler on ALL data (including test)
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(X_all)
# Now split
X_train_leaked = X_all_scaled[:100]
X_test_leaked = X_all_scaled[100:]
# Train and evaluate
model_leaked = LogisticRegression()
model_leaked.fit(X_train_leaked, y_train)
score_leaked = model_leaked.score(X_test_leaked, y_test)
print(f"Score with leakage: {score_leaked:.3f}")
The right way: scale after splitting¶
# RIGHT: Fit scaler only on training data
scaler_right = StandardScaler()
X_train_clean = scaler_right.fit_transform(X_train_raw)
X_test_clean = scaler_right.transform(X_test_raw) # transform only!
# Train and evaluate
model_clean = LogisticRegression()
model_clean.fit(X_train_clean, y_train)
score_clean = model_clean.score(X_test_clean, y_test)
print(f"Score without leakage: {score_clean:.3f}")
The leaked model appears to perform better, but this is an illusion. In production, you won't have access to future data to inform your scaling.
Why this matters for cross-validation¶
The problem gets worse with cross-validation. If you scale before CV, every fold's validation data leaks into every fold's training data through the scaler statistics.
from sklearn.model_selection import cross_val_score
# WRONG: Scale all data, then cross-validate
scaler = StandardScaler()
X_scaled_all = scaler.fit_transform(X_all)
scores_leaked = cross_val_score(
LogisticRegression(),
X_scaled_all,
y_all,
cv=5,
)
print(f"CV with leakage: {scores_leaked.mean():.3f} (+/- {scores_leaked.std():.3f})")
The solution: sklearn Pipelines¶
Pipelines bundle preprocessing and modeling into a single estimator. When you
call fit() on a pipeline, it fits each step in sequence. When sklearn's
cross-validation fits a pipeline, it refits the entire pipeline on each
fold's training data.
This means the scaler never sees validation data.
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression()),
])
# RIGHT: Cross-validate the pipeline
scores_clean = cross_val_score(pipeline, X_all, y_all, cv=5)
print(f"CV without leakage: {scores_clean.mean():.3f} (+/- {scores_clean.std():.3f})")
How sklab enforces this¶
sklab requires a Pipeline object—not a raw estimator. This isn't a limitation; it's a forcing function for correct methodology.
from sklab.experiment import Experiment
experiment = Experiment(
pipeline=pipeline,
scoring="accuracy",
name="leakage-demo",
)
# Every sklab method uses the pipeline correctly
cv_result = experiment.cross_validate(X_all, y_all, cv=5, run_name="cv")
print(f"sklab CV: {cv_result.metrics['cv/accuracy_mean']:.3f}")
When you use sklab:
fit()fits the entire pipelineevaluate()uses the fitted pipeline to transform and predictcross_validate()refits the pipeline on each foldsearch()searches over pipeline parameters, refitting on each trial
You can't accidentally leak because the pipeline encapsulates the correct order of operations.
Common leakage patterns¶
Beyond scaling, watch out for these leakage sources:
| Pattern | Problem | Solution |
|---|---|---|
| Feature selection on full data | Selected features "know" about test labels | Use SelectKBest inside pipeline |
| Target encoding on full data | Encoded values include test target info | Use TargetEncoder inside pipeline |
| Imputation on full data | Imputed values use test statistics | Use SimpleImputer inside pipeline |
| Oversampling before split | Synthetic samples from test distribution | Use imblearn.Pipeline with SMOTE |
Best practices¶
-
Always use pipelines. Every preprocessing step that learns from data (scaling, encoding, imputation, feature selection) belongs in the pipeline.
-
Split early. If you must do exploratory analysis, split first and only look at training data.
-
Be paranoid about temporal data. Time series adds another dimension of leakage—you can't use future data to predict the past. Use
TimeSeriesSplit. -
Validate with holdout. Even with correct CV, keep a final holdout set that you only touch once, at the very end.
Further reading¶
- sklearn Pipeline documentation
- sklearn Cross-validation guide
- Common pitfalls in ML
- Kaufman et al., "Leakage in Data Mining" (2012) — foundational paper on leakage taxonomy