Quickstart: The Experiment Class¶
What you'll learn:
- The four core operations:
fit,evaluate,cross_validate,search - When to use each operation
- How sklab standardizes the ML workflow
Time to complete: 5 minutes.
The Experiment class at a glance¶
The Experiment class is the heart of sklab. It wraps a sklearn pipeline
with consistent scoring, logging, and methods for the full ML lifecycle:
| Method | Purpose | When to use |
|---|---|---|
fit() |
Train the pipeline | Initial training, final model |
evaluate() |
Score on held-out data | Holdout evaluation |
cross_validate() |
k-fold cross-validation | Robust performance estimate |
search() |
Hyperparameter search | Finding better configurations |
Each method logs results automatically using the configured logger (MLflow, W&B, or no-op by default).
Setup: Create an experiment¶
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklab.experiment import Experiment
X, y = load_iris(return_X_y=True)
# Build a pipeline
pipeline = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=200)),
])
# Create the experiment
experiment = Experiment(
pipeline=pipeline,
scoring="accuracy",
name="quickstart",
)
What this does:
pipeline: The sklearn pipeline to train and evaluatescoring: A scorer string, callable, or list of scorersname: A human-readable name for logging and identification
Operation 1: fit()¶
Train the pipeline on your data.
fit_result = experiment.fit(X, y, run_name="fit")
print(f"Estimator type: {type(fit_result.estimator).__name__}")
print(f"Logged params: {fit_result.params}")
Returns: FitResult with:
- estimator: The fitted pipeline (cloned from original)
- params: Parameters logged for the run
Use when: Training on your full training set, or training a final model.
Operation 2: evaluate()¶
Score a fitted model on held-out data.
# In practice, you'd evaluate on a separate test set
# Here we reuse the data for demonstration
eval_result = experiment.evaluate(
X, y,
run_name="eval",
)
print(f"Metrics: {eval_result.metrics}")
Returns: EvalResult with:
- metrics: Dict of metric names to values
Use when: Evaluating on a holdout test set after training.
Operation 3: cross_validate()¶
Get a robust performance estimate by training and evaluating across multiple folds.
cv_result = experiment.cross_validate(
X, y,
cv=5, # 5-fold cross-validation
run_name="cv",
)
print(f"CV accuracy: {cv_result.metrics['cv/accuracy_mean']:.4f}")
print(f"CV std: {cv_result.metrics['cv/accuracy_std']:.4f}")
Returns: CVResult with:
- metrics: Mean and std for each metric (prefixed with cv/)
- fold_metrics: Per-fold scores
Use when: Estimating model performance, comparing model variants, model selection.
Cross-validation variants¶
sklab accepts any sklearn splitter:
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
# Stratified (preserves class balance) - good for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_result = experiment.cross_validate(X, y, cv=skf, run_name="strat-cv")
print(f"Stratified CV: {strat_result.metrics['cv/accuracy_mean']:.4f}")
# Time series split - respects temporal order
import numpy as np
rng = np.random.default_rng(42)
X_ts = np.arange(100).reshape(-1, 1)
y_ts = np.sin(X_ts[:, 0] / 10) + rng.normal(0, 0.1, size=100)
ts_experiment = Experiment(
pipeline=Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=200)),
]),
scoring="accuracy",
name="ts-demo",
)
# Binarize for classification demo
y_ts_binary = (y_ts > 0).astype(int)
tscv = TimeSeriesSplit(n_splits=3)
ts_result = ts_experiment.cross_validate(X_ts, y_ts_binary, cv=tscv, run_name="ts-cv")
print(f"Time series CV: {ts_result.metrics['cv/accuracy_mean']:.4f}")
Concept: Choosing a Splitter
- Classification: Use
StratifiedKFoldto preserve class balance - Regression: Use
KFold(or pass an integer likecv=5) - Time series: Use
TimeSeriesSplitto avoid using future data to predict past - Grouped data: Use
GroupKFoldto keep groups together
Operation 4: search()¶
Find better hyperparameters by searching over a parameter space.
from sklab.search import GridSearchConfig
search_result = experiment.search(
GridSearchConfig(param_grid={"model__C": [0.1, 1.0, 10.0]}),
X, y,
cv=3,
run_name="search",
)
print(f"Best params: {search_result.best_params}")
print(f"Best score: {search_result.best_score:.4f}")
Returns: SearchResult with:
- best_params: Best parameter combination found
- best_score: Score achieved by best params
- best_estimator: Fitted pipeline with best params (if refit=True)
Use when: Tuning hyperparameters, exploring the parameter space.
Search options¶
sklab supports multiple search strategies:
# Grid search via config
from sklab.search import GridSearchConfig
grid_result = experiment.search(
GridSearchConfig(param_grid={"model__C": [0.1, 1.0, 10.0]}),
X, y, cv=3, run_name="grid",
)
# Random search via sklearn searcher
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
random_searcher = RandomizedSearchCV(
pipeline,
param_distributions={"model__C": loguniform(0.01, 100)},
n_iter=10,
cv=3,
random_state=42,
refit=True,
)
random_result = experiment.search(random_searcher, X, y, run_name="random")
See Hyperparameter Search for a complete guide to search strategies.
Multiple metrics¶
Track multiple metrics simultaneously:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklab.experiment import Experiment
X, y = load_iris(return_X_y=True)
pipeline = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=200)),
])
# Define multiple metrics
experiment = Experiment(
pipeline=pipeline,
scoring=["accuracy", "f1_macro", "precision_macro"],
name="multi-metric",
)
cv_result = experiment.cross_validate(X, y, cv=5, run_name="multi-cv")
for key, value in cv_result.metrics.items():
print(f"{key}: {value:.4f}")
Putting it all together¶
A typical workflow combines these operations:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklab.experiment import Experiment
from sklab.search import GridSearchConfig
# 1. Load and split data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Create experiment
pipeline = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=200)),
])
experiment = Experiment(
pipeline=pipeline,
scoring="accuracy",
name="full-workflow",
)
# 3. Cross-validate to estimate baseline performance
cv_result = experiment.cross_validate(X_train, y_train, cv=5, run_name="baseline-cv")
print(f"Baseline CV: {cv_result.metrics['cv/accuracy_mean']:.4f}")
# 4. Search for better hyperparameters
search_result = experiment.search(
GridSearchConfig(param_grid={"model__C": [0.01, 0.1, 1.0, 10.0, 100.0]}),
X_train, y_train,
cv=5,
run_name="search",
)
print(f"Best params: {search_result.best_params}")
print(f"Search CV: {search_result.best_score:.4f}")
# 5. Final evaluation on holdout (search already stored the best estimator)
eval_result = experiment.evaluate(
X_test, y_test,
run_name="final-eval",
)
print(f"Holdout accuracy: {eval_result.metrics['accuracy']:.4f}")
Next steps¶
- Classification Workflow — Detailed classification tutorial
- Why Pipelines Matter — Understanding data leakage
- Hyperparameter Search — Grid, random, and halving search
- Bayesian Optimization — Intelligent search with Optuna
- Logger Adapters — Track experiments with MLflow or W&B