Mixed Data Types: Titanic Classification¶
What you'll learn:
- How to handle datasets with both numeric and categorical features
- Why
ColumnTransformeris essential for mixed-type preprocessing - How to deal with missing values in a pipeline-safe way
- The importance of stratified CV for imbalanced classification
Prerequisites: Classification Workflow, understanding of pipelines.
The problem: real data is messy¶
The Iris dataset is clean—numeric features, no missing values, balanced classes. Real-world data rarely cooperates like this.
The Titanic dataset represents reality better: passenger age (numeric, sometimes missing), ticket fare (numeric), gender (categorical), embarkation port (categorical, sometimes missing), and class (ordinal). Different feature types need different preprocessing. Missing values must be handled. Class imbalance (more deaths than survivals) affects evaluation.
This tutorial shows how to build a pipeline that handles all of this correctly, without leaking information between training and test data.
Step 1: Load and explore the data¶
from sklearn.datasets import fetch_openml
titanic = fetch_openml(data_id=40945, as_frame=True)
titanic_df = titanic.frame
print(f"Shape: {titanic_df.shape}")
print(f"\nFeatures:")
for col in ["pclass", "sex", "age", "fare", "embarked"]:
missing = titanic_df[col].isna().sum()
print(f" {col}: {titanic_df[col].dtype}, {missing} missing")
print(f"\nTarget distribution:")
print(titanic_df["survived"].value_counts())
Concept: Mixed Data Types
Datasets often contain different types of features: - Numeric: age, fare, continuous measurements - Categorical: gender, port of embarkation, discrete labels - Ordinal: passenger class (1st > 2nd > 3rd), education level
Each type needs different preprocessing. Scalers work on numbers. Encoders work on categories. Applying the wrong transform corrupts your data.
Why it matters: A pipeline that treats "sex" as numeric will try to compute its mean—nonsense that sklearn might not catch.
Step 2: Define feature groups¶
We need to tell sklearn which columns are which type, so it applies the correct preprocessing to each.
feature_cols = ["pclass", "sex", "age", "fare", "embarked"]
X = titanic_df[feature_cols].to_numpy()
y = titanic_df["survived"].to_numpy().astype(int) # Convert string labels to int
# Column indices by type (after selecting feature_cols)
categorical_cols = [0, 1, 4] # pclass, sex, embarked
numeric_cols = [2, 3] # age, fare
Step 3: Build the preprocessing pipeline¶
This is where ColumnTransformer shines. It routes different columns through
different transformers.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Categorical pipeline: impute missing, then one-hot encode
categorical_pipeline = Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
])
# Numeric pipeline: impute missing, then scale
numeric_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
])
# Combine with ColumnTransformer
preprocess = ColumnTransformer(
transformers=[
("cat", categorical_pipeline, categorical_cols),
("num", numeric_pipeline, numeric_cols),
],
remainder="drop", # drop any columns not explicitly handled
)
What this does:
- Categorical columns: Fill missing with most frequent value, then create binary columns for each category
- Numeric columns: Fill missing with median, then scale to mean=0, std=1
- Other columns: Dropped (we've selected only the features we want)
Concept: Imputation Inside the Pipeline
Missing values must be handled inside the pipeline, not before. Why?
If you impute before splitting, the imputer sees test data statistics—leakage. The median age in your training set shouldn't include test passengers.
Why it matters: Inside the pipeline, imputation is refit on each fold's training data. The test fold's missing values are filled using only training statistics—the correct approach.
Concept: OneHotEncoder Options
handle_unknown="ignore" prevents errors when the test set contains
categories not seen during training. Instead of crashing, it creates a
row of zeros for that observation.
Why it matters: In production, you might see a new embarkation port or edge case. The model should handle it gracefully, not crash.
Step 4: Build the full pipeline and experiment¶
from sklearn.linear_model import LogisticRegression
from sklab.experiment import Experiment
pipeline = Pipeline([
("prep", preprocess),
("model", LogisticRegression(max_iter=200)),
])
experiment = Experiment(
pipeline=pipeline,
scoring=["accuracy", "f1"],
name="titanic",
)
Step 5: Cross-validate with stratification¶
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
result = experiment.cross_validate(X, y, cv=cv, run_name="titanic-cv")
print(f"Accuracy: {result.metrics['cv/accuracy_mean']:.3f} (+/- {result.metrics['cv/accuracy_std']:.3f})")
print(f"F1 Score: {result.metrics['cv/f1_mean']:.3f} (+/- {result.metrics['cv/f1_std']:.3f})")
Concept: Stratified Splits for Imbalanced Data
The Titanic dataset is imbalanced—more passengers died than survived. Random splits might accidentally put most survivors in one fold, distorting metrics.
StratifiedKFold ensures each fold has the same proportion of survivors and
non-survivors as the original data.
Why it matters: Without stratification, fold metrics vary wildly based on random chance rather than model quality.
Understanding the metrics¶
For imbalanced classification, accuracy alone is misleading. If 60% of passengers died, a model that always predicts "died" gets 60% accuracy—useless but high-scoring.
F1 score balances precision and recall: - Precision: Of passengers predicted to survive, how many actually did? - Recall: Of passengers who survived, how many did we predict correctly?
A good model has high F1, not just high accuracy.
Complete example¶
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklab.experiment import Experiment
# 1. Load data
titanic = fetch_openml(data_id=40945, as_frame=True)
titanic_df = titanic.frame
feature_cols = ["pclass", "sex", "age", "fare", "embarked"]
X = titanic_df[feature_cols].to_numpy()
y = titanic_df["survived"].to_numpy().astype(int) # Convert string labels to int
categorical_cols = [0, 1, 4]
numeric_cols = [2, 3]
# 2. Build preprocessing
preprocess = ColumnTransformer(
transformers=[
(
"cat",
Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]),
categorical_cols,
),
(
"num",
Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]),
numeric_cols,
),
],
remainder="drop",
)
# 3. Build full pipeline
pipeline = Pipeline([
("prep", preprocess),
("model", LogisticRegression(max_iter=200)),
])
# 4. Create experiment and cross-validate
experiment = Experiment(
pipeline=pipeline,
scoring=["accuracy", "f1"],
name="titanic",
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
result = experiment.cross_validate(X, y, cv=cv, run_name="cv")
print(f"Accuracy: {result.metrics['cv/accuracy_mean']:.3f}")
print(f"F1 Score: {result.metrics['cv/f1_mean']:.3f}")
Best practices¶
-
Use ColumnTransformer for mixed types. Don't try to preprocess everything the same way—different features need different handling.
-
Impute inside the pipeline. Missing value handling must be part of the pipeline to prevent leakage.
-
Use stratified splits for imbalanced data. Ensures each fold has representative class proportions.
-
Look beyond accuracy. F1, precision, recall, and AUC-ROC tell you more about real-world performance on imbalanced problems.
-
Handle unknown categories. Use
handle_unknown="ignore"for production robustness.
Tradeoffs¶
| Choice | Pros | Cons |
|---|---|---|
| OneHotEncoder | No ordinal assumptions | High dimensionality |
| OrdinalEncoder | Compact | Implies ordering that may not exist |
| Median imputation | Robust to outliers | Ignores feature relationships |
| Model-based imputation | Uses correlations | Adds complexity, may overfit |
Next steps¶
- Hyperparameter Search — Find better model parameters
- Why Pipelines Matter — Deeper dive into leakage prevention
- Logger Adapters — Track these experiments