References¶

A curated list of papers, documentation, and resources referenced throughout the sklab documentation. Organized by topic for easy browsing.

Hyperparameter Optimization¶

Random Search¶

Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.

Paper

The foundational paper showing that random search often outperforms grid search, especially in high-dimensional spaces where only a few parameters matter.

Key insight: Random search samples each parameter independently, so it explores important dimensions densely regardless of how many unimportant dimensions exist.

Bayesian Optimization and TPE¶

Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems, 24.

Paper

Introduces the TPE (Tree-structured Parzen Estimator) algorithm, which models the density of good vs. bad configurations rather than the objective function directly.

Key insight: TPE scales better to high dimensions than Gaussian Process-based Bayesian optimization because it treats parameters independently.

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25^th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

Paper | Documentation

The Optuna framework paper, describing its define-by-run API and efficient TPE implementation.

Early Stopping and Successive Halving¶

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18, 1-52.

Paper

Introduces Hyperband, which combines successive halving with multiple brackets to robustly handle different convergence rates.

Key insight: Running many configurations with small budgets, then progressively eliminating the worst performers, uses resources more efficiently than running all configurations to completion.

Jamieson, K., & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. Proceedings of the 19^th International Conference on Artificial Intelligence and Statistics (AISTATS).

Paper

Formalizes successive halving as a multi-armed bandit problem and proves theoretical guarantees.

Gaussian Process Bayesian Optimization¶

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas, N. (2016). Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1), 148-175.

Paper

Comprehensive survey of Bayesian optimization, covering acquisition functions, surrogate models, and practical considerations.

Cross-Validation and Model Evaluation¶

scikit-learn User Guide: Cross-validation

Documentation

Comprehensive guide to cross-validation strategies in sklearn, including k-fold, stratified, time series, and grouped variants.

scikit-learn User Guide: Pipelines and Composite Estimators

Documentation

Official documentation on sklearn Pipelines, ColumnTransformers, and avoiding data leakage.

Data Leakage¶

Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data, 6(4), 1-21.

Paper

Foundational paper defining and categorizing data leakage in machine learning.

Key insight: Leakage can occur through many subtle mechanisms—feature engineering, sampling, temporal ordering—not just obvious train/test contamination.

scikit-learn: Common pitfalls and recommended practices

Documentation

Official guide to common mistakes in ML workflows, including data leakage, overfitting, and evaluation errors.

External Tools¶

Experiment Tracking¶

MLflow Documentation — Open-source platform for ML lifecycle management
Weights & Biases Documentation — ML experiment tracking and visualization

Hyperparameter Optimization¶

Optuna Documentation — Hyperparameter optimization framework
scikit-learn GridSearchCV — Exhaustive grid search
scikit-learn RandomizedSearchCV — Randomized search
scikit-learn HalvingRandomSearchCV — Successive halving

How to cite sklab¶

If you use sklab in your research, please cite this repository:

@software{sklab,
  title = {sklab: A lightweight experiment runner for sklearn pipelines},
  url = {https://github.com/your-username/scikit-lab},
  year = {2024}
}