Scikit-Learn Data Preprocessing Guide
Scikit-Learn Data Preprocessing Guide
Sebastian Raschka
[Link]
6. Scikit-learn Pipelines
Labels
Training Dataset
Learning
Final Model New Data
Labels Algorithm
Model Selection
Cross-Validation
Performance Metrics
Hyperparameter Optimization
Dataset paper: Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936);
also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
[Link]
Source: [Link]
Modin: [Link]
6. Scikit-learn Pipelines
MLXTEND [Link]
Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and
extensions to Python’s scientific computing stack."
The Journal of Open Source Software 3.24 (2018).
6. Scikit-learn Pipelines
Code notebook: [Link]
L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 28
Python Classes
6. Scikit-learn Pipelines
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 38
The "Main" Machine Learning Library for
Python
[Link]
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of
Machine Learning Research, 12, pp.2825-2830.
Training Training
Data Labels
① [Link](X_train, y_train)
② [Link](X_test)
Predicted
labels
6. Scikit-learn Pipelines
Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.
Sebastian Raschka
1
STAT 451: Intro to ML
[Link] Lecture 5: Scikit-learn 44
Stratified Splits
[i]
[i]
x − xmin
xnorm =
xmax − xmin
[i]
[i]
x − xmin
xnorm =
xmax − xmin
[i]
[i]
x − μx
xstd =
σx
1 i=1 [i] 2
n−1∑
sx = (x − x̄)
n
i=1
1
(x [i] − μx)2
n∑
σx =
n
i=1
1
(x [i] − x̄)2
n−1∑
sx =
n
1 i=1 [i]
(x − μx)2
n∑
σx =
n
Estimate:
mean: 20 cm
Estimate:
mean: 20 cm
Standardize:
Estimate:
mean: 20 cm
mean: 20 cm
- example6: 7 cm -> class ?
<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>
Estimate:
Estimate "new" mean and std.:
mean: 20 cm
Training Test
Data Data
①
[Link](X_train)
Transformed Transformed
Training Data Test Data
6. Scikit-learn Pipelines
(Step 1) (Step 2)
Class labels
Training set Test set
[Link](…) [Link](…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)
(Step 1) (Step 2)
Class labels
Training set Test set
[Link](…) [Link](…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)
Change hyperparameters
and repeat
Machine learning
algorithm
Evaluate
Fit
Predictive model
Final performance estimate
[Link]
[Link]
master/L05/code/05-preprocessing-and-sklearn__notes.ipynb