Lecture 6. Data preprocessing¶

Real-world machine learning pipelines

Joaquin Vanschoren

Data transformations¶

Machine learning models make a lot of assumptions about the data
In reality, these assumptions are often violated
We build pipelines that transform the data before feeding it to the learners
- Scaling (or other numeric transformations)
- Encoding (convert categorical features into numerical ones)
- Automatic feature selection
- Feature engineering (e.g. binning, polynomial features,...)
- Handling missing data
- Handling imbalanced data
- Dimensionality reduction (e.g. PCA)
- Learned embeddings (e.g. for text)
Seek the best combinations of transformations and learning methods
- Often done empirically, using cross-validation
- Make sure that there is no data leakage during this process!

Use when different numeric features have different scales (different range of values)
- Features with much higher values may overpower the others
Goal: bring them all within the same range
Different methods exist

interactive(children=(Dropdown(description='scaler', options=(StandardScaler(), RobustScaler(), MinMaxScaler()…

KNN: Distances depend mainly on feature with larger values
SVMs: (kernelized) dot products are also based on distances
Linear model: Feature scale affects regularization
- Weights have similar scales, more interpretable

interactive(children=(Dropdown(description='classifier', options=(KNeighborsClassifier(), SVC(), LinearSVC(), …

Generally most useful, assumes data is more or less normally distributed
Per feature, subtract the mean value $\mu$, scale by standard deviation $\sigma$
New feature has $\mu=0$ and $\sigma=1$, values can still be arbitrarily large $$\mathbf{x}_{new} = \frac{\mathbf{x} - \mu}{\sigma}$$

$$\mathbf{x}_{new} = \frac{\mathbf{x} - x_{min}}{x_{max} - x_{min}} \cdot (max - min) + min $$

Makes sure that feature values of each point (each row) sum up to 1 (L1 norm)
- Useful for count data (e.g. word counts in documents)
Can also be used with L2 norm (sum of squares is 1)
- Useful when computing distances in high dimensions
- Normalized Euclidean distance is equivalent to cosine similarity

For sparse data (many features, but few are non-zero)
- Maintain sparseness (efficient storage)
Scales all values so that maximum absolute value is 1
Similar to Min-Max scaling without changing 0 values

Some features follow certain distributions
- E.g. number of twitter followers is log-normal distributed
Box-Cox transformations transform these to normal distributions ($\lambda$ is fitted)
- Only works for positive values, use Yeo-Johnson otherwise $$bc_{\lambda}(x) = \begin{cases} log(x) & \lambda = 0\\ \frac{x^{\lambda}-1}{\lambda} & \lambda \neq 0 \\ \end{cases}$$

Many algorithms can only handle numeric features, so we need to encode the categorical ones

Simply assigns an integer value to each category in the order they are encountered
Only really useful if there exist a natural order in categories
- Model will consider one category to be 'higher' or 'closer' to another

Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
Can explode if a feature has lots of values, causing issues with high dimensionality
What if test set contains a new category not seen in training data?
- Either ignore it (just use all 0's in row), or handle manually (e.g. resample)

	boro	boro_Bronx	boro_Brooklyn	boro_Manhattan	boro_Queens	salary
0	Manhattan	0	0	1	0	103
1	Queens	0	0	0	1	89
2	Manhattan	0	0	1	0	142
3	Brooklyn	0	1	0	0	54
4	Brooklyn	0	1	0	0	63
5	Bronx	1	0	0	0	219

Value close to 1 if category correlates with class 1, close to 0 if correlates with class 0
Preferred when you have lots of category values. It only creates one new feature per class
Blends posterior probability of the target $\frac{n_{iY}}{n_i}$ and prior probability $\frac{n_Y}{n}$.
- $n_{iY}$: nr of samples with category i and class Y=1, $n_{i}$: nr of samples with category i
- Blending: gradually decrease as you get more examples of category i and class Y=0 $$Enc(i) = \color{blue}{\frac{1}{1+e^{-(n_{i}-1)}} \frac{n_{iY}}{n_i}} + \color{green}{(1-\frac{1}{1+e^{-(n_{i}-1)}}) \frac{n_Y}{n}}$$
- Same for regression, using $\frac{n_{iY}}{n_i}$: average target value with category i, $\frac{n_{Y}}{n}$: overall mean