Lecture 6. Data preprocessing¶

Real-world machine learning pipelines

Joaquin Vanschoren

Data transformations¶

  • Machine learning models make a lot of assumptions about the data
  • In reality, these assumptions are often violated
  • We build pipelines that transform the data before feeding it to the learners
    • Scaling (or other numeric transformations)
    • Encoding (convert categorical features into numerical ones)
    • Automatic feature selection
    • Feature engineering (e.g. binning, polynomial features,...)
    • Handling missing data
    • Handling imbalanced data
    • Dimensionality reduction (e.g. PCA)
    • Learned embeddings (e.g. for text)
  • Seek the best combinations of transformations and learning methods
    • Often done empirically, using cross-validation
    • Make sure that there is no data leakage during this process!

Scaling¶

  • Use when different numeric features have different scales (different range of values)
    • Features with much higher values may overpower the others
  • Goal: bring them all within the same range
  • Different methods exist
interactive(children=(Dropdown(description='scaler', options=(StandardScaler(), RobustScaler(), MinMaxScaler()…

Why do we need scaling?¶

  • KNN: Distances depend mainly on feature with larger values
  • SVMs: (kernelized) dot products are also based on distances
  • Linear model: Feature scale affects regularization
    • Weights have similar scales, more interpretable
interactive(children=(Dropdown(description='classifier', options=(KNeighborsClassifier(), SVC(), LinearSVC(), …

Standard scaling (standardization)¶

  • Generally most useful, assumes data is more or less normally distributed
  • Per feature, subtract the mean value $\mu$, scale by standard deviation $\sigma$
  • New feature has $\mu=0$ and $\sigma=1$, values can still be arbitrarily large $$\mathbf{x}_{new} = \frac{\mathbf{x} - \mu}{\sigma}$$

Min-max scaling¶

  • Scales all features between a given $min$ and $max$ value (e.g. 0 and 1)
  • Makes sense if min/max values have meaning in your data
  • Sensitive to outliers
$$\mathbf{x}_{new} = \frac{\mathbf{x} - x_{min}}{x_{max} - x_{min}} \cdot (max - min) + min $$

Robust scaling¶

  • Subtracts the median, scales between quantiles $q_{25}$ and $q_{75}$
  • New feature has median 0, $q_{25}=-1$ and $q_{75}=1$
  • Similar to standard scaler, but ignores outliers

Normalization¶

  • Makes sure that feature values of each point (each row) sum up to 1 (L1 norm)
    • Useful for count data (e.g. word counts in documents)
  • Can also be used with L2 norm (sum of squares is 1)
    • Useful when computing distances in high dimensions
    • Normalized Euclidean distance is equivalent to cosine similarity

Maximum Absolute scaler¶

  • For sparse data (many features, but few are non-zero)
    • Maintain sparseness (efficient storage)
  • Scales all values so that maximum absolute value is 1
  • Similar to Min-Max scaling without changing 0 values

Power transformations¶

  • Some features follow certain distributions
    • E.g. number of twitter followers is log-normal distributed
  • Box-Cox transformations transform these to normal distributions ($\lambda$ is fitted)
    • Only works for positive values, use Yeo-Johnson otherwise $$bc_{\lambda}(x) = \begin{cases} log(x) & \lambda = 0\\ \frac{x^{\lambda}-1}{\lambda} & \lambda \neq 0 \\ \end{cases}$$

Categorical feature encoding¶

  • Many algorithms can only handle numeric features, so we need to encode the categorical ones
  boro salary vegan
0 Manhattan 103 0
1 Queens 89 0
2 Manhattan 142 0
3 Brooklyn 54 1
4 Brooklyn 63 1
5 Bronx 219 0

Ordinal encoding¶

  • Simply assigns an integer value to each category in the order they are encountered
  • Only really useful if there exist a natural order in categories
    • Model will consider one category to be 'higher' or 'closer' to another
  boro boro_ordinal salary
0 Manhattan 2 103
1 Queens 3 89
2 Manhattan 2 142
3 Brooklyn 1 54
4 Brooklyn 1 63
5 Bronx 0 219

One-hot encoding (dummy encoding)¶

  • Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
  • Can explode if a feature has lots of values, causing issues with high dimensionality
  • What if test set contains a new category not seen in training data?
    • Either ignore it (just use all 0's in row), or handle manually (e.g. resample)
  boro boro_Bronx boro_Brooklyn boro_Manhattan boro_Queens salary
0 Manhattan 0 0 1 0 103
1 Queens 0 0 0 1 89
2 Manhattan 0 0 1 0 142
3 Brooklyn 0 1 0 0 54
4 Brooklyn 0 1 0 0 63
5 Bronx 1 0 0 0 219

Target encoding¶

  • Value close to 1 if category correlates with class 1, close to 0 if correlates with class 0
  • Preferred when you have lots of category values. It only creates one new feature per class
  • Blends posterior probability of the target $\frac{n_{iY}}{n_i}$ and prior probability $\frac{n_Y}{n}$.
    • $n_{iY}$: nr of samples with category i and class Y=1, $n_{i}$: nr of samples with category i
    • Blending: gradually decrease as you get more examples of category i and class Y=0 $$Enc(i) = \color{blue}{\frac{1}{1+e^{-(n_{i}-1)}} \frac{n_{iY}}{n_i}} + \color{green}{(1-\frac{1}{1+e^{-(n_{i}-1)}}) \frac{n_Y}{n}}$$
    • Same for regression, using $\frac{n_{iY}}{n_i}$: average target value with category i, $\frac{n_{Y}}{n}$: overall mean