Data Transformation
Data transformation helps convert raw datasets into
usable, uniform formats for improved analysis and
insights. Answering these interview questions effectively
requires a solid understanding of how and when different
methods are implemented.
What to expect
Example questions include:
• Explain how scaling and normalization affect the
distribution and scale of the data.
• When would you use Box-Cox transformation over
other types of transformations?
• When can one-hot encoding be a problem?
This lesson will discuss:
• Scaling, standardization, and normalization
• Transformation
• Encoding categorical variables
For each topic, we’ll provide a brief description and list
common mitigation methods.
Scaling, standardization, and normalization
Scaling, standardization, and normalization are data
preprocessing techniques used to rescale and transform
the features of a dataset to a common scale.
Scaling
Scaling rescales the features to a speci c range, such as
[0, 1] or [-1, 1]. Scaling ensures that all features contribute
equally to the analysis and prevents features with larger
magnitudes from dominating the model.
Standardization
Standardization transforms the features to have a mean of
0 and a standard deviation of 1. This makes the feature
distribution more Gaussian (normal) and allows algorithms
to converge faster and perform better.
Normalization
Normalization rescales the features to have a mean of 0
and a standard deviation of 1 but does not necessarily
constrain the feature values to a speci c range.
Normalization is particularly useful when the feature
distribution is not Gaussian and the data has varying
scales.
fi
fi
Transformation
Data transformation involves converting the original data
into a different format or representation to make it more
suitable for analysis or modeling. The table below
illustrates common types of transformations.
Type Description Application
Logarit Takes the logarithm of the original Commonly applied
hmic data values. It is useful for reducing to data with highly
the skewness of data distributions skewed
and making them more symmetrical distributions, such
as nancial data or
counts of
Square Takes the square root of the original Often used for
root data values. It is effective for count data or data
reducing the variance of data with right-skewed
distributions and stabilizing the distributions.
variance across different levels of
Box- A family of power transformations Particularly useful
cox that includes both logarithmic and when the data
square root transformations as transformation is
special cases. It optimizes the not obvious or
transformation parameter lambda when the data
(λ) to nd the best t for the data. distribution is
highly skewed.
Z-score Involves transforming the data so Commonly used in
that it has a mean of 0 and a statistical analysis
standard deviation of 1. It is useful and machine
for standardizing the scale of learning
features and ensuring that they algorithms.
fi
fi
fi
Encoding categorical variables
Encoding categorical variables involves converting
categorical data, which represents categories or labels,
into numerical representations that can be used in
machine learning algorithms.
Categorical variables can be of two types: ordinal and
nominal.
Ordinal variables have a natural order or ranking among
their categories. For example, a variable representing
educational attainment might have categories like ‘High
School Diploma’, ‘Bachelor's Degree’, and ‘Master's
Degree’, which have a clear order from lowest to highest.
Nominal variables do not have a natural order or ranking
among their categories. For example, a variable
representing colors might have categories like ‘Red’,
‘Blue’, etc., which do not have a meaningful order.
Common techniques for encoding include:
• Label encoding: assigns a unique integer to each
category of the categorical variable. This is suitable
for ordinal variables, but should be used with caution
for nominal variables, as it may inadvertently
introduce order where none exists.
• One-hot encoding: creates binary dummy variables
for each category of the categorical variable. Each
category is represented by a column, and a value of 1
indicates the presence of that category, while a value
of 0 indicates its absence. One-hot encoding is
suitable for both ordinal and nominal variables and
avoids the issue of introducing unintended order.
• Dummy encoding: similar to one-hot encoding but
creates n−1 dummy variables for n categories, where
n is the number of categories in the variable. This
helps avoid multicollinearity issues in regression
models while still capturing all the necessary
information.