Mastering Categorical
Encoding
Encoding categorical variables is a crucial step in preparing data for machine
learning models. This presentation will explore the different encoding
techniques, their applications, and the trade-offs to consider when choosing
the right method.
Nominal Vs. Ordinal Encoding
Nominal Categorical Variables Ordinal Categorical Variables
Variables with unordered categories, like gender Variables with ordered categories, like education
or state, where the order of the categories level, where the rank of the categories is
doesn't matter. important.
One-Hot Encoding
1 For Nominal Variables 2 Avoids Dummy Variable Trap
Creates new binary columns, one for Skip the last column to avoid perfect
each unique category, indicating the multicollinearity.
presence or absence of each category.
3 Downsides
High dimensionality for datasets with many categories.
Label Encoding
For Ordinal Variables Preserves Ordering
Assign numerical labels to ordered categories Maintains the inherent order of the categories.
based on their rank.
Caution Combination
Assumes equal distance between categories, Can be combined with other techniques like
which may not always be the case. target encoding for better performance.
Target Encoding
1 For Ordinal Variables
Replace categories with the mean or median of the target variable for each category.
2 Captures Relationship
Exploits the relationship between the categorical variable and the target variable.
3 Caution
Prone to overfitting, so cross-validation is essential.
Mean Encoding
For Nominal Variables Captures Relationship Caution
Replace categories with the Effectively captures the May lead to overfitting, so
mean of the target variable for relationship between the cross-validation and
each category. categorical variable and the regularization are important.
target variable.
Ordinal Encoding for Nominal Variables
Step 1
Calculate the mean/median of the target variable for each nominal category.
Step 2
Rank the categories based on the calculated mean/median values.
Step 3
Assign numerical labels to the categories based on their rank.
Choosing the Right Encoding
Model Interpretability Dimensionality Overfitting
Performance
Consider the Avoid high- Employ cross-
Select the encoding interpretability of the dimensional validation and
that maximizes model encoded features for encodings that can regularization to
accuracy and better insights. lead to the curse of mitigate the risk of
generalization. dimensionality. overfitting.