Machine learning Concepts
One hot Encoding
•Dummy Encoding
•Effect Encoding
•Binary Encoding
•BaseN Encoding
•Hash Encoding
•Target Encoding
Ordinal Data: The categories have an inherent order
•Nominal Data: The categories do not have an inherent order
In Ordinal data, while encoding, one should retain the information regarding the
order in which the category is provided. Like in the above example the highest
degree a person possesses, gives vital information about his qualification. The
degree is an important feature to decide whether a person is suitable for a post or
not.
While encoding Nominal data, we have to consider the presence or absence of a
feature. In such a case, no notion of order is present. For example, the city a
person lives in. For the data, it is important to retain where a person lives. Here,
We do not have any order or sequence. It is equal if a person lives in Delhi or
Bangalore
Drawbacks of One-Hot and Dummy Encoding
One hot encoder and dummy encoder are two powerful and effective encoding
schemes. They are also very popular among the data scientists, But may not be as
effective when-
1.A large number of levels are present in data. If there are multiple
categories in a feature variable in such a case we need a similar number of
dummy variables to encode the data. For example, a column with 30
different values will require 30 new variables for coding.
2.If we have multiple categorical features in the dataset similar situation will
occur and again we will end to have several binary features each
representing the categorical feature and their multiple categories e.g a
dataset having 10 or more categorical columns.
Machine learning algorithm bascics
ML based techniques include several steps. First, features are extracted by calculating over
multiple packets of flows (such as packet lengths, flow duration or inter-packet arrival
times) [17]. Then features are refined by feature selection algorithms if possible.
Label Encoding is a popular encoding technique for handling categorical variables.
In this technique, each label is assigned a unique integer based on alphabetical
ordering.Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.
One-Hot Encoding is the process of creating dummy variables.
Challenges of One-Hot Encoding: Dummy Variable Trap
One-Hot Encoding results in a Dummy Variable Trap as the outcome of one
variable can easily be predicted with the help of the remaining variables.
Dummy Variable Trap is a scenario in which variables are highly correlated to each
other.
The Dummy Variable Trap leads to the problem known as multicollinearity.
Multicollinearity occurs where there is a dependency between the independent
features. Multicollinearity is a serious issue in machine learning models like Linear
Regression and Logistic Regression
A categorical variable has too many levels. This pulls down performance level of the
model. For example, a cat. variable “zip code” would have numerous levels.
•A categorical variable has levels which rarely occur. Many of these levels have
minimal chance of making a real impact on model fit. For example, a variable
‘disease’ might have some levels which would rarely occur.
•There is one level which always occurs i.e. for most of the observations in data
set there is only one level. Variables with such levels fail to make a positive
impact on model performance due to very low variation.
•If the categorical variable is masked, it becomes a laborious task to decipher its
meaning. Such situations are commonly found in data science competitions.
•You can’t fit categorical variables into a regression equation in their raw form.
They must be treated.
•Most of the algorithms (or ML libraries) produce better result with numerical
variable. In python, library “sklearn” requires features in numerical arrays. Look at
the below snapshot. I have applied random forest using sklearn library on titanic
data set (only two features sex and pclass are taken as independent variables). It
has returned an error because feature “sex” is categorical and has not been
converted to numerical form.