ML Concepts Papers

This document discusses various machine learning concepts including different encoding techniques for categorical variables like one hot encoding and dummy encoding. It notes that these techniques can result in a large number of variables for categories with many levels. Ordinal and nominal data are described, with ordinal data retaining information about category order. Label encoding and one hot encoding are also summarized, along with challenges like dummy variable traps and multicollinearity. Reasons categorical variables may need preprocessing before machine learning algorithms are provided.

Uploaded by

Krunal R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views3 pages

ML Concepts Papers

Uploaded by

Krunal R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 3

Machine learning Concepts

One hot Encoding

•Dummy Encoding
•Effect Encoding
•Binary Encoding
•BaseN Encoding
•Hash Encoding
•Target Encoding

Ordinal Data: The categories have an inherent order

•Nominal Data: The categories do not have an inherent order

In Ordinal data, while encoding, one should retain the information regarding the
order in which the category is provided. Like in the above example the highest
degree a person possesses, gives vital information about his qualification. The
degree is an important feature to decide whether a person is suitable for a post or
not.

While encoding Nominal data, we have to consider the presence or absence of a

feature. In such a case, no notion of order is present. For example, the city a
person lives in. For the data, it is important to retain where a person lives. Here,
We do not have any order or sequence. It is equal if a person lives in Delhi or
Bangalore

Drawbacks of One-Hot and Dummy Encoding

One hot encoder and dummy encoder are two powerful and effective encoding
schemes. They are also very popular among the data scientists, But may not be as
effective when-
1.A large number of levels are present in data. If there are multiple
categories in a feature variable in such a case we need a similar number of
dummy variables to encode the data. For example, a column with 30
different values will require 30 new variables for coding.
2.If we have multiple categorical features in the dataset similar situation will
occur and again we will end to have several binary features each
representing the categorical feature and their multiple categories e.g a
dataset having 10 or more categorical columns.

Machine learning algorithm bascics

ML based techniques include several steps. First, features are extracted by calculating over
multiple packets of flows (such as packet lengths, flow duration or inter-packet arrival
times) [17]. Then features are refined by feature selection algorithms if possible.

Label Encoding is a popular encoding technique for handling categorical variables.

In this technique, each label is assigned a unique integer based on alphabetical
ordering.Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.

One-Hot Encoding is the process of creating dummy variables.

Challenges of One-Hot Encoding: Dummy Variable Trap

One-Hot Encoding results in a Dummy Variable Trap as the outcome of one

variable can easily be predicted with the help of the remaining variables.

Dummy Variable Trap is a scenario in which variables are highly correlated to each
other.

The Dummy Variable Trap leads to the problem known as multicollinearity.

Multicollinearity occurs where there is a dependency between the independent
features. Multicollinearity is a serious issue in machine learning models like Linear
Regression and Logistic Regression

A categorical variable has too many levels. This pulls down performance level of the
model. For example, a cat. variable “zip code” would have numerous levels.
•A categorical variable has levels which rarely occur. Many of these levels have
minimal chance of making a real impact on model fit. For example, a variable
‘disease’ might have some levels which would rarely occur.
•There is one level which always occurs i.e. for most of the observations in data
set there is only one level. Variables with such levels fail to make a positive
impact on model performance due to very low variation.
•If the categorical variable is masked, it becomes a laborious task to decipher its
meaning. Such situations are commonly found in data science competitions.
•You can’t fit categorical variables into a regression equation in their raw form.
They must be treated.
•Most of the algorithms (or ML libraries) produce better result with numerical
variable. In python, library “sklearn” requires features in numerical arrays. Look at
the below snapshot. I have applied random forest using sklearn library on titanic
data set (only two features sex and pclass are taken as independent variables). It
has returned an error because feature “sex” is categorical and has not been
converted to numerical form.

L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Exp 6
No ratings yet
Exp 6
9 pages
L7 - Categorical Data - Encoding - Preprocessing - NCU
No ratings yet
L7 - Categorical Data - Encoding - Preprocessing - NCU
23 pages
Categorical Variable Encoding Techniques
No ratings yet
Categorical Variable Encoding Techniques
25 pages
Feature Encoding
No ratings yet
Feature Encoding
5 pages
What Are Categorical Data Encoding Methods - Binary Encoding
No ratings yet
What Are Categorical Data Encoding Methods - Binary Encoding
14 pages
Encoding Notes
No ratings yet
Encoding Notes
4 pages
Categorical Variable Encoding Guide
No ratings yet
Categorical Variable Encoding Guide
21 pages
TP4-ML-features Encoding
No ratings yet
TP4-ML-features Encoding
4 pages
Handling Categorical Data in ML
No ratings yet
Handling Categorical Data in ML
18 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
Mastering Categorical Encoding
No ratings yet
Mastering Categorical Encoding
8 pages
Deep-Learned Encoding for Categorical Data
No ratings yet
Deep-Learned Encoding for Categorical Data
11 pages
Encoding Categorical Data
No ratings yet
Encoding Categorical Data
4 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Feature Engineering Techniques in Data Science
100% (2)
Feature Engineering Techniques in Data Science
76 pages
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
No ratings yet
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
4 pages
7 - InnovatiCS - Categorical Data & Data Transformation
No ratings yet
7 - InnovatiCS - Categorical Data & Data Transformation
20 pages
Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning With High Cardinality Features
No ratings yet
Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning With High Cardinality Features
22 pages
What - Why: Dummy Variables
No ratings yet
What - Why: Dummy Variables
4 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
Understanding Discrete and Continuous Data
No ratings yet
Understanding Discrete and Continuous Data
43 pages
One-Hot Encoding for Categorical Data
No ratings yet
One-Hot Encoding for Categorical Data
4 pages
Unit 1 MLF 1
No ratings yet
Unit 1 MLF 1
33 pages
Dealing With Categorical Data
No ratings yet
Dealing With Categorical Data
14 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
Machine Learning Pipeline Overview
No ratings yet
Machine Learning Pipeline Overview
32 pages
Categorical Variables Explained
No ratings yet
Categorical Variables Explained
3 pages
Comparison Between Encoding Methods - 1
No ratings yet
Comparison Between Encoding Methods - 1
7 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Data Preparation For ML in Practice v213
No ratings yet
Data Preparation For ML in Practice v213
78 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Feature Selection Techniques Explained
No ratings yet
Feature Selection Techniques Explained
54 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Comparing Categorical Encoding Methods
No ratings yet
Comparing Categorical Encoding Methods
11 pages
Label Encoding in Machine Learning
No ratings yet
Label Encoding in Machine Learning
11 pages
Machine Learning for Data Scientists
No ratings yet
Machine Learning for Data Scientists
41 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Working With Pre (Rocessing Data Files
No ratings yet
Working With Pre (Rocessing Data Files
4 pages
Categorical Input in Neural Networks
No ratings yet
Categorical Input in Neural Networks
10 pages
Feature Engineering for Regression Models
No ratings yet
Feature Engineering for Regression Models
23 pages
Exploring Categorical Data - Students
No ratings yet
Exploring Categorical Data - Students
40 pages
Topic 2
No ratings yet
Topic 2
47 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Cerda Et Al. - 2018 - Similarity Encoding For Learning With Dirty Categorical Variables
No ratings yet
Cerda Et Al. - 2018 - Similarity Encoding For Learning With Dirty Categorical Variables
18 pages
Lab 6
No ratings yet
Lab 6
6 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Microsoft Word Notes
No ratings yet
Microsoft Word Notes
9 pages
Session 21
No ratings yet
Session 21
19 pages
Advanced PowerShell Exploitation Guide
No ratings yet
Advanced PowerShell Exploitation Guide
8 pages
How To Setup and Run RSLogix Emulation (SLC500&Micrologix 1000)
No ratings yet
How To Setup and Run RSLogix Emulation (SLC500&Micrologix 1000)
4 pages
B.SC Computer Science Sem I II Vide Item No. 6.5 R
No ratings yet
B.SC Computer Science Sem I II Vide Item No. 6.5 R
65 pages
Cloud Resources Provisioning
No ratings yet
Cloud Resources Provisioning
13 pages
Qualifying Exam 1
No ratings yet
Qualifying Exam 1
2 pages
Unit 6 Backup and Recovery
No ratings yet
Unit 6 Backup and Recovery
10 pages
VMware NSX Advanced Load Balancer Admin Guide
No ratings yet
VMware NSX Advanced Load Balancer Admin Guide
393 pages
VSAM To DB2 Conversion
100% (1)
VSAM To DB2 Conversion
63 pages
Proportional Solenoid Driver Guide
No ratings yet
Proportional Solenoid Driver Guide
4 pages
21bce0968 VL2023240100969 Ast02
No ratings yet
21bce0968 VL2023240100969 Ast02
20 pages
Mastering Windows Security and Hardening 1st Edition by Mark Dunkerley, Matt Tumbarello 1839214287 9781839214288
No ratings yet
Mastering Windows Security and Hardening 1st Edition by Mark Dunkerley, Matt Tumbarello 1839214287 9781839214288
48 pages
ICT 11 Student Test Prep
No ratings yet
ICT 11 Student Test Prep
2 pages
AGuidetoLinuxHardware SoftwareCo DesignontheZynqUltraScaleSoC
No ratings yet
AGuidetoLinuxHardware SoftwareCo DesignontheZynqUltraScaleSoC
38 pages
Cloud Security Software Engineer Resume
No ratings yet
Cloud Security Software Engineer Resume
2 pages
DIY Raspberry Pi Roomba Hack
No ratings yet
DIY Raspberry Pi Roomba Hack
15 pages
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
No ratings yet
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
40 pages
Unit-1 CS-503 Cyber Security
No ratings yet
Unit-1 CS-503 Cyber Security
74 pages
MN67B19 Eng
No ratings yet
MN67B19 Eng
36 pages
1GitHub - Modelcontextprotocol - Python-Sdk - The Official Python SDK For Model Context Protocol Servers and Clients
No ratings yet
1GitHub - Modelcontextprotocol - Python-Sdk - The Official Python SDK For Model Context Protocol Servers and Clients
9 pages
AuditScripts CIS Controls Master Mappings v7.1b
No ratings yet
AuditScripts CIS Controls Master Mappings v7.1b
3,429 pages
RISC vs. CISC Processor Design
No ratings yet
RISC vs. CISC Processor Design
7 pages
ICS 3202 - Artificial Intelligence - December 2022
No ratings yet
ICS 3202 - Artificial Intelligence - December 2022
5 pages
Elevator Controller Quick Start Guide V1.0.0
No ratings yet
Elevator Controller Quick Start Guide V1.0.0
36 pages
Secure Webmail: Sending Mail Using Stunnel, Mail Submission Port and
No ratings yet
Secure Webmail: Sending Mail Using Stunnel, Mail Submission Port and
103 pages
5G SA Architecture
No ratings yet
5G SA Architecture
5 pages
Fuzzy Systems in Bio-Inspired Computing - State-of-the-Art Literature Review
No ratings yet
Fuzzy Systems in Bio-Inspired Computing - State-of-the-Art Literature Review
1 page
Hydra Cheat Sheet - by Codelivly
No ratings yet
Hydra Cheat Sheet - by Codelivly
6 pages
Opamp Design Project in 32nm CMOS
100% (1)
Opamp Design Project in 32nm CMOS
3 pages