0% found this document useful (0 votes)

2 views16 pages

Lec 2 Data Preprocessing

Uploaded by

marconcostine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views16 pages

Lec 2 Data Preprocessing

Uploaded by

marconcostine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Pre-processing

• Importing the libraries

• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Splitting the dataset into training and testing datasets
• Feature Scaling

We shall have a practical example to better understand the concepts

Basic libraries
• import numpy as np # used for handling numbers
• import pandas as pd # used for handling the dataset
• from sklearn.impute import SimpleImputer # used for handling missing data
• from sklearn.preprocessing import LabelEncoder, OneHotEncoder # used for
encoding categorical data
• from sklearn.model_selection import train_test_split # used for splitting training
and testing data
• from sklearn.preprocessing import StandardScaler # used for feature scaling
• import matplotlib.pyplot as plt # for plotting figures and graphs

pip install <package name> # to install any packages

Eg. pip install matplotlib
Importing the Dataset

# Reading the dataset

• dataset = pd.read_csv(‘datasetname.csv') # to import the dataset into a variable

# Splitting the attributes into independent and dependent attributes

• X = dataset.iloc[:, :-1].values # attributes to determine dependent variable / Class

• Y = dataset.iloc[:, -1].values # dependent variable / Class
Handling missing data
Ways to handle missing data:

1. Deleting the particular row or column:

➢ Delete the specific row or column which consists of null values

This may lead to loss of information compromising the model accuracy

2. Calculating the mean:

➢ Compute the mean of that column contains any missing value and
put it on the place of missing value
Handling missing data …
Example

•# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])

• The missing values will be replaced by the average values of the

respective columns.
Handling of Categorical Data

• Data which has some categories such as country, gender, hair color, or
product type.
• Machine models deal with numbers
• Categorical variables, may complicate the model building procedures
• So we encode these categorical variables into numbers.
Handling of Categorical Data
Example
The Region contains three categories. It’s India, USA & Brazil and the online
shopper variable contains two categories. Yes and No.
# encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

Output
• The Region variable is now made up of a 3 bit binary variable. The left most bit
represents India, 2nd bit represents Brazil and the last bit represents USA. If the
bit is 1 then it represents data for that country otherwise not. For Online
Shopper variable, 1 represents Yes and 0 represents No.
Handling of Categorical Data…..

• Output

• The Region variable is now made up of a 3 bit binary variable. The left most
bit represents India, 2nd bit represents Brazil and the last bit
represents USA. If the bit is 1 then it represents data for that country
otherwise not.

• For Online Shopper variable, 1 represents Yes and 0 represents No.

Handling of Categorical Data…..

• Output

• For Online Shopper variable, 1 represents Yes and 0 represents No.

Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the

algorithm

Example:

# splitting the dataset into training set and test set

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training

20%: for testing
Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the

algorithm

Example:

# splitting the dataset into training set and test set

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training

20%: for testing
Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the

algorithm

Example:

# splitting the dataset into training set and test set

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training

20%: for testing
Feature Scaling

▪ Standardize features to a specific range (common ground).

Features (variables) are kept in the same range and on the
same scale so that no variable dominates the other variable.

Feature scaling Methods

1. Normalization

2. Standardization
Normalization

▪ Normalization scales the feature between 0.0 & 1.0, retaining

their proportional range to each other.
Standardization

▪ Standardization measures the standard deviation of the

value from its mean.

▪ It transforms data such that the resulting distribution has

a mean of 0 and a standard deviation of 1.

Lec 2 Data Preprocessing Notes
No ratings yet
Lec 2 Data Preprocessing Notes
16 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Week 10
No ratings yet
Week 10
50 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Lec 2 ML S4 Data Preprocessing
No ratings yet
Lec 2 ML S4 Data Preprocessing
20 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
14 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Assignment DataPreprocessing
No ratings yet
Assignment DataPreprocessing
3 pages
ML Lab Maual
No ratings yet
ML Lab Maual
25 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Data Pre-Processing for Machine Learning
No ratings yet
Data Pre-Processing for Machine Learning
12 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Mini 4
No ratings yet
Mini 4
9 pages
Scikit-Learn ML Cheat Sheet Guide
No ratings yet
Scikit-Learn ML Cheat Sheet Guide
3 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Data Sampling and Feature Engineering Guide
No ratings yet
Data Sampling and Feature Engineering Guide
2 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
Data Science Bootcamp Insights
No ratings yet
Data Science Bootcamp Insights
161 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Python in Research
No ratings yet
Python in Research
18 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Logistic Regression and Beginner ML Notes
No ratings yet
Logistic Regression and Beginner ML Notes
9 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Digital Principal and System Design
No ratings yet
Digital Principal and System Design
17 pages
NPV and Inflation
100% (1)
NPV and Inflation
2 pages
Oracle ARCS Fixed Assets Reconciliation
No ratings yet
Oracle ARCS Fixed Assets Reconciliation
3 pages
Multiplication Theorum of Probability
No ratings yet
Multiplication Theorum of Probability
2 pages
Types of Soil Structures Explained
No ratings yet
Types of Soil Structures Explained
6 pages
ADVEI D 21 00714 - Reviewer
No ratings yet
ADVEI D 21 00714 - Reviewer
25 pages
Mensuration Formula Sheet
82% (55)
Mensuration Formula Sheet
2 pages
The Most Vital Text Book
No ratings yet
The Most Vital Text Book
35 pages
Junior Inter BIPC Study Guide
No ratings yet
Junior Inter BIPC Study Guide
2 pages
Electric Field of A Point Charge
No ratings yet
Electric Field of A Point Charge
4 pages
Final Exam Time Table - 24S1
No ratings yet
Final Exam Time Table - 24S1
2 pages
Rock Mechanics and Mining Engineering
100% (1)
Rock Mechanics and Mining Engineering
29 pages
Cellular Codes in Matlab
No ratings yet
Cellular Codes in Matlab
11 pages
Large-Eddy Simulations of Turbulence - Lesieur Et Al
No ratings yet
Large-Eddy Simulations of Turbulence - Lesieur Et Al
233 pages
Apar 12
No ratings yet
Apar 12
5 pages
Understanding Blackbody Radiation
No ratings yet
Understanding Blackbody Radiation
22 pages
EJ SUPREME Trial FDD TM4 To TM6 Changes Trial 20200513 Rev1
100% (2)
EJ SUPREME Trial FDD TM4 To TM6 Changes Trial 20200513 Rev1
24 pages
Pan in Acidic Medium
No ratings yet
Pan in Acidic Medium
11 pages
Probability Basics for Students
No ratings yet
Probability Basics for Students
36 pages
PROBLEM 13.132: Table A-4, Air (Assume T - 8, H A
No ratings yet
PROBLEM 13.132: Table A-4, Air (Assume T - 8, H A
2 pages
Higher Order Differential Equations - Dr. M. A. Maleque
No ratings yet
Higher Order Differential Equations - Dr. M. A. Maleque
54 pages
Motor Paso A Paso Españolberger-Lahr 2-Phase
No ratings yet
Motor Paso A Paso Españolberger-Lahr 2-Phase
22 pages
Motionless Electromagnetic Generator Design
No ratings yet
Motionless Electromagnetic Generator Design
16 pages
02 - Theoretical - and - Experimental - Analysis - of - The - Squir
No ratings yet
02 - Theoretical - and - Experimental - Analysis - of - The - Squir
13 pages
Caterpillar Cat D6M TRACK-TYPE TRACTOR (Prefix 3WN) Service Repair Manual Instant Download
No ratings yet
Caterpillar Cat D6M TRACK-TYPE TRACTOR (Prefix 3WN) Service Repair Manual Instant Download
39 pages
Sennuopu SQ-8 Manual EN V200929
100% (1)
Sennuopu SQ-8 Manual EN V200929
22 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
Field Guide to Constellations
No ratings yet
Field Guide to Constellations
130 pages
Focal Length Lab Report
No ratings yet
Focal Length Lab Report
6 pages
Hexagon MI AT403 ASME Datasheet A4 en
No ratings yet
Hexagon MI AT403 ASME Datasheet A4 en
1 page
Unit 1 & 3 Discrete Structure Vedveethi - Co.in
No ratings yet
Unit 1 & 3 Discrete Structure Vedveethi - Co.in
73 pages

Lec 2 Data Preprocessing

Uploaded by

Lec 2 Data Preprocessing

Uploaded by

Data Pre-processing

• Importing the libraries

We shall have a practical example to better understand the concepts

pip install <package name> # to install any packages

# Reading the dataset

# Splitting the attributes into independent and dependent attributes

• X = dataset.iloc[:, :-1].values # attributes to determine dependent variable / Class

1. Deleting the particular row or column:

➢ Delete the specific row or column which consists of null values

2. Calculating the mean:

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

• The missing values will be replaced by the average values of the

• For Online Shopper variable, 1 represents Yes and 0 represents No.

• For Online Shopper variable, 1 represents Yes and 0 represents No.

Test set: To check the correctness (accuracy/efficiency) of the

# splitting the dataset into training set and test set

80%: for training

Test set: To check the correctness (accuracy/efficiency) of the

# splitting the dataset into training set and test set

80%: for training

Test set: To check the correctness (accuracy/efficiency) of the

# splitting the dataset into training set and test set

80%: for training

▪ Standardize features to a specific range (common ground).

Feature scaling Methods

▪ Normalization scales the feature between 0.0 & 1.0, retaining

▪ Standardization measures the standard deviation of the

▪ It transforms data such that the resulting distribution has

You might also like