0% found this document useful (0 votes)
2 views16 pages

Lec 2 Data Preprocessing

Uploaded by

marconcostine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

Lec 2 Data Preprocessing

Uploaded by

marconcostine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Pre-processing

Data Pre-processing

• Importing the libraries


• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Splitting the dataset into training and testing datasets
• Feature Scaling

We shall have a practical example to better understand the concepts


Basic libraries
• import numpy as np # used for handling numbers
• import pandas as pd # used for handling the dataset
• from sklearn.impute import SimpleImputer # used for handling missing data
• from sklearn.preprocessing import LabelEncoder, OneHotEncoder # used for
encoding categorical data
• from sklearn.model_selection import train_test_split # used for splitting training
and testing data
• from sklearn.preprocessing import StandardScaler # used for feature scaling
• import matplotlib.pyplot as plt # for plotting figures and graphs

pip install <package name> # to install any packages


Eg. pip install matplotlib
Importing the Dataset

# Reading the dataset


• dataset = pd.read_csv(‘datasetname.csv') # to import the dataset into a variable

# Splitting the attributes into independent and dependent attributes

• X = dataset.iloc[:, :-1].values # attributes to determine dependent variable / Class


• Y = dataset.iloc[:, -1].values # dependent variable / Class
Handling missing data
Ways to handle missing data:

1. Deleting the particular row or column:

➢ Delete the specific row or column which consists of null values


This may lead to loss of information compromising the model accuracy

2. Calculating the mean:

➢ Compute the mean of that column contains any missing value and
put it on the place of missing value
Handling missing data …
Example

•# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')


imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])

• The missing values will be replaced by the average values of the


respective columns.
Handling of Categorical Data

• Data which has some categories such as country, gender, hair color, or
product type.
• Machine models deal with numbers
• Categorical variables, may complicate the model building procedures
• So we encode these categorical variables into numbers.
Handling of Categorical Data
Example
The Region contains three categories. It’s India, USA & Brazil and the online
shopper variable contains two categories. Yes and No.
# encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

Output
• The Region variable is now made up of a 3 bit binary variable. The left most bit
represents India, 2nd bit represents Brazil and the last bit represents USA. If the
bit is 1 then it represents data for that country otherwise not. For Online
Shopper variable, 1 represents Yes and 0 represents No.
Handling of Categorical Data…..

• Output

• The Region variable is now made up of a 3 bit binary variable. The left most
bit represents India, 2nd bit represents Brazil and the last bit
represents USA. If the bit is 1 then it represents data for that country
otherwise not.

• For Online Shopper variable, 1 represents Yes and 0 represents No.


Handling of Categorical Data…..

• Output

• The Region variable is now made up of a 3 bit binary variable. The left most
bit represents India, 2nd bit represents Brazil and the last bit
represents USA. If the bit is 1 then it represents data for that country
otherwise not.

• For Online Shopper variable, 1 represents Yes and 0 represents No.


Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the


algorithm

Example:

# splitting the dataset into training set and test set


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training


20%: for testing
Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the


algorithm

Example:

# splitting the dataset into training set and test set


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training


20%: for testing
Splitting the dataset into training and testing datasets

Training set: used to make (train) the algorithm (model) learn the
data patterns

Test set: To check the correctness (accuracy/efficiency) of the


algorithm

Example:

# splitting the dataset into training set and test set


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)

80%: for training


20%: for testing
Feature Scaling

▪ Standardize features to a specific range (common ground).


Features (variables) are kept in the same range and on the
same scale so that no variable dominates the other variable.

Feature scaling Methods

1. Normalization

2. Standardization
Normalization

▪ Normalization scales the feature between 0.0 & 1.0, retaining


their proportional range to each other.
Standardization

▪ Standardization measures the standard deviation of the


value from its mean.

▪ It transforms data such that the resulting distribution has


a mean of 0 and a standard deviation of 1.

You might also like