Data Pre-processing
Data Pre-processing
• Importing the libraries
• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Splitting the dataset into training and testing datasets
• Feature Scaling
We shall have a practical example to better understand the concepts
Basic libraries
• import numpy as np # used for handling numbers
• import pandas as pd # used for handling the dataset
• from sklearn.impute import SimpleImputer # used for handling missing data
• from sklearn.preprocessing import LabelEncoder, OneHotEncoder # used for
encoding categorical data
• from sklearn.model_selection import train_test_split # used for splitting training
and testing data
• from sklearn.preprocessing import StandardScaler # used for feature scaling
• import matplotlib.pyplot as plt # for plotting figures and graphs
pip install <package name> # to install any packages
Eg. pip install matplotlib
Importing the Dataset
# Reading the dataset
• dataset = pd.read_csv(‘datasetname.csv') # to import the dataset into a variable
# Splitting the attributes into independent and dependent attributes
• X = dataset.iloc[:, :-1].values # attributes to determine dependent variable / Class
• Y = dataset.iloc[:, -1].values # dependent variable / Class
Handling missing data
Ways to handle missing data:
1. Deleting the particular row or column:
➢ Delete the specific row or column which consists of null values
This may lead to loss of information compromising the model accuracy
2. Calculating the mean:
➢ Compute the mean of that column contains any missing value and
put it on the place of missing value
Handling missing data …
Example
•# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
• The missing values will be replaced by the average values of the
respective columns.
Handling of Categorical Data
• Data which has some categories such as country, gender, hair color, or
product type.
• Machine models deal with numbers
• Categorical variables, may complicate the model building procedures
• So we encode these categorical variables into numbers.
Handling of Categorical Data
Example
The Region contains three categories. It’s India, USA & Brazil and the online
shopper variable contains two categories. Yes and No.
# encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Output
• The Region variable is now made up of a 3 bit binary variable. The left most bit
represents India, 2nd bit represents Brazil and the last bit represents USA. If the
bit is 1 then it represents data for that country otherwise not. For Online
Shopper variable, 1 represents Yes and 0 represents No.
Handling of Categorical Data…..
• Output
• The Region variable is now made up of a 3 bit binary variable. The left most
bit represents India, 2nd bit represents Brazil and the last bit
represents USA. If the bit is 1 then it represents data for that country
otherwise not.
• For Online Shopper variable, 1 represents Yes and 0 represents No.
Handling of Categorical Data…..
• Output
• The Region variable is now made up of a 3 bit binary variable. The left most
bit represents India, 2nd bit represents Brazil and the last bit
represents USA. If the bit is 1 then it represents data for that country
otherwise not.
• For Online Shopper variable, 1 represents Yes and 0 represents No.
Splitting the dataset into training and testing datasets
Training set: used to make (train) the algorithm (model) learn the
data patterns
Test set: To check the correctness (accuracy/efficiency) of the
algorithm
Example:
# splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)
80%: for training
20%: for testing
Splitting the dataset into training and testing datasets
Training set: used to make (train) the algorithm (model) learn the
data patterns
Test set: To check the correctness (accuracy/efficiency) of the
algorithm
Example:
# splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)
80%: for training
20%: for testing
Splitting the dataset into training and testing datasets
Training set: used to make (train) the algorithm (model) learn the
data patterns
Test set: To check the correctness (accuracy/efficiency) of the
algorithm
Example:
# splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)
80%: for training
20%: for testing
Feature Scaling
▪ Standardize features to a specific range (common ground).
Features (variables) are kept in the same range and on the
same scale so that no variable dominates the other variable.
Feature scaling Methods
1. Normalization
2. Standardization
Normalization
▪ Normalization scales the feature between 0.0 & 1.0, retaining
their proportional range to each other.
Standardization
▪ Standardization measures the standard deviation of the
value from its mean.
▪ It transforms data such that the resulting distribution has
a mean of 0 and a standard deviation of 1.