0% found this document useful (0 votes)
10 views14 pages

Data Preprocessing in Machine Learning

The document provides an overview of data in machine learning, emphasizing its importance for model training and performance. It discusses different forms of data, the significance of data preprocessing, and outlines the steps involved in preparing data for machine learning, including handling missing values, encoding categorical data, and feature scaling. Additionally, it highlights the need to split datasets into training and testing sets for effective model evaluation.

Uploaded by

Manya Raghuwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Data Preprocessing in Machine Learning

The document provides an overview of data in machine learning, emphasizing its importance for model training and performance. It discusses different forms of data, the significance of data preprocessing, and outlines the steps involved in preparing data for machine learning, including handling missing values, encoding categorical data, and feature scaling. Additionally, it highlights the need to split datasets into training and testing sets for effective model evaluation.

Uploaded by

Manya Raghuwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data in Machine Learning

• Data refers to the set of observations or measurements that can be used to train a
machine-learning model.
• The quality and quantity of data available for training and testing play a significant
role in determining the performance of a machine-learning model.
• Data can be in various forms such as numerical, categorical, or time-series data,
and can come from various sources such as databases, spreadsheets or from API
• Machine learning algorithms use data to learn patterns and relationships between
input variables and target outputs, which can then be used for prediction or
classification tasks.
• Data is typically divided into two types:
• Labeled data
• Unlabeled data
Forms of Data

• Numeric Data : If a feature represents a characteristic measured in


numbers , it is called a numeric feature.
• Categorical Data : A categorical feature is an attribute that can take
on one of the limited , and usually fixed number of possible values on
the basis of some qualitative property . A categorical feature is also
called a nominal feature.
• Ordinal Data : This denotes a nominal variable with categories falling
in an ordered list . Examples include clothing sizes such as small,
medium , and large , or a measurement of customer satisfaction on a
scale from “not at all happy” to “very happy”.
Data Preprocessing in Machine
learning
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning model.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models
• Data preprocessing increases the accuracy and efficiency of a machine
learning model.
Steps of Data Preprocessing
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset
• To create a machine learning model, the first thing we required is a
dataset as a machine learning model completely works on data.
• The collected data for a particular problem in a proper format is
known as the dataset.
• Mostly, dataset available in comma-separated values(CSV) files.
• it is a file format which allows us to save the tabular data, such as
spreadsheets.
2) Importing Libraries

• we need to import some predefined Python libraries like:

• Numpy
• Matplotlib
• Pandas

• Example:
• Import [Link] as plt
• import numpy as nm
• Import pandas as pd
3) Importing the Datasets
Reading data in dataframe
data_set= pd.read_csv('[Link]')

Extracting independent variable:


x= data_set.iloc[:,:-1].values

Extracting dependent variable:


y= data_set.iloc[:,3].values
4) Handling Missing data:

• If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.
• it is necessary to handle missing values present in the dataset.
• There are mainly two ways to handle missing data, which are:
By deleting the particular row
By calculating the mean

• Example
from [Link] import Imputer
Imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= [Link](x[:, 1:3])
#Replacing missing data with the calculated mean value
5) Encoding Categorical data:

• Categorical data is data which has some categories such as, in our
dataset; there are two categorical variable, Country, and Purchased.
• Machine learning model completely works on mathematics and
numbers.
• Dataset would have a categorical variable, then it may create trouble
while building the model.
• So it is necessary to encode these categorical variables into numbers.
#Catgorical data for Country Variable
from [Link] import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
All country variables shall be encoded into 0, 1 and 2
Dummy variables and
OneHotEncoder
• The machine learning model may assume that there is some
correlation between Label encoded variables which will produce the
wrong output. So to remove this issue, we will use dummy encoding.
• Dummy variables are those variables which have values 0 or 1.
#for Country Variable

from [Link] import LabelEncoder, OneHotEn


coder

label_encoder_x= LabelEncoder()

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

#Encoding for dummy variables

onehot_encoder= OneHotEncoder(categorical_features= [0]


)
6) Splitting the Dataset

• We divide our dataset into a training set and test set.


• Training Set: A subset of dataset to train the machine learning model,
and we already know the output.
• Test set: A subset of dataset to test the machine learning model, and
by using the test set, model predicts the output.
• Example:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2)


Where as:
test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2 which
tells the dividing ratio of training and testing sets.
7) Feature Scaling

• It is a technique to standardize the independent variables of the


dataset in a specific range
• In this, we put our variables in the same range and in the same scale
so that no any variable dominate the other variable.
• A machine learning model is based on Euclidean distance, and if we
do not scale the variable, then it will cause some issue in our machine
learning model.
• Euclidean Distance between a and b=
• There are two ways to perform feature scaling in machine learning:
Standardization Normalization
Example of feature scaling
• from [Link] import StandardScaler
• st_x= StandardScaler()
• x_train= st_x.fit_transform(x_train)
• x_test= st_x.transform(x_test)

You might also like