0% found this document useful (0 votes)

10 views14 pages

Data Preprocessing in Machine Learning

The document provides an overview of data in machine learning, emphasizing its importance for model training and performance. It discusses different forms of data, the significance of data preprocessing, and outlines the steps involved in preparing data for machine learning, including handling missing values, encoding categorical data, and feature scaling. Additionally, it highlights the need to split datasets into training and testing sets for effective model evaluation.

Uploaded by

Manya Raghuwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Data Preprocessing in Machine Learning

Uploaded by

Manya Raghuwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data in Machine Learning

• Data refers to the set of observations or measurements that can be used to train a
machine-learning model.
• The quality and quantity of data available for training and testing play a significant
role in determining the performance of a machine-learning model.
• Data can be in various forms such as numerical, categorical, or time-series data,
and can come from various sources such as databases, spreadsheets or from API
• Machine learning algorithms use data to learn patterns and relationships between
input variables and target outputs, which can then be used for prediction or
classification tasks.
• Data is typically divided into two types:
• Labeled data
• Unlabeled data
Forms of Data

• Numeric Data : If a feature represents a characteristic measured in

numbers , it is called a numeric feature.
• Categorical Data : A categorical feature is an attribute that can take
on one of the limited , and usually fixed number of possible values on
the basis of some qualitative property . A categorical feature is also
called a nominal feature.
• Ordinal Data : This denotes a nominal variable with categories falling
in an ordered list . Examples include clothing sizes such as small,
medium , and large , or a measurement of customer satisfaction on a
scale from “not at all happy” to “very happy”.
Data Preprocessing in Machine
learning
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning model.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models
• Data preprocessing increases the accuracy and efficiency of a machine
learning model.
Steps of Data Preprocessing
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset
• To create a machine learning model, the first thing we required is a
dataset as a machine learning model completely works on data.
• The collected data for a particular problem in a proper format is
known as the dataset.
• Mostly, dataset available in comma-separated values(CSV) files.
• it is a file format which allows us to save the tabular data, such as
spreadsheets.
2) Importing Libraries

• we need to import some predefined Python libraries like:

• Numpy
• Matplotlib
• Pandas

• Example:
• Import [Link] as plt
• import numpy as nm
• Import pandas as pd
3) Importing the Datasets
Reading data in dataframe
data_set= pd.read_csv('[Link]')

Extracting independent variable:

x= data_set.iloc[:,:-1].values

Extracting dependent variable:

y= data_set.iloc[:,3].values
4) Handling Missing data:

• If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.
• it is necessary to handle missing values present in the dataset.
• There are mainly two ways to handle missing data, which are:
By deleting the particular row
By calculating the mean

• Example
from [Link] import Imputer
Imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= [Link](x[:, 1:3])
#Replacing missing data with the calculated mean value
5) Encoding Categorical data:

• Categorical data is data which has some categories such as, in our
dataset; there are two categorical variable, Country, and Purchased.
• Machine learning model completely works on mathematics and
numbers.
• Dataset would have a categorical variable, then it may create trouble
while building the model.
• So it is necessary to encode these categorical variables into numbers.
#Catgorical data for Country Variable
from [Link] import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
All country variables shall be encoded into 0, 1 and 2
Dummy variables and
OneHotEncoder
• The machine learning model may assume that there is some
correlation between Label encoded variables which will produce the
wrong output. So to remove this issue, we will use dummy encoding.
• Dummy variables are those variables which have values 0 or 1.
#for Country Variable

from [Link] import LabelEncoder, OneHotEn

coder

label_encoder_x= LabelEncoder()

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

#Encoding for dummy variables

onehot_encoder= OneHotEncoder(categorical_features= [0]

)
6) Splitting the Dataset

• We divide our dataset into a training set and test set.

• Training Set: A subset of dataset to train the machine learning model,
and we already know the output.
• Test set: A subset of dataset to test the machine learning model, and
by using the test set, model predicts the output.
• Example:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2)

Where as:
test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2 which
tells the dividing ratio of training and testing sets.
7) Feature Scaling

• It is a technique to standardize the independent variables of the

dataset in a specific range
• In this, we put our variables in the same range and in the same scale
so that no any variable dominate the other variable.
• A machine learning model is based on Euclidean distance, and if we
do not scale the variable, then it will cause some issue in our machine
learning model.
• Euclidean Distance between a and b=
• There are two ways to perform feature scaling in machine learning:
Standardization Normalization
Example of feature scaling
• from [Link] import StandardScaler
• st_x= StandardScaler()
• x_train= st_x.fit_transform(x_train)
• x_test= st_x.transform(x_test)

4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
ML 1
No ratings yet
ML 1
13 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Final ML
No ratings yet
Final ML
2 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
18 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Scikit-learn Machine Learning Tutorial
No ratings yet
Scikit-learn Machine Learning Tutorial
17 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Unit 1
No ratings yet
Unit 1
95 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Social Media Analytics Techniques
No ratings yet
Social Media Analytics Techniques
77 pages
Introduction To Data in Machine Learning
No ratings yet
Introduction To Data in Machine Learning
12 pages
Week 4
No ratings yet
Week 4
2 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
UNIT1@
No ratings yet
UNIT1@
4 pages
Scikit-Learn ML Cheat Sheet Guide
No ratings yet
Scikit-Learn ML Cheat Sheet Guide
3 pages
Comparing Human and Machine Learning
No ratings yet
Comparing Human and Machine Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Learning Algorithms & Models
No ratings yet
Learning Algorithms & Models
9 pages
ML Da
No ratings yet
ML Da
55 pages
IntegratingAIandIoTinMushroomGrowingChamber v3
No ratings yet
IntegratingAIandIoTinMushroomGrowingChamber v3
9 pages
AI in Talent & Job Management 2023
No ratings yet
AI in Talent & Job Management 2023
26 pages
LLaMA-Adapter: Efficient LLM Tuning
No ratings yet
LLaMA-Adapter: Efficient LLM Tuning
30 pages
Chapter8 GANs
No ratings yet
Chapter8 GANs
24 pages
Ccw331 Business Analytics
No ratings yet
Ccw331 Business Analytics
35 pages
Aishwarya DL Mini Project Report
No ratings yet
Aishwarya DL Mini Project Report
4 pages
Nimish
No ratings yet
Nimish
4 pages
Graduate Kit
No ratings yet
Graduate Kit
32 pages
Cervical Cancer Prediction Using Machine Learning
No ratings yet
Cervical Cancer Prediction Using Machine Learning
10 pages
Data-Driven de Novo Design of Super-Adhesive Hydrogels: Article
No ratings yet
Data-Driven de Novo Design of Super-Adhesive Hydrogels: Article
20 pages
Deep Learning in Solving Mathematical Equations
No ratings yet
Deep Learning in Solving Mathematical Equations
14 pages
Data Science Shubham Gupta
No ratings yet
Data Science Shubham Gupta
2 pages
Ss 2
No ratings yet
Ss 2
4 pages
Cricket Game Results via Machine Learning
No ratings yet
Cricket Game Results via Machine Learning
9 pages
Crop Yield Prediction
No ratings yet
Crop Yield Prediction
2 pages
Unit No. 01 - Introduction To AI & ML
No ratings yet
Unit No. 01 - Introduction To AI & ML
34 pages
Question Bank RL
No ratings yet
Question Bank RL
4 pages
Final Review Paper
No ratings yet
Final Review Paper
8 pages
Dubber
No ratings yet
Dubber
36 pages
Application of Data Science and Bioinformatics in Healthcare Technologies
No ratings yet
Application of Data Science and Bioinformatics in Healthcare Technologies
12 pages
Sathiya Sanjay
No ratings yet
Sathiya Sanjay
1 page
Financial Management
No ratings yet
Financial Management
98 pages
Question Bank UM19MB602: Introduction To Machine Learning Unit 4: Decision Tree
No ratings yet
Question Bank UM19MB602: Introduction To Machine Learning Unit 4: Decision Tree
4 pages
AI in Obstetrics
No ratings yet
AI in Obstetrics
3 pages
Report On Advancements in Early Detection of Alzheimer's Disease
No ratings yet
Report On Advancements in Early Detection of Alzheimer's Disease
40 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
3 pages
Fusing Global and Local Features For Generalized AI-synthesized Image Detection
No ratings yet
Fusing Global and Local Features For Generalized AI-synthesized Image Detection
5 pages
Deep Learning with CNN Architectures
No ratings yet
Deep Learning with CNN Architectures
7 pages
Digitalization's Impact on Insurance Value Chain
No ratings yet
Digitalization's Impact on Insurance Value Chain
38 pages
Technicals - SILK ERP Document
No ratings yet
Technicals - SILK ERP Document
119 pages

Data Preprocessing in Machine Learning

Uploaded by

Data Preprocessing in Machine Learning

Uploaded by

Data in Machine Learning

• Numeric Data : If a feature represents a characteristic measured in

• we need to import some predefined Python libraries like:

Extracting independent variable:

Extracting dependent variable:

from [Link] import LabelEncoder, OneHotEn

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

#Encoding for dummy variables

onehot_encoder= OneHotEncoder(categorical_features= [0]

• We divide our dataset into a training set and test set.

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2)

• It is a technique to standardize the independent variables of the

You might also like