0% found this document useful (0 votes)

23 views23 pages

03 - Data Preprocessing

The document provides a comprehensive overview of data pre-processing techniques using Python's scikit-learn library, emphasizing the importance of data wrangling, handling missing values, and transforming categorical variables into numerical formats. It covers various methods such as imputation, feature selection, discretization, and scaling, along with practical examples using the Titanic dataset. Additionally, it highlights the necessity of applying consistent transformations to both training and testing datasets.

Uploaded by

Pepa Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views23 pages

03 - Data Preprocessing

Uploaded by

Pepa Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data pre-processing

with Python
Data Processes

Antonio Jesús Díaz Honrubia

[email protected]

1
Data pre-processing
with scikit-learn
Even though there are other options, we are going to
use the scikit-learn library for this purpose.

It is one of the most widespread libraries for data

processes with Python.

2
Data pre-processing in CRISP-DM

We are here

3
Why is data preprocessing so important?

WHAT WE EXPECT WHEN WHAT WE GET WHEN

WORKING WITH DATA WORKING WITH DATA

4
Data wrangling
• We are going to use the Titanic dataset.
• But let’s apply what we know about “data wrangling” before.

5
Training and testing
• We need to install the scikit-learn library:
• pip install sklearn

• If we are going to train a supervised model, the first thing that

we need to do is to divide the dataset in two folds: training and
testing.

6
Missing values
• The first step is to impute missing values.
• Identify variables with missing values.

• Options:
• Drop rows.
• Drop columns.
• Imputation.

7
Missing values
• We are going to use the SimpleImputer class from sklearn.
• We are going to use simple techniques:
• For numerical variables, we can impute the values with the mean.
• For categorical variables, we can impute the values with the mode.

• To process the DataFrames with sklearn, two steps are always

required:
• fit: it is in charge of “fitting” to our data, i.e., it obtains the data
transformation function.
• transform: it applies the function obtained in the previous step to a
given DataFrame.

• Sometimes it is posible to apply both steps with a combines

method called fit_transform.

8
Missing values
• We create two imputers, each with a different strategy, and
apply them to the corresponding columns.
• Possible strategies: mean, median, most_frequent, constant.

9
Transformation into numerical variables
• Although there are algorithms that can work with categorical
variables encoded as text, in scikit-learn (in general) it will always
be necessary to work with numerically encoded variables.
• We are going to apply a simple transformation first.
• A categorical variable will be transformed into a numerical one by
simply assigning a value to each category.
• This can be done by using a LabelEncoder.
• However, this approach presents a major drawback:
• We are introducing an order in the categories.
• It can still be used for binary variables.

• There is another approach that solves this problem.

10
Transformation into numerical variables
• First, we need to know which features are categorical.

• Now, we create a LabelEncoder for each of the previous

features.

11
Transformation into numerical variables
• An inverse transformation can also be applied with the
inverse_transform method.

• Future data can be transformed with the same

LabelEncoder (transform method instead of
fit_transform).

12
Transformation into numerical variables
• As it has been mentioned, a LabelEncoder presents the drawback
of introducing an order in the categories.
• It can still be used for feature Sex, since it only has two categories.

• This problem can be solved using a OneHotEncoder.

• This transformer creates as many columns as categories exist in the
original column.
• It will place a 1 in the column that corresponds with the original
category and a 0 in the others.
• This way, all the new variables have a binary nature.

13
Transformation into numerical variables
• We need to consider that the result after the transformation
will contain several columns.
• We need to accomodate them in the DataFrame.

14
Feature selection
• We are going to perform a statistical feature selection.
• We can select the k best or a percentile of them.
• Methods: 𝜒 2 contrast or mutual information.

• We are going to try first with SelectKBest.

• The transform method returns an array and the feature
names cannot be recovered.
• It is better to filter the columns that are selected.

15
Feature selection
• What would have happened if we had transformed the Sex
columna with a OneHotEncoder?

• Both columns have the same information and, therefore,

they are “equally important”.
• The importance of that feature has been doubled.

16
Feature selection
• Using a SelectPercentile (keeping the 40% of the original
features) and the mutual information criterion.

17
Feature discretization
• Contrary to what we have done before, some algorithms
work better with discrete features.
• We can use a KBinsDiscretizer to discretize the range of a
feature so that each bin contains the same number of samples.

18
Feature discretization
• The same process can be done for feature Fare:

19
Feature scaling
• Some algorithms work better that features in the same scale.
• We can use a MinMaxScaler or a StandardScaler.

• With MinMaxScaler the range is transformed into the interval [0, 1].
• Another interval can be specified with the parameter feature_range.

20
Feature scaling
• With StandardScaler a standarization is performed (based
on a Normal Distribution).
• The mean is substracted to each value and, then, it is devided
by the standard deviation.

21
Transforming the test data
• The same transformations must be done to the test dataset.
• We must not fit new models, the existing ones need to be used.

22
Othar preprocessing tasks
• We have covered the most common preprocessing tasks.
• However, the number of preprocessing tasks is nearly limitless.

• Some examples:
• Data integration from different sources.
• Advanced outlier detection.
• Derivation of new features.
• Transformation of existing features.
• Treatment of duplicates.
• Treatment of inconsistences.
• Dimensionality reduction.
• Treatment of imbalanced data.
23

Data Preprocessing with Scikit-learn
No ratings yet
Data Preprocessing with Scikit-learn
14 pages
Scikit-Learn: Python Data Analytics
No ratings yet
Scikit-Learn: Python Data Analytics
58 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Preprocessing and Feature Selection
No ratings yet
Data Preprocessing and Feature Selection
47 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Scikit-Learn ML Cheat Sheet Guide
No ratings yet
Scikit-Learn ML Cheat Sheet Guide
3 pages
Lec 2 Data Preprocessing Notes
No ratings yet
Lec 2 Data Preprocessing Notes
16 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
4-2 Preprocessing and Pipelines
No ratings yet
4-2 Preprocessing and Pipelines
64 pages
Lec 2 Data Preprocessing
No ratings yet
Lec 2 Data Preprocessing
16 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Feature Engineering Basics for ML
No ratings yet
Feature Engineering Basics for ML
33 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Hands-On Data Preprocessing in Python
No ratings yet
Hands-On Data Preprocessing in Python
32 pages
Machine Learning Ess - Week 1-4week
No ratings yet
Machine Learning Ess - Week 1-4week
43 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Week 10
No ratings yet
Week 10
50 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
T5 Part II
No ratings yet
T5 Part II
39 pages
Scikit-learn Machine Learning Tutorial
No ratings yet
Scikit-learn Machine Learning Tutorial
17 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Interview Questions Answers
No ratings yet
Interview Questions Answers
17 pages
Python in Research
No ratings yet
Python in Research
18 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Prep for ML Beginners
No ratings yet
Data Prep for ML Beginners
39 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
DS 1
No ratings yet
DS 1
20 pages
Unit II
No ratings yet
Unit II
119 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
ML Lec 4
No ratings yet
ML Lec 4
9 pages
ML Notes
No ratings yet
ML Notes
44 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
SCULi & LIDIA Order Form Guide
No ratings yet
SCULi & LIDIA Order Form Guide
1 page
Lead-Free Piezoceramics Study
No ratings yet
Lead-Free Piezoceramics Study
8 pages
YDS Tense Soru Tipi
100% (1)
YDS Tense Soru Tipi
4 pages
Matrix Algebra For Engineers
No ratings yet
Matrix Algebra For Engineers
187 pages
Synonym Test PDF
No ratings yet
Synonym Test PDF
2 pages
TSA 2023: S3 English Performance Analysis
No ratings yet
TSA 2023: S3 English Performance Analysis
131 pages
Backup - 06-Outreach Methods Ranked From Best To Worst
No ratings yet
Backup - 06-Outreach Methods Ranked From Best To Worst
3 pages
Optom 272A Summary
No ratings yet
Optom 272A Summary
32 pages
Creative Writing Language Guide
No ratings yet
Creative Writing Language Guide
34 pages
Class 7 Maths Chapter 6 Triangle and Its Properties Important Questions
No ratings yet
Class 7 Maths Chapter 6 Triangle and Its Properties Important Questions
7 pages
LRT3 Telecom System Tender Guide
100% (1)
LRT3 Telecom System Tender Guide
2 pages
State-of-the-Art Passive Beam-Steering Antenna Technologies: Challenges and Capabilities
No ratings yet
State-of-the-Art Passive Beam-Steering Antenna Technologies: Challenges and Capabilities
16 pages
AT5 - Listening Test 1
No ratings yet
AT5 - Listening Test 1
21 pages
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
5 pages
The School For Scandal
100% (1)
The School For Scandal
9 pages
Compact Switch Mode Rectifier With Wide Input and Overvoltage Protection For World-Wide Telecom Applications
100% (1)
Compact Switch Mode Rectifier With Wide Input and Overvoltage Protection For World-Wide Telecom Applications
2 pages
Conference Presentation Template Guide
No ratings yet
Conference Presentation Template Guide
40 pages
User's Manual: Version: 13/3/2016
No ratings yet
User's Manual: Version: 13/3/2016
124 pages
Sample Paper Mid Term 1 BST 12 2025-26
No ratings yet
Sample Paper Mid Term 1 BST 12 2025-26
7 pages
Life Skills Respect Pic
No ratings yet
Life Skills Respect Pic
2 pages
CALCULATION OF LOADS WD AND WL
No ratings yet
CALCULATION OF LOADS WD AND WL
2 pages
Plate No.2 Dynamics
No ratings yet
Plate No.2 Dynamics
3 pages
IM2297 RL8000 (Datanet) Manual
No ratings yet
IM2297 RL8000 (Datanet) Manual
21 pages
MOONLIGHT AND MAGNOLIAS - The Public Theatre Pages 1-10 - Flip PDF Download - FlipHTML5
No ratings yet
MOONLIGHT AND MAGNOLIAS - The Public Theatre Pages 1-10 - Flip PDF Download - FlipHTML5
10 pages
Min Chieh Tsai, 5avmmb
No ratings yet
Min Chieh Tsai, 5avmmb
3 pages
? Evaluation of ANDAL Document Compliance
No ratings yet
? Evaluation of ANDAL Document Compliance
2 pages
Form PDF 404443340090723
No ratings yet
Form PDF 404443340090723
39 pages
My New Babysitter, PT 8 - Tumbex
No ratings yet
My New Babysitter, PT 8 - Tumbex
48 pages
Past Paper Practice - Paper 1 English
No ratings yet
Past Paper Practice - Paper 1 English
2 pages
Market Board Connector: VCC 12V VCC VCC 12V VCC VCC 12V VCC 12V VCC
80% (5)
Market Board Connector: VCC 12V VCC VCC 12V VCC VCC 12V VCC 12V VCC
1 page

03 - Data Preprocessing

Uploaded by

03 - Data Preprocessing

Uploaded by

Data pre-processing

Antonio Jesús Díaz Honrubia

It is one of the most widespread libraries for data

WHAT WE EXPECT WHEN WHAT WE GET WHEN

• If we are going to train a supervised model, the first thing that

• To process the DataFrames with sklearn, two steps are always

• Sometimes it is posible to apply both steps with a combines

• There is another approach that solves this problem.

• Now, we create a LabelEncoder for each of the previous

• Future data can be transformed with the same

• This problem can be solved using a OneHotEncoder.

• We are going to try first with SelectKBest.

• Both columns have the same information and, therefore,

You might also like