0% found this document useful (0 votes)
23 views23 pages

03 - Data Preprocessing

The document provides a comprehensive overview of data pre-processing techniques using Python's scikit-learn library, emphasizing the importance of data wrangling, handling missing values, and transforming categorical variables into numerical formats. It covers various methods such as imputation, feature selection, discretization, and scaling, along with practical examples using the Titanic dataset. Additionally, it highlights the necessity of applying consistent transformations to both training and testing datasets.

Uploaded by

Pepa Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

03 - Data Preprocessing

The document provides a comprehensive overview of data pre-processing techniques using Python's scikit-learn library, emphasizing the importance of data wrangling, handling missing values, and transforming categorical variables into numerical formats. It covers various methods such as imputation, feature selection, discretization, and scaling, along with practical examples using the Titanic dataset. Additionally, it highlights the necessity of applying consistent transformations to both training and testing datasets.

Uploaded by

Pepa Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data pre-processing

with Python
Data Processes

Antonio Jesús Díaz Honrubia


[email protected]

1
Data pre-processing
with scikit-learn
Even though there are other options, we are going to
use the scikit-learn library for this purpose.

It is one of the most widespread libraries for data


processes with Python.

2
Data pre-processing in CRISP-DM

We are here

3
Why is data preprocessing so important?

WHAT WE EXPECT WHEN WHAT WE GET WHEN


WORKING WITH DATA WORKING WITH DATA

4
Data wrangling
• We are going to use the Titanic dataset.
• But let’s apply what we know about “data wrangling” before.

5
Training and testing
• We need to install the scikit-learn library:
• pip install sklearn

• If we are going to train a supervised model, the first thing that


we need to do is to divide the dataset in two folds: training and
testing.

6
Missing values
• The first step is to impute missing values.
• Identify variables with missing values.

• Options:
• Drop rows.
• Drop columns.
• Imputation.

7
Missing values
• We are going to use the SimpleImputer class from sklearn.
• We are going to use simple techniques:
• For numerical variables, we can impute the values with the mean.
• For categorical variables, we can impute the values with the mode.

• To process the DataFrames with sklearn, two steps are always


required:
• fit: it is in charge of “fitting” to our data, i.e., it obtains the data
transformation function.
• transform: it applies the function obtained in the previous step to a
given DataFrame.

• Sometimes it is posible to apply both steps with a combines


method called fit_transform.

8
Missing values
• We create two imputers, each with a different strategy, and
apply them to the corresponding columns.
• Possible strategies: mean, median, most_frequent, constant.

9
Transformation into numerical variables
• Although there are algorithms that can work with categorical
variables encoded as text, in scikit-learn (in general) it will always
be necessary to work with numerically encoded variables.
• We are going to apply a simple transformation first.
• A categorical variable will be transformed into a numerical one by
simply assigning a value to each category.
• This can be done by using a LabelEncoder.
• However, this approach presents a major drawback:
• We are introducing an order in the categories.
• It can still be used for binary variables.

• There is another approach that solves this problem.


10

10
Transformation into numerical variables
• First, we need to know which features are categorical.

• Now, we create a LabelEncoder for each of the previous


features.

11

11
Transformation into numerical variables
• An inverse transformation can also be applied with the
inverse_transform method.

• Future data can be transformed with the same


LabelEncoder (transform method instead of
fit_transform).

12

12
Transformation into numerical variables
• As it has been mentioned, a LabelEncoder presents the drawback
of introducing an order in the categories.
• It can still be used for feature Sex, since it only has two categories.

• This problem can be solved using a OneHotEncoder.


• This transformer creates as many columns as categories exist in the
original column.
• It will place a 1 in the column that corresponds with the original
category and a 0 in the others.
• This way, all the new variables have a binary nature.

13

13
Transformation into numerical variables
• We need to consider that the result after the transformation
will contain several columns.
• We need to accomodate them in the DataFrame.

14

14
Feature selection
• We are going to perform a statistical feature selection.
• We can select the k best or a percentile of them.
• Methods: 𝜒 2 contrast or mutual information.

• We are going to try first with SelectKBest.


• The transform method returns an array and the feature
names cannot be recovered.
• It is better to filter the columns that are selected.

15

15
Feature selection
• What would have happened if we had transformed the Sex
columna with a OneHotEncoder?

• Both columns have the same information and, therefore,


they are “equally important”.
• The importance of that feature has been doubled.

16

16
Feature selection
• Using a SelectPercentile (keeping the 40% of the original
features) and the mutual information criterion.

17

17
Feature discretization
• Contrary to what we have done before, some algorithms
work better with discrete features.
• We can use a KBinsDiscretizer to discretize the range of a
feature so that each bin contains the same number of samples.

18

18
Feature discretization
• The same process can be done for feature Fare:

19

19
Feature scaling
• Some algorithms work better that features in the same scale.
• We can use a MinMaxScaler or a StandardScaler.

• With MinMaxScaler the range is transformed into the interval [0, 1].
• Another interval can be specified with the parameter feature_range.

20

20
Feature scaling
• With StandardScaler a standarization is performed (based
on a Normal Distribution).
• The mean is substracted to each value and, then, it is devided
by the standard deviation.

21

21
Transforming the test data
• The same transformations must be done to the test dataset.
• We must not fit new models, the existing ones need to be used.

22

22
Othar preprocessing tasks
• We have covered the most common preprocessing tasks.
• However, the number of preprocessing tasks is nearly limitless.

• Some examples:
• Data integration from different sources.
• Advanced outlier detection.
• Derivation of new features.
• Transformation of existing features.
• Treatment of duplicates.
• Treatment of inconsistences.
• Dimensionality reduction.
• Treatment of imbalanced data.
23

23

You might also like