Data pre-processing
with Python
Data Processes
Antonio Jesús Díaz Honrubia
[email protected] 1
Data pre-processing
with scikit-learn
Even though there are other options, we are going to
use the scikit-learn library for this purpose.
It is one of the most widespread libraries for data
processes with Python.
2
Data pre-processing in CRISP-DM
We are here
3
Why is data preprocessing so important?
WHAT WE EXPECT WHEN WHAT WE GET WHEN
WORKING WITH DATA WORKING WITH DATA
4
Data wrangling
• We are going to use the Titanic dataset.
• But let’s apply what we know about “data wrangling” before.
5
Training and testing
• We need to install the scikit-learn library:
• pip install sklearn
• If we are going to train a supervised model, the first thing that
we need to do is to divide the dataset in two folds: training and
testing.
6
Missing values
• The first step is to impute missing values.
• Identify variables with missing values.
• Options:
• Drop rows.
• Drop columns.
• Imputation.
7
Missing values
• We are going to use the SimpleImputer class from sklearn.
• We are going to use simple techniques:
• For numerical variables, we can impute the values with the mean.
• For categorical variables, we can impute the values with the mode.
• To process the DataFrames with sklearn, two steps are always
required:
• fit: it is in charge of “fitting” to our data, i.e., it obtains the data
transformation function.
• transform: it applies the function obtained in the previous step to a
given DataFrame.
• Sometimes it is posible to apply both steps with a combines
method called fit_transform.
8
Missing values
• We create two imputers, each with a different strategy, and
apply them to the corresponding columns.
• Possible strategies: mean, median, most_frequent, constant.
9
Transformation into numerical variables
• Although there are algorithms that can work with categorical
variables encoded as text, in scikit-learn (in general) it will always
be necessary to work with numerically encoded variables.
• We are going to apply a simple transformation first.
• A categorical variable will be transformed into a numerical one by
simply assigning a value to each category.
• This can be done by using a LabelEncoder.
• However, this approach presents a major drawback:
• We are introducing an order in the categories.
• It can still be used for binary variables.
• There is another approach that solves this problem.
10
10
Transformation into numerical variables
• First, we need to know which features are categorical.
• Now, we create a LabelEncoder for each of the previous
features.
11
11
Transformation into numerical variables
• An inverse transformation can also be applied with the
inverse_transform method.
• Future data can be transformed with the same
LabelEncoder (transform method instead of
fit_transform).
12
12
Transformation into numerical variables
• As it has been mentioned, a LabelEncoder presents the drawback
of introducing an order in the categories.
• It can still be used for feature Sex, since it only has two categories.
• This problem can be solved using a OneHotEncoder.
• This transformer creates as many columns as categories exist in the
original column.
• It will place a 1 in the column that corresponds with the original
category and a 0 in the others.
• This way, all the new variables have a binary nature.
13
13
Transformation into numerical variables
• We need to consider that the result after the transformation
will contain several columns.
• We need to accomodate them in the DataFrame.
14
14
Feature selection
• We are going to perform a statistical feature selection.
• We can select the k best or a percentile of them.
• Methods: 𝜒 2 contrast or mutual information.
• We are going to try first with SelectKBest.
• The transform method returns an array and the feature
names cannot be recovered.
• It is better to filter the columns that are selected.
15
15
Feature selection
• What would have happened if we had transformed the Sex
columna with a OneHotEncoder?
• Both columns have the same information and, therefore,
they are “equally important”.
• The importance of that feature has been doubled.
16
16
Feature selection
• Using a SelectPercentile (keeping the 40% of the original
features) and the mutual information criterion.
17
17
Feature discretization
• Contrary to what we have done before, some algorithms
work better with discrete features.
• We can use a KBinsDiscretizer to discretize the range of a
feature so that each bin contains the same number of samples.
18
18
Feature discretization
• The same process can be done for feature Fare:
19
19
Feature scaling
• Some algorithms work better that features in the same scale.
• We can use a MinMaxScaler or a StandardScaler.
• With MinMaxScaler the range is transformed into the interval [0, 1].
• Another interval can be specified with the parameter feature_range.
20
20
Feature scaling
• With StandardScaler a standarization is performed (based
on a Normal Distribution).
• The mean is substracted to each value and, then, it is devided
by the standard deviation.
21
21
Transforming the test data
• The same transformations must be done to the test dataset.
• We must not fit new models, the existing ones need to be used.
22
22
Othar preprocessing tasks
• We have covered the most common preprocessing tasks.
• However, the number of preprocessing tasks is nearly limitless.
• Some examples:
• Data integration from different sources.
• Advanced outlier detection.
• Derivation of new features.
• Transformation of existing features.
• Treatment of duplicates.
• Treatment of inconsistences.
• Dimensionality reduction.
• Treatment of imbalanced data.
23
23