0% found this document useful (0 votes)
77 views32 pages

Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS

This tutorial provides Python examples for data preprocessing techniques essential for improving data mining analysis, including handling data quality issues such as missing values, outliers, and duplicates. It demonstrates how to clean and transform a breast cancer dataset using various methods, including replacing missing values, identifying and discarding outliers, and removing duplicates. The tutorial also covers shuffling and sorting dataframes to prepare data for analysis.

Uploaded by

bilalabaloch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views32 pages

Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS

This tutorial provides Python examples for data preprocessing techniques essential for improving data mining analysis, including handling data quality issues such as missing values, outliers, and duplicates. It demonstrates how to clean and transform a breast cancer dataset using various methods, including replacing missing values, identifying and discarding outliers, and removing duplicates. The tutorial also covers shuffling and sorting dataframes to prepare data for analysis.

Uploaded by

bilalabaloch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).

ipynb - Colab

keyboard_arrow_down Module 4: Data Preprocessing


The following tutorial contains Python examples for data preprocessing. You should refer to the "Data" chapter of the "Introduction to Data
Mining" book (slides are available at https://www-users.cs.umn.edu/~kumar001/dmbook/index.php) to understand some of the concepts
introduced in this tutorial. Data preprocessing consists of a broad set of techniques for cleaning, selecting, and transforming data to
improve data mining analysis. Read the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and
press the SHIFT-ENTER keys simultaneously.

keyboard_arrow_down Data Quality Issues


Poor data quality can have an adverse effect on data mining. Among the common data quality issues include noise, outliers, missing
values, and duplicate data. This section presents examples of Python code to alleviate some of these data quality problems. We begin
with an example dataset from the UCI machine learning repository containing information about breast cancer patients. We will first
download the dataset using Pandas read_csv() function and display its first 5 data points.

Code:

import pandas as pd
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.da
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']

data = data.drop(['Sample code'],axis=1)


print('Number of instances = %d' % (data.shape[0]))
print('Number of attributes = %d' % (data.shape[1]))
data.head()

Number of instances = 699


Number of attributes = 10
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

0 5 1 1 1 2 1 3 1 1 2

1 5 4 4 5 7 10 3 2 1 2

2 3 1 1 1 2 2 3 1 1 2

3 6 8 8 1 3 4 3 7 1 2

Next steps: Generate code with data toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Missing Values


It is not unusual for an object to be missing one or more attribute values. In some cases, the information was not collected; while in other
cases, some attributes are inapplicable to the data instances. This section presents examples on the different approaches for handling
missing values.

According to the description of the data (https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original), the missing values


are encoded as '?' in the original data. Our first task is to convert the missing values to NaNs. We can then count the number of missing
values in each column of the data.

Code:

import numpy as np

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 1/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
data = data.replace('?',np.NaN)

print('Number of instances = %d' % (data.shape[0]))


print('Number of attributes = %d' % (data.shape[1]))

print('Number of missing values:')


for col in data.columns:
print('\t%s: %d' % (col,data[col].isna().sum()))

Number of instances = 699


Number of attributes = 10
Number of missing values:
Clump Thickness: 0
Uniformity of Cell Size: 0
Uniformity of Cell Shape: 0
Marginal Adhesion: 0
Single Epithelial Cell Size: 0
Bare Nuclei: 16
Bland Chromatin: 0
Normal Nucleoli: 0
Mitoses: 0
Class: 0

Observe that only the 'Bare Nuclei' column contains missing values. In the following example, the missing values in the 'Bare Nuclei'
column are replaced by the median value of that column. The values before and after replacement are shown for a subset of the data
points.

Code:

data2 = data['Bare Nuclei']

# Convert 'Bare Nuclei' column to numeric before calculating the median.


data2 = pd.to_numeric(data['Bare Nuclei'], errors='coerce')

print('Before replacing missing values:')


print(data2[20:25])
data2 = data2.fillna(data2.median())

print('\nAfter replacing missing values:')


print(data2[20:25])

Before replacing missing values:


20 10.0
21 7.0
22 1.0
23 NaN
24 1.0
Name: Bare Nuclei, dtype: float64

After replacing missing values:


20 10.0
21 7.0
22 1.0
23 1.0
24 1.0
Name: Bare Nuclei, dtype: float64

Instead of replacing the missing values, another common approach is to discard the data points that contain missing values. This can be
easily accomplished by applying the dropna() function to the data frame.

Code:

print('Number of rows in original data = %d' % (data.shape[0]))

data2 = data.dropna()
print('Number of rows after discarding missing values = %d' % (data2.shape[0]))

Number of rows in original data = 699


Number of rows after discarding missing values = 683
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 2/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

keyboard_arrow_down Outliers
Outliers are data instances with characteristics that are considerably different from the rest of the dataset. In the example code below, we
will draw a boxplot to identify the columns in the table that contain outliers. Note that the values in all columns (except for 'Bare Nuclei')
are originally stored as 'int64' whereas the values in the 'Bare Nuclei' column are stored as string objects (since the column initially
contains strings such as '?' for representing missing values). Thus, we must convert the column into numeric values first before creating
the boxplot. Otherwise, the column will not be displayed when drawing the boxplot.

Code:

%matplotlib inline

data2 = data.drop(['Class'],axis=1)
data2['Bare Nuclei'] = pd.to_numeric(data2['Bare Nuclei'])
data2.boxplot(figsize=(20,3))

<Axes: >

The boxplots suggest that only 5 of the columns (Marginal Adhesion, Single Epithetial Cell Size, Bland Cromatin, Normal Nucleoli, and
Mitoses) contain abnormally high values. To discard the outliers, we can compute the Z-score for each attribute and remove those
instances containing attributes with abnormally high or low Z-score (e.g., if Z > 3 or Z <= -3).

Code:

The following code shows the results of standardizing the columns of the data. Note that missing values (NaN) are not affected by the
standardization process.

Z = (data2-data2.mean())/data2.std()
Z[20:25]

Single
Clump Uniformity Uniformity of Marginal Bare Bland Normal
Epithelial Mitoses
Thickness of Cell Size Cell Shape Adhesion Nuclei Chromatin Nucleoli
Cell Size

20 0.917080 -0.044070 -0.406284 2.519152 0.805662 1.771569 0.640688 0.371049 1.405526

21 1.982519 0.611354 0.603167 0.067638 1.257272 0.948266 1.460910 2.335921 -0.343666

22 -0.503505 -0.699494 -0.742767 -0.632794 -0.549168 -0.698341 -0.589645 -0.611387 -0.343666

23 1.272227 0.283642 0.603167 -0.632794 -0.549168 NaN 1.460910 0.043570 -0.343666

Code:

The following code shows the results of discarding columns with Z > 3 or Z <= -3.

print('Number of rows before discarding outliers = %d' % (Z.shape[0]))

Z2 = Z.loc[((Z > -3).sum(axis=1)==9) & ((Z <= 3).sum(axis=1)==9),:]


print('Number of rows after discarding missing values = %d' % (Z2.shape[0]))

Number of rows before discarding outliers = 699


Number of rows after discarding missing values = 632

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 3/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

keyboard_arrow_down Duplicate Data


Some datasets, especially those obtained by merging multiple data sources, may contain duplicates or near duplicate instances. The term
deduplication is often used to refer to the process of dealing with duplicate data issues.

Code:

In the following example, we first check for duplicate instances in the breast cancer dataset.

dups = data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
data.loc[[11,28]]

Number of duplicate rows = 236


Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

11 2 1 1 1 2 1 2 1 1 2

The duplicated() function will return a Boolean array that indicates whether each row is a duplicate of a previous row in the table. The
results suggest there are 236 duplicate rows in the breast cancer dataset. For example, the instance with row index 11 has identical
attribute values as the instance with row index 28. Although such duplicate rows may correspond to samples for different individuals, in
this hypothetical example, we assume that the duplicates are samples taken from the same individual and illustrate below how to remove
the duplicated rows.

Code:

print('Number of rows before discarding duplicates = %d' % (data.shape[0]))


data2 = data.drop_duplicates()
print('Number of rows after discarding duplicates = %d' % (data2.shape[0]))

Number of rows before discarding duplicates = 699


Number of rows after discarding duplicates = 463

keyboard_arrow_down Shuffling Dataframes


It is possible to shuffle.

import os
import numpy as np
import pandas as pd

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read, na_values=['NA','?'])

#np.random.seed(30) # Uncomment this line to get the same shuffle each time

df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
# use inplace=False
df

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 4/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

mpg cylinders displacement horsepower weight acceleration year origin name

0 14.0 8 318.0 150.0 4237 14.5 73 1 plymouth fury gran sedan

1 22.0 6 232.0 112.0 2835 14.7 82 1 ford granada l

2 19.0 6 250.0 100.0 3282 15.0 71 1 pontiac firebird

3 13.0 8 302.0 129.0 3169 12.0 75 1 ford mustang ii

4 25.0 4 140.0 92.0 2572 14.9 76 1 capri ii

... ... ... ... ... ... ... ... ... ...

393 18.0 6 250.0 88.0 3139 14.5 71 1 ford mustang

394 29.8 4 134.0 90.0 2711 15.5 80 3 toyota corona liftback

395 16.0 6 225.0 105.0 3439 15.5 71 1 plymouth satellite custom

396 28.4 4 151.0 90.0 2670 16.0 79 1 buick skylark limited

397 21.5 3 80.0 110.0 2720 13.5 77 3 mazda rx-4

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Sorting Dataframes


It is possible to sort.

df = df.sort_values(by='name',ascending=True)
df

mpg cylinders displacement horsepower weight acceleration year origin name

367 13.0 8 360.0 175.0 3821 11.0 73 1 amc ambassador brougham

123 15.0 8 390.0 190.0 3850 8.5 70 1 amc ambassador dpl

151 17.0 8 304.0 150.0 3672 11.5 72 1 amc ambassador sst

54 24.3 4 151.0 90.0 3003 20.1 80 1 amc concord

306 19.4 6 232.0 90.0 3210 17.2 78 1 amc concord

... ... ... ... ... ... ... ... ... ...

148 44.0 4 97.0 52.0 2130 24.6 82 2 vw pickup

338 29.0 4 90.0 70.0 1937 14.2 76 2 vw rabbit

72 41.5 4 98.0 76.0 2144 14.7 80 2 vw rabbit

79 44.3 4 90.0 48.0 2085 21.7 80 2 vw rabbit c (diesel)

205 31.9 4 89.0 71.0 1925 14.0 79 2 vw rabbit custom

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

print("The first car is: {}".format(df['name'].iloc[0]))

The first car is: amc ambassador brougham

print("The first car is: {}".format(df['name'].loc[0]))

#loc gets rows (or columns) with particular labels from the index.
#iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 5/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

The first car is: datsun 710

keyboard_arrow_down Saving a Dataframe


The following code performs a shuffle and then saves a new copy.

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
filename_write = os.path.join(path,"auto-mpg-shuffle.csv") #The file is available to download from canvas in path Files --> Lab
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")

Done

keyboard_arrow_down Dropping Fields


Some fields are of no value to the neural network and can be dropped. The following code removes the name column from the MPG
dataset.

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])

print("Before drop: {}".format(df.columns))


df.drop('name', axis=1, inplace=True)
print("After drop: {}".format(df.columns))
df[0:5]

Before drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',


'acceleration', 'year', 'origin', 'name'],
dtype='object')
After drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'year', 'origin'],
dtype='object')
mpg cylinders displacement horsepower weight acceleration year origin

0 18.0 8 307.0 130.0 3504 12.0 70 1

1 15.0 8 350.0 165.0 3693 11.5 70 1

2 18.0 8 318.0 150.0 3436 11.0 70 1

3 16.0 8 304.0 150.0 3433 12.0 70 1

keyboard_arrow_down Calculated Fields


It is possible to add new fields to the dataframe that are calculated from the other fields. We can create a new column that gives the
weight in kilograms. The equation to calculate a metric weight, given a weight in pounds is:

m(kg) = m(lb) × 0.45359237

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 6/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
This can be used with the following Python code:

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df

mpg weight_kg cylinders displacement horsepower weight acceleration year origin name

0 18.0 1589 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu

1 15.0 1675 8 350.0 165.0 3693 11.5 70 1 buick skylark 320

2 18.0 1558 8 318.0 150.0 3436 11.0 70 1 plymouth satellite

3 16.0 1557 8 304.0 150.0 3433 12.0 70 1 amc rebel sst

4 17.0 1564 8 302.0 140.0 3449 10.5 70 1 ford torino

... ... ... ... ... ... ... ... ... ... ...

393 27.0 1265 4 140.0 86.0 2790 15.6 82 1 ford mustang gl

394 44.0 966 4 97.0 52.0 2130 24.6 82 2 vw pickup

395 32.0 1040 4 135.0 84.0 2295 11.6 82 1 dodge rampage

396 28.0 1190 4 120.0 79.0 2625 18.6 82 1 ford ranger

397 31.0 1233 4 119.0 82.0 2720 19.4 82 1 chevy s-10

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Feature Normalization


A normalization allows numbers to be put in a standard form so that two values can easily be compared. One very common machine
learning normalization is the Z-Score:
x−μ
z =
σ

To calculate the Z-Score you need to also calculate the mean(μ ) and the standard deviation (σ). The mean is calculated as follows:
x1 +x2 +⋯+xn
μ = x̄ =
n

The standard deviation is calculated as follows:


−−−−−−−−−−−−−
1 N 2 1 N
σ = √ ∑ (xi − μ) , where μ = ∑ xi
N i=1 N i=1

The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and
below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
df['mpg'] = zscore(df['mpg'])
df

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 7/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

mpg cylinders displacement horsepower weight acceleration year origin name

0 -0.706439 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu

1 -1.090751 8 350.0 165.0 3693 11.5 70 1 buick skylark 320

2 -0.706439 8 318.0 150.0 3436 11.0 70 1 plymouth satellite

3 -0.962647 8 304.0 150.0 3433 12.0 70 1 amc rebel sst

4 -0.834543 8 302.0 140.0 3449 10.5 70 1 ford torino

... ... ... ... ... ... ... ... ... ...

393 0.446497 4 140.0 86.0 2790 15.6 82 1 ford mustang gl

394 2.624265 4 97.0 52.0 2130 24.6 82 2 vw pickup

395 1.087017 4 135.0 84.0 2295 11.6 82 1 dodge rampage

396 0.574601 4 120.0 79.0 2625 18.6 82 1 ford ranger

397 0.958913 4 119.0 82.0 2720 19.4 82 1 chevy s-10

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Missing Values


You can also simply drop any rows with any NA values. Another common practice is to replace missing values with the median value for
that column. The following code replaces any NA values in horsepower with the median:

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)

# df = df.dropna() # you can also simply drop NA values

keyboard_arrow_down Concatenating Rows and Columns


Rows and columns can be concatenated together to form new data frames.

# Create a new dataframe from name and horsepower

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 8/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

name horsepower

0 chevrolet chevelle malibu 130.0

1 buick skylark 320 165.0

2 plymouth satellite 150.0

3 amc rebel sst 150.0

4 ford torino 140.0

... ... ...

393 ford mustang gl 86.0

394 vw pickup 52.0

395 dodge rampage 84.0

396 ford ranger 79.0

397 chevy s-10 82.0

Next steps: Generate code with result toggle_off View recommended plots New interactive sheet

# Create a new dataframe from name and horsepower, but this time by row

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result

0 chevrolet chevelle malibu

1 buick skylark 320

2 plymouth satellite

3 amc rebel sst

4 ford torino

... ...

393 86.0

394 52.0

395 84.0

396 79.0

397 82.0

796 rows × 1 columns

keyboard_arrow_down Helpful Functions for Tensorflow (Little Gems)


https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 9/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
The following functions will be used with TensorFlow to help preprocess the data.

They allow you to build the feature vector in the format that TensorFlow expects from raw data.
(1) Encoding data:

encode_text_dummy - Encode text fields as numeric, such as the iris species as a single field for each class. Three classes would
become "0,0,1" "0,1,0" and "1,0,0". Encode non-target features this way. used when the data is part of input (one hot encoding)
encode_text_index - Encode text fields to numeric, such as the iris species as a single numeric field as "0" "1" and "2". Encode the
target field for a classification this way. used when data is part of output (label encoding)

(2) Normalizing data:

encode_numeric_zscore - Encode numeric values as a z-score. Neural networks deal well with "normalized" fields only.
encode_numeric_range - Encode a column to a range between the given normalized_low and normalized_high.

(3) Dealing with missing data:

missing_median - Fill all missing values with the median value.

(4) Removing outliers:

remove_outliers - Remove outliers in a certain column with a value beyond X times SD

(5) Creating the feature vector and target vector that * Tensorflow needs*:

to_xy - Once all fields are encoded to numeric, this function can provide the x and y matrixes that TensorFlow needs to fit the neural
network with data.

(6) Other utility functions:

hms_string - Print out an elapsed time string.


chart_regression - Display a chart to show how well a regression performs.

import collections.abc
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os

# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)


def encode_text_dummy(df, name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name, x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)

# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).


def encode_text_index(df, name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_

# Encode a numeric column as zscores


def encode_numeric_zscore(df, name, mean=None, sd=None):
if mean is None:
mean = df[name].mean()

if sd is None:
sd = df[name].std()

df[name] = (df[name] - mean) / sd

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 10/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)

# Convert all missing values in the specified column to the default


def missing_default(df, name, default_value):
df[name] = df[name].fillna(default_value)

# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs


def to_xy(df, target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column.
target_type = df[target].dtypes
target_type = target_type[0] if isinstance(target_type, collections.abc.Sequence) else target_type
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
dummies = pd.get_dummies(df[target])
return df[result].values.astype(np.float32), dummies.values.astype(np.float32)
else:
# Regression
return df[result].values.astype(np.float32), df[target].values.astype(np.float32)

# Nicely formatted time string


def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Regression chart.
def chart_regression(pred,y,sort=True):
t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
if sort:
t.sort_values(by=['y'],inplace=True)
a = plt.plot(t['y'].tolist(),label='expected')
b = plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
df.drop(drop_rows, axis=0, inplace=True)

# Encode a column to a range between normalized_low and normalized_high.


def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
data_low=None, data_high=None):
if data_low is None:
data_low = min(df[name])
data_high = max(df[name])

df[name] = ((df[name] - data_low) / (data_high - data_low)) * (normalized_high - normalized_low) + normalized_low

keyboard_arrow_down Examples of label encoding, one hot encoding, and creating X/Y for TensorFlow
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 11/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
encode_text_index(df,"species") # label encoding
df

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -

encode_text_dummy(df,"species") # One hot encoding


df

sepal_l sepal_w petal_l petal_w species-Iris-setosa species-Iris-versicolor species-Iris-virginica

0 5.1 3.5 1.4 0.2 True False False

1 4.9 3.0 1.4 0.2 True False False

2 4.7 3.2 1.3 0.2 True False False

3 4.6 3.1 1.5 0.2 True False False

4 5.0 3.6 1.4 0.2 True False False

... ... ... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 False False True

146 6.3 2.5 5.0 1.9 False False True

147 6.5 3.0 5.2 2.0 False False True

148 6.2 3.4 5.4 2.3 False False True

149 5.9 3.0 5.1 1.8 False False True

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Make sure you encode the lables first before you call to_xy()
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -

encode_text_index(df,"species") # encoding first before you call to_xy()

df

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 12/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

x,y = to_xy(df,"species")

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 13/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]], dtype=float32)

[0., 1., 0.],


[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 14/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

keyboard_arrow_down Example of Deal with Missing Values and Outliers


path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])

# Handle mising values in horsepower


missing_median(df, 'horsepower')
#df.drop('name', 1,inplace=True)

# Drop outliers in horsepower


print("Length before MPG outliers dropped: {}".format(len(df)))
remove_outliers(df,'mpg',2)
print("Length after MPG outliers dropped: {}".format(len(df)))

Length before MPG outliers dropped: 398


Length after MPG outliers dropped: 388

keyboard_arrow_down Training and Validation


The machine learning model will learn from the training data, but ultimately be evaluated based on the validation data.

Training Data - In Sample Data - The data that the machine learning model was fit to/created from.
Validation Data - Out of Sample Data - The data that the machine learning model is evaluated upon after it is fit to the training data.

There are two predominant means of dealing with training and validation data:

Training/Test Split - The data are split according to some ratio between a training and validation (hold-out) set. Common ratios are
80% training and 20% validation.
K-Fold Cross Validation - The data are split into a number of folds and models. Because a number of models equal to the folds is
created out-of-sample predictions can be generated for the entire dataset.

keyboard_arrow_down Training/Test Split


The code below performs a split of the MPG data into a training and validation set. The training set uses 80% of the data and the
test(validation) set uses 20%.

The following image shows how a model is trained on 80% of the data and then validated against the remaining 20%.

import pandas as pd
import io
import numpy as np
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 15/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
import os
from sklearn.model_selection import train_test_split

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename = os.path.join(path,"iris.csv") #The file is available to download from canvas in path Files --> Lab Help --> Labs -->
df = pd.read_csv(filename,na_values=['NA','?'])

df[0:5]

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df['encoded_species'] = le.fit_transform(df['species'])

df[0:5]

sepal_l sepal_w petal_l petal_w species encoded_species

0 5.1 3.5 1.4 0.2 Iris-setosa 0

1 4.9 3.0 1.4 0.2 Iris-setosa 0

2 4.7 3.2 1.3 0.2 Iris-setosa 0

3 4.6 3.1 1.5 0.2 Iris-setosa 0

# Split into train/test


x_train, x_test, y_train, y_test = train_test_split(df[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']], df['encoded_species'], tes

x_train.shape

(112, 4)

y_train.shape

(112,)

x_test.shape

(38, 4)

y_test.shape

(38,)

keyboard_arrow_down Aggregation
Data aggregation is a preprocessing task where the values of two or more objects are combined into a single object. The motivation for
aggregation includes (1) reducing the size of data to be processed, (2) changing the granularity of analysis (from fine-scale to coarser-

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 16/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
scale), and (3) improving the stability of the data.

In the example below, we will use the daily precipitation time series data for a weather station located at Detroit Metro Airport. The raw
data was obtained from the Climate Data Online website (https://www.ncdc.noaa.gov/cdo-web/). The daily precipitation time series will be
compared against its monthly values.

Code:

The code below will load the precipitation time series data and draw a line plot of its daily time series.

daily = pd.read_csv('/data/DTW_prec.csv', header='infer') #The file is available to download from canvas in path Files --> Lab He
daily.index = pd.to_datetime(daily['DATE'])
daily = daily['PRCP']
ax = daily.plot(kind='line',figsize=(15,3))
ax.set_title('Daily Precipitation (variance = %.4f)' % (daily.var()))

Text(0.5, 1.0, 'Daily Precipitation (variance = 0.0530)')

daily

DATE
2001-01-01 0.00
2001-01-02 0.00
2001-01-03 0.00
2001-01-04 0.04
2001-01-05 0.14
...
2017-12-27 0.00
2017-12-28 0.00
2017-12-29 0.00
2017-12-30 0.00
2017-12-31 0.00
Name: PRCP, Length: 6191, dtype: float64

Observe that the daily time series appear to be quite chaotic and varies significantly from one time step to another. The time series can be
grouped and aggregated by month to obtain the total monthly precipitation values. The resulting time series appears to vary more
smoothly compared to the daily time series.

Code:

monthly = daily.groupby(pd.Grouper(freq='M')).sum()
ax = monthly.plot(kind='line',figsize=(15,3))
ax.set_title('Monthly Precipitation (variance = %.4f)' % (monthly.var()))

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 17/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

<ipython-input-60-5ae1711a1e83>:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME'
monthly = daily.groupby(pd.Grouper(freq='M')).sum()
Text(0.5, 1.0, 'Monthly Precipitation (variance = 2.4241)')

In the example below, the daily precipitation time series are grouped and aggregated by year to obtain the annual precipitation values.

Code:

annual = daily.groupby(pd.Grouper(freq='Y')).sum()
ax = annual.plot(kind='line',figsize=(15,3))
ax.set_title('Annual Precipitation (variance = %.4f)' % (annual.var()))

<ipython-input-61-46ed05734b2b>:1: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE'
annual = daily.groupby(pd.Grouper(freq='Y')).sum()
Text(0.5, 1.0, 'Annual Precipitation (variance = 23.6997)')

keyboard_arrow_down Sampling
Sampling is an approach commonly used to facilitate (1) data reduction for exploratory data analysis and scaling up algorithms to big data
applications and (2) quantifying uncertainties due to varying data distributions. There are various methods available for data sampling,
such as sampling without replacement, where each selected instance is removed from the dataset, and sampling with replacement, where
each selected instance is not removed, thus allowing it to be selected more than once in the sample.

In the example below, we will apply sampling with replacement and without replacement to the breast cancer dataset obtained from the
UCI machine learning repository.

Code:

We initially display the first five records of the table.

data.head()

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 18/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

0 5 1 1 1 2 1 3 1 1 2

1 5 4 4 5 7 10 3 2 1 2

2 3 1 1 1 2 2 3 1 1 2

3 6 8 8 1 3 4 3 7 1 2

Next steps: Generate code with data toggle_off View recommended plots New interactive sheet

In the following code, a sample of size 3 is randomly selected (without replacement) from the original data.

Code:

sample = data.sample(n=3)
sample

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

505 3 3 1 1 2 1 1 1 1 2

595 5 1 1 1 2 1 2 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

In the next example, we randomly select 1% of the data (without replacement) and display the selected samples. The random_state
argument of the function specifies the seed value of the random number generator.

Code:

sample = data.sample(frac=0.01, random_state=1)


sample

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

584 5 1 1 6 3 1 1 1 1 2

417 1 1 1 1 2 1 2 1 1 2

606 4 1 1 2 2 1 1 1 1 2

349 4 2 3 5 3 8 7 6 1 4

134 3 1 1 1 3 1 2 1 1 2

502 4 1 1 2 2 1 2 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

Finally, we perform a sampling with replacement to create a sample whose size is equal to 1% of the entire data. You should be able to
observe duplicate instances in the sample by increasing the sample size.

Code:

sample = data sample(frac=0 01 replace=True random state=1)


https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 19/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
sample = data.sample(frac=0.01, replace=True, random_state=1)
sample

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

37 6 2 1 1 1 1 7 1 1 2

235 3 1 4 1 2 NaN 3 1 1 2

72 1 3 3 2 2 1 7 2 1 2

645 3 1 1 1 2 1 2 1 1 2

144 2 1 1 1 2 1 2 1 1 2

129 1 1 1 1 10 1 1 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Discretization
Discretization is a data preprocessing step that is often used to transform a continuous-valued attribute to a categorical attribute. The
example below illustrates two simple but widely-used unsupervised discretization methods (equal width and equal depth) applied to the
'Clump Thickness' attribute of the breast cancer dataset.

First, we plot a histogram that shows the distribution of the attribute values. The value_counts() function can also be applied to count the
frequency of each attribute value.

Code:

data['Clump Thickness'].hist(bins=10)
data['Clump Thickness'].value_counts(sort=False)

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 20/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

count

Clump Thickness

5 130

3 108

6 34

4 80

8 46

1 145

2 50

7 23

10 69

9 14

dtype: int64

For the equal width method, we can apply the cut() function to discretize the attribute into 4 bins of similar interval widths. The
value_counts() function can be used to determine the number of instances in each bin.

Code:

bins = pd.cut(data['Clump Thickness'],4)


bins.value_counts(sort=False)

count

Clump Thickness

(0.991, 3.25] 303

(3.25, 5.5] 210

(5.5, 7.75] 57

(7.75, 10.0] 129

For the equal frequency method, the qcut() function can be used to partition the values into 4 bins such that each bin has nearly the same
number of instances.

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 21/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Code:

bins = pd.qcut(data['Clump Thickness'],4)


bins.value_counts(sort=False)

count

Clump Thickness

(0.999, 2.0] 195

(2.0, 4.0] 188

(4.0, 6.0] 164

(6.0, 10.0] 152

keyboard_arrow_down Principal Component Analysis


Principal component analysis (PCA) is a classical method for reducing the number of attributes in the data by projecting the data from its
original high-dimensional space into a lower-dimensional space. The new attributes (also known as components) created by PCA have the
following properties: (1) they are linear combinations of the original attributes, (2) they are orthogonal (perpendicular) to each other, and
(3) they capture the maximum amount of variation in the data.

The example below illustrates the application of PCA to an image dataset. There are 16 RGB files, each of which has a size of 111 x 111
pixels. The example code below will read each image file and convert the RGB image into a 111 x 111 x 3 = 36963 feature values. This will
create a data matrix of size 16 x 36963.

Code:

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

numImages = 16
fig = plt.figure(figsize=(7,7))
imgData = np.zeros(shape=(numImages,36963))

for i in range(1,numImages+1):
filename = '/data/pics/Picture'+str(i)+'.jpg' #The file is available to download from canvas in path Files --> Lab Help --> L
img = mpimg.imread(filename)
ax = fig.add_subplot(4,4,i)
plt.imshow(img)
plt.axis('off')
ax.set_title(str(i))
imgData[i-1] = np.array(img.flatten()).reshape(1,img.shape[0]*img.shape[1]*img.shape[2])

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 22/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

Using PCA, the data matrix is projected to its first two principal components. The projected values of the original image data are stored in
a pandas DataFrame object named projected.

Code:

import pandas as pd
from sklearn.decomposition import PCA

numComponents = 2
pca = PCA(n_components=numComponents)
pca.fit(imgData)

projected = pca.transform(imgData)
projected = pd.DataFrame(projected,columns=['pc1','pc2'],index=range(1,numImages+1))
projected['food'] = ['burger', 'burger','burger','burger','drink','drink','drink','drink',
'pasta', 'pasta', 'pasta', 'pasta', 'chicken', 'chicken', 'chicken', 'chicken']
projected

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 23/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

pc1 pc2 food

1 1592.891398 -6650.655315 burger

2 513.010062 -6333.930780 burger

3 -963.256334 -7210.464745 burger

4 -2165.075382 -9038.865901 burger

5 7842.472601 1064.496270 drink

6 8458.912039 5385.049711 drink

7 11181.792685 5359.952065 drink

8 6831.001348 -1129.478069 drink

9 -7639.881092 5060.785108 pasta

10 704.458513 532.387235 pasta

11 -7237.636932 5287.195103 pasta

12 -4426.758058 4630.679800 pasta

13 -11866.492500 -1519.390075 chicken

14 -73.978096 -1380.857789 chicken

15 7510.615629 1188.363637 chicken

Next steps: Generate code with projected toggle_off View recommended plots New interactive sheet

Finally, we draw a scatter plot to display the projected values. Observe that the images of burgers, drinks, and pastas are all projected to
the same region. However, the images for fried chicken (shown as black squares in the diagram) are harder to discriminate.

Code:

import matplotlib.pyplot as plt

colors = {'burger':'b', 'drink':'r', 'pasta':'g', 'chicken':'k'}


markerTypes = {'burger':'+', 'drink':'x', 'pasta':'o', 'chicken':'s'}

for foodType in markerTypes:


d = projected[projected['food']==foodType]
plt.scatter(d['pc1'],d['pc2'],c=colors[foodType],s=60,marker=markerTypes[foodType])

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 24/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

keyboard_arrow_down Feature Selection Techniques


The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of
studied phenomena.

Information Gain

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_selection import mutual_info_regression

df = pd.read_csv('/data/Admission_Predict_Ver1.1.csv') #The file is available to download from canvas in path Files --> Lab Help
X = df.iloc[:,1:-1]
Y = df.iloc[:,-1]
importances = mutual_info_regression(X, Y)
feat_importances = pd.Series(importances,X.columns)
feat_importances.plot(kind='barh',color = 'teal')
plt.show()

keyboard_arrow_down Chi-square Test


The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select
the desired number of features with the best Chi-square scores. In order to correctly apply the chi-squared to test the relation between
various features in the dataset and the target variable, the following conditions have to be met: the variables have to be categorical,
sampled independently, and values should have an expected frequency greater than 5.

from sklearn.feature_selection import SelectKBest


from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder

X_Cat = X.astype(int)
le = LabelEncoder()
chi2_features = SelectKBest(chi2, k = 3)
Y = le.fit_transform(Y)
X_kbest_features = chi2_features.fit_transform(X_Cat,Y)

print('Original feature number:', X_Cat.shape[1])


print('Reduced feature number: ',X_kbest_features.shape[1])

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 25/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

Original feature number: 7


Reduced feature number: 3

keyboard_arrow_down Correlation Coefficient


Correlation is a measure of the linear relationship between 2 or more variables. Through correlation, we can predict one variable from the
other. The logic behind using correlation for feature selection is that good variables correlate highly with the target. Furthermore, variables
should be correlated with the target but uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only needs one, as
the second does not add additional information. We will use the Pearson Correlation here.

import seaborn as sns


import matplotlib.pyplot as plt
%matplotlib inline

cor = df.corr()

plt.figure(figsize=(10,6))
sns.heatmap(cor,annot= True)

<Axes: >

keyboard_arrow_down Variance Threshold


The variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some
threshold. By default, it removes all zero-variance features, i.e., features with the same value in all samples. We assume that features with
a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or
feature and target variables into account, which is one of the drawbacks of filter methods.

The get_support returns a Boolean vector where True means the variable does not have zero variance.

f kl f t l ti i t V i Th h ld
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 26/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
from sklearn.feature_selection import VarianceThreshold

X = globals()["X"]

v_threshold = VarianceThreshold(threshold = 0)
v_threshold.fit(X)
v_threshold.get_support()

array([ True, True, True, True, True, True, True])

keyboard_arrow_down Mean Absolute Difference (MAD)


‘The mean absolute difference (MAD) computes the absolute difference from the mean value. The main difference between the variance
and MAD measures is the absence of the square in the latter. The MAD, like the variance, is also a scaled variant.’ This means that the
higher the MAD, the higher the discriminatory power.

mean_abs_diff = np.sum(np.abs(X -np.mean(X, axis=0)), axis=0)/X.shape[0]

plt.bar(np.arange(X.shape[1]),mean_abs_diff,color = 'teal')

<BarContainer object of 7 artists>

keyboard_arrow_down Backward Elimination Techniques


Importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing as prepr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score

keyboard_arrow_down Import dataset and do necessary changes


df = pd.read_csv('/data/Admission_Predict_Ver1.1.csv') #The file is available to download from canvas in path Files --> Lab Help
( ( ))
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 27/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
print('Shape:{}'.format(df.shape))
print(df.head(10))
print(df.describe())
df.columns = [x.replace(' ', '').replace('.', '').lower() for x in list(df)] #converts columnnames to lower single words
del df['serialno']

Shape:(500, 9)
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \
0 1 337 118 4 4.5 4.5 9.65
1 2 324 107 4 4.0 4.5 8.87
2 3 316 104 3 3.0 3.5 8.00
3 4 322 110 3 3.5 2.5 8.67
4 5 314 103 2 2.0 3.0 8.21
5 6 330 115 5 4.5 3.0 9.34
6 7 321 109 3 3.0 4.0 8.20
7 8 308 101 2 3.0 4.0 7.90
8 9 302 102 1 2.0 1.5 8.00
9 10 323 108 3 3.5 3.0 8.60

Research Chance of Admit


0 1 0.92
1 1 0.76
2 1 0.72
3 1 0.80
4 0 0.65
5 1 0.90
6 1 0.75
7 0 0.68
8 0 0.50
9 0 0.45
Serial No. GRE Score TOEFL Score University Rating SOP \
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 250.500000 316.472000 107.192000 3.114000 3.374000
std 144.481833 11.295148 6.081868 1.143512 0.991004
min 1.000000 290.000000 92.000000 1.000000 1.000000
25% 125.750000 308.000000 103.000000 2.000000 2.500000
50% 250.500000 317.000000 107.000000 3.000000 3.500000
75% 375.250000 325.000000 112.000000 4.000000 4.000000
max 500.000000 340.000000 120.000000 5.000000 5.000000

LOR CGPA Research Chance of Admit


count 500.00000 500.000000 500.000000 500.00000
mean 3.48400 8.576440 0.560000 0.72174
std 0.92545 0.604813 0.496884 0.14114
min 1.00000 6.800000 0.000000 0.34000
25% 3.00000 8.127500 0.000000 0.63000
50% 3.50000 8.560000 1.000000 0.72000
75% 4.00000 9.040000 1.000000 0.82000
max 5.00000 9.920000 1.000000 0.97000

df.boxplot(showbox=True, figsize=(10,8))

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 28/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

<Axes: >

keyboard_arrow_down Standardizing the entire dataset using Min-Max scaling


cols = list(df)
scaler = prepr.MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=cols)
print(scaled_df.head(10))
print(scaled_df.describe())

grescore toeflscore universityrating sop lor cgpa research \


0 0.94 0.928571 0.75 0.875 0.875 0.913462 1.0
1 0.68 0.535714 0.75 0.750 0.875 0.663462 1.0
2 0.52 0.428571 0.50 0.500 0.625 0.384615 1.0
3 0.64 0.642857 0.50 0.625 0.375 0.599359 1.0
4 0.48 0.392857 0.25 0.250 0.500 0.451923 0.0
5 0.80 0.821429 1.00 0.875 0.500 0.814103 1.0
6 0.62 0.607143 0.50 0.500 0.750 0.448718 1.0
7 0.36 0.321429 0.25 0.500 0.750 0.352564 0.0
8 0.24 0.357143 0.00 0.250 0.125 0.384615 0.0
9 0.66 0.571429 0.50 0.625 0.500 0.576923 0.0

chanceofadmit
0 0.920635
1 0.666667
2 0.603175
3 0.730159
4 0.492063
5 0.888889
6 0.650794
7 0.539683
8 0.253968
9 0.174603
grescore toeflscore universityrating sop lor \
count 500.000000 500.000000 500.000000 500.000000 500.000000
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 29/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
mean 0.529440 0.542571 0.528500 0.593500 0.621000
std 0.225903 0.217210 0.285878 0.247751 0.231362
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.360000 0.392857 0.250000 0.375000 0.500000
50% 0.540000 0.535714 0.500000 0.625000 0.625000
75% 0.700000 0.714286 0.750000 0.750000 0.750000
max 1.000000 1.000000 1.000000 1.000000 1.000000

cgpa research chanceofadmit


count 500.000000 500.000000 500.000000
mean 0.569372 0.560000 0.605937
std 0.193850 0.496884 0.224032
min 0.000000 0.000000 0.000000
25% 0.425481 0.000000 0.460317
50% 0.564103 1.000000 0.603175
75% 0.717949 1.000000 0.761905
max 1.000000 1.000000 1.000000

keyboard_arrow_down Pairplot
This is a scatter plot of all the attributes on both x and y axes. 'chanceofadmit' is the dependent variable and the plot clearly shows that
there are few attributes that have a linear relation with the dependent attribute. So we can move ahead with linear regression.

sns.pairplot(scaled_df)

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 30/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

<seaborn.axisgrid.PairGrid at 0x7fb49256cc50>

keyboard_arrow_down Creating X and Y for the linear regression equation (Y = a + βX1 + βX2 + ... + βXn ) where Y is the dependent variable
and Xs are the independent variables

cols = list(scaled_df)
X = scaled_df.iloc[:, :-1]
y = scaled_df[cols[-1]]

keyboard_arrow_down Using statsmodels provided Linear Regression Model


OLS is the function used to create a linear model. This model defines the linear regression formula as Y = aX0 + βX1 + βX2 + ... + βXn .

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 31/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
So create a new attribute with all ones and append it to the start of the dataframe. This will serve as X0

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 32/32

You might also like