0% found this document useful (0 votes)

77 views32 pages

Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS

This tutorial provides Python examples for data preprocessing techniques essential for improving data mining analysis, including handling data quality issues such as missing values, outliers, and duplicates. It demonstrates how to clean and transform a breast cancer dataset using various methods, including replacing missing values, identifying and discarding outliers, and removing duplicates. The tutorial also covers shuffling and sorting dataframes to prepare data for analysis.

Uploaded by

bilalabaloch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views32 pages

Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS

Uploaded by

bilalabaloch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).

ipynb - Colab

keyboard_arrow_down Module 4: Data Preprocessing

The following tutorial contains Python examples for data preprocessing. You should refer to the "Data" chapter of the "Introduction to Data
Mining" book (slides are available at https://www-users.cs.umn.edu/~kumar001/dmbook/index.php) to understand some of the concepts
introduced in this tutorial. Data preprocessing consists of a broad set of techniques for cleaning, selecting, and transforming data to
improve data mining analysis. Read the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and
press the SHIFT-ENTER keys simultaneously.

keyboard_arrow_down Data Quality Issues

Poor data quality can have an adverse effect on data mining. Among the common data quality issues include noise, outliers, missing
values, and duplicate data. This section presents examples of Python code to alleviate some of these data quality problems. We begin
with an example dataset from the UCI machine learning repository containing information about breast cancer patients. We will first
download the dataset using Pandas read_csv() function and display its first 5 data points.

Code:

import pandas as pd
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.da
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']

data = data.drop(['Sample code'],axis=1)

print('Number of instances = %d' % (data.shape[0]))
print('Number of attributes = %d' % (data.shape[1]))
data.head()

Number of instances = 699

Number of attributes = 10
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

0 5 1 1 1 2 1 3 1 1 2

1 5 4 4 5 7 10 3 2 1 2

2 3 1 1 1 2 2 3 1 1 2

3 6 8 8 1 3 4 3 7 1 2

Next steps: Generate code with data toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Missing Values

It is not unusual for an object to be missing one or more attribute values. In some cases, the information was not collected; while in other
cases, some attributes are inapplicable to the data instances. This section presents examples on the different approaches for handling
missing values.

According to the description of the data (https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original), the missing values

are encoded as '?' in the original data. Our first task is to convert the missing values to NaNs. We can then count the number of missing
values in each column of the data.

Code:

import numpy as np

print('Number of instances = %d' % (data.shape[0]))

print('Number of attributes = %d' % (data.shape[1]))

print('Number of missing values:')

for col in data.columns:
print('\t%s: %d' % (col,data[col].isna().sum()))

Number of instances = 699

Number of attributes = 10
Number of missing values:
Clump Thickness: 0
Uniformity of Cell Size: 0
Uniformity of Cell Shape: 0
Marginal Adhesion: 0
Single Epithelial Cell Size: 0
Bare Nuclei: 16
Bland Chromatin: 0
Normal Nucleoli: 0
Mitoses: 0
Class: 0

Observe that only the 'Bare Nuclei' column contains missing values. In the following example, the missing values in the 'Bare Nuclei'
column are replaced by the median value of that column. The values before and after replacement are shown for a subset of the data
points.

Code:

data2 = data['Bare Nuclei']

# Convert 'Bare Nuclei' column to numeric before calculating the median.

data2 = pd.to_numeric(data['Bare Nuclei'], errors='coerce')

print('Before replacing missing values:')

print(data2[20:25])
data2 = data2.fillna(data2.median())

print('\nAfter replacing missing values:')

print(data2[20:25])

Before replacing missing values:

20 10.0
21 7.0
22 1.0
23 NaN
24 1.0
Name: Bare Nuclei, dtype: float64

After replacing missing values:

20 10.0
21 7.0
22 1.0
23 1.0
24 1.0
Name: Bare Nuclei, dtype: float64

Instead of replacing the missing values, another common approach is to discard the data points that contain missing values. This can be
easily accomplished by applying the dropna() function to the data frame.

Code:

print('Number of rows in original data = %d' % (data.shape[0]))

data2 = data.dropna()
print('Number of rows after discarding missing values = %d' % (data2.shape[0]))

Number of rows in original data = 699

Number of rows after discarding missing values = 683
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 2/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab

keyboard_arrow_down Outliers
Outliers are data instances with characteristics that are considerably different from the rest of the dataset. In the example code below, we
will draw a boxplot to identify the columns in the table that contain outliers. Note that the values in all columns (except for 'Bare Nuclei')
are originally stored as 'int64' whereas the values in the 'Bare Nuclei' column are stored as string objects (since the column initially
contains strings such as '?' for representing missing values). Thus, we must convert the column into numeric values first before creating
the boxplot. Otherwise, the column will not be displayed when drawing the boxplot.

Code:

%matplotlib inline

data2 = data.drop(['Class'],axis=1)
data2['Bare Nuclei'] = pd.to_numeric(data2['Bare Nuclei'])
data2.boxplot(figsize=(20,3))

<Axes: >

The boxplots suggest that only 5 of the columns (Marginal Adhesion, Single Epithetial Cell Size, Bland Cromatin, Normal Nucleoli, and
Mitoses) contain abnormally high values. To discard the outliers, we can compute the Z-score for each attribute and remove those
instances containing attributes with abnormally high or low Z-score (e.g., if Z > 3 or Z <= -3).

Code:

The following code shows the results of standardizing the columns of the data. Note that missing values (NaN) are not affected by the
standardization process.

Z = (data2-data2.mean())/data2.std()
Z[20:25]

Single
Clump Uniformity Uniformity of Marginal Bare Bland Normal
Epithelial Mitoses
Thickness of Cell Size Cell Shape Adhesion Nuclei Chromatin Nucleoli
Cell Size

20 0.917080 -0.044070 -0.406284 2.519152 0.805662 1.771569 0.640688 0.371049 1.405526

21 1.982519 0.611354 0.603167 0.067638 1.257272 0.948266 1.460910 2.335921 -0.343666

22 -0.503505 -0.699494 -0.742767 -0.632794 -0.549168 -0.698341 -0.589645 -0.611387 -0.343666

23 1.272227 0.283642 0.603167 -0.632794 -0.549168 NaN 1.460910 0.043570 -0.343666

Code:

The following code shows the results of discarding columns with Z > 3 or Z <= -3.

print('Number of rows before discarding outliers = %d' % (Z.shape[0]))

Z2 = Z.loc[((Z > -3).sum(axis=1)==9) & ((Z <= 3).sum(axis=1)==9),:]

print('Number of rows after discarding missing values = %d' % (Z2.shape[0]))

Number of rows before discarding outliers = 699

Number of rows after discarding missing values = 632

keyboard_arrow_down Duplicate Data

Some datasets, especially those obtained by merging multiple data sources, may contain duplicates or near duplicate instances. The term
deduplication is often used to refer to the process of dealing with duplicate data issues.

Code:

In the following example, we first check for duplicate instances in the breast cancer dataset.

dups = data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
data.loc[[11,28]]

Number of duplicate rows = 236

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

11 2 1 1 1 2 1 2 1 1 2

The duplicated() function will return a Boolean array that indicates whether each row is a duplicate of a previous row in the table. The
results suggest there are 236 duplicate rows in the breast cancer dataset. For example, the instance with row index 11 has identical
attribute values as the instance with row index 28. Although such duplicate rows may correspond to samples for different individuals, in
this hypothetical example, we assume that the duplicates are samples taken from the same individual and illustrate below how to remove
the duplicated rows.

Code:

print('Number of rows before discarding duplicates = %d' % (data.shape[0]))

data2 = data.drop_duplicates()
print('Number of rows after discarding duplicates = %d' % (data2.shape[0]))

Number of rows before discarding duplicates = 699

Number of rows after discarding duplicates = 463

keyboard_arrow_down Shuffling Dataframes

It is possible to shuffle.

import os
import numpy as np
import pandas as pd

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read, na_values=['NA','?'])

#np.random.seed(30) # Uncomment this line to get the same shuffle each time

df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
# use inplace=False
df

mpg cylinders displacement horsepower weight acceleration year origin name

0 14.0 8 318.0 150.0 4237 14.5 73 1 plymouth fury gran sedan

1 22.0 6 232.0 112.0 2835 14.7 82 1 ford granada l

2 19.0 6 250.0 100.0 3282 15.0 71 1 pontiac firebird

3 13.0 8 302.0 129.0 3169 12.0 75 1 ford mustang ii

4 25.0 4 140.0 92.0 2572 14.9 76 1 capri ii

... ... ... ... ... ... ... ... ... ...

393 18.0 6 250.0 88.0 3139 14.5 71 1 ford mustang

394 29.8 4 134.0 90.0 2711 15.5 80 3 toyota corona liftback

395 16.0 6 225.0 105.0 3439 15.5 71 1 plymouth satellite custom

396 28.4 4 151.0 90.0 2670 16.0 79 1 buick skylark limited

397 21.5 3 80.0 110.0 2720 13.5 77 3 mazda rx-4

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Sorting Dataframes

It is possible to sort.

df = df.sort_values(by='name',ascending=True)
df

mpg cylinders displacement horsepower weight acceleration year origin name

367 13.0 8 360.0 175.0 3821 11.0 73 1 amc ambassador brougham

123 15.0 8 390.0 190.0 3850 8.5 70 1 amc ambassador dpl

151 17.0 8 304.0 150.0 3672 11.5 72 1 amc ambassador sst

54 24.3 4 151.0 90.0 3003 20.1 80 1 amc concord

306 19.4 6 232.0 90.0 3210 17.2 78 1 amc concord

... ... ... ... ... ... ... ... ... ...

148 44.0 4 97.0 52.0 2130 24.6 82 2 vw pickup

338 29.0 4 90.0 70.0 1937 14.2 76 2 vw rabbit

72 41.5 4 98.0 76.0 2144 14.7 80 2 vw rabbit

79 44.3 4 90.0 48.0 2085 21.7 80 2 vw rabbit c (diesel)

205 31.9 4 89.0 71.0 1925 14.0 79 2 vw rabbit custom

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

print("The first car is: {}".format(df['name'].iloc[0]))

The first car is: amc ambassador brougham

print("The first car is: {}".format(df['name'].loc[0]))

#loc gets rows (or columns) with particular labels from the index.
#iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

The first car is: datsun 710

keyboard_arrow_down Saving a Dataframe

The following code performs a shuffle and then saves a new copy.

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
filename_write = os.path.join(path,"auto-mpg-shuffle.csv") #The file is available to download from canvas in path Files --> Lab
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")

Done

keyboard_arrow_down Dropping Fields

Some fields are of no value to the neural network and can be dropped. The following code removes the name column from the MPG
dataset.

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])

print("Before drop: {}".format(df.columns))

df.drop('name', axis=1, inplace=True)
print("After drop: {}".format(df.columns))
df[0:5]

Before drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',

'acceleration', 'year', 'origin', 'name'],
dtype='object')
After drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'year', 'origin'],
dtype='object')
mpg cylinders displacement horsepower weight acceleration year origin

0 18.0 8 307.0 130.0 3504 12.0 70 1

1 15.0 8 350.0 165.0 3693 11.5 70 1

2 18.0 8 318.0 150.0 3436 11.0 70 1

3 16.0 8 304.0 150.0 3433 12.0 70 1

keyboard_arrow_down Calculated Fields

It is possible to add new fields to the dataframe that are calculated from the other fields. We can create a new column that gives the
weight in kilograms. The equation to calculate a metric weight, given a weight in pounds is:

m(kg) = m(lb) × 0.45359237

import os
import pandas as pd
import numpy as np

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df

mpg weight_kg cylinders displacement horsepower weight acceleration year origin name

0 18.0 1589 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu

1 15.0 1675 8 350.0 165.0 3693 11.5 70 1 buick skylark 320

2 18.0 1558 8 318.0 150.0 3436 11.0 70 1 plymouth satellite

3 16.0 1557 8 304.0 150.0 3433 12.0 70 1 amc rebel sst

4 17.0 1564 8 302.0 140.0 3449 10.5 70 1 ford torino

... ... ... ... ... ... ... ... ... ... ...

393 27.0 1265 4 140.0 86.0 2790 15.6 82 1 ford mustang gl

394 44.0 966 4 97.0 52.0 2130 24.6 82 2 vw pickup

395 32.0 1040 4 135.0 84.0 2295 11.6 82 1 dodge rampage

396 28.0 1190 4 120.0 79.0 2625 18.6 82 1 ford ranger

397 31.0 1233 4 119.0 82.0 2720 19.4 82 1 chevy s-10

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Feature Normalization

A normalization allows numbers to be put in a standard form so that two values can easily be compared. One very common machine
learning normalization is the Z-Score:
x−μ
z =
σ

To calculate the Z-Score you need to also calculate the mean(μ ) and the standard deviation (σ). The mean is calculated as follows:
x1 +x2 +⋯+xn
μ = x̄ =
n

The standard deviation is calculated as follows:

−−−−−−−−−−−−−
1 N 2 1 N
σ = √ ∑ (xi − μ) , where μ = ∑ xi
N i=1 N i=1

The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and
below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

mpg cylinders displacement horsepower weight acceleration year origin name

0 -0.706439 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu

1 -1.090751 8 350.0 165.0 3693 11.5 70 1 buick skylark 320

2 -0.706439 8 318.0 150.0 3436 11.0 70 1 plymouth satellite

3 -0.962647 8 304.0 150.0 3433 12.0 70 1 amc rebel sst

4 -0.834543 8 302.0 140.0 3449 10.5 70 1 ford torino

... ... ... ... ... ... ... ... ... ...

393 0.446497 4 140.0 86.0 2790 15.6 82 1 ford mustang gl

394 2.624265 4 97.0 52.0 2130 24.6 82 2 vw pickup

395 1.087017 4 135.0 84.0 2295 11.6 82 1 dodge rampage

396 0.574601 4 120.0 79.0 2625 18.6 82 1 ford ranger

397 0.958913 4 119.0 82.0 2720 19.4 82 1 chevy s-10

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Missing Values

You can also simply drop any rows with any NA values. Another common practice is to replace missing values with the median value for
that column. The following code replaces any NA values in horsepower with the median:

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)

# df = df.dropna() # you can also simply drop NA values

keyboard_arrow_down Concatenating Rows and Columns

Rows and columns can be concatenated together to form new data frames.

# Create a new dataframe from name and horsepower

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result

name horsepower

0 chevrolet chevelle malibu 130.0

1 buick skylark 320 165.0

2 plymouth satellite 150.0

3 amc rebel sst 150.0

4 ford torino 140.0

... ... ...

393 ford mustang gl 86.0

394 vw pickup 52.0

395 dodge rampage 84.0

396 ford ranger 79.0

397 chevy s-10 82.0

Next steps: Generate code with result toggle_off View recommended plots New interactive sheet

# Create a new dataframe from name and horsepower, but this time by row

import os
import pandas as pd
import numpy as np
from scipy.stats import zscore

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result

0 chevrolet chevelle malibu

1 buick skylark 320

2 plymouth satellite

3 amc rebel sst

4 ford torino

... ...

393 86.0

394 52.0

395 84.0

396 79.0

397 82.0

796 rows × 1 columns

keyboard_arrow_down Helpful Functions for Tensorflow (Little Gems)

They allow you to build the feature vector in the format that TensorFlow expects from raw data.
(1) Encoding data:

encode_text_dummy - Encode text fields as numeric, such as the iris species as a single field for each class. Three classes would
become "0,0,1" "0,1,0" and "1,0,0". Encode non-target features this way. used when the data is part of input (one hot encoding)
encode_text_index - Encode text fields to numeric, such as the iris species as a single numeric field as "0" "1" and "2". Encode the
target field for a classification this way. used when data is part of output (label encoding)

(2) Normalizing data:

encode_numeric_zscore - Encode numeric values as a z-score. Neural networks deal well with "normalized" fields only.
encode_numeric_range - Encode a column to a range between the given normalized_low and normalized_high.

(3) Dealing with missing data:

missing_median - Fill all missing values with the median value.

(4) Removing outliers:

remove_outliers - Remove outliers in a certain column with a value beyond X times SD

(5) Creating the feature vector and target vector that * Tensorflow needs*:

to_xy - Once all fields are encoded to numeric, this function can provide the x and y matrixes that TensorFlow needs to fit the neural
network with data.

(6) Other utility functions:

hms_string - Print out an elapsed time string.

chart_regression - Display a chart to show how well a regression performs.

import collections.abc
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os

# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)

def encode_text_dummy(df, name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name, x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)

# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).

def encode_text_index(df, name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_

# Encode a numeric column as zscores

def encode_numeric_zscore(df, name, mean=None, sd=None):
if mean is None:
mean = df[name].mean()

if sd is None:
sd = df[name].std()

df[name] = (df[name] - mean) / sd

# Convert all missing values in the specified column to the default

def missing_default(df, name, default_value):
df[name] = df[name].fillna(default_value)

# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs

def to_xy(df, target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column.
target_type = df[target].dtypes
target_type = target_type[0] if isinstance(target_type, collections.abc.Sequence) else target_type
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
dummies = pd.get_dummies(df[target])
return df[result].values.astype(np.float32), dummies.values.astype(np.float32)
else:
# Regression
return df[result].values.astype(np.float32), df[target].values.astype(np.float32)

# Nicely formatted time string

def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Regression chart.
def chart_regression(pred,y,sort=True):
t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
if sort:
t.sort_values(by=['y'],inplace=True)
a = plt.plot(t['y'].tolist(),label='expected')
b = plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
df.drop(drop_rows, axis=0, inplace=True)

# Encode a column to a range between normalized_low and normalized_high.

def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
data_low=None, data_high=None):
if data_low is None:
data_low = min(df[name])
data_high = max(df[name])

df[name] = ((df[name] - data_low) / (data_high - data_low)) * (normalized_high - normalized_low) + normalized_low

keyboard_arrow_down Examples of label encoding, one hot encoding, and creating X/Y for TensorFlow
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -

encode_text_dummy(df,"species") # One hot encoding

sepal_l sepal_w petal_l petal_w species-Iris-setosa species-Iris-versicolor species-Iris-virginica

0 5.1 3.5 1.4 0.2 True False False

1 4.9 3.0 1.4 0.2 True False False

2 4.7 3.2 1.3 0.2 True False False

3 4.6 3.1 1.5 0.2 True False False

4 5.0 3.6 1.4 0.2 True False False

... ... ... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 False False True

146 6.3 2.5 5.0 1.9 False False True

147 6.5 3.0 5.2 2.0 False False True

148 6.2 3.4 5.4 2.3 False False True

149 5.9 3.0 5.1 1.8 False False True

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Make sure you encode the lables first before you call to_xy()
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -

encode_text_index(df,"species") # encoding first before you call to_xy()

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

x,y = to_xy(df,"species")

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 13/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]], dtype=float32)

[0., 1., 0.],

[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)

keyboard_arrow_down Example of Deal with Missing Values and Outliers

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina

filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])

# Handle mising values in horsepower

missing_median(df, 'horsepower')
#df.drop('name', 1,inplace=True)

# Drop outliers in horsepower

print("Length before MPG outliers dropped: {}".format(len(df)))
remove_outliers(df,'mpg',2)
print("Length after MPG outliers dropped: {}".format(len(df)))

Length before MPG outliers dropped: 398

Length after MPG outliers dropped: 388

keyboard_arrow_down Training and Validation

The machine learning model will learn from the training data, but ultimately be evaluated based on the validation data.

Training Data - In Sample Data - The data that the machine learning model was fit to/created from.
Validation Data - Out of Sample Data - The data that the machine learning model is evaluated upon after it is fit to the training data.

There are two predominant means of dealing with training and validation data:

Training/Test Split - The data are split according to some ratio between a training and validation (hold-out) set. Common ratios are
80% training and 20% validation.
K-Fold Cross Validation - The data are split into a number of folds and models. Because a number of models equal to the folds is
created out-of-sample predictions can be generated for the entire dataset.

keyboard_arrow_down Training/Test Split

The code below performs a split of the MPG data into a training and validation set. The training set uses 80% of the data and the
test(validation) set uses 20%.

The following image shows how a model is trained on 80% of the data and then validated against the remaining 20%.

import pandas as pd
import io
import numpy as np
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 15/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
import os
from sklearn.model_selection import train_test_split

path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin

filename = os.path.join(path,"iris.csv") #The file is available to download from canvas in path Files --> Lab Help --> Labs -->
df = pd.read_csv(filename,na_values=['NA','?'])

df[0:5]

sepal_l sepal_w petal_l petal_w species

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df['encoded_species'] = le.fit_transform(df['species'])

df[0:5]

sepal_l sepal_w petal_l petal_w species encoded_species

0 5.1 3.5 1.4 0.2 Iris-setosa 0

1 4.9 3.0 1.4 0.2 Iris-setosa 0

2 4.7 3.2 1.3 0.2 Iris-setosa 0

3 4.6 3.1 1.5 0.2 Iris-setosa 0

# Split into train/test

x_train, x_test, y_train, y_test = train_test_split(df[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']], df['encoded_species'], tes

x_train.shape

(112, 4)

y_train.shape

(112,)

x_test.shape

(38, 4)

y_test.shape

(38,)

keyboard_arrow_down Aggregation
Data aggregation is a preprocessing task where the values of two or more objects are combined into a single object. The motivation for
aggregation includes (1) reducing the size of data to be processed, (2) changing the granularity of analysis (from fine-scale to coarser-

In the example below, we will use the daily precipitation time series data for a weather station located at Detroit Metro Airport. The raw
data was obtained from the Climate Data Online website (https://www.ncdc.noaa.gov/cdo-web/). The daily precipitation time series will be
compared against its monthly values.

Code:

The code below will load the precipitation time series data and draw a line plot of its daily time series.

daily = pd.read_csv('/data/DTW_prec.csv', header='infer') #The file is available to download from canvas in path Files --> Lab He
daily.index = pd.to_datetime(daily['DATE'])
daily = daily['PRCP']
ax = daily.plot(kind='line',figsize=(15,3))
ax.set_title('Daily Precipitation (variance = %.4f)' % (daily.var()))

Text(0.5, 1.0, 'Daily Precipitation (variance = 0.0530)')

daily

DATE
2001-01-01 0.00
2001-01-02 0.00
2001-01-03 0.00
2001-01-04 0.04
2001-01-05 0.14
...
2017-12-27 0.00
2017-12-28 0.00
2017-12-29 0.00
2017-12-30 0.00
2017-12-31 0.00
Name: PRCP, Length: 6191, dtype: float64

Observe that the daily time series appear to be quite chaotic and varies significantly from one time step to another. The time series can be
grouped and aggregated by month to obtain the total monthly precipitation values. The resulting time series appears to vary more
smoothly compared to the daily time series.

Code:

monthly = daily.groupby(pd.Grouper(freq='M')).sum()
ax = monthly.plot(kind='line',figsize=(15,3))
ax.set_title('Monthly Precipitation (variance = %.4f)' % (monthly.var()))

<ipython-input-60-5ae1711a1e83>:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME'
monthly = daily.groupby(pd.Grouper(freq='M')).sum()
Text(0.5, 1.0, 'Monthly Precipitation (variance = 2.4241)')

In the example below, the daily precipitation time series are grouped and aggregated by year to obtain the annual precipitation values.

Code:

annual = daily.groupby(pd.Grouper(freq='Y')).sum()
ax = annual.plot(kind='line',figsize=(15,3))
ax.set_title('Annual Precipitation (variance = %.4f)' % (annual.var()))

<ipython-input-61-46ed05734b2b>:1: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE'
annual = daily.groupby(pd.Grouper(freq='Y')).sum()
Text(0.5, 1.0, 'Annual Precipitation (variance = 23.6997)')

keyboard_arrow_down Sampling
Sampling is an approach commonly used to facilitate (1) data reduction for exploratory data analysis and scaling up algorithms to big data
applications and (2) quantifying uncertainties due to varying data distributions. There are various methods available for data sampling,
such as sampling without replacement, where each selected instance is removed from the dataset, and sampling with replacement, where
each selected instance is not removed, thus allowing it to be selected more than once in the sample.

In the example below, we will apply sampling with replacement and without replacement to the breast cancer dataset obtained from the
UCI machine learning repository.

Code:

We initially display the first five records of the table.

data.head()

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

0 5 1 1 1 2 1 3 1 1 2

1 5 4 4 5 7 10 3 2 1 2

2 3 1 1 1 2 2 3 1 1 2

3 6 8 8 1 3 4 3 7 1 2

Next steps: Generate code with data toggle_off View recommended plots New interactive sheet

In the following code, a sample of size 3 is randomly selected (without replacement) from the original data.

Code:

sample = data.sample(n=3)
sample

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

505 3 3 1 1 2 1 1 1 1 2

595 5 1 1 1 2 1 2 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

In the next example, we randomly select 1% of the data (without replacement) and display the selected samples. The random_state
argument of the function specifies the seed value of the random number generator.

Code:

sample = data.sample(frac=0.01, random_state=1)

sample

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

584 5 1 1 6 3 1 1 1 1 2

417 1 1 1 1 2 1 2 1 1 2

606 4 1 1 2 2 1 1 1 1 2

349 4 2 3 5 3 8 7 6 1 4

134 3 1 1 1 3 1 2 1 1 2

502 4 1 1 2 2 1 2 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

Finally, we perform a sampling with replacement to create a sample whose size is equal to 1% of the entire data. You should be able to
observe duplicate instances in the sample by increasing the sample size.

Code:

sample = data sample(frac=0 01 replace=True random state=1)

Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size

37 6 2 1 1 1 1 7 1 1 2

235 3 1 4 1 2 NaN 3 1 1 2

72 1 3 3 2 2 1 7 2 1 2

645 3 1 1 1 2 1 2 1 1 2

144 2 1 1 1 2 1 2 1 1 2

129 1 1 1 1 10 1 1 1 1 2

Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet

keyboard_arrow_down Discretization
Discretization is a data preprocessing step that is often used to transform a continuous-valued attribute to a categorical attribute. The
example below illustrates two simple but widely-used unsupervised discretization methods (equal width and equal depth) applied to the
'Clump Thickness' attribute of the breast cancer dataset.

First, we plot a histogram that shows the distribution of the attribute values. The value_counts() function can also be applied to count the
frequency of each attribute value.

Code:

data['Clump Thickness'].hist(bins=10)
data['Clump Thickness'].value_counts(sort=False)

count

Clump Thickness

5 130

3 108

6 34

4 80

8 46

1 145

2 50

7 23

10 69

9 14

dtype: int64

For the equal width method, we can apply the cut() function to discretize the attribute into 4 bins of similar interval widths. The
value_counts() function can be used to determine the number of instances in each bin.

Code:

bins = pd.cut(data['Clump Thickness'],4)

bins.value_counts(sort=False)

count

Clump Thickness

(0.991, 3.25] 303

(3.25, 5.5] 210

(5.5, 7.75] 57

(7.75, 10.0] 129

For the equal frequency method, the qcut() function can be used to partition the values into 4 bins such that each bin has nearly the same
number of instances.

bins = pd.qcut(data['Clump Thickness'],4)

bins.value_counts(sort=False)

count

Clump Thickness

(0.999, 2.0] 195

(2.0, 4.0] 188

(4.0, 6.0] 164

(6.0, 10.0] 152

keyboard_arrow_down Principal Component Analysis

Principal component analysis (PCA) is a classical method for reducing the number of attributes in the data by projecting the data from its
original high-dimensional space into a lower-dimensional space. The new attributes (also known as components) created by PCA have the
following properties: (1) they are linear combinations of the original attributes, (2) they are orthogonal (perpendicular) to each other, and
(3) they capture the maximum amount of variation in the data.

The example below illustrates the application of PCA to an image dataset. There are 16 RGB files, each of which has a size of 111 x 111
pixels. The example code below will read each image file and convert the RGB image into a 111 x 111 x 3 = 36963 feature values. This will
create a data matrix of size 16 x 36963.

Code:

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

numImages = 16
fig = plt.figure(figsize=(7,7))
imgData = np.zeros(shape=(numImages,36963))

for i in range(1,numImages+1):
filename = '/data/pics/Picture'+str(i)+'.jpg' #The file is available to download from canvas in path Files --> Lab Help --> L
img = mpimg.imread(filename)
ax = fig.add_subplot(4,4,i)
plt.imshow(img)
plt.axis('off')
ax.set_title(str(i))
imgData[i-1] = np.array(img.flatten()).reshape(1,img.shape[0]*img.shape[1]*img.shape[2])

Using PCA, the data matrix is projected to its first two principal components. The projected values of the original image data are stored in
a pandas DataFrame object named projected.

Code:

import pandas as pd
from sklearn.decomposition import PCA

numComponents = 2
pca = PCA(n_components=numComponents)
pca.fit(imgData)

projected = pca.transform(imgData)
projected = pd.DataFrame(projected,columns=['pc1','pc2'],index=range(1,numImages+1))
projected['food'] = ['burger', 'burger','burger','burger','drink','drink','drink','drink',
'pasta', 'pasta', 'pasta', 'pasta', 'chicken', 'chicken', 'chicken', 'chicken']
projected

pc1 pc2 food

1 1592.891398 -6650.655315 burger

2 513.010062 -6333.930780 burger

3 -963.256334 -7210.464745 burger

4 -2165.075382 -9038.865901 burger

5 7842.472601 1064.496270 drink

6 8458.912039 5385.049711 drink

7 11181.792685 5359.952065 drink

8 6831.001348 -1129.478069 drink

9 -7639.881092 5060.785108 pasta

10 704.458513 532.387235 pasta

11 -7237.636932 5287.195103 pasta

12 -4426.758058 4630.679800 pasta

13 -11866.492500 -1519.390075 chicken

14 -73.978096 -1380.857789 chicken

15 7510.615629 1188.363637 chicken

Next steps: Generate code with projected toggle_off View recommended plots New interactive sheet

Finally, we draw a scatter plot to display the projected values. Observe that the images of burgers, drinks, and pastas are all projected to
the same region. However, the images for fried chicken (shown as black squares in the diagram) are harder to discriminate.

Code:

import matplotlib.pyplot as plt

colors = {'burger':'b', 'drink':'r', 'pasta':'g', 'chicken':'k'}

markerTypes = {'burger':'+', 'drink':'x', 'pasta':'o', 'chicken':'s'}

for foodType in markerTypes:

d = projected[projected['food']==foodType]
plt.scatter(d['pc1'],d['pc2'],c=colors[foodType],s=60,marker=markerTypes[foodType])

keyboard_arrow_down Feature Selection Techniques

The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of
studied phenomena.

Information Gain

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_selection import mutual_info_regression

df = pd.read_csv('/data/Admission_Predict_Ver1.1.csv') #The file is available to download from canvas in path Files --> Lab Help
X = df.iloc[:,1:-1]
Y = df.iloc[:,-1]
importances = mutual_info_regression(X, Y)
feat_importances = pd.Series(importances,X.columns)
feat_importances.plot(kind='barh',color = 'teal')
plt.show()

keyboard_arrow_down Chi-square Test

The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select
the desired number of features with the best Chi-square scores. In order to correctly apply the chi-squared to test the relation between
various features in the dataset and the target variable, the following conditions have to be met: the variables have to be categorical,
sampled independently, and values should have an expected frequency greater than 5.

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder

X_Cat = X.astype(int)
le = LabelEncoder()
chi2_features = SelectKBest(chi2, k = 3)
Y = le.fit_transform(Y)
X_kbest_features = chi2_features.fit_transform(X_Cat,Y)

print('Original feature number:', X_Cat.shape[1])

print('Reduced feature number: ',X_kbest_features.shape[1])

Original feature number: 7

Reduced feature number: 3

keyboard_arrow_down Correlation Coefficient

Correlation is a measure of the linear relationship between 2 or more variables. Through correlation, we can predict one variable from the
other. The logic behind using correlation for feature selection is that good variables correlate highly with the target. Furthermore, variables
should be correlated with the target but uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only needs one, as
the second does not add additional information. We will use the Pearson Correlation here.

import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

cor = df.corr()

plt.figure(figsize=(10,6))
sns.heatmap(cor,annot= True)

<Axes: >

keyboard_arrow_down Variance Threshold

The variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some
threshold. By default, it removes all zero-variance features, i.e., features with the same value in all samples. We assume that features with
a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or
feature and target variables into account, which is one of the drawbacks of filter methods.

The get_support returns a Boolean vector where True means the variable does not have zero variance.

f kl f t l ti i t V i Th h ld
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 26/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
from sklearn.feature_selection import VarianceThreshold

X = globals()["X"]

v_threshold = VarianceThreshold(threshold = 0)
v_threshold.fit(X)
v_threshold.get_support()

array([ True, True, True, True, True, True, True])

keyboard_arrow_down Mean Absolute Difference (MAD)

‘The mean absolute difference (MAD) computes the absolute difference from the mean value. The main difference between the variance
and MAD measures is the absence of the square in the latter. The MAD, like the variance, is also a scaled variant.’ This means that the
higher the MAD, the higher the discriminatory power.

mean_abs_diff = np.sum(np.abs(X -np.mean(X, axis=0)), axis=0)/X.shape[0]

plt.bar(np.arange(X.shape[1]),mean_abs_diff,color = 'teal')

<BarContainer object of 7 artists>

keyboard_arrow_down Backward Elimination Techniques

Importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing as prepr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score

keyboard_arrow_down Import dataset and do necessary changes

df = pd.read_csv('/data/Admission_Predict_Ver1.1.csv') #The file is available to download from canvas in path Files --> Lab Help
( ( ))
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 27/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
print('Shape:{}'.format(df.shape))
print(df.head(10))
print(df.describe())
df.columns = [x.replace(' ', '').replace('.', '').lower() for x in list(df)] #converts columnnames to lower single words
del df['serialno']

Shape:(500, 9)
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \
0 1 337 118 4 4.5 4.5 9.65
1 2 324 107 4 4.0 4.5 8.87
2 3 316 104 3 3.0 3.5 8.00
3 4 322 110 3 3.5 2.5 8.67
4 5 314 103 2 2.0 3.0 8.21
5 6 330 115 5 4.5 3.0 9.34
6 7 321 109 3 3.0 4.0 8.20
7 8 308 101 2 3.0 4.0 7.90
8 9 302 102 1 2.0 1.5 8.00
9 10 323 108 3 3.5 3.0 8.60

Research Chance of Admit

0 1 0.92
1 1 0.76
2 1 0.72
3 1 0.80
4 0 0.65
5 1 0.90
6 1 0.75
7 0 0.68
8 0 0.50
9 0 0.45
Serial No. GRE Score TOEFL Score University Rating SOP \
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 250.500000 316.472000 107.192000 3.114000 3.374000
std 144.481833 11.295148 6.081868 1.143512 0.991004
min 1.000000 290.000000 92.000000 1.000000 1.000000
25% 125.750000 308.000000 103.000000 2.000000 2.500000
50% 250.500000 317.000000 107.000000 3.000000 3.500000
75% 375.250000 325.000000 112.000000 4.000000 4.000000
max 500.000000 340.000000 120.000000 5.000000 5.000000

LOR CGPA Research Chance of Admit

count 500.00000 500.000000 500.000000 500.00000
mean 3.48400 8.576440 0.560000 0.72174
std 0.92545 0.604813 0.496884 0.14114
min 1.00000 6.800000 0.000000 0.34000
25% 3.00000 8.127500 0.000000 0.63000
50% 3.50000 8.560000 1.000000 0.72000
75% 4.00000 9.040000 1.000000 0.82000
max 5.00000 9.920000 1.000000 0.97000

df.boxplot(showbox=True, figsize=(10,8))

<Axes: >

keyboard_arrow_down Standardizing the entire dataset using Min-Max scaling

cols = list(df)
scaler = prepr.MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=cols)
print(scaled_df.head(10))
print(scaled_df.describe())

grescore toeflscore universityrating sop lor cgpa research \

0 0.94 0.928571 0.75 0.875 0.875 0.913462 1.0
1 0.68 0.535714 0.75 0.750 0.875 0.663462 1.0
2 0.52 0.428571 0.50 0.500 0.625 0.384615 1.0
3 0.64 0.642857 0.50 0.625 0.375 0.599359 1.0
4 0.48 0.392857 0.25 0.250 0.500 0.451923 0.0
5 0.80 0.821429 1.00 0.875 0.500 0.814103 1.0
6 0.62 0.607143 0.50 0.500 0.750 0.448718 1.0
7 0.36 0.321429 0.25 0.500 0.750 0.352564 0.0
8 0.24 0.357143 0.00 0.250 0.125 0.384615 0.0
9 0.66 0.571429 0.50 0.625 0.500 0.576923 0.0

chanceofadmit
0 0.920635
1 0.666667
2 0.603175
3 0.730159
4 0.492063
5 0.888889
6 0.650794
7 0.539683
8 0.253968
9 0.174603
grescore toeflscore universityrating sop lor \
count 500.000000 500.000000 500.000000 500.000000 500.000000
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 29/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
mean 0.529440 0.542571 0.528500 0.593500 0.621000
std 0.225903 0.217210 0.285878 0.247751 0.231362
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.360000 0.392857 0.250000 0.375000 0.500000
50% 0.540000 0.535714 0.500000 0.625000 0.625000
75% 0.700000 0.714286 0.750000 0.750000 0.750000
max 1.000000 1.000000 1.000000 1.000000 1.000000

cgpa research chanceofadmit

count 500.000000 500.000000 500.000000
mean 0.569372 0.560000 0.605937
std 0.193850 0.496884 0.224032
min 0.000000 0.000000 0.000000
25% 0.425481 0.000000 0.460317
50% 0.564103 1.000000 0.603175
75% 0.717949 1.000000 0.761905
max 1.000000 1.000000 1.000000

keyboard_arrow_down Pairplot
This is a scatter plot of all the attributes on both x and y axes. 'chanceofadmit' is the dependent variable and the plot clearly shows that
there are few attributes that have a linear relation with the dependent attribute. So we can move ahead with linear regression.

sns.pairplot(scaled_df)

<seaborn.axisgrid.PairGrid at 0x7fb49256cc50>

keyboard_arrow_down Creating X and Y for the linear regression equation (Y = a + βX1 + βX2 + ... + βXn ) where Y is the dependent variable
and Xs are the independent variables

cols = list(scaled_df)
X = scaled_df.iloc[:, :-1]
y = scaled_df[cols[-1]]

keyboard_arrow_down Using statsmodels provided Linear Regression Model

OLS is the function used to create a linear model. This model defines the linear regression formula as Y = aX0 + βX1 + βX2 + ... + βXn .

https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 32/32

Tutorial 4
No ratings yet
Tutorial 4
8 pages
Data Prep for ML Beginners
No ratings yet
Data Prep for ML Beginners
39 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Analytics Lab: Handling Missing Data
No ratings yet
Data Analytics Lab: Handling Missing Data
47 pages
Regression
No ratings yet
Regression
26 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
Week 3
No ratings yet
Week 3
77 pages
Advance Python
No ratings yet
Advance Python
5 pages
Eda U2
No ratings yet
Eda U2
141 pages
Academic Performance Data Wrangling
No ratings yet
Academic Performance Data Wrangling
9 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
Pandas: Data Cleaning Essentials
No ratings yet
Pandas: Data Cleaning Essentials
6 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Exp-3 - Rai - 05
No ratings yet
Exp-3 - Rai - 05
7 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
DP Prog
No ratings yet
DP Prog
10 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
Da Lab6
No ratings yet
Da Lab6
3 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
1 Asdfadgaf
No ratings yet
1 Asdfadgaf
8 pages
DA Lab
No ratings yet
DA Lab
27 pages
DMML Lab Report 03
No ratings yet
DMML Lab Report 03
9 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
3 pages
EDA Preprocessing Techniques Explained
No ratings yet
EDA Preprocessing Techniques Explained
16 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
Exp 2
No ratings yet
Exp 2
6 pages
Data Preprocessing and Cleaning For Machine Learning
No ratings yet
Data Preprocessing and Cleaning For Machine Learning
16 pages
Lab2
No ratings yet
Lab2
8 pages
DM Record Final
No ratings yet
DM Record Final
68 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Data Preprocessing for COVID-19 Data
No ratings yet
Data Preprocessing for COVID-19 Data
8 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Data Cleaning Normalization
No ratings yet
Data Cleaning Normalization
2 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Statistics For Business and Economics: Metric Edition, 14th Edition Cengage South-Western Download
No ratings yet
Statistics For Business and Economics: Metric Edition, 14th Edition Cengage South-Western Download
89 pages
Solutions To Exam 1 Problem Set
100% (1)
Solutions To Exam 1 Problem Set
24 pages
Data Science Unit 6 Sppu
No ratings yet
Data Science Unit 6 Sppu
24 pages
Term 1 Examination Maths
No ratings yet
Term 1 Examination Maths
21 pages
Data Analysis & Regression Techniques
No ratings yet
Data Analysis & Regression Techniques
8 pages
Skill Levels and Gains in University STEM Education in China India Russia and The United States
No ratings yet
Skill Levels and Gains in University STEM Education in China India Russia and The United States
18 pages
Derived Hematology Reference Intervals For Healthy Males in Eastern India
No ratings yet
Derived Hematology Reference Intervals For Healthy Males in Eastern India
7 pages
Yearly Math Lesson Plan Form 5 2023/24
No ratings yet
Yearly Math Lesson Plan Form 5 2023/24
11 pages
Introduction To Stata and Data Management
No ratings yet
Introduction To Stata and Data Management
30 pages
Difference Between Data Analytics and Data Visualization
No ratings yet
Difference Between Data Analytics and Data Visualization
7 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
8 pages
AP Statistics 习题集合集
No ratings yet
AP Statistics 习题集合集
49 pages
Relative Measure of Variations
No ratings yet
Relative Measure of Variations
34 pages
Telecom Churn Report
No ratings yet
Telecom Churn Report
66 pages
Cambridge IGCSE: Mathematics 0580/23
No ratings yet
Cambridge IGCSE: Mathematics 0580/23
12 pages
2024 BA Pre-Read
No ratings yet
2024 BA Pre-Read
33 pages
Data Science Lab Manual 2022-23
No ratings yet
Data Science Lab Manual 2022-23
107 pages
Batch of 10 Descriptive Statistics
No ratings yet
Batch of 10 Descriptive Statistics
12 pages
Railway Pantograph-Catenary Wear Analysis
No ratings yet
Railway Pantograph-Catenary Wear Analysis
11 pages
Understanding Statistics and Data Analysis
No ratings yet
Understanding Statistics and Data Analysis
104 pages
Statistics 2022
No ratings yet
Statistics 2022
14 pages
Economic and Social Impact of Fuel Price Fluctuations in India's Metropolitan Regions
No ratings yet
Economic and Social Impact of Fuel Price Fluctuations in India's Metropolitan Regions
8 pages
IT - R23 - Skills Development-DATA VISUALIZATION Lab
No ratings yet
IT - R23 - Skills Development-DATA VISUALIZATION Lab
31 pages
Introduction To Data Science Unit 2 Notes
No ratings yet
Introduction To Data Science Unit 2 Notes
38 pages
QMT 3001 BUSINESS FORECASTING TERM PROJECT Group 8
No ratings yet
QMT 3001 BUSINESS FORECASTING TERM PROJECT Group 8
50 pages
(Ebook PDF) Statistics For Business Economics 12th by David R. Anderson Download
No ratings yet
(Ebook PDF) Statistics For Business Economics 12th by David R. Anderson Download
55 pages
Guth New Worksheet Chapter 2 DN
No ratings yet
Guth New Worksheet Chapter 2 DN
12 pages
Sayed Hassan MIS410
No ratings yet
Sayed Hassan MIS410
5 pages
Mirror Link:: Please Click A Few of The Buttons Below If The Link Above Does Not Work
No ratings yet
Mirror Link:: Please Click A Few of The Buttons Below If The Link Above Does Not Work
326 pages