Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS
Tutorial - 4 - Data - Preprocessing - PDF From CSC177 in CSUS
ipynb - Colab
Code:
import pandas as pd
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.da
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']
0 5 1 1 1 2 1 3 1 1 2
1 5 4 4 5 7 10 3 2 1 2
2 3 1 1 1 2 2 3 1 1 2
3 6 8 8 1 3 4 3 7 1 2
Next steps: Generate code with data toggle_off View recommended plots New interactive sheet
Code:
import numpy as np
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 1/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
data = data.replace('?',np.NaN)
Observe that only the 'Bare Nuclei' column contains missing values. In the following example, the missing values in the 'Bare Nuclei'
column are replaced by the median value of that column. The values before and after replacement are shown for a subset of the data
points.
Code:
Instead of replacing the missing values, another common approach is to discard the data points that contain missing values. This can be
easily accomplished by applying the dropna() function to the data frame.
Code:
data2 = data.dropna()
print('Number of rows after discarding missing values = %d' % (data2.shape[0]))
keyboard_arrow_down Outliers
Outliers are data instances with characteristics that are considerably different from the rest of the dataset. In the example code below, we
will draw a boxplot to identify the columns in the table that contain outliers. Note that the values in all columns (except for 'Bare Nuclei')
are originally stored as 'int64' whereas the values in the 'Bare Nuclei' column are stored as string objects (since the column initially
contains strings such as '?' for representing missing values). Thus, we must convert the column into numeric values first before creating
the boxplot. Otherwise, the column will not be displayed when drawing the boxplot.
Code:
%matplotlib inline
data2 = data.drop(['Class'],axis=1)
data2['Bare Nuclei'] = pd.to_numeric(data2['Bare Nuclei'])
data2.boxplot(figsize=(20,3))
<Axes: >
The boxplots suggest that only 5 of the columns (Marginal Adhesion, Single Epithetial Cell Size, Bland Cromatin, Normal Nucleoli, and
Mitoses) contain abnormally high values. To discard the outliers, we can compute the Z-score for each attribute and remove those
instances containing attributes with abnormally high or low Z-score (e.g., if Z > 3 or Z <= -3).
Code:
The following code shows the results of standardizing the columns of the data. Note that missing values (NaN) are not affected by the
standardization process.
Z = (data2-data2.mean())/data2.std()
Z[20:25]
Single
Clump Uniformity Uniformity of Marginal Bare Bland Normal
Epithelial Mitoses
Thickness of Cell Size Cell Shape Adhesion Nuclei Chromatin Nucleoli
Cell Size
Code:
The following code shows the results of discarding columns with Z > 3 or Z <= -3.
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 3/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Code:
In the following example, we first check for duplicate instances in the breast cancer dataset.
dups = data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
data.loc[[11,28]]
11 2 1 1 1 2 1 2 1 1 2
The duplicated() function will return a Boolean array that indicates whether each row is a duplicate of a previous row in the table. The
results suggest there are 236 duplicate rows in the breast cancer dataset. For example, the instance with row index 11 has identical
attribute values as the instance with row index 28. Although such duplicate rows may correspond to samples for different individuals, in
this hypothetical example, we assume that the duplicates are samples taken from the same individual and illustrate below how to remove
the duplicated rows.
Code:
import os
import numpy as np
import pandas as pd
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read, na_values=['NA','?'])
#np.random.seed(30) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
# use inplace=False
df
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 4/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
... ... ... ... ... ... ... ... ... ...
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
df = df.sort_values(by='name',ascending=True)
df
... ... ... ... ... ... ... ... ... ...
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
#loc gets rows (or columns) with particular labels from the index.
#iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 5/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
import os
import pandas as pd
import numpy as np
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
filename_write = os.path.join(path,"auto-mpg-shuffle.csv") #The file is available to download from canvas in path Files --> Lab
df = pd.read_csv(filename_read,na_values=['NA','?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write,index=False) # Specify index = false to not write row numbers
print("Done")
Done
import os
import pandas as pd
import numpy as np
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 6/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
This can be used with the following Python code:
import os
import pandas as pd
import numpy as np
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
df.insert(1,'weight_kg',(df['weight']*0.45359237).astype(int))
df
mpg weight_kg cylinders displacement horsepower weight acceleration year origin name
... ... ... ... ... ... ... ... ... ... ...
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
To calculate the Z-Score you need to also calculate the mean(μ ) and the standard deviation (σ). The mean is calculated as follows:
x1 +x2 +⋯+xn
μ = x̄ =
n
The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and
below zero is below average. Z-Scores above/below -3/3 are very rare, these are outliers.
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
df['mpg'] = zscore(df['mpg'])
df
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 7/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
... ... ... ... ... ... ... ... ... ...
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help -->
df = pd.read_csv(filename_read,na_values=['NA','?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower],axis=1)
result
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 8/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
name horsepower
Next steps: Generate code with result toggle_off View recommended plots New interactive sheet
# Create a new dataframe from name and horsepower, but this time by row
import os
import pandas as pd
import numpy as np
from scipy.stats import zscore
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origina
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name,col_horsepower])
result
2 plymouth satellite
4 ford torino
... ...
393 86.0
394 52.0
395 84.0
396 79.0
397 82.0
They allow you to build the feature vector in the format that TensorFlow expects from raw data.
(1) Encoding data:
encode_text_dummy - Encode text fields as numeric, such as the iris species as a single field for each class. Three classes would
become "0,0,1" "0,1,0" and "1,0,0". Encode non-target features this way. used when the data is part of input (one hot encoding)
encode_text_index - Encode text fields to numeric, such as the iris species as a single numeric field as "0" "1" and "2". Encode the
target field for a classification this way. used when data is part of output (label encoding)
encode_numeric_zscore - Encode numeric values as a z-score. Neural networks deal well with "normalized" fields only.
encode_numeric_range - Encode a column to a range between the given normalized_low and normalized_high.
(5) Creating the feature vector and target vector that * Tensorflow needs*:
to_xy - Once all fields are encoded to numeric, this function can provide the x and y matrixes that TensorFlow needs to fit the neural
network with data.
import collections.abc
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os
if sd is None:
sd = df[name].std()
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 10/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)
# Regression chart.
def chart_regression(pred,y,sort=True):
t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
if sort:
t.sort_values(by=['y'],inplace=True)
a = plt.plot(t['y'].tolist(),label='expected')
b = plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()
# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
df.drop(drop_rows, axis=0, inplace=True)
keyboard_arrow_down Examples of label encoding, one hot encoding, and creating X/Y for TensorFlow
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 11/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
encode_text_index(df,"species") # label encoding
df
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
keyboard_arrow_down Make sure you encode the lables first before you call to_xy()
df=pd.read_csv("/data/iris.csv",na_values=['NA','?']) #The file is available to download from canvas in path Files --> Lab Help -
df
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 12/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
x,y = to_xy(df,"species")
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 13/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]], dtype=float32)
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 14/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
filename_read = os.path.join(path,"auto-mpg.csv") #The file is available to download from canvas in path Files --> Lab Help --> L
df = pd.read_csv(filename_read,na_values=['NA','?'])
Training Data - In Sample Data - The data that the machine learning model was fit to/created from.
Validation Data - Out of Sample Data - The data that the machine learning model is evaluated upon after it is fit to the training data.
There are two predominant means of dealing with training and validation data:
Training/Test Split - The data are split according to some ratio between a training and validation (hold-out) set. Common ratios are
80% training and 20% validation.
K-Fold Cross Validation - The data are split into a number of folds and models. Because a number of models equal to the folds is
created out-of-sample predictions can be generated for the entire dataset.
The following image shows how a model is trained on 80% of the data and then validated against the remaining 20%.
import pandas as pd
import io
import numpy as np
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 15/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
import os
from sklearn.model_selection import train_test_split
path = "/data" #data file can be dowloaded from the link https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(origin
filename = os.path.join(path,"iris.csv") #The file is available to download from canvas in path Files --> Lab Help --> Labs -->
df = pd.read_csv(filename,na_values=['NA','?'])
df[0:5]
le = preprocessing.LabelEncoder()
df['encoded_species'] = le.fit_transform(df['species'])
df[0:5]
x_train.shape
(112, 4)
y_train.shape
(112,)
x_test.shape
(38, 4)
y_test.shape
(38,)
keyboard_arrow_down Aggregation
Data aggregation is a preprocessing task where the values of two or more objects are combined into a single object. The motivation for
aggregation includes (1) reducing the size of data to be processed, (2) changing the granularity of analysis (from fine-scale to coarser-
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 16/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
scale), and (3) improving the stability of the data.
In the example below, we will use the daily precipitation time series data for a weather station located at Detroit Metro Airport. The raw
data was obtained from the Climate Data Online website (https://www.ncdc.noaa.gov/cdo-web/). The daily precipitation time series will be
compared against its monthly values.
Code:
The code below will load the precipitation time series data and draw a line plot of its daily time series.
daily = pd.read_csv('/data/DTW_prec.csv', header='infer') #The file is available to download from canvas in path Files --> Lab He
daily.index = pd.to_datetime(daily['DATE'])
daily = daily['PRCP']
ax = daily.plot(kind='line',figsize=(15,3))
ax.set_title('Daily Precipitation (variance = %.4f)' % (daily.var()))
daily
DATE
2001-01-01 0.00
2001-01-02 0.00
2001-01-03 0.00
2001-01-04 0.04
2001-01-05 0.14
...
2017-12-27 0.00
2017-12-28 0.00
2017-12-29 0.00
2017-12-30 0.00
2017-12-31 0.00
Name: PRCP, Length: 6191, dtype: float64
Observe that the daily time series appear to be quite chaotic and varies significantly from one time step to another. The time series can be
grouped and aggregated by month to obtain the total monthly precipitation values. The resulting time series appears to vary more
smoothly compared to the daily time series.
Code:
monthly = daily.groupby(pd.Grouper(freq='M')).sum()
ax = monthly.plot(kind='line',figsize=(15,3))
ax.set_title('Monthly Precipitation (variance = %.4f)' % (monthly.var()))
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 17/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
<ipython-input-60-5ae1711a1e83>:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME'
monthly = daily.groupby(pd.Grouper(freq='M')).sum()
Text(0.5, 1.0, 'Monthly Precipitation (variance = 2.4241)')
In the example below, the daily precipitation time series are grouped and aggregated by year to obtain the annual precipitation values.
Code:
annual = daily.groupby(pd.Grouper(freq='Y')).sum()
ax = annual.plot(kind='line',figsize=(15,3))
ax.set_title('Annual Precipitation (variance = %.4f)' % (annual.var()))
<ipython-input-61-46ed05734b2b>:1: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE'
annual = daily.groupby(pd.Grouper(freq='Y')).sum()
Text(0.5, 1.0, 'Annual Precipitation (variance = 23.6997)')
keyboard_arrow_down Sampling
Sampling is an approach commonly used to facilitate (1) data reduction for exploratory data analysis and scaling up algorithms to big data
applications and (2) quantifying uncertainties due to varying data distributions. There are various methods available for data sampling,
such as sampling without replacement, where each selected instance is removed from the dataset, and sampling with replacement, where
each selected instance is not removed, thus allowing it to be selected more than once in the sample.
In the example below, we will apply sampling with replacement and without replacement to the breast cancer dataset obtained from the
UCI machine learning repository.
Code:
data.head()
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 18/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size
0 5 1 1 1 2 1 3 1 1 2
1 5 4 4 5 7 10 3 2 1 2
2 3 1 1 1 2 2 3 1 1 2
3 6 8 8 1 3 4 3 7 1 2
Next steps: Generate code with data toggle_off View recommended plots New interactive sheet
In the following code, a sample of size 3 is randomly selected (without replacement) from the original data.
Code:
sample = data.sample(n=3)
sample
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size
505 3 3 1 1 2 1 1 1 1 2
595 5 1 1 1 2 1 2 1 1 2
Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet
In the next example, we randomly select 1% of the data (without replacement) and display the selected samples. The random_state
argument of the function specifies the seed value of the random number generator.
Code:
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size
584 5 1 1 6 3 1 1 1 1 2
417 1 1 1 1 2 1 2 1 1 2
606 4 1 1 2 2 1 1 1 1 2
349 4 2 3 5 3 8 7 6 1 4
134 3 1 1 1 3 1 2 1 1 2
502 4 1 1 2 2 1 2 1 1 2
Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet
Finally, we perform a sampling with replacement to create a sample whose size is equal to 1% of the entire data. You should be able to
observe duplicate instances in the sample by increasing the sample size.
Code:
Uniformity Single
Clump Uniformity Marginal Bare Bland Normal
of Cell Epithelial Mitoses Class
Thickness of Cell Size Adhesion Nuclei Chromatin Nucleoli
Shape Cell Size
37 6 2 1 1 1 1 7 1 1 2
235 3 1 4 1 2 NaN 3 1 1 2
72 1 3 3 2 2 1 7 2 1 2
645 3 1 1 1 2 1 2 1 1 2
144 2 1 1 1 2 1 2 1 1 2
129 1 1 1 1 10 1 1 1 1 2
Next steps: Generate code with sample toggle_off View recommended plots New interactive sheet
keyboard_arrow_down Discretization
Discretization is a data preprocessing step that is often used to transform a continuous-valued attribute to a categorical attribute. The
example below illustrates two simple but widely-used unsupervised discretization methods (equal width and equal depth) applied to the
'Clump Thickness' attribute of the breast cancer dataset.
First, we plot a histogram that shows the distribution of the attribute values. The value_counts() function can also be applied to count the
frequency of each attribute value.
Code:
data['Clump Thickness'].hist(bins=10)
data['Clump Thickness'].value_counts(sort=False)
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 20/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
count
Clump Thickness
5 130
3 108
6 34
4 80
8 46
1 145
2 50
7 23
10 69
9 14
dtype: int64
For the equal width method, we can apply the cut() function to discretize the attribute into 4 bins of similar interval widths. The
value_counts() function can be used to determine the number of instances in each bin.
Code:
count
Clump Thickness
(5.5, 7.75] 57
For the equal frequency method, the qcut() function can be used to partition the values into 4 bins such that each bin has nearly the same
number of instances.
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 21/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Code:
count
Clump Thickness
The example below illustrates the application of PCA to an image dataset. There are 16 RGB files, each of which has a size of 111 x 111
pixels. The example code below will read each image file and convert the RGB image into a 111 x 111 x 3 = 36963 feature values. This will
create a data matrix of size 16 x 36963.
Code:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
numImages = 16
fig = plt.figure(figsize=(7,7))
imgData = np.zeros(shape=(numImages,36963))
for i in range(1,numImages+1):
filename = '/data/pics/Picture'+str(i)+'.jpg' #The file is available to download from canvas in path Files --> Lab Help --> L
img = mpimg.imread(filename)
ax = fig.add_subplot(4,4,i)
plt.imshow(img)
plt.axis('off')
ax.set_title(str(i))
imgData[i-1] = np.array(img.flatten()).reshape(1,img.shape[0]*img.shape[1]*img.shape[2])
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 22/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Using PCA, the data matrix is projected to its first two principal components. The projected values of the original image data are stored in
a pandas DataFrame object named projected.
Code:
import pandas as pd
from sklearn.decomposition import PCA
numComponents = 2
pca = PCA(n_components=numComponents)
pca.fit(imgData)
projected = pca.transform(imgData)
projected = pd.DataFrame(projected,columns=['pc1','pc2'],index=range(1,numImages+1))
projected['food'] = ['burger', 'burger','burger','burger','drink','drink','drink','drink',
'pasta', 'pasta', 'pasta', 'pasta', 'chicken', 'chicken', 'chicken', 'chicken']
projected
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 23/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Next steps: Generate code with projected toggle_off View recommended plots New interactive sheet
Finally, we draw a scatter plot to display the projected values. Observe that the images of burgers, drinks, and pastas are all projected to
the same region. However, the images for fried chicken (shown as black squares in the diagram) are harder to discriminate.
Code:
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 24/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
Information Gain
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_selection import mutual_info_regression
df = pd.read_csv('/data/Admission_Predict_Ver1.1.csv') #The file is available to download from canvas in path Files --> Lab Help
X = df.iloc[:,1:-1]
Y = df.iloc[:,-1]
importances = mutual_info_regression(X, Y)
feat_importances = pd.Series(importances,X.columns)
feat_importances.plot(kind='barh',color = 'teal')
plt.show()
X_Cat = X.astype(int)
le = LabelEncoder()
chi2_features = SelectKBest(chi2, k = 3)
Y = le.fit_transform(Y)
X_kbest_features = chi2_features.fit_transform(X_Cat,Y)
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 25/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only needs one, as
the second does not add additional information. We will use the Pearson Correlation here.
cor = df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(cor,annot= True)
<Axes: >
The get_support returns a Boolean vector where True means the variable does not have zero variance.
f kl f t l ti i t V i Th h ld
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 26/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
from sklearn.feature_selection import VarianceThreshold
X = globals()["X"]
v_threshold = VarianceThreshold(threshold = 0)
v_threshold.fit(X)
v_threshold.get_support()
plt.bar(np.arange(X.shape[1]),mean_abs_diff,color = 'teal')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing as prepr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score
Shape:(500, 9)
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \
0 1 337 118 4 4.5 4.5 9.65
1 2 324 107 4 4.0 4.5 8.87
2 3 316 104 3 3.0 3.5 8.00
3 4 322 110 3 3.5 2.5 8.67
4 5 314 103 2 2.0 3.0 8.21
5 6 330 115 5 4.5 3.0 9.34
6 7 321 109 3 3.0 4.0 8.20
7 8 308 101 2 3.0 4.0 7.90
8 9 302 102 1 2.0 1.5 8.00
9 10 323 108 3 3.5 3.0 8.60
df.boxplot(showbox=True, figsize=(10,8))
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 28/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
<Axes: >
chanceofadmit
0 0.920635
1 0.666667
2 0.603175
3 0.730159
4 0.492063
5 0.888889
6 0.650794
7 0.539683
8 0.253968
9 0.174603
grescore toeflscore universityrating sop lor \
count 500.000000 500.000000 500.000000 500.000000 500.000000
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 29/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
mean 0.529440 0.542571 0.528500 0.593500 0.621000
std 0.225903 0.217210 0.285878 0.247751 0.231362
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.360000 0.392857 0.250000 0.375000 0.500000
50% 0.540000 0.535714 0.500000 0.625000 0.625000
75% 0.700000 0.714286 0.750000 0.750000 0.750000
max 1.000000 1.000000 1.000000 1.000000 1.000000
keyboard_arrow_down Pairplot
This is a scatter plot of all the attributes on both x and y axes. 'chanceofadmit' is the dependent variable and the plot clearly shows that
there are few attributes that have a linear relation with the dependent attribute. So we can move ahead with linear regression.
sns.pairplot(scaled_df)
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 30/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
<seaborn.axisgrid.PairGrid at 0x7fb49256cc50>
keyboard_arrow_down Creating X and Y for the linear regression equation (Y = a + βX1 + βX2 + ... + βXn ) where Y is the dependent variable
and Xs are the independent variables
cols = list(scaled_df)
X = scaled_df.iloc[:, :-1]
y = scaled_df[cols[-1]]
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 31/32
2/5/25, 2:46 PM Tutorial_4_Data_Preprocessing (3).ipynb - Colab
So create a new attribute with all ones and append it to the start of the dataframe. This will serve as X0
https://colab.research.google.com/drive/1Kuvm11C1jBBcyQAByFjZpYPzXocm0SjW#scrollTo=zHb7PSIheHzc&printMode=true 32/32