LAB EXERCISE – 2
Data Preprocessing
Aim of the Experiment.
The main aim of this experiment is to preprocess the given dataset. The database is created
and is available in the file [Link].
Sample Dataset
id first last gender Marks selected
1 Leone Debrick Female 50 TRUE
2 Romola Phinness Female 60 FALSE
y
3 Geri Prium Male 65 FALSE
4 Sandy Doveston Female 95 FALSE
5 Jacenta Jansik Female 31 TRUE
6 Diane- Medhurst Female 45 TRUE
marie
7 Austen Pool Male 45 TRUE
8 Vanya Teffrey Male 70 FALSE
9 Giordano Elloy Male 36 FALSE
10 Rozele Fawcett Female 50 FALSE
The objectives of this experiment are
1. Explore Label Encoder
2. Explore Scikit Preprocessing routines like Scaling
3. Explore Scikit Preprocessing routines like Binarizer
Reference to the Textbook and Explanation
All the fundamentals are given in Chapter 2 and Appendix 2.
The variable in the dataset Female and Male can be changed to 0 or 1 using Label Encoder. It is done as
given below:
df_gender_encode=LabelEncoder()
[Link]=df_gender_encode.fit_transform([Link])
Scaling can be done as follows:
[Link] = [Link]([Link])
scaled_df= [Link]([Link])
Scaling removes the mean
Copyright @ Oxford University Press, India 2021
Binarization uses threshold and converts values to binary as shown below:
scaled_df_bin = [Link](threshold=0.5).transform(newarr)
Duplicates can be removed as follows:
df_duplicates_removed = [Link].drop_duplicates(df_duplicated)
The NaN of a column can be removed as shown below:
df['m5']=df['m5'].fillna(0)
This removes all the NaN to zero.
The command,
df=[Link](axis=1)
removes all the columns that has NaN.
Listing 1
import pandas as pd
col_list=["id","first","last","gender","Marks","selected"]
df = pd.read_csv("[Link]",usecols=col_list)
print(df)
print("End of Listing\n\n\n")
# Let us convert the in Gender column, make Female as 0 and
# male as 1 using LabelEncoder in scikitlearn method
from [Link] import LabelEncoder
df_gender_encode=LabelEncoder()
[Link]=df_gender_encode.fit_transform([Link])
# One can observe that female is coded as 0 and Male as 1
print(df)
print("End of Listing\n\n\n")
# Now one can scale the marks to remove mean
Copyright @ Oxford University Press, India 2021
from sklearn import preprocessing
[Link] = [Link]([Link])
scaled_df= [Link]([Link])
print(df)
print("Scaling of marks is completed\n\n\n\n")
newarr = scaled_df.reshape(-1,1)
scaled_df_bin = [Link](threshold=0.5).transform(newarr)
df['Marks']=scaled_df_bin
print(df)
print("Binarizarion of marks is completed\n\n\n\n")
Output
Copyright @ Oxford University Press, India 2021
Copyright @ Oxford University Press, India 2021
Listing 2
import pandas as pd
col_list=["id","first","last","gender","Marks","selected"]
df = pd.read_csv("[Link]",usecols=col_list)
print(df)
print("End of Listing\n\n\n")
# Let us create duplicate elements in the given dataset
# This is done using the command concate 2 times as given below
df_duplicated = [Link]([df]*2, ignore_index=True)
print(df_duplicated)
print("Display before duplication\n\n\n\n")
df_duplicates_removed = [Link].drop_duplicates(df_duplicated)
print(df_duplicates_removed)
print("Display after duplication\n\n\n\n")
Output
Copyright @ Oxford University Press, India 2021
Copyright @ Oxford University Press, India 2021
Listing 3
import pandas as pd
df = [Link]({
'm1':[50,'A',60,'A',80],
'm2':[60,'A','60','A',80],
'm3':[50,70,'A','A',60],
'm4':[60,'A','A','A',60],
'm5':['A','A','A',10,20]
})
df = [Link](pd.to_numeric,errors='coerce')
print(df)
print('Dataframe with NaN\n\n\n')
# Make all the NaN in Mark5 as zero
df['m5']=df['m5'].fillna(0)
print(df)
print('Making m5 NaN as 0 using fillna() function\n\n\n\n')
df1 = [Link]()
df1['m2'].fillna(df1['m2'].mean(),inplace=True)
print(df1)
print('Making m5 NaN as mean using fillna() function\n\n\n\n')
df2 = [Link]()
df1['m3'].fillna(df1['m2'].median(),inplace=True)
print(df2)
print('Making m5 NaN as median using fillna() function\n\n\n\n')
Copyright @ Oxford University Press, India 2021
# Dropping all columns having NaN
df=[Link](axis=1)
print(df)
print('Dropping all columns having NaN\n\n\n\n')
Output
Copyright @ Oxford University Press, India 2021
Listing 4
This listing illustrates the use of MinMax scaling and Standard scaling for finding Z-scores.
from numpy import asarray
from [Link] import MinMaxScaler
from [Link] import StandardScaler
data = asarray([[1,3],[8,5],[6,7],[8,9]])
print("\n Original Data")
print(data)
Copyright @ Oxford University Press, India 2021
scaler1 = MinMaxScaler()
scaler2 = StandardScaler()
scaled1 = scaler1.fit_transform(data)
scaled2 = scaler2.fit_transform(data)
print("\n\nThe output of MinMax Scaling")
print(scaled1)
print("\n\nThe output of Standard scaling as z-score")
print(scaled2)
Output
Copyright @ Oxford University Press, India 2021