Lab report
Course code: CSE326
Course Title: Data Mining and Machine Learning Lab
Lab report: 04
Topic: Categorical Data Handling and Feature Scaling.
Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University
Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University
Submission Date: 15-03-2025
Code: Categorical Columns information
1. categorical_columns = [feature for feature in
df2.columns if df2[feature].dtype == 'O']
2.
3. print("Value Counts for Categorical Columns:")
4. for column in categorical_columns:
5. print("\n")
6. print(df2[column].value_counts())
7.
Output:
Explanation:
This code prints the value counts for each categorical column in df2, showing the frequency of unique
values in those columns. Helps to understand the distribution of categories in each categorical column
of the dataset.
Code: Label Encoding
1. from sklearn.preprocessing import
LabelEncoder 2.
3. le = LabelEncoder()
4. for column in ['gender', 'ever_married', 'Residence_type']:
5. df2[column] = le.fit_transform(df2[column])
6. df2
7.
Output:
Explanation:
This code uses Label Encoding to convert ['gender', 'ever_married', 'Residence_type']
categorical columns in df2 into numeric values.
Keys:
1. from sklearn.preprocessing import LabelEncoder
Imports LabelEncoder from Scikit-learn.
2. le = LabelEncoder()
Initializes the label encoder object.
3. for column in ['gender', 'ever_married', 'Residence_type']:
Loops through all categorical columns in df2.
df2[column] = le.fit_transform(df2[column]) converts each categorical column into
numeric values using the fit_transform method.
4. df2
Displays the updated DataFrame with encoded values for all categorical columns.
Purpose:
Efficiently converts all categorical columns into numeric values, preparing the dataset
for machine learning models.
Code: One Hot Encoding
1. df2 = pd.get_dummies(df2, columns=['work_type',
'smoking_status'], drop_first = True)
2. df2
3.
Output:
Explanation:
This code applies one-hot encoding to the categorical columns work_type and smoking_status,
converting them into numeric form.
Explanation:
pd.get_dummies(df2, columns=['work_type', 'smoking_status'], drop_first=True)
Creates dummy variables (one-hot encoding) for work_type and smoking_status.
pd.get_dummies() return Boolean type data.
drop_first=True removes the first category from each column to avoid the
dummy variable trap (multicollinearity).
df2 is updated with the transformed data.
Code: Boolean to Integer
1. bool_col_list = ['work_type_Never_worked', 'work_type_Private',
2. 'work_type_Self-employed', 'work_type_children',
3. 'smoking_status_formerly smoked', 'smoking_status_never
smoked',
4. 'smoking_status_smokes']
5. df2[bool_col_list] =
df2[bool_col_list].astype(int) 6.
7. df2.head()
8.
Output: only converted columns here
Explanation:
This code converts boolean columns (False/True) into integer format (0 / 1) for consistency in numerical
processing.
Explanation:
bool_col_list → List of one-hot encoded categorical columns.
df2[bool_col_list] = df2[bool_col_list].astype(int)
Converts True/False values (if any) into 1/0 integers.
df2.head() → Displays the first 5 rows of the updated DataFrame.
Purpose:
Ensures categorical dummy variables are in a consistent numeric format for ML models.
Code : Feature Scaling
1. from sklearn.preprocessing import MinMaxScaler,
RobustScaler 2.
3. scaler = MinMaxScaler()
4. robust = RobustScaler() # for Outliers
5.
6. df2['age'] = scaler.fit_transform(df2[['age']])
7. df2['bmi'] =
scaler.fit_transform(df2[['bmi']]) 8.
9. df2['avg_glucose_level'] =
robust.fit_transform(df2[['avg_glucose_level']
]) 10.
11. df2.head()
12.
Output:
Explanation:
This code normalizes numerical features in df2 using Min-Max Scaling and Robust Scaling.
Explanation:
MinMaxScaler() → Scales age and bmi between 0 and 1.
RobustScaler() → Scales avg_glucose_level using median and IQR, making it resistant to
outliers.
.fit_transform() → Applies the transformation to each feature.
df2.head() → Displays the first 5 rows of the updated dataset.
Purpose:
Ensures consistent scaling for better model performance while handling outliers in avg_glucose_level