0% found this document useful (0 votes)
22 views6 pages

DMML Lab Report 04

This lab report focuses on handling categorical data and feature scaling in a dataset using Python. It includes code examples for value counts of categorical columns, label encoding, one-hot encoding, converting boolean to integer, and feature scaling using MinMaxScaler and RobustScaler. The report is submitted by Fardus Alam for the CSE326 course at Daffodil International University.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

DMML Lab Report 04

This lab report focuses on handling categorical data and feature scaling in a dataset using Python. It includes code examples for value counts of categorical columns, label encoding, one-hot encoding, converting boolean to integer, and feature scaling using MinMaxScaler and RobustScaler. The report is submitted by Fardus Alam for the CSE326 course at Daffodil International University.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Lab report

Course code: CSE326


Course Title: Data Mining and Machine Learning Lab
Lab report: 04
Topic: Categorical Data Handling and Feature Scaling.

Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University

Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University

Submission Date: 15-03-2025


Code: Categorical Columns information
1. categorical_columns = [feature for feature in
df2.columns if df2[feature].dtype == 'O']
2.
3. print("Value Counts for Categorical Columns:")
4. for column in categorical_columns:
5. print("\n")
6. print(df2[column].value_counts())
7.

Output:

Explanation:
This code prints the value counts for each categorical column in df2, showing the frequency of unique
values in those columns. Helps to understand the distribution of categories in each categorical column
of the dataset.
Code: Label Encoding
1. from sklearn.preprocessing import
LabelEncoder 2.
3. le = LabelEncoder()
4. for column in ['gender', 'ever_married', 'Residence_type']:
5. df2[column] = le.fit_transform(df2[column])
6. df2
7.

Output:

Explanation:
This code uses Label Encoding to convert ['gender', 'ever_married', 'Residence_type']
categorical columns in df2 into numeric values.
Keys:
1. from sklearn.preprocessing import LabelEncoder
 Imports LabelEncoder from Scikit-learn.
2. le = LabelEncoder()
 Initializes the label encoder object.
3. for column in ['gender', 'ever_married', 'Residence_type']:
 Loops through all categorical columns in df2.
 df2[column] = le.fit_transform(df2[column]) converts each categorical column into
numeric values using the fit_transform method.
4. df2
 Displays the updated DataFrame with encoded values for all categorical columns.
Purpose:
 Efficiently converts all categorical columns into numeric values, preparing the dataset
for machine learning models.
Code: One Hot Encoding

1. df2 = pd.get_dummies(df2, columns=['work_type',


'smoking_status'], drop_first = True)
2. df2
3.

Output:

Explanation:
This code applies one-hot encoding to the categorical columns work_type and smoking_status,
converting them into numeric form.
Explanation:
 pd.get_dummies(df2, columns=['work_type', 'smoking_status'], drop_first=True)
 Creates dummy variables (one-hot encoding) for work_type and smoking_status.
 pd.get_dummies() return Boolean type data.
 drop_first=True removes the first category from each column to avoid the
dummy variable trap (multicollinearity).
 df2 is updated with the transformed data.

Code: Boolean to Integer

1. bool_col_list = ['work_type_Never_worked', 'work_type_Private',


2. 'work_type_Self-employed', 'work_type_children',
3. 'smoking_status_formerly smoked', 'smoking_status_never
smoked',
4. 'smoking_status_smokes']
5. df2[bool_col_list] =
df2[bool_col_list].astype(int) 6.
7. df2.head()
8.

Output: only converted columns here

Explanation:
This code converts boolean columns (False/True) into integer format (0 / 1) for consistency in numerical
processing.
Explanation:
 bool_col_list → List of one-hot encoded categorical columns.
 df2[bool_col_list] = df2[bool_col_list].astype(int)
 Converts True/False values (if any) into 1/0 integers.
 df2.head() → Displays the first 5 rows of the updated DataFrame.
Purpose:
 Ensures categorical dummy variables are in a consistent numeric format for ML models.

Code : Feature Scaling

1. from sklearn.preprocessing import MinMaxScaler,


RobustScaler 2.
3. scaler = MinMaxScaler()
4. robust = RobustScaler() # for Outliers
5.
6. df2['age'] = scaler.fit_transform(df2[['age']])
7. df2['bmi'] =
scaler.fit_transform(df2[['bmi']]) 8.
9. df2['avg_glucose_level'] =
robust.fit_transform(df2[['avg_glucose_level']
]) 10.
11. df2.head()
12.

Output:

Explanation:
This code normalizes numerical features in df2 using Min-Max Scaling and Robust Scaling.
Explanation:
 MinMaxScaler() → Scales age and bmi between 0 and 1.
 RobustScaler() → Scales avg_glucose_level using median and IQR, making it resistant to
outliers.
 .fit_transform() → Applies the transformation to each feature.
 df2.head() → Displays the first 5 rows of the updated dataset.
Purpose:
Ensures consistent scaling for better model performance while handling outliers in avg_glucose_level

You might also like