0% found this document useful (0 votes)
11 views12 pages

Assignment 2 DSBDA

This document outlines a notebook for analyzing student performance data through various techniques including data preprocessing, outlier detection, and machine learning. It details steps such as loading data, handling missing values, detecting outliers using methods like Z-Score and IQR, and applying transformations and feature engineering. The notebook also includes code snippets for implementing these techniques using Python libraries.

Uploaded by

bargeanjali650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Assignment 2 DSBDA

This document outlines a notebook for analyzing student performance data through various techniques including data preprocessing, outlier detection, and machine learning. It details steps such as loading data, handling missing values, detecting outliers using methods like Z-Score and IQR, and applying transformations and feature engineering. The notebook also includes code snippets for implementing these techniques using Python libraries.

Uploaded by

bargeanjali650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Introduction

Student Performance Data Analysis


This notebook explores student academic performance using data preprocessing, outlier
detection, transformation, feature engineering, and machine learning.

** Steps Covered**
1. Data Loading & Exploration

2. Handling Missing Values


3. Detecting & Handling Outliers (Boxplot, Z-Score, IQR, Winsorization, Capping)

4. Data Transformation (Log Scaling, Normalization, Box-Cox)


5. Feature Engineering & Encoding

6. Exploratory Data Analysis (EDA)

7. Saving Processed Data

2. Importing Required Libraries


# Importing Required Libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import zscore, boxcox
from [Link] import MinMaxScaler, StandardScaler,
LabelEncoder
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, classification_report

# Display all columns in Pandas output


pd.set_option('display.max_columns', None)

3. Loading and Exploring the Dataset


file_path = "student_data.csv" # Update this if needed
df = pd.read_csv(file_path)
# Display first few rows
print("\n📌 First 5 rows of the dataset:")
display([Link]())

📌 First 5 rows of the dataset:

school sex age address famsize Pstatus Medu Fedu Mjob


Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup


paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout


Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3


0 1 3 6 5 6 6
1 1 3 4 5 5 6
2 3 3 10 7 8 10
3 1 5 2 15 14 15
4 2 5 4 6 10 10

print("\n📌 Dataset Info:")


[Link]()

📌 Dataset Info:
<class '[Link]'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB
print("\n📌 Missing Values in Each Column:")
print([Link]().sum())

📌 Missing Values in Each Column:


school 0
sex 0
age 0
address 0
famsize 0
Pstatus 0
Medu 0
Fedu 0
Mjob 0
Fjob 0
reason 0
guardian 0
traveltime 0
studytime 0
failures 0
schoolsup 0
famsup 0
paid 0
activities 0
nursery 0
higher 0
internet 0
romantic 0
famrel 0
freetime 0
goout 0
Dalc 0
Walc 0
health 0
absences 0
G1 0
G2 0
G3 0
dtype: int64

4. Detecting Outliers (Multiple Techniques)


Detecting Outliers using Boxplot, Z-Score, and IQR

[Link](figsize=(12, 6))
[Link](data=df[['G1', 'G2', 'G3', 'absences']])
[Link]("Boxplot for Outlier Detection")
[Link]()

z_scores = [Link](zscore(df[['G1', 'G2', 'G3', 'absences']]))


print("\n Z-Scores of Features:")
print(z_scores)

Z-Scores of Features:
G1 G2 G3 absences
0 1.782467 1.254791 0.964934 0.036424
1 1.782467 1.520979 0.964934 0.213796
2 1.179147 0.722415 0.090739 0.536865
3 1.234133 0.874715 1.002004 0.464016
4 1.480807 0.190038 0.090739 0.213796
.. ... ... ... ...
390 0.575827 0.456226 0.309288 0.661975
391 0.932473 1.407091 1.220553 0.338906
392 0.274167 0.722415 0.746385 0.338906
393 0.027493 0.342338 0.090739 0.714236
394 0.877487 0.456226 0.309288 0.088686

[395 rows x 4 columns]

# Select only numeric columns before calculating IQR


numeric_cols = df.select_dtypes(include=[[Link]])

# Calculate Q1 (25th percentile) and Q3 (75th percentile)


Q1 = numeric_cols.quantile(0.25)
Q3 = numeric_cols.quantile(0.75)
IQR = Q3 - Q1 # Interquartile Range

# Define lower and upper bounds for outliers


lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("\n📌 IQR Outlier Boundaries:")


print(f"Lower Bound:\n{lower_bound}")
print(f"Upper Bound:\n{upper_bound}")

📌 IQR Outlier Boundaries:


Lower Bound:
age 13.0
Medu -1.0
Fedu 0.5
traveltime -0.5
studytime -0.5
failures 0.0
famrel 2.5
freetime 1.5
goout -1.0
Dalc -0.5
Walc -2.0
health 0.0
absences -12.0
G1 0.5
G2 3.0
G3 -1.0
dtype: float64
Upper Bound:
age 21.0
Medu 7.0
Fedu 4.5
traveltime 3.5
studytime 3.5
failures 0.0
famrel 6.5
freetime 5.5
goout 7.0
Dalc 3.5
Walc 6.0
health 8.0
absences 20.0
G1 20.5
G2 19.0
G3 23.0
dtype: float64
# 5. Handling Outliers (Removing, Capping, Winsorization)

df_no_outliers = df[(z_scores < 3).all(axis=1)]


print("\n Data after Removing Outliers (Z-Score Method):")
display(df_no_outliers.head())

Data after Removing Outliers (Z-Score Method):

school sex age address famsize Pstatus Medu Fedu Mjob


Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup


paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout


Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3


0 1 3 6 5 6 6
1 1 3 4 5 5 6
2 3 3 10 7 8 10
3 1 5 2 15 14 15
4 2 5 4 6 10 10

def cap_outliers(column):
"""Cap outliers to specified percentiles (5th and 95th
percentile)."""
lower_limit = [Link](0.05)
upper_limit = [Link](0.95)
return [Link](lower=lower_limit, upper=upper_limit)

df_winsorized = [Link]()
for col in ['G1', 'G2', 'G3', 'absences']:
df_winsorized[col] = cap_outliers(df_winsorized[col])

print("\n Data after Winsorization:")


display(df_winsorized.head())

Data after Winsorization:

school sex age address famsize Pstatus Medu Fedu Mjob


Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup


paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout


Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3


0 1 3 6.0 6 6.0 6
1 1 3 4.0 6 5.0 6
2 3 3 10.0 7 8.0 10
3 1 5 2.0 15 14.0 15
4 2 5 4.0 6 10.0 10

df_replaced = [Link]()
for col in ['G1', 'G2', 'G3', 'absences']:
median_value = df_replaced[col].median()
df_replaced[col] = [Link](
(df_replaced[col] < lower_bound[col]) | (df_replaced[col] >
upper_bound[col]),
median_value, df_replaced[col]
)

print("\n Data after Replacing Outliers with Median:")


display(df_replaced.head())

Data after Replacing Outliers with Median:

school sex age address famsize Pstatus Medu Fedu Mjob


Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup


paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout


Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3


0 1 3 6.0 5.0 6.0 6.0
1 1 3 4.0 5.0 5.0 6.0
2 3 3 10.0 7.0 8.0 10.0
3 1 5 2.0 15.0 14.0 15.0
4 2 5 4.0 6.0 10.0 10.0

6. Data Transformation
df['Log_Absences'] = np.log1p(df['absences'])

# 2. Square Root Transformation


df['Sqrt_Absences'] = [Link](df['absences'])

# 3. Box-Cox Transformation (only for positive values)


df['BoxCox_G3'], _ = boxcox(df['G3'] + 1) # Adding 1 to avoid zero
values

# [Link]-Max Scaling
scaler = MinMaxScaler()
df[['G1', 'G2', 'G3', 'Log_Absences', 'Sqrt_Absences', 'BoxCox_G3']] =
scaler.fit_transform(df[['G1', 'G2', 'G3', 'Log_Absences',
'Sqrt_Absences', 'BoxCox_G3']])

print("\n Data after Transformation:")


display([Link]())
Data after Transformation:

school sex age address famsize Pstatus Medu Fedu Mjob


Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup


paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout


Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3 Log_Absences \


0 1 3 6 0.1250 0.315789 0.30 0.449326
1 1 3 4 0.1250 0.263158 0.30 0.371632
2 3 3 10 0.2500 0.421053 0.50 0.553693
3 1 5 2 0.7500 0.736842 0.75 0.253678
4 2 5 4 0.1875 0.526316 0.50 0.371632

Sqrt_Absences BoxCox_G3
0 0.282843 0.230437
1 0.230940 0.230437
2 0.365148 0.426526
3 0.163299 0.700747
4 0.230940 0.426526

7. Saving Processed Data


df.to_csv("processed_student_data.csv", index=False)
print(" Processed dataset saved as 'processed_student_data.csv'.")

Processed dataset saved as 'processed_student_data.csv'.

You might also like