0% found this document useful (0 votes)

11 views12 pages

Assignment 2 DSBDA

This document outlines a notebook for analyzing student performance data through various techniques including data preprocessing, outlier detection, and machine learning. It details steps such as loading data, handling missing values, detecting outliers using methods like Z-Score and IQR, and applying transformations and feature engineering. The notebook also includes code snippets for implementing these techniques using Python libraries.

Uploaded by

bargeanjali650

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

Assignment 2 DSBDA

Uploaded by

bargeanjali650

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1.

Introduction

Student Performance Data Analysis

This notebook explores student academic performance using data preprocessing, outlier
detection, transformation, feature engineering, and machine learning.

** Steps Covered**
1. Data Loading & Exploration

2. Handling Missing Values

3. Detecting & Handling Outliers (Boxplot, Z-Score, IQR, Winsorization, Capping)

4. Data Transformation (Log Scaling, Normalization, Box-Cox)

5. Feature Engineering & Encoding

6. Exploratory Data Analysis (EDA)

7. Saving Processed Data

2. Importing Required Libraries

# Importing Required Libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import zscore, boxcox
from [Link] import MinMaxScaler, StandardScaler,
LabelEncoder
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, classification_report

# Display all columns in Pandas output

pd.set_option('display.max_columns', None)

3. Loading and Exploring the Dataset

file_path = "student_data.csv" # Update this if needed
df = pd.read_csv(file_path)
# Display first few rows
print("\n📌 First 5 rows of the dataset:")
display([Link]())

📌 First 5 rows of the dataset:

school sex age address famsize Pstatus Medu Fedu Mjob

Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup

paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout

Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3

0 1 3 6 5 6 6
1 1 3 4 5 5 6
2 3 3 10 7 8 10
3 1 5 2 15 14 15
4 2 5 4 6 10 10

print("\n📌 Dataset Info:")

[Link]()

📌 Dataset Info:
<class '[Link]'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB
print("\n📌 Missing Values in Each Column:")
print([Link]().sum())

📌 Missing Values in Each Column:

school 0
sex 0
age 0
address 0
famsize 0
Pstatus 0
Medu 0
Fedu 0
Mjob 0
Fjob 0
reason 0
guardian 0
traveltime 0
studytime 0
failures 0
schoolsup 0
famsup 0
paid 0
activities 0
nursery 0
higher 0
internet 0
romantic 0
famrel 0
freetime 0
goout 0
Dalc 0
Walc 0
health 0
absences 0
G1 0
G2 0
G3 0
dtype: int64

4. Detecting Outliers (Multiple Techniques)

Detecting Outliers using Boxplot, Z-Score, and IQR

[Link](figsize=(12, 6))
[Link](data=df[['G1', 'G2', 'G3', 'absences']])
[Link]("Boxplot for Outlier Detection")
[Link]()

z_scores = [Link](zscore(df[['G1', 'G2', 'G3', 'absences']]))

print("\n Z-Scores of Features:")
print(z_scores)

Z-Scores of Features:
G1 G2 G3 absences
0 1.782467 1.254791 0.964934 0.036424
1 1.782467 1.520979 0.964934 0.213796
2 1.179147 0.722415 0.090739 0.536865
3 1.234133 0.874715 1.002004 0.464016
4 1.480807 0.190038 0.090739 0.213796
.. ... ... ... ...
390 0.575827 0.456226 0.309288 0.661975
391 0.932473 1.407091 1.220553 0.338906
392 0.274167 0.722415 0.746385 0.338906
393 0.027493 0.342338 0.090739 0.714236
394 0.877487 0.456226 0.309288 0.088686

[395 rows x 4 columns]

# Select only numeric columns before calculating IQR

numeric_cols = df.select_dtypes(include=[[Link]])

# Calculate Q1 (25th percentile) and Q3 (75th percentile)

Q1 = numeric_cols.quantile(0.25)
Q3 = numeric_cols.quantile(0.75)
IQR = Q3 - Q1 # Interquartile Range

# Define lower and upper bounds for outliers

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("\n📌 IQR Outlier Boundaries:")

print(f"Lower Bound:\n{lower_bound}")
print(f"Upper Bound:\n{upper_bound}")

📌 IQR Outlier Boundaries:

Lower Bound:
age 13.0
Medu -1.0
Fedu 0.5
traveltime -0.5
studytime -0.5
failures 0.0
famrel 2.5
freetime 1.5
goout -1.0
Dalc -0.5
Walc -2.0
health 0.0
absences -12.0
G1 0.5
G2 3.0
G3 -1.0
dtype: float64
Upper Bound:
age 21.0
Medu 7.0
Fedu 4.5
traveltime 3.5
studytime 3.5
failures 0.0
famrel 6.5
freetime 5.5
goout 7.0
Dalc 3.5
Walc 6.0
health 8.0
absences 20.0
G1 20.5
G2 19.0
G3 23.0
dtype: float64
# 5. Handling Outliers (Removing, Capping, Winsorization)

df_no_outliers = df[(z_scores < 3).all(axis=1)]

print("\n Data after Removing Outliers (Z-Score Method):")
display(df_no_outliers.head())

Data after Removing Outliers (Z-Score Method):

school sex age address famsize Pstatus Medu Fedu Mjob

Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup

paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout

Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3

0 1 3 6 5 6 6
1 1 3 4 5 5 6
2 3 3 10 7 8 10
3 1 5 2 15 14 15
4 2 5 4 6 10 10

def cap_outliers(column):
"""Cap outliers to specified percentiles (5th and 95th
percentile)."""
lower_limit = [Link](0.05)
upper_limit = [Link](0.95)
return [Link](lower=lower_limit, upper=upper_limit)

df_winsorized = [Link]()
for col in ['G1', 'G2', 'G3', 'absences']:
df_winsorized[col] = cap_outliers(df_winsorized[col])

print("\n Data after Winsorization:")

display(df_winsorized.head())

Data after Winsorization:

school sex age address famsize Pstatus Medu Fedu Mjob

Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup

paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout

Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3

0 1 3 6.0 6 6.0 6
1 1 3 4.0 6 5.0 6
2 3 3 10.0 7 8.0 10
3 1 5 2.0 15 14.0 15
4 2 5 4.0 6 10.0 10

df_replaced = [Link]()
for col in ['G1', 'G2', 'G3', 'absences']:
median_value = df_replaced[col].median()
df_replaced[col] = [Link](
(df_replaced[col] < lower_bound[col]) | (df_replaced[col] >
upper_bound[col]),
median_value, df_replaced[col]
)

print("\n Data after Replacing Outliers with Median:")

display(df_replaced.head())

Data after Replacing Outliers with Median:

school sex age address famsize Pstatus Medu Fedu Mjob

Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup

paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout

Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3

0 1 3 6.0 5.0 6.0 6.0
1 1 3 4.0 5.0 5.0 6.0
2 3 3 10.0 7.0 8.0 10.0
3 1 5 2.0 15.0 14.0 15.0
4 2 5 4.0 6.0 10.0 10.0

6. Data Transformation
df['Log_Absences'] = np.log1p(df['absences'])

# 2. Square Root Transformation

df['Sqrt_Absences'] = [Link](df['absences'])

# 3. Box-Cox Transformation (only for positive values)

df['BoxCox_G3'], _ = boxcox(df['G3'] + 1) # Adding 1 to avoid zero
values

# [Link]-Max Scaling
scaler = MinMaxScaler()
df[['G1', 'G2', 'G3', 'Log_Absences', 'Sqrt_Absences', 'BoxCox_G3']] =
scaler.fit_transform(df[['G1', 'G2', 'G3', 'Log_Absences',
'Sqrt_Absences', 'BoxCox_G3']])

print("\n Data after Transformation:")

display([Link]())
Data after Transformation:

school sex age address famsize Pstatus Medu Fedu Mjob

Fjob \
0 GP F 18 U GT3 A 4 4 at_home
teacher
1 GP F 17 U GT3 T 1 1 at_home
other
2 GP F 15 U LE3 T 1 1 at_home
other
3 GP F 15 U GT3 T 4 2 health
services
4 GP F 16 U GT3 T 3 3 other
other

reason guardian traveltime studytime failures schoolsup famsup

paid \
0 course mother 2 2 0 yes no
no
1 course father 1 2 0 no yes
no
2 other mother 1 2 3 yes no
yes
3 home mother 1 3 0 no yes
yes
4 home father 1 2 0 no yes
yes

activities nursery higher internet romantic famrel freetime goout

Dalc \
0 no yes yes no no 4 3 4
1
1 no no yes yes no 5 3 3
1
2 no yes yes yes no 4 3 2
2
3 yes yes yes yes yes 3 2 2
1
4 no yes yes no no 4 3 2
1

Walc health absences G1 G2 G3 Log_Absences \

0 1 3 6 0.1250 0.315789 0.30 0.449326
1 1 3 4 0.1250 0.263158 0.30 0.371632
2 3 3 10 0.2500 0.421053 0.50 0.553693
3 1 5 2 0.7500 0.736842 0.75 0.253678
4 2 5 4 0.1875 0.526316 0.50 0.371632

Sqrt_Absences BoxCox_G3
0 0.282843 0.230437
1 0.230940 0.230437
2 0.365148 0.426526
3 0.163299 0.700747
4 0.230940 0.426526

7. Saving Processed Data

df.to_csv("processed_student_data.csv", index=False)
print(" Processed dataset saved as 'processed_student_data.csv'.")

Processed dataset saved as 'processed_student_data.csv'.

Student Grade Prediction
No ratings yet
Student Grade Prediction
9 pages
00 - Project - Your First Data Science Project - Jupyter Notebook
No ratings yet
00 - Project - Your First Data Science Project - Jupyter Notebook
8 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Codealpha Studentseda
No ratings yet
Codealpha Studentseda
2 pages
AIC Analysis for Student Performance Model
No ratings yet
AIC Analysis for Student Performance Model
42 pages
Absenteizm
No ratings yet
Absenteizm
14 pages
Student Data Attribute Analysis
No ratings yet
Student Data Attribute Analysis
19 pages
Practise
No ratings yet
Practise
9 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
22 pages
Student Por - CSV
No ratings yet
Student Por - CSV
23 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
Assignment # 07 (Updated)
No ratings yet
Assignment # 07 (Updated)
59 pages
Ai YasmeenAlhajYousef 0197638 Mohammad Almajali 2191370 End
No ratings yet
Ai YasmeenAlhajYousef 0197638 Mohammad Almajali 2191370 End
2 pages
TAQ Manual Spinazzola 2019
No ratings yet
TAQ Manual Spinazzola 2019
7 pages
Students Exam Scores Analysis - Ipynb
No ratings yet
Students Exam Scores Analysis - Ipynb
4 pages
DSBDA Assignment2
No ratings yet
DSBDA Assignment2
12 pages
Project
No ratings yet
Project
5 pages
Survey Analysis on Sleep and School Impact
No ratings yet
Survey Analysis on Sleep and School Impact
6 pages
Unit 05 - Bivariate I
No ratings yet
Unit 05 - Bivariate I
12 pages
Practice Assignment 2
No ratings yet
Practice Assignment 2
1 page
Experiment 2
No ratings yet
Experiment 2
5 pages
FULL REPORT (DSC MATH STUD) 15nov2019
No ratings yet
FULL REPORT (DSC MATH STUD) 15nov2019
16 pages
Assignment 2 (Set B)
No ratings yet
Assignment 2 (Set B)
5 pages
Getachew's File
No ratings yet
Getachew's File
2 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
16 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Table 3 The Errors of Multiple Choices in Using Interrogative Sentence
No ratings yet
Table 3 The Errors of Multiple Choices in Using Interrogative Sentence
7 pages
Lab 03 Numpy - Ipynb - Colab
No ratings yet
Lab 03 Numpy - Ipynb - Colab
15 pages
Frequencies: Frequencies Variables Age Gender Nochild Faoccup Mooccup Moeduc Faeduc /order Analysis
No ratings yet
Frequencies: Frequencies Variables Age Gender Nochild Faoccup Mooccup Moeduc Faeduc /order Analysis
7 pages
R Data Analysis and Visualization Guide
No ratings yet
R Data Analysis and Visualization Guide
8 pages
ML Project 1
No ratings yet
ML Project 1
3 pages
Statistics Answer Pt. 9
No ratings yet
Statistics Answer Pt. 9
1 page
Ordering Proc Freq Arpind
No ratings yet
Ordering Proc Freq Arpind
9 pages
COM2007 CaseStudy Sample
No ratings yet
COM2007 CaseStudy Sample
44 pages
التحليل الوصفي
No ratings yet
التحليل الوصفي
2 pages
Lambda Functions & Alternative Methods in Python
No ratings yet
Lambda Functions & Alternative Methods in Python
8 pages
Stats Answer Pt. 11
No ratings yet
Stats Answer Pt. 11
1 page
A09Ass02 - Jupyter Notebook
No ratings yet
A09Ass02 - Jupyter Notebook
11 pages
Asatasdfs
No ratings yet
Asatasdfs
6 pages
Student-Por Base de Datos
No ratings yet
Student-Por Base de Datos
75 pages
Assignment 1 DSB Da
No ratings yet
Assignment 1 DSB Da
14 pages
Por 2
No ratings yet
Por 2
52 pages
ML Cops
No ratings yet
ML Cops
17 pages
I222153 Lab03
No ratings yet
I222153 Lab03
28 pages
SIMPOC Child Labour Survey Questionnaires
No ratings yet
SIMPOC Child Labour Survey Questionnaires
31 pages
Rasch Models (Taken From William Revelle's) : "Short Guide To R"
No ratings yet
Rasch Models (Taken From William Revelle's) : "Short Guide To R"
7 pages
Kassahun's Log File
No ratings yet
Kassahun's Log File
4 pages
Quiz Coding Question 1
No ratings yet
Quiz Coding Question 1
9 pages
ReviewProblems1 Solution PDF
No ratings yet
ReviewProblems1 Solution PDF
14 pages
Student Alcohol Use Analysis
No ratings yet
Student Alcohol Use Analysis
20 pages
Untitled Document 2
No ratings yet
Untitled Document 2
5 pages
Data Preprocessing - Ipynb - Colaboratory
No ratings yet
Data Preprocessing - Ipynb - Colaboratory
7 pages
CC7182 - Programming For Data Analytics
No ratings yet
CC7182 - Programming For Data Analytics
9 pages
DW 14
No ratings yet
DW 14
14 pages
Logistic Regression Implementation
No ratings yet
Logistic Regression Implementation
10 pages
Gangtok Final Master Table 2024
No ratings yet
Gangtok Final Master Table 2024
50 pages
WT 1
No ratings yet
WT 1
1 page
WTAssignment No 5
No ratings yet
WTAssignment No 5
5 pages
WTAssignment No 4
No ratings yet
WTAssignment No 4
4 pages
Wtexp 10
No ratings yet
Wtexp 10
3 pages
Restaurant Website HTML & CSS Guide
No ratings yet
Restaurant Website HTML & CSS Guide
12 pages
Wtexp 9
No ratings yet
Wtexp 9
4 pages
Excel Logical Functions Explained
No ratings yet
Excel Logical Functions Explained
4 pages
Project Report 1
No ratings yet
Project Report 1
23 pages
Iogp 459-1
100% (1)
Iogp 459-1
52 pages
IT ES308 IU 2F Datasheet - SWITCH ETHERNET UNMANAGED INDUSTRIAL
No ratings yet
IT ES308 IU 2F Datasheet - SWITCH ETHERNET UNMANAGED INDUSTRIAL
2 pages
Particle Swarm Optimization Overview
No ratings yet
Particle Swarm Optimization Overview
24 pages
Mathematics (51) : Aims
No ratings yet
Mathematics (51) : Aims
10 pages
AI Problem Solving and Planning Techniques
No ratings yet
AI Problem Solving and Planning Techniques
30 pages
Betterworks Goal Setting Ebook PDF
100% (3)
Betterworks Goal Setting Ebook PDF
13 pages
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
20 pages
653
No ratings yet
653
127 pages
Standard Bitcoin Withdrawal Time in Cash App - Google Search
No ratings yet
Standard Bitcoin Withdrawal Time in Cash App - Google Search
1 page
CCNA1 Lab 1 1 6 en
No ratings yet
CCNA1 Lab 1 1 6 en
6 pages
Parts List Packing Materials: Accessories
No ratings yet
Parts List Packing Materials: Accessories
11 pages
BCC M425-0000-1A-004-PX0334-030 BCC0727: Display/Operation Functional Safety
No ratings yet
BCC M425-0000-1A-004-PX0334-030 BCC0727: Display/Operation Functional Safety
2 pages
CSS Solved (MCQS) 2005-2017 Pakistan Affairs
No ratings yet
CSS Solved (MCQS) 2005-2017 Pakistan Affairs
25 pages
Graphic Design Course in Mumba Fees Syllabus Contents
No ratings yet
Graphic Design Course in Mumba Fees Syllabus Contents
28 pages
Packet Power EMX
No ratings yet
Packet Power EMX
2 pages
Schematic Lenovo A706 PDF
No ratings yet
Schematic Lenovo A706 PDF
27 pages
Lecture 2 - BJT
No ratings yet
Lecture 2 - BJT
37 pages
Samsung Card-UFD Authentication Utility Manual English 1.3
No ratings yet
Samsung Card-UFD Authentication Utility Manual English 1.3
6 pages
Chapter 2
No ratings yet
Chapter 2
59 pages
VR Development with Unity Guide
No ratings yet
VR Development with Unity Guide
38 pages
Car Dealer Management System
No ratings yet
Car Dealer Management System
6 pages
Software Testing Career Profile
No ratings yet
Software Testing Career Profile
5 pages
SoC Power Management Systems Guide
No ratings yet
SoC Power Management Systems Guide
54 pages
FYP THESIS Uet PDF
No ratings yet
FYP THESIS Uet PDF
45 pages
Consolidated Stock For 04-03-24
No ratings yet
Consolidated Stock For 04-03-24
32 pages
NGO Darpan
No ratings yet
NGO Darpan
13 pages
Dokumen..tips - Big Picture b1 Workbook Key
No ratings yet
Dokumen..tips - Big Picture b1 Workbook Key
8 pages