0% found this document useful (0 votes)

22 views28 pages

Student Performance Analysis

The document presents an analysis of a dataset containing 5,000 JEE aspirants, focusing on dropout rates and student performance through Exploratory Data Analysis (EDA), K-Nearest Neighbors (KNN), and K-Means clustering. Key findings include a 21% dropout rate, weak correlations between most features and dropout, and the identification of three performance clusters: high, moderate, and at-risk performers. The analysis aims to provide insights for targeted interventions and resource allocation to improve student outcomes.

Uploaded by

shrayjain47.madrid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views28 pages

Student Performance Analysis

Uploaded by

shrayjain47.madrid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Student Dataset Analysis: EDA,

KNN & K-Means

A Data Science Approach to
Understanding JEE Aspirants
Exploratory Data Analysis (EDA)
• Dataset contains 5,000 students with 12
features
• No missing values detected
• 21% of students dropped out after Class 12
• Dataset is clean and balanced for numeric
analysis
Dropout Distribution
• Majority (79%) of students did NOT drop out
• Class imbalance exists (important for
classification modeling)
• Useful for identifying dropout risk patterns
Correlation Heatmap Insights
• Weak correlation between most features and
dropout
• Daily Study Hours negatively correlated with
dropout
• No strong multicollinearity between numeric
predictors
Summary Statistics
• JEE Main Score: Mean = 71.96, Std = 13.67
• Class 12 Percent: Mean = 74.96, Std = 9.89
• Mock Test Avg: Mean = 69.91, Std = 13.65
• Daily Study Hours: Mean = 4.48, Std = 1.98
• Dropout Rate: 21%
KNN Classification Overview
• Objective: Predict dropout risk using student
performance data
• Selected features: JEE scores, Mock Avg, Class
12%, Study Hours
• Applied StandardScaler to normalize features
• Chose k = 5 (from elbow method & cross-
validation)
KNN Classification Results
• Evaluated with Accuracy, Precision, Recall, F1-
score
• Confusion Matrix shows model performance
• Moderate performance due to class imbalance
K-Means Clustering: Objective
• Group students into performance-based
clusters
• Identify: High, Moderate, and At-Risk
performers
• Useful for planning custom interventions
Features & Preprocessing
• Features: JEE scores, Mock Avg, Class 12%,
Study Hours
• Applied StandardScaler for normalization
• Handled missing values before clustering
Choosing Number of Clusters
• Used Elbow Method to determine optimal k
• Plotted WCSS vs. number of clusters
• Elbow observed at k = 3
Cluster Interpretation (k=3)
• Cluster 0: High scores, high study hours – low
dropout risk
• Cluster 1: Moderate scores, mixed study habits
– moderate risk
• Cluster 2: Low scores, low effort – high dropout
risk
Applications of Clustering
• Targeted support for Cluster 2 (at-risk group)
• Design personalized learning for Cluster 1
• Encourage excellence in Cluster 0
• Efficient allocation of mentorship resources
Dropout Count Plot
Class 12 Percentage Distribution
JEE Main vs JEE Advanced Scores
Additional Student Insights
Solving first question:

# Load and examine the student dataset

import pandas as pd
import numpy as np

# Load the dataset

df = pd.read_csv('updated_student_dataset.csv')

# Basic info about the dataset

print("Dataset shape:", [Link])
print("\
Column names:")
print([Link]())
print("\
First few rows:")
print([Link]())
print("\
Data types:")
print([Link])
Question 1 to 8
# Question 1: How many students have 100 in class_12_percent?
students_with_100_percent = (df['class_12_percent'] == 100).sum()
print("Question 1: Students with 100% in class 12:", students_with_100_percent)

# Question 2: What is the average of class_12_percent?

avg_class_12_percent = df['class_12_percent'].mean()
print("Question 2: Average class 12 percent:", round(avg_class_12_percent, 2))

# Question 3: What is the most frequent value in location_type?

location_type_counts = df['location_type'].value_counts()
most_frequent_location = location_type_counts.index[0]
print("Question 3: Most frequent location type:", most_frequent_location)
print("Location type counts:")
print(location_type_counts)

# Question 4: What is the median value of daily_study_hours?

median_study_hours = df['daily_study_hours'].median()
print("Question 4: Median daily study hours:", median_study_hours)

# Question 5: How many students have mental health issues?

students_with_mental_health_issues = (df['mental_health_issues'] == 'Yes').sum()
print("Question 5: Students with mental health issues:", students_with_mental_health_issues)

# Question 6: How many students whose location_type is Urban and they dropout?
urban_dropout_students = ((df['location_type'] == 'Urban') & (df['dropout'] == 'yes')).sum()
print("Question 6: Urban students who dropped out:", urban_dropout_students)

# Question 7: How many students attempted more than once?

students_multiple_attempts = (df['attempt_count'] > 1).sum()
print("Question 7: Students who attempted more than once:", students_multiple_attempts)

# Question 8: Which school board has the highest average JEE Main score?
avg_jee_main_by_board = [Link]('school_board')['jee_main_score'].mean()
highest_avg_board = avg_jee_main_by_board.idxmax()
print("Question 8: School board with highest average JEE Main score:", highest_avg_board)
print("Average JEE Main scores by board:")
print(avg_jee_main_by_board.round(2))
Section 2
# First, let's create the encoding mappings and apply them to the dataset
import pandas as pd
import numpy as np
from [Link] import euclidean

# Load the dataset

df = pd.read_csv('updated_student_dataset.csv')

# Define the encoding mappings

encoding_mappings = {
'school_board': {'CBSE': 0, 'ICSE': 1, 'State': 2},
'coaching_institute': {'No Coaching': 0, 'Allen': 1, 'FIITJEE': 2, 'Local': 3},
'family_income': {'Low': 0, 'Mid': 1, 'High': 2},
'parent_education': {'Upto 10th': 0, '12th': 1, 'Graduate': 2, 'PG': 3},
'location_type': {'Rural': 0, 'Semi-Urban': 1, 'Urban': 2},
'peer_pressure_level': {'Low': 0, 'Medium': 1, 'High': 2},
'mental_health_issues': {'No': 0, 'Yes': 1}
}

# Create a copy of the dataframe for encoding

df_encoded = [Link]()

# Apply encodings
for column, mapping in encoding_mappings.items():
df_encoded[column] = df_encoded[column].map(mapping)

# Also encode dropout for easier processing

df_encoded['dropout'] = df_encoded['dropout'].map({'no': 0, 'yes': 1})

print("Encoded dataset shape:", df_encoded.shape)

print("First few rows of encoded data:")
print(df_encoded.head())
Section 2
# Define the new student data
new_student = {
'id': 'ST10001',
'jee_main_score': 61.26,
'jee_advanced_score': 65.14,
'mock_test_score_avg': 71.535,
'school_board': 1, # ICSE
'class_12_percent': 71.85,
'attempt_count': 1,
'coaching_institute': 3, # Local
'daily_study_hours': 4.05,
'family_income': 1, # Mid
'parent_education': 3, # PG
'location_type': 0, # Rural
'peer_pressure_level': 2, # High
'mental_health_issues': 1 # Yes
}

# Select the features for distance calculation (excluding id and dropout)

feature_columns = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']

# Extract new student features

new_student_features = [new_student[col] for col in feature_columns]

print("New student features:", new_student_features)

print("Feature columns:", feature_columns)
Section 2
# Calculate Euclidean distances between new student and all existing students
distances = []

for idx, row in df_encoded.iterrows():

# Extract features for current student
current_student_features = [row[col] for col in feature_columns]

# Calculate Euclidean distance

distance = euclidean(new_student_features, current_student_features)
[Link]({
'student_id': row['id'],
'distance': distance,
'dropout': row['dropout']
})

# Convert to DataFrame and sort by distance

distances_df = [Link](distances)
distances_df = distances_df.sort_values('distance').reset_index(drop=True)

print("Top 10 nearest neighbors:")

print(distances_df.head(10))

# Get the 3 nearest neighbors

top_3_neighbors = distances_df.head(3)['student_id'].tolist()
print("\
Top 3 nearest neighbors (Student IDs):")
for i, student_id in enumerate(top_3_neighbors, 1):
print(f"{i}. {student_id}")
Section 2
# K-NN with K=5 for dropout prediction
k5_neighbors = distances_df.head(5)
k5_dropout_values = k5_neighbors['dropout'].tolist()
k5_prediction = 1 if sum(k5_dropout_values) > 2.5 else 0

print("K-NN with K=5:")

print("Top 5 neighbors and their dropout values:")
for i, row in k5_neighbors.iterrows():
print(f" {row['student_id']}: dropout = {row['dropout']}")

print(f"\
Dropout values: {k5_dropout_values}")
print(f"Sum of dropout values: {sum(k5_dropout_values)}")
print(f"K=5 Prediction (1 for Yes, 0 for No): {k5_prediction}")

# K-NN with K=20

k20_neighbors = distances_df.head(20)
k20_dropout_values = k20_neighbors['dropout'].tolist()
k20_dropout_0_count = k20_dropout_values.count(0)

print(f"\
K-NN with K=20:")
print(f"Number of neighbors with dropout = 0: {k20_dropout_0_count}")
print(f"Number of neighbors with dropout = 1: {k20_dropout_values.count(1)}")

print("\
Top 20 neighbors:")
print(k20_neighbors[['student_id', 'distance', 'dropout']])
Section 3
# First, let's find the initial cluster centers ST7612 and ST9269
st7612_data = df_encoded[df_encoded['id'] == 'ST7612']
st9269_data = df_encoded[df_encoded['id'] == 'ST9269']

print("ST7612 (C1) data:")

print(st7612_data)
print("\
ST9269 (C2) data:")
print(st9269_data)

# Extract features for cluster centers (excluding id and dropout)

c1_features = st7612_data[feature_columns].values[0]
c2_features = st9269_data[feature_columns].values[0]

print("\
C1 (ST7612) features:", c1_features)
print("C2 (ST9269) features:", c2_features)
Section 3
# Question 1: Distance between ST5001 and cluster center C2 (ST9269)
st5001_data = df_encoded[df_encoded['id'] == 'ST5001']
st5001_features = st5001_data[feature_columns].values[0]

print("ST5001 features:", st5001_features)

print("C2 (ST9269) features:", c2_features)

# Calculate Euclidean distance between ST5001 and C2

distance_st5001_c2 = euclidean(st5001_features, c2_features)
print(f"\
Distance between ST5001 and C2 (ST9269): {distance_st5001_c2:.2f}")

# Question 2: Which cluster would ST5745 be assigned to?

st5745_data = df_encoded[df_encoded['id'] == 'ST5745']
st5745_features = st5745_data[feature_columns].values[0]

print("\
ST5745 features:", st5745_features)

# Calculate distances to both cluster centers

distance_st5745_c1 = euclidean(st5745_features, c1_features)
distance_st5745_c2 = euclidean(st5745_features, c2_features)

print(f"Distance from ST5745 to C1 (ST7612): {distance_st5745_c1:.2f}")

print(f"Distance from ST5745 to C2 (ST9269): {distance_st5745_c2:.2f}")

if distance_st5745_c1 < distance_st5745_c2:

assigned_cluster = "Cluster 1 (ST7612)"
else:
assigned_cluster = "Cluster 2 (ST9269)"

print(f"ST5745 would be assigned to: {assigned_cluster}")

Section 3
# Question 3: Perform cluster assignment for all students and then recompute cluster centers
# Calculate distances for all students to both cluster centers
cluster_assignments = []

for idx, row in df_encoded.iterrows():

student_features = row[feature_columns].values

# Calculate distances to both clusters

dist_c1 = euclidean(student_features, c1_features)
dist_c2 = euclidean(student_features, c2_features)

# Assign to closest cluster

if dist_c1 < dist_c2:
assigned_cluster = 1
else:
assigned_cluster = 2

cluster_assignments.append({
'id': row['id'],
'cluster': assigned_cluster,
'features': student_features
})

# Convert to DataFrame for easier manipulation

assignments_df = [Link](cluster_assignments)

# Count assignments
cluster1_count = len(assignments_df[assignments_df['cluster'] == 1])
cluster2_count = len(assignments_df[assignments_df['cluster'] == 2])

print(f"Cluster 1 assignments: {cluster1_count}")

print(f"Cluster 2 assignments: {cluster2_count}")

# Recompute cluster centers

cluster1_students = assignments_df[assignments_df['cluster'] == 1]
cluster2_students = assignments_df[assignments_df['cluster'] == 2]

# Calculate new cluster centers (mean of all assigned points)

Section 3
# Format the new cluster centers in a more readable way
print("New Cluster Centers (formatted):")
print("\
C1 (new center):")
feature_names = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']

for i, feature in enumerate(feature_names):

print(f"{feature}: {new_c1[i]:.2f}")

print("\
C2 (new center):")
for i, feature in enumerate(feature_names):
print(f"{feature}: {new_c2[i]:.2f}")

# Create a summary table

import pandas as pd
cluster_centers_df = [Link]({
'Feature': feature_names,
'C1_New': [round(val, 2) for val in new_c1],
'C2_New': [round(val, 2) for val in new_c2]
})

print("\
Cluster Centers Summary Table:")
print(cluster_centers_df)
Section 3
# Load and examine the student dataset
import pandas as pd
import numpy as np
from [Link] import euclidean

# Load the dataset

df = pd.read_csv('updated_student_dataset.csv', encoding='ascii')

Student Performance Analysis
No ratings yet
Student Performance Analysis
28 pages
Shreyas - Shivakumar09@gmail - Com Predicting Student Dropout Risks
No ratings yet
Shreyas - Shivakumar09@gmail - Com Predicting Student Dropout Risks
15 pages
Week
No ratings yet
Week
11 pages
Student Grade Analysis from Dataset
No ratings yet
Student Grade Analysis from Dataset
20 pages
JEE Dropout Prediction IIT Madras
No ratings yet
JEE Dropout Prediction IIT Madras
4 pages
Student Data Analysis Slides
No ratings yet
Student Data Analysis Slides
12 pages
Student Data Analysis Slides
No ratings yet
Student Data Analysis Slides
12 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
11 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
Ishita Marwah Takehome Project IITM
No ratings yet
Ishita Marwah Takehome Project IITM
10 pages
DWM Journal
No ratings yet
DWM Journal
104 pages
Student Data Analysis With Charts
No ratings yet
Student Data Analysis With Charts
16 pages
Student Data Analysis With Charts
No ratings yet
Student Data Analysis With Charts
16 pages
Predicting Student Dropout Risk
No ratings yet
Predicting Student Dropout Risk
12 pages
MLFILE
No ratings yet
MLFILE
21 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Dsbda Assign1
No ratings yet
Dsbda Assign1
4 pages
First 4
No ratings yet
First 4
11 pages
Predicting Student Dropout Risk
No ratings yet
Predicting Student Dropout Risk
11 pages
Dav Lab Manual
No ratings yet
Dav Lab Manual
28 pages
Student Performance Analysis and Prediction 2.3
No ratings yet
Student Performance Analysis and Prediction 2.3
19 pages
E 876 VSB 13 Y
No ratings yet
E 876 VSB 13 Y
11 pages
Hanesh2008@Gmail - Com Predicting Student Dropout Risk
No ratings yet
Hanesh2008@Gmail - Com Predicting Student Dropout Risk
10 pages
Phase 3.PDF Ramana
No ratings yet
Phase 3.PDF Ramana
17 pages
Sagnikshaw03@gmail - Com Predicting Student Dropout Risk
No ratings yet
Sagnikshaw03@gmail - Com Predicting Student Dropout Risk
13 pages
Feature Selection in Python
No ratings yet
Feature Selection in Python
11 pages
DSBDA Assignment 2
No ratings yet
DSBDA Assignment 2
7 pages
Student Dropout Analysis Chat
No ratings yet
Student Dropout Analysis Chat
2 pages
CSC - 310 Advanced Python Programming Continuous Assessment-2 Assignment:Ca2
No ratings yet
CSC - 310 Advanced Python Programming Continuous Assessment-2 Assignment:Ca2
33 pages
Lab 13
No ratings yet
Lab 13
5 pages
DSDBAAssignment2 SUMEET
No ratings yet
DSDBAAssignment2 SUMEET
8 pages
Dhanish Predicting Student Dropout Risk
No ratings yet
Dhanish Predicting Student Dropout Risk
9 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Class12 IP Practical File
No ratings yet
Class12 IP Practical File
6 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Untitled Presentation
No ratings yet
Untitled Presentation
37 pages
Lecture 4 StudentRecommedSystemWithLoadModel
No ratings yet
Lecture 4 StudentRecommedSystemWithLoadModel
2 pages
Predicting Dropout Risks: Lagan Ahuja Date of Submission-03/06/2025
No ratings yet
Predicting Dropout Risks: Lagan Ahuja Date of Submission-03/06/2025
11 pages
Dav Practicals
No ratings yet
Dav Practicals
33 pages
DataFrame Creation
No ratings yet
DataFrame Creation
5 pages
Hassan Light Gaya
No ratings yet
Hassan Light Gaya
8 pages
Assignment 3
No ratings yet
Assignment 3
15 pages
Project Code
No ratings yet
Project Code
2 pages
List of Practical Ip065 Xii Session 2025 CKC Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 CKC Academy
19 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
16 pages
Case study-ML-SI No 2
No ratings yet
Case study-ML-SI No 2
13 pages
Student Performance Analysis Insights
No ratings yet
Student Performance Analysis Insights
12 pages
12 IP File Programs 6 To 17
No ratings yet
12 IP File Programs 6 To 17
9 pages
New Text Document
No ratings yet
New Text Document
2 pages
A09Ass02 - Jupyter Notebook
No ratings yet
A09Ass02 - Jupyter Notebook
11 pages
ML Lab Programs PDF
No ratings yet
ML Lab Programs PDF
15 pages
Jamboree
No ratings yet
Jamboree
56 pages
Linear Regression with PySpark Tutorial
No ratings yet
Linear Regression with PySpark Tutorial
4 pages
CMSC320 Final Project
No ratings yet
CMSC320 Final Project
20 pages
Mining Equipment Performance Clustering
100% (1)
Mining Equipment Performance Clustering
6 pages
ML Chapter 1 - Problems and Solutions
No ratings yet
ML Chapter 1 - Problems and Solutions
5 pages
Coding
No ratings yet
Coding
6 pages
Python Unit 4
No ratings yet
Python Unit 4
89 pages
Python BI Cookbook: 60+ Recipes
No ratings yet
Python BI Cookbook: 60+ Recipes
22 pages
Python Internship Report by Gnanadeep
No ratings yet
Python Internship Report by Gnanadeep
29 pages
ML With Python
No ratings yet
ML With Python
6 pages
Moudle 5 - Sparsk
No ratings yet
Moudle 5 - Sparsk
14 pages
Class XII Holiday Homework 2024-25
No ratings yet
Class XII Holiday Homework 2024-25
34 pages
Student Grade Prediction Python Full Document
100% (1)
Student Grade Prediction Python Full Document
69 pages
Grade10 AI Practical Programs Questions 2025-26
No ratings yet
Grade10 AI Practical Programs Questions 2025-26
4 pages
KRISHNA SAI - INTERNSHIP - REPORT-final
No ratings yet
KRISHNA SAI - INTERNSHIP - REPORT-final
34 pages
ML Assignments
No ratings yet
ML Assignments
2 pages
Agile-Based Airport Passenger Verification
No ratings yet
Agile-Based Airport Passenger Verification
14 pages
Python in Finance & Accounting - Van Der Post, Hayden
No ratings yet
Python in Finance & Accounting - Van Der Post, Hayden
241 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
43 pages
Python
No ratings yet
Python
16 pages
Python and Numpy Basics Guide
100% (2)
Python and Numpy Basics Guide
98 pages
Design Report 1 (Repaired)
No ratings yet
Design Report 1 (Repaired)
50 pages
NUS Python Analytics Brochure
No ratings yet
NUS Python Analytics Brochure
14 pages
Project Report
No ratings yet
Project Report
37 pages
Data Science Lab Manual for B.Tech CSE
No ratings yet
Data Science Lab Manual for B.Tech CSE
39 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
18 pages
Computer Science: Class Xi
No ratings yet
Computer Science: Class Xi
17 pages
Python Lab Manual CSE 2022-23
No ratings yet
Python Lab Manual CSE 2022-23
27 pages
Lab1 (Python)
No ratings yet
Lab1 (Python)
15 pages
NumPy Array Slicing Notes
No ratings yet
NumPy Array Slicing Notes
2 pages
R22 Data Science Using Python Lab Manual
No ratings yet
R22 Data Science Using Python Lab Manual
127 pages
Student Grade Prediction
No ratings yet
Student Grade Prediction
35 pages
Numpy Notes
No ratings yet
Numpy Notes
5 pages
Chapter 6 Python Libraries For Machine Learning
No ratings yet
Chapter 6 Python Libraries For Machine Learning
21 pages
Data Science Python Cheat Sheet
No ratings yet
Data Science Python Cheat Sheet
25 pages
Python Matplotlib 2
No ratings yet
Python Matplotlib 2
48 pages