0% found this document useful (0 votes)
22 views28 pages

Student Performance Analysis

The document presents an analysis of a dataset containing 5,000 JEE aspirants, focusing on dropout rates and student performance through Exploratory Data Analysis (EDA), K-Nearest Neighbors (KNN), and K-Means clustering. Key findings include a 21% dropout rate, weak correlations between most features and dropout, and the identification of three performance clusters: high, moderate, and at-risk performers. The analysis aims to provide insights for targeted interventions and resource allocation to improve student outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views28 pages

Student Performance Analysis

The document presents an analysis of a dataset containing 5,000 JEE aspirants, focusing on dropout rates and student performance through Exploratory Data Analysis (EDA), K-Nearest Neighbors (KNN), and K-Means clustering. Key findings include a 21% dropout rate, weak correlations between most features and dropout, and the identification of three performance clusters: high, moderate, and at-risk performers. The analysis aims to provide insights for targeted interventions and resource allocation to improve student outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Student Dataset Analysis: EDA,

KNN & K-Means


A Data Science Approach to
Understanding JEE Aspirants
Exploratory Data Analysis (EDA)
• Dataset contains 5,000 students with 12
features
• No missing values detected
• 21% of students dropped out after Class 12
• Dataset is clean and balanced for numeric
analysis
Dropout Distribution
• Majority (79%) of students did NOT drop out
• Class imbalance exists (important for
classification modeling)
• Useful for identifying dropout risk patterns
Correlation Heatmap Insights
• Weak correlation between most features and
dropout
• Daily Study Hours negatively correlated with
dropout
• No strong multicollinearity between numeric
predictors
Summary Statistics
• JEE Main Score: Mean = 71.96, Std = 13.67
• Class 12 Percent: Mean = 74.96, Std = 9.89
• Mock Test Avg: Mean = 69.91, Std = 13.65
• Daily Study Hours: Mean = 4.48, Std = 1.98
• Dropout Rate: 21%
KNN Classification Overview
• Objective: Predict dropout risk using student
performance data
• Selected features: JEE scores, Mock Avg, Class
12%, Study Hours
• Applied StandardScaler to normalize features
• Chose k = 5 (from elbow method & cross-
validation)
KNN Classification Results
• Evaluated with Accuracy, Precision, Recall, F1-
score
• Confusion Matrix shows model performance
• Moderate performance due to class imbalance
K-Means Clustering: Objective
• Group students into performance-based
clusters
• Identify: High, Moderate, and At-Risk
performers
• Useful for planning custom interventions
Features & Preprocessing
• Features: JEE scores, Mock Avg, Class 12%,
Study Hours
• Applied StandardScaler for normalization
• Handled missing values before clustering
Choosing Number of Clusters
• Used Elbow Method to determine optimal k
• Plotted WCSS vs. number of clusters
• Elbow observed at k = 3
Cluster Interpretation (k=3)
• Cluster 0: High scores, high study hours – low
dropout risk
• Cluster 1: Moderate scores, mixed study habits
– moderate risk
• Cluster 2: Low scores, low effort – high dropout
risk
Applications of Clustering
• Targeted support for Cluster 2 (at-risk group)
• Design personalized learning for Cluster 1
• Encourage excellence in Cluster 0
• Efficient allocation of mentorship resources
Dropout Count Plot
Class 12 Percentage Distribution
JEE Main vs JEE Advanced Scores
Additional Student Insights
Solving first question:

# Load and examine the student dataset


import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('updated_student_dataset.csv')

# Basic info about the dataset


print("Dataset shape:", [Link])
print("\
Column names:")
print([Link]())
print("\
First few rows:")
print([Link]())
print("\
Data types:")
print([Link])
Question 1 to 8
# Question 1: How many students have 100 in class_12_percent?
students_with_100_percent = (df['class_12_percent'] == 100).sum()
print("Question 1: Students with 100% in class 12:", students_with_100_percent)

# Question 2: What is the average of class_12_percent?


avg_class_12_percent = df['class_12_percent'].mean()
print("Question 2: Average class 12 percent:", round(avg_class_12_percent, 2))

# Question 3: What is the most frequent value in location_type?


location_type_counts = df['location_type'].value_counts()
most_frequent_location = location_type_counts.index[0]
print("Question 3: Most frequent location type:", most_frequent_location)
print("Location type counts:")
print(location_type_counts)

# Question 4: What is the median value of daily_study_hours?


median_study_hours = df['daily_study_hours'].median()
print("Question 4: Median daily study hours:", median_study_hours)

# Question 5: How many students have mental health issues?


students_with_mental_health_issues = (df['mental_health_issues'] == 'Yes').sum()
print("Question 5: Students with mental health issues:", students_with_mental_health_issues)

# Question 6: How many students whose location_type is Urban and they dropout?
urban_dropout_students = ((df['location_type'] == 'Urban') & (df['dropout'] == 'yes')).sum()
print("Question 6: Urban students who dropped out:", urban_dropout_students)

# Question 7: How many students attempted more than once?


students_multiple_attempts = (df['attempt_count'] > 1).sum()
print("Question 7: Students who attempted more than once:", students_multiple_attempts)

# Question 8: Which school board has the highest average JEE Main score?
avg_jee_main_by_board = [Link]('school_board')['jee_main_score'].mean()
highest_avg_board = avg_jee_main_by_board.idxmax()
print("Question 8: School board with highest average JEE Main score:", highest_avg_board)
print("Average JEE Main scores by board:")
print(avg_jee_main_by_board.round(2))
Section 2
# First, let's create the encoding mappings and apply them to the dataset
import pandas as pd
import numpy as np
from [Link] import euclidean

# Load the dataset


df = pd.read_csv('updated_student_dataset.csv')

# Define the encoding mappings


encoding_mappings = {
'school_board': {'CBSE': 0, 'ICSE': 1, 'State': 2},
'coaching_institute': {'No Coaching': 0, 'Allen': 1, 'FIITJEE': 2, 'Local': 3},
'family_income': {'Low': 0, 'Mid': 1, 'High': 2},
'parent_education': {'Upto 10th': 0, '12th': 1, 'Graduate': 2, 'PG': 3},
'location_type': {'Rural': 0, 'Semi-Urban': 1, 'Urban': 2},
'peer_pressure_level': {'Low': 0, 'Medium': 1, 'High': 2},
'mental_health_issues': {'No': 0, 'Yes': 1}
}

# Create a copy of the dataframe for encoding


df_encoded = [Link]()

# Apply encodings
for column, mapping in encoding_mappings.items():
df_encoded[column] = df_encoded[column].map(mapping)

# Also encode dropout for easier processing


df_encoded['dropout'] = df_encoded['dropout'].map({'no': 0, 'yes': 1})

print("Encoded dataset shape:", df_encoded.shape)


print("First few rows of encoded data:")
print(df_encoded.head())
Section 2
# Define the new student data
new_student = {
'id': 'ST10001',
'jee_main_score': 61.26,
'jee_advanced_score': 65.14,
'mock_test_score_avg': 71.535,
'school_board': 1, # ICSE
'class_12_percent': 71.85,
'attempt_count': 1,
'coaching_institute': 3, # Local
'daily_study_hours': 4.05,
'family_income': 1, # Mid
'parent_education': 3, # PG
'location_type': 0, # Rural
'peer_pressure_level': 2, # High
'mental_health_issues': 1 # Yes
}

# Select the features for distance calculation (excluding id and dropout)


feature_columns = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']

# Extract new student features


new_student_features = [new_student[col] for col in feature_columns]

print("New student features:", new_student_features)


print("Feature columns:", feature_columns)
Section 2
# Calculate Euclidean distances between new student and all existing students
distances = []

for idx, row in df_encoded.iterrows():


# Extract features for current student
current_student_features = [row[col] for col in feature_columns]

# Calculate Euclidean distance


distance = euclidean(new_student_features, current_student_features)
[Link]({
'student_id': row['id'],
'distance': distance,
'dropout': row['dropout']
})

# Convert to DataFrame and sort by distance


distances_df = [Link](distances)
distances_df = distances_df.sort_values('distance').reset_index(drop=True)

print("Top 10 nearest neighbors:")


print(distances_df.head(10))

# Get the 3 nearest neighbors


top_3_neighbors = distances_df.head(3)['student_id'].tolist()
print("\
Top 3 nearest neighbors (Student IDs):")
for i, student_id in enumerate(top_3_neighbors, 1):
print(f"{i}. {student_id}")
Section 2
# K-NN with K=5 for dropout prediction
k5_neighbors = distances_df.head(5)
k5_dropout_values = k5_neighbors['dropout'].tolist()
k5_prediction = 1 if sum(k5_dropout_values) > 2.5 else 0

print("K-NN with K=5:")


print("Top 5 neighbors and their dropout values:")
for i, row in k5_neighbors.iterrows():
print(f" {row['student_id']}: dropout = {row['dropout']}")

print(f"\
Dropout values: {k5_dropout_values}")
print(f"Sum of dropout values: {sum(k5_dropout_values)}")
print(f"K=5 Prediction (1 for Yes, 0 for No): {k5_prediction}")

# K-NN with K=20


k20_neighbors = distances_df.head(20)
k20_dropout_values = k20_neighbors['dropout'].tolist()
k20_dropout_0_count = k20_dropout_values.count(0)

print(f"\
K-NN with K=20:")
print(f"Number of neighbors with dropout = 0: {k20_dropout_0_count}")
print(f"Number of neighbors with dropout = 1: {k20_dropout_values.count(1)}")

print("\
Top 20 neighbors:")
print(k20_neighbors[['student_id', 'distance', 'dropout']])
Section 3
# First, let's find the initial cluster centers ST7612 and ST9269
st7612_data = df_encoded[df_encoded['id'] == 'ST7612']
st9269_data = df_encoded[df_encoded['id'] == 'ST9269']

print("ST7612 (C1) data:")


print(st7612_data)
print("\
ST9269 (C2) data:")
print(st9269_data)

# Extract features for cluster centers (excluding id and dropout)


feature_columns = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']

c1_features = st7612_data[feature_columns].values[0]
c2_features = st9269_data[feature_columns].values[0]

print("\
C1 (ST7612) features:", c1_features)
print("C2 (ST9269) features:", c2_features)
Section 3
# Question 1: Distance between ST5001 and cluster center C2 (ST9269)
st5001_data = df_encoded[df_encoded['id'] == 'ST5001']
st5001_features = st5001_data[feature_columns].values[0]

print("ST5001 features:", st5001_features)


print("C2 (ST9269) features:", c2_features)

# Calculate Euclidean distance between ST5001 and C2


distance_st5001_c2 = euclidean(st5001_features, c2_features)
print(f"\
Distance between ST5001 and C2 (ST9269): {distance_st5001_c2:.2f}")

# Question 2: Which cluster would ST5745 be assigned to?


st5745_data = df_encoded[df_encoded['id'] == 'ST5745']
st5745_features = st5745_data[feature_columns].values[0]

print("\
ST5745 features:", st5745_features)

# Calculate distances to both cluster centers


distance_st5745_c1 = euclidean(st5745_features, c1_features)
distance_st5745_c2 = euclidean(st5745_features, c2_features)

print(f"Distance from ST5745 to C1 (ST7612): {distance_st5745_c1:.2f}")


print(f"Distance from ST5745 to C2 (ST9269): {distance_st5745_c2:.2f}")

if distance_st5745_c1 < distance_st5745_c2:


assigned_cluster = "Cluster 1 (ST7612)"
else:
assigned_cluster = "Cluster 2 (ST9269)"

print(f"ST5745 would be assigned to: {assigned_cluster}")


Section 3
# Question 3: Perform cluster assignment for all students and then recompute cluster centers
# Calculate distances for all students to both cluster centers
cluster_assignments = []

for idx, row in df_encoded.iterrows():


student_features = row[feature_columns].values

# Calculate distances to both clusters


dist_c1 = euclidean(student_features, c1_features)
dist_c2 = euclidean(student_features, c2_features)

# Assign to closest cluster


if dist_c1 < dist_c2:
assigned_cluster = 1
else:
assigned_cluster = 2

cluster_assignments.append({
'id': row['id'],
'cluster': assigned_cluster,
'features': student_features
})

# Convert to DataFrame for easier manipulation


assignments_df = [Link](cluster_assignments)

# Count assignments
cluster1_count = len(assignments_df[assignments_df['cluster'] == 1])
cluster2_count = len(assignments_df[assignments_df['cluster'] == 2])

print(f"Cluster 1 assignments: {cluster1_count}")


print(f"Cluster 2 assignments: {cluster2_count}")

# Recompute cluster centers


cluster1_students = assignments_df[assignments_df['cluster'] == 1]
cluster2_students = assignments_df[assignments_df['cluster'] == 2]

# Calculate new cluster centers (mean of all assigned points)


Section 3
# Format the new cluster centers in a more readable way
print("New Cluster Centers (formatted):")
print("\
C1 (new center):")
feature_names = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']

for i, feature in enumerate(feature_names):


print(f"{feature}: {new_c1[i]:.2f}")

print("\
C2 (new center):")
for i, feature in enumerate(feature_names):
print(f"{feature}: {new_c2[i]:.2f}")

# Create a summary table


import pandas as pd
cluster_centers_df = [Link]({
'Feature': feature_names,
'C1_New': [round(val, 2) for val in new_c1],
'C2_New': [round(val, 2) for val in new_c2]
})

print("\
Cluster Centers Summary Table:")
print(cluster_centers_df)
Section 3
# Load and examine the student dataset
import pandas as pd
import numpy as np
from [Link] import euclidean

# Load the dataset


df = pd.read_csv('updated_student_dataset.csv', encoding='ascii')

print("Dataset shape:", [Link])


print("\
First few rows:")
print([Link]())
print("\
Column information:")
print([Link]())
print("\
Basic statistics:")
print([Link]())
Section 3

You might also like