Student Dataset Analysis: EDA,
KNN & K-Means
A Data Science Approach to
Understanding JEE Aspirants
Exploratory Data Analysis (EDA)
• Dataset contains 5,000 students with 12
features
• No missing values detected
• 21% of students dropped out after Class 12
• Dataset is clean and balanced for numeric
analysis
Dropout Distribution
• Majority (79%) of students did NOT drop out
• Class imbalance exists (important for
classification modeling)
• Useful for identifying dropout risk patterns
Correlation Heatmap Insights
• Weak correlation between most features and
dropout
• Daily Study Hours negatively correlated with
dropout
• No strong multicollinearity between numeric
predictors
Summary Statistics
• JEE Main Score: Mean = 71.96, Std = 13.67
• Class 12 Percent: Mean = 74.96, Std = 9.89
• Mock Test Avg: Mean = 69.91, Std = 13.65
• Daily Study Hours: Mean = 4.48, Std = 1.98
• Dropout Rate: 21%
KNN Classification Overview
• Objective: Predict dropout risk using student
performance data
• Selected features: JEE scores, Mock Avg, Class
12%, Study Hours
• Applied StandardScaler to normalize features
• Chose k = 5 (from elbow method & cross-
validation)
KNN Classification Results
• Evaluated with Accuracy, Precision, Recall, F1-
score
• Confusion Matrix shows model performance
• Moderate performance due to class imbalance
K-Means Clustering: Objective
• Group students into performance-based
clusters
• Identify: High, Moderate, and At-Risk
performers
• Useful for planning custom interventions
Features & Preprocessing
• Features: JEE scores, Mock Avg, Class 12%,
Study Hours
• Applied StandardScaler for normalization
• Handled missing values before clustering
Choosing Number of Clusters
• Used Elbow Method to determine optimal k
• Plotted WCSS vs. number of clusters
• Elbow observed at k = 3
Cluster Interpretation (k=3)
• Cluster 0: High scores, high study hours – low
dropout risk
• Cluster 1: Moderate scores, mixed study habits
– moderate risk
• Cluster 2: Low scores, low effort – high dropout
risk
Applications of Clustering
• Targeted support for Cluster 2 (at-risk group)
• Design personalized learning for Cluster 1
• Encourage excellence in Cluster 0
• Efficient allocation of mentorship resources
Dropout Count Plot
Class 12 Percentage Distribution
JEE Main vs JEE Advanced Scores
Additional Student Insights
Solving first question:
# Load and examine the student dataset
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('updated_student_dataset.csv')
# Basic info about the dataset
print("Dataset shape:", [Link])
print("\
Column names:")
print([Link]())
print("\
First few rows:")
print([Link]())
print("\
Data types:")
print([Link])
Question 1 to 8
# Question 1: How many students have 100 in class_12_percent?
students_with_100_percent = (df['class_12_percent'] == 100).sum()
print("Question 1: Students with 100% in class 12:", students_with_100_percent)
# Question 2: What is the average of class_12_percent?
avg_class_12_percent = df['class_12_percent'].mean()
print("Question 2: Average class 12 percent:", round(avg_class_12_percent, 2))
# Question 3: What is the most frequent value in location_type?
location_type_counts = df['location_type'].value_counts()
most_frequent_location = location_type_counts.index[0]
print("Question 3: Most frequent location type:", most_frequent_location)
print("Location type counts:")
print(location_type_counts)
# Question 4: What is the median value of daily_study_hours?
median_study_hours = df['daily_study_hours'].median()
print("Question 4: Median daily study hours:", median_study_hours)
# Question 5: How many students have mental health issues?
students_with_mental_health_issues = (df['mental_health_issues'] == 'Yes').sum()
print("Question 5: Students with mental health issues:", students_with_mental_health_issues)
# Question 6: How many students whose location_type is Urban and they dropout?
urban_dropout_students = ((df['location_type'] == 'Urban') & (df['dropout'] == 'yes')).sum()
print("Question 6: Urban students who dropped out:", urban_dropout_students)
# Question 7: How many students attempted more than once?
students_multiple_attempts = (df['attempt_count'] > 1).sum()
print("Question 7: Students who attempted more than once:", students_multiple_attempts)
# Question 8: Which school board has the highest average JEE Main score?
avg_jee_main_by_board = [Link]('school_board')['jee_main_score'].mean()
highest_avg_board = avg_jee_main_by_board.idxmax()
print("Question 8: School board with highest average JEE Main score:", highest_avg_board)
print("Average JEE Main scores by board:")
print(avg_jee_main_by_board.round(2))
Section 2
# First, let's create the encoding mappings and apply them to the dataset
import pandas as pd
import numpy as np
from [Link] import euclidean
# Load the dataset
df = pd.read_csv('updated_student_dataset.csv')
# Define the encoding mappings
encoding_mappings = {
'school_board': {'CBSE': 0, 'ICSE': 1, 'State': 2},
'coaching_institute': {'No Coaching': 0, 'Allen': 1, 'FIITJEE': 2, 'Local': 3},
'family_income': {'Low': 0, 'Mid': 1, 'High': 2},
'parent_education': {'Upto 10th': 0, '12th': 1, 'Graduate': 2, 'PG': 3},
'location_type': {'Rural': 0, 'Semi-Urban': 1, 'Urban': 2},
'peer_pressure_level': {'Low': 0, 'Medium': 1, 'High': 2},
'mental_health_issues': {'No': 0, 'Yes': 1}
}
# Create a copy of the dataframe for encoding
df_encoded = [Link]()
# Apply encodings
for column, mapping in encoding_mappings.items():
df_encoded[column] = df_encoded[column].map(mapping)
# Also encode dropout for easier processing
df_encoded['dropout'] = df_encoded['dropout'].map({'no': 0, 'yes': 1})
print("Encoded dataset shape:", df_encoded.shape)
print("First few rows of encoded data:")
print(df_encoded.head())
Section 2
# Define the new student data
new_student = {
'id': 'ST10001',
'jee_main_score': 61.26,
'jee_advanced_score': 65.14,
'mock_test_score_avg': 71.535,
'school_board': 1, # ICSE
'class_12_percent': 71.85,
'attempt_count': 1,
'coaching_institute': 3, # Local
'daily_study_hours': 4.05,
'family_income': 1, # Mid
'parent_education': 3, # PG
'location_type': 0, # Rural
'peer_pressure_level': 2, # High
'mental_health_issues': 1 # Yes
}
# Select the features for distance calculation (excluding id and dropout)
feature_columns = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']
# Extract new student features
new_student_features = [new_student[col] for col in feature_columns]
print("New student features:", new_student_features)
print("Feature columns:", feature_columns)
Section 2
# Calculate Euclidean distances between new student and all existing students
distances = []
for idx, row in df_encoded.iterrows():
# Extract features for current student
current_student_features = [row[col] for col in feature_columns]
# Calculate Euclidean distance
distance = euclidean(new_student_features, current_student_features)
[Link]({
'student_id': row['id'],
'distance': distance,
'dropout': row['dropout']
})
# Convert to DataFrame and sort by distance
distances_df = [Link](distances)
distances_df = distances_df.sort_values('distance').reset_index(drop=True)
print("Top 10 nearest neighbors:")
print(distances_df.head(10))
# Get the 3 nearest neighbors
top_3_neighbors = distances_df.head(3)['student_id'].tolist()
print("\
Top 3 nearest neighbors (Student IDs):")
for i, student_id in enumerate(top_3_neighbors, 1):
print(f"{i}. {student_id}")
Section 2
# K-NN with K=5 for dropout prediction
k5_neighbors = distances_df.head(5)
k5_dropout_values = k5_neighbors['dropout'].tolist()
k5_prediction = 1 if sum(k5_dropout_values) > 2.5 else 0
print("K-NN with K=5:")
print("Top 5 neighbors and their dropout values:")
for i, row in k5_neighbors.iterrows():
print(f" {row['student_id']}: dropout = {row['dropout']}")
print(f"\
Dropout values: {k5_dropout_values}")
print(f"Sum of dropout values: {sum(k5_dropout_values)}")
print(f"K=5 Prediction (1 for Yes, 0 for No): {k5_prediction}")
# K-NN with K=20
k20_neighbors = distances_df.head(20)
k20_dropout_values = k20_neighbors['dropout'].tolist()
k20_dropout_0_count = k20_dropout_values.count(0)
print(f"\
K-NN with K=20:")
print(f"Number of neighbors with dropout = 0: {k20_dropout_0_count}")
print(f"Number of neighbors with dropout = 1: {k20_dropout_values.count(1)}")
print("\
Top 20 neighbors:")
print(k20_neighbors[['student_id', 'distance', 'dropout']])
Section 3
# First, let's find the initial cluster centers ST7612 and ST9269
st7612_data = df_encoded[df_encoded['id'] == 'ST7612']
st9269_data = df_encoded[df_encoded['id'] == 'ST9269']
print("ST7612 (C1) data:")
print(st7612_data)
print("\
ST9269 (C2) data:")
print(st9269_data)
# Extract features for cluster centers (excluding id and dropout)
feature_columns = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']
c1_features = st7612_data[feature_columns].values[0]
c2_features = st9269_data[feature_columns].values[0]
print("\
C1 (ST7612) features:", c1_features)
print("C2 (ST9269) features:", c2_features)
Section 3
# Question 1: Distance between ST5001 and cluster center C2 (ST9269)
st5001_data = df_encoded[df_encoded['id'] == 'ST5001']
st5001_features = st5001_data[feature_columns].values[0]
print("ST5001 features:", st5001_features)
print("C2 (ST9269) features:", c2_features)
# Calculate Euclidean distance between ST5001 and C2
distance_st5001_c2 = euclidean(st5001_features, c2_features)
print(f"\
Distance between ST5001 and C2 (ST9269): {distance_st5001_c2:.2f}")
# Question 2: Which cluster would ST5745 be assigned to?
st5745_data = df_encoded[df_encoded['id'] == 'ST5745']
st5745_features = st5745_data[feature_columns].values[0]
print("\
ST5745 features:", st5745_features)
# Calculate distances to both cluster centers
distance_st5745_c1 = euclidean(st5745_features, c1_features)
distance_st5745_c2 = euclidean(st5745_features, c2_features)
print(f"Distance from ST5745 to C1 (ST7612): {distance_st5745_c1:.2f}")
print(f"Distance from ST5745 to C2 (ST9269): {distance_st5745_c2:.2f}")
if distance_st5745_c1 < distance_st5745_c2:
assigned_cluster = "Cluster 1 (ST7612)"
else:
assigned_cluster = "Cluster 2 (ST9269)"
print(f"ST5745 would be assigned to: {assigned_cluster}")
Section 3
# Question 3: Perform cluster assignment for all students and then recompute cluster centers
# Calculate distances for all students to both cluster centers
cluster_assignments = []
for idx, row in df_encoded.iterrows():
student_features = row[feature_columns].values
# Calculate distances to both clusters
dist_c1 = euclidean(student_features, c1_features)
dist_c2 = euclidean(student_features, c2_features)
# Assign to closest cluster
if dist_c1 < dist_c2:
assigned_cluster = 1
else:
assigned_cluster = 2
cluster_assignments.append({
'id': row['id'],
'cluster': assigned_cluster,
'features': student_features
})
# Convert to DataFrame for easier manipulation
assignments_df = [Link](cluster_assignments)
# Count assignments
cluster1_count = len(assignments_df[assignments_df['cluster'] == 1])
cluster2_count = len(assignments_df[assignments_df['cluster'] == 2])
print(f"Cluster 1 assignments: {cluster1_count}")
print(f"Cluster 2 assignments: {cluster2_count}")
# Recompute cluster centers
cluster1_students = assignments_df[assignments_df['cluster'] == 1]
cluster2_students = assignments_df[assignments_df['cluster'] == 2]
# Calculate new cluster centers (mean of all assigned points)
Section 3
# Format the new cluster centers in a more readable way
print("New Cluster Centers (formatted):")
print("\
C1 (new center):")
feature_names = ['jee_main_score', 'jee_advanced_score', 'mock_test_score_avg',
'school_board', 'class_12_percent', 'attempt_count',
'coaching_institute', 'daily_study_hours', 'family_income',
'parent_education', 'location_type', 'peer_pressure_level',
'mental_health_issues']
for i, feature in enumerate(feature_names):
print(f"{feature}: {new_c1[i]:.2f}")
print("\
C2 (new center):")
for i, feature in enumerate(feature_names):
print(f"{feature}: {new_c2[i]:.2f}")
# Create a summary table
import pandas as pd
cluster_centers_df = [Link]({
'Feature': feature_names,
'C1_New': [round(val, 2) for val in new_c1],
'C2_New': [round(val, 2) for val in new_c2]
})
print("\
Cluster Centers Summary Table:")
print(cluster_centers_df)
Section 3
# Load and examine the student dataset
import pandas as pd
import numpy as np
from [Link] import euclidean
# Load the dataset
df = pd.read_csv('updated_student_dataset.csv', encoding='ascii')
print("Dataset shape:", [Link])
print("\
First few rows:")
print([Link]())
print("\
Column information:")
print([Link]())
print("\
Basic statistics:")
print([Link]())
Section 3