A GLOB REPORT on “MACHINE LEARNING LAB”
AI-Driven Student Performance Prediction Using Machine Learning
Algorithms
Submitted in partial fulfilment of the requirements for the award of the
Bachelor of Technology
in
INFORMATION TECHNOLOGY
(2022-2026)
By
Manisha Margam 22241A1237
Under the Esteemed guidance of
Assistant Professor
Department of INFORMATION TECHNOLOGY
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Approved by AICTE, Autonomous under JNTUH, Hyderabad)
Bachupally, Kukatpally, Hyderabad-500090
2024-2025
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
Hyderabad-500090
CERTIFICATE
This is to certify that the GLOB entitled “AI-Driven Student Performance
Prediction Using Machine Learning Algorithms” is submitted by Margam
Margam(22241A1237) in partial fulfilment of the award of degree in BACHELOR OF
TECHNOLOGY in INFORMATION TECHNOLOGY during Academic year 2024-2025.
Internal Guide Head of Department
Dr. Y. J. Nagendra Kumar
ABSTRACT
Student academic performance plays a pivotal role in evaluating both individual
potential and the effectiveness of educational systems. Traditionally, assessing
student performance has been a subjective and reactive process, often relying on
manual evaluations and limited data. However, with the increasing availability
of student data and the advent of machine learning (ML), there is an opportunity
to develop predictive models that can identify students at risk of
underperforming, providing a proactive approach to improve academic
outcomes.
This project explores the use of machine learning to predict student
performance based on a variety of factors, including study time, previous
grades, failures, absences, family support, and other demographic and
behavioral attributes. The dataset used in this study is the Student
Performance Dataset from the UCI Machine Learning Repository, which
includes various features influencing a student's final grade (G3). The objective
of this research is to apply three distinct machine learning algorithms—Support
Vector Machine (SVM), Decision Tree, and K-Nearest Neighbors (KNN)—
to predict the final grade of students and evaluate the performance of each
model.
The findings demonstrate that all three algorithms provide valuable insights into
predicting academic performance, with the Decision Tree model outperforming
the others with an accuracy of 88%. The SVM model showed robustness in
handling complex data, achieving 84% accuracy, while KNN performed
decently with 80% accuracy but was sensitive to feature scaling. These models
not only predict performance but also offer insights into the most influential
factors contributing to student success, enabling educators to make data-driven
decisions.
I. INTRODUCTION
1.1 Problem Statement
Educational institutions strive to ensure student success, but early detection of
students at academic risk remains a challenge. Manual methods are inefficient
and often inaccurate. Machine learning can provide predictive models that
enable proactive academic support based on student attributes and past
performance.
1.2 Objective
The objective of this project is to build a machine learning model that can
predict student academic performance based on various factors such as study
time, previous failures, absences, and family support. The project will
implement the following objectives:
1. Preprocess the Data:
o Clean and prepare the dataset by handling any missing values,
encoding categorical variables, and performing feature scaling.
2. Apply Machine Learning Algorithms:
o Implement three distinct machine learning algorithms—Support
Vector Machine (SVM), Decision Tree, and K-Nearest
Neighbors (KNN)—to predict student performance based on the
input features.
3. Evaluate Model Performance:
o Evaluate the models' performance using accuracy for classification
tasks and Root Mean Squared Error (RMSE) for regression
tasks.
4. Compare the Models:
o Compare the performance of the models to identify the best-
performing algorithm for predicting student final grades.
5. Provide Insights for Educational Institutions:
o Use the results of the models to generate insights that can help
educators in identifying at-risk students, optimizing resource
allocation, and planning personalized interventions to improve
student outcomes.
1
1.3 Scope of the Project
This project focuses on predicting the final grade (G3) of students using a
variety of available features such as study time, failures, absences, family
support, and previous academic performance. The aim is to apply supervised
learning algorithms to model the relationship between these features and the
target variable (final grade). Both classification (predicting grade categories)
and regression (predicting exact grade scores) approaches will be utilized to
assess the different ways machine learning can predict student performance.
The project will evaluate and compare the effectiveness of three widely used
machine learning algorithms: Support Vector Machine (SVM), Decision
Tree, and K-Nearest Neighbors (KNN). The models will be compared based
on metrics such as accuracy and Root Mean Squared Error (RMSE) to
determine their performance in predicting student grades.
Finally, the findings will be discussed in the context of their implications in
educational settings, with a focus on how machine learning can help educators
identify at-risk students and implement data-driven strategies to improve
academic outcomes and retention rates.
2
II. Literature Review
Numerous studies have explored the use of machine learning (ML) techniques
to predict academic outcomes, leveraging the wealth of data available in
educational systems. Algorithms like Logistic Regression, Naive Bayes,
Random Forests, and Neural Networks have been extensively applied, each
showing promising results in various academic prediction tasks. These methods
are particularly beneficial in analyzing large datasets with multiple features such
as student demographics, behavior patterns, and academic performance.
Among these, Support Vector Machines (SVM) have excelled in classification
tasks, especially when distinguishing between different levels of academic
performance. SVM’s ability to maximize the margin between classes makes it
effective for complex, high-dimensional datasets. It is particularly well-suited to
predicting binary outcomes, such as identifying students who are at risk of
failing versus those who are likely to succeed.
Decision Trees, on the other hand, are favored for their interpretability and
transparency. Educators and researchers appreciate how Decision Trees split
data based on feature values, creating a clear path for understanding how
different attributes contribute to a student's final grade. This interpretability
allows for more actionable insights, which can be used to design targeted
interventions.
K-Nearest Neighbors (KNN) is often applied in educational data mining
because of its simplicity and effectiveness in handling small to moderate-sized
datasets. KNN is a non-parametric algorithm that makes predictions based on
the proximity of data points, which works well when the decision boundaries
are non-linear and there are clear similarities between data points. While KNN
is highly sensitive to feature scaling and the choice of k (number of neighbors),
it remains a valuable tool due to its simplicity and ease of implementation.
Furthermore, ensemble methods like Random Forests have shown robust
performance in predicting academic outcomes by combining multiple decision
trees to improve accuracy and reduce overfitting. Neural Networks,
particularly deep learning models, are also gaining traction in the field of
educational data mining due to their capacity to model complex relationships
3
within the data. However, these models tend to require larger datasets and
computational resources.
In conclusion, while there is no one-size-fits-all algorithm, each of these models
offers unique advantages and trade-offs, and the choice of model depends
largely on the dataset, the problem at hand, and the desired level of
interpretability. These machine learning algorithms provide an opportunity to
not only predict academic outcomes but also to gain deeper insights into the
factors influencing student success and failure. By incorporating these
predictive models, educational institutions can move towards more personalized
learning experiences and targeted interventions.
4
III.Methodology
3.1 Data Collection
The dataset used for this project is the Student Performance Dataset from the
UCI Machine Learning Repository. This dataset consists of 1,000 instances with
various input features related to student demographic, behavioral, and academic
factors. The target variable is the final grade (G3), which represents the
student's final grade on a scale from 0 to 20. The features in the dataset include:
Study time
Failures
Absences
Family support
Health
Monthly alcohol consumption
Extracurricular activities
Parental education
Previous grades (G1 and G2)
Gender
3.2 Data Preprocessing
Data preprocessing is a vital step in machine learning that helps ensure the
quality and suitability of the data for training. The following preprocessing steps
were carried out:
Handling Missing Values: The dataset contains no missing values, so no
imputation or removal of missing data was necessary.
Feature Encoding: Categorical variables, such as gender and family
support, were encoded using one-hot encoding or label encoding as
needed to prepare them for the machine learning models.
Feature Scaling: Numerical features like study time, absences, and
previous grades were standardized using techniques like StandardScaler
(zero mean, unit variance) to ensure that each feature contributes equally
to the model's performance, especially for models like KNN and SVM.
5
Data Splitting: The data was divided into training and testing sets, with
80% of the data used for training the models and 20% used for testing
their performance.
3.3 Algorithms Implemented
3.3.1 Support Vector Machine (SVM) (Supervised Learning)
Support Vector Machines (SVM) are supervised learning models that are highly
effective for both classification and regression tasks. SVM works by finding a
hyperplane that best separates the classes in the feature space. In this project, we
use SVM to predict whether a student will perform well (high grade) or poorly
(low grade) based on their attributes. The SVM model can handle non-linear
relationships through kernel tricks, making it well-suited for this kind of
educational data.
3.3.2 Decision Tree (Supervised Learning)
A Decision Tree is a model that splits the data into subsets based on the most
significant features. It works by recursively dividing the dataset into smaller
segments, making decisions at each node. For this project, we use the Decision
Tree model to predict students' final grades by classifying them into categories
like low, medium, and high performance. The model is chosen for its
interpretability, allowing educators to easily understand the factors influencing a
student's performance.
3.3.3 K-Nearest Neighbors (KNN) (Supervised Learning)
K-Nearest Neighbors (KNN) is a non-parametric, lazy learning algorithm that
predicts a student’s performance by considering the grades of the nearest
neighbors. The model finds the k most similar instances and assigns the grade
based on their majority. KNN is known for being simple yet effective,
especially in datasets where data points with similar attributes tend to belong to
the same class. It is sensitive to the scale of the data, which is why feature
scaling was applied prior to training.
6
IV. Implementation
4.1 Preprocessing the Data
[fig-1]
4.2 Support Vector Machine (SVM)
[fig-2]
7
4.3 Decision Tree
[fig-3]
8
4.4 K-Nearest Neighbors (KNN)
[fig-4]
9
V. Results
5.1 Support Vector Machine (SVM)
[fig-5]
5.2 Decision Tree
[fig-6]
10
[fig-7]
[fig-8]
11
5.3 K-Nearest Neighbors (KNN)
[fig-9]
VI. Conclusion
This project successfully demonstrated the application of machine learning
techniques to predict student academic performance using the Student
Performance Dataset from the UCI Machine Learning Repository. By
focusing on relevant features such as study time, past failures, absences, and
family support, three machine learning models—Support Vector Machine
(SVM), Decision Tree, and K-Nearest Neighbors (KNN)—were
implemented and evaluated.
Among the models, the Decision Tree algorithm emerged as the most
effective, achieving an accuracy of 88%, followed by SVM at 84% and
KNN at 80%. These results highlight the capability of machine learning to
model complex educational data and provide reliable predictions of student
outcomes. The models also offered insights into the most influential factors
12
affecting performance, which can support data-driven decision-making in
educational settings.
By identifying students at risk of underperforming early, educational
institutions can proactively implement personalized interventions and
allocate resources more strategically. Ultimately, the integration of
predictive analytics into academic environments can lead to improved
student retention, enhanced learning outcomes, and a more efficient and
equitable education system.
VII. References
1. Yadav, S., & Pal, S. (2012)
Used classification techniques to improve student performance
predictions.
2. Scikit-learn: Machine Learning in Python
3. Pandas Documentation
4. Peña-Ayala, A. (2014)
Analyzed recent work in educational data mining and how it's used
to help students.
5. K-Nearest Neighbors Explained – GeeksforGeeks
13