A Deep Learning Model Using A Bi-LSTM Network
A Deep Learning Model Using A Bi-LSTM Network
REVIEWED BY
Mostafa Aboulnour Salem,
SHAP-based interpretability and
King Faisal University, Saudi Arabia
Gail Augustine,
Walden University, United States
statistical validation
*CORRESPONDENCE
Silvia Gaftandzhieva Emi Kalita 1, Abdullah Mana Alfarwan 2, Houssam El Aouifi 3,4,
[email protected] Ashima Kukkar 5, Sadiq Hussain 1, Tazid Ali 1 and
RECEIVED 19March 2025
ACCEPTED 02 June 2025
Silvia Gaftandzhieva 6*
PUBLISHED 23 June 2025 1
Centre for Computer Science and Applications, Dibrugarh University, Dibrugarh, India, 2 Department
CITATION of Education and Psychology, Najran University, Najran, Saudi Arabia, 3 FSJES, Ibn Zohr University,
Kalita E, Alfarwan AM, El Aouifi H, Kukkar A, Ait Melloul, Morocco, 4 IRF-SIC Laboratory, Faculty of Science, Ibn Zohr University, Agadir, Morocco,
Hussain S, Ali T and Gaftandzhieva S (2025)
5
Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, India,
Predicting student academic performance
6
Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, Plovdiv, Bulgaria
using Bi-LSTM: a deep learning framework
with SHAP-based interpretability and
statistical validation. Introduction: Educational Data Mining (EDM) involves analysing educational
Front. Educ. 10:1581247. data to identify patterns and trends. By uncovering these insights, educators
doi: 10.3389/feduc.2025.1581247
can better understand student learning, optimise teaching methods, and refine
COPYRIGHT
curriculum. One of the main tasks in educational data mining is predicting
© 2025 Kalita, Alfarwan, El Aouifi, Kukkar,
Hussain, Ali and Gaftandzhieva. This is an the student’s academic performance because it makes it possible to provide
open-access article distributed under the appropriate interventions supporting students’ achievements. Predicting the
terms of the Creative Commons Attribution
student’s academic performance also helps to identify at-risk students and
License (CC BY). The use, distribution or
reproduction in other forums is permitted, explore the possibility of providing intervention techniques.
provided the original author(s) and the
Methods: In this paper, a deep learning model using a Bi-LSTM network is
copyright owner(s) are credited and that the
original publication in this journal is cited, in introduced to predict second term GPA.
accordance with accepted academic
Results: The model had an average accuracy of 88.23% and was statistically
practice. No use, distribution or reproduction
is permitted which does not comply with better than traditional machine learning algorithms such as CatBoost, XGBoost,
these terms. Hist Gradient Boosting, and LightGBM for accuracy, precision, recall, or F1-
score metrics. The results are also analysed with the help of SHAP values for
model interpretability to understand feature contributions, making the proposed
framework more transparent. The performance of models is also compared
using various statistical tests.
Discussion: The results demonstrate that BI-LSTM performance is significantly
different from other models. Hence, the proposed model provides a way to
prevent student dropouts and improve academic achievements.
KEYWORDS
student academic outcome, XAI, SHAP, Bi-LSTM, student dropout, statistical test
1 Introduction
Student academic performance is a key factor when evaluating the outcome of global
education systems. Our civilisation depends heavily on education, which is a crucial
component. Research in many areas, particularly education, has been impacted by information
and communication technology. For instance, the recent COVID-19 pandemic forced many
countries to adopt various e-learning platforms (Albreiki et al., 2021). Higher education
institutions prioritise student academic achievement as a key indicator of quality education.
However, identifying the factors that significantly impact student achievement, we can incorporate variables like aptitude test results,
success early in their academic journey is a complex challenge. Several high school GPAs, and the student’s graduating high school. We think
useful strategies have been employed to address the academic that a student’s success during their first year of college can be used as
performance issues of the students (Bravo-Agapito et al., 2021; Alamri a predictor of how well they will perform during the remaining years
and Alharbi, 2021; Hamsa et al., 2016). These resources may not of their education. These elements enable students to receive early
be easily implemented everywhere. Also, while technology has feedback and take steps to enhance their performance. The main
improved student performance prediction, further work is necessary purpose of this study is to achieve early classification of at-risk
to achieve higher accuracy through new data and techniques. students and the prediction of their GPA to allow timely intervention
Additionally, clustering and classification techniques are presented to by educators and other policymakers. That is why recognising
identify the impact of students’ performance early on the GPA. Grade potential dropouts can help an institution improve dropout and
Point Average, commonly known as GPA, is the widely used and retention rates. The three key objectives of this research are:
accepted criterion for determining student academic performance. It
is a very significant component of the overall academic evaluation • To predict the at-risk students using classification so that the
process. However, there is a need to predict GPA initially to easily teachers and policymakers can stop the possible dropout of
track and address any student who is most likely to drop out during these students.
their academic period. To address this challenge, this study applies • To find the best classifiers among different classifiers to predict
modern computational techniques. the at-risk students that may be applied to similar datasets of
Student performance is a major component of the learning other Universities.
process. Predicting student performance is necessary to identify those • To utilise SHAP (Shapley Additive exPlanations) to interpret the
most likely to experience poor academic accomplishment in the results, providing stakeholders with insights into the key features
future. The data may be helpful and utilised to make predictions if it influencing predictions and reinforcing the principles of
has been converted into knowledge. Therefore, the information could Explainable AI (XAI).
help students reach their academic goals and enhance the quality of • To compute the performance of the best classifier with others, a
education and learning. This study, Educational Data Mining (EDM), statistical analysis such as the mean, median, standard deviation,
analyses data from educational backgrounds using data mining t-test–test, bootstrap confidence levels, Friedman test, Effect Sizes
techniques (Kaunang and Rotikan, 2018; Yağcı, 2022). EDM (Cohen’s d) and Tukey’s HSD Test are employed on the four
application also assists in preparing action plans for enhancing student performance metrics.
performance. This will ultimately lead to improved teaching, learning,
and the overall student experience within the institution (Ajibade This study aims to improve predictive accuracy while providing
et al., 2022; Nabil et al., 2021). Analysing academic data with machine comprehensible and practical recommendations to educational
learning has shown promising results in identifying learning patterns stakeholders using deep learning methodologies and interpretability
and predicting student performance (Hussain and Khan, 2023). tools like SHAP. The proposed framework offers a reference model for
Through the application of ML algorithms, an assessment of student early GPA prediction, contributing to better academic outcomes and
outcomes can be made due to the identification of patterns that exist fewer student dropouts.
within the data (Dabhade et al., 2021). While machine learning offers The rest of the paper is organised as follows. Section 2 describes
potential for academic data analysis, traditional model-building the related works, while Section 3 depicts the methodology section.
methods are inadequate. They suffer from issues such as lack of Results and discussions were described in Section 4, and Section 5 is
interpretability, vulnerability to overfitting in imbalanced datasets, the conclusion.
and difficulty managing feature interdependencies (Alam and
Mohanty, 2022). These limitations, in turn, make it difficult for those
who apply the models to make important decisions based on the 2 Related work
provided information by the models. Deep Learning (DL) has
emerged as a promising solution to address the limitations of The growth and development of a country depend on the
traditional machine learning models (Rodríguez-Hernández et al., achievements of students in school. Therefore, various researchers
2021). However, even with DL, handling the complexities and work to develop diverse methods for the early prediction of student’s
non-linear relationships found in academic datasets remains a academic performance.
significant challenge (Waheed et al., 2020; Lee et al., 2021; Al-Azazi Sarker et al. (2024) conducted a study by applying the EDM
and Ghurab, 2023; Shen, 2024; Sateesh et al., 2023; Manigandan et al., method to investigate student achievement in higher secondary
2024). Moreover, DL’s capability of handling big data will enhance the education in Bangladesh. The research focused on categorising
prediction accuracy of GPA if integrated with workflows of handling students into good, average, and poorly performing groups. It
imbalanced data and the feature importance workflow, as shown in evaluated their academic performance through four key aspects:
Figure 1. assessment of probable outcomes, comparison of subject-wise
Academic achievement is significant since it is closely related to performance analysis, performance trends, and internal examination
the favourable results that we appreciate. Students’ academic pattern parameters. Therefore, a two-year dataset of humanities
achievement in college or university is one of the aspects that students was used, and five machine learning algorithms were used for
contribute to academic success. Every college or university’s analysis, such as Naive Bayes (NB), Decision Tree (DT), Random
performance is still determined by the total academic achievement of Forest (RF), Neural Network (NN), and Nearest Neighbour. The study
its students. To enhance our analysis and prediction of academic demonstrated a clear correlation between students’ performance
FIGURE 1
A flowchart of the ML & DL process with the constraints and transition stages.
during the term and their final grades, and it also identified specific RF, Support Vector Machine (SVM), and XGB. The proposed system
subjects that significantly contribute to high academic achievement. also included nine feature selection techniques, including variance
Such a concept can help college administrations with intervention threshold and recursive feature elimination. The ensemble DXK
strategies that can be used to help low achievers while motivating (DT + XGB + KNN) model achieved 97.83% accuracy with 80:20 data
high achievers. proportions, showing better results than traditional classifiers.
Kukkar et al. (2023) proposed a new Student Academic Furthermore, the ACO-DT Model achieved a 98.15% accuracy rate
Performance Predicting (SAPP) system to enhance the prediction and was higher than all the models used. The authors highlighted that
accuracy and solve performance prediction issues. The proposed more research should enhance the performance of more accurate and
system combined 4-layer stacked LSTM with RF and Gradient faster predictions.
Boosting (GB) algorithms. The system performance was evaluated Another analysis was done by Liang et al. (2024) using five
using Accuracy, Precision, F-measure and Recall parameters on a machine learning models to predict academic performance in an
newly created emotional dataset with an OULAD dataset. The engineering mechanics course with inputs as online learning
accuracy of the proposed SAPP system was around 96%, which is behaviours and comprehensive performance and outputs as final exam
higher than ANN, RNN, CNN, SVM, DT, and NB. These results scores (FESs). The best performance was achieved by GB Regression
supported its accuracy over other approaches employed in student (GBR) with RMSE (9.3595) and a correlation coefficient of (0.7558).
prediction performance. Thus, they found that the Intellectual Education Score (IES) was the
Mahawar and Rattan (2025) developed a performance prediction most important performance indicator affecting the change in the
model using ML models involving demographic, social, psychological, scores. Live viewing rate (LVR), replay viewing rate (RVR), and
and economic indicators. An online survey was performed, and a number of completed assignments (NOCA) were critical for FESs.
dataset of pre-year undergraduate students was considered for analysis They presented practical information for educators who could
using eight different ML classifiers, namely, Logistic Regression (LR), incorporate or modify particular practices to help a student at risk.
Huang and Zeng (2024) developed a novel academic Kapucu et al. (2024) explored ML and DL approaches to predict
performance prediction model leveraging dual graph NN to utilise student performance in science classes. They collected the data from
both interaction-based structural information and attribute feature 445 students in grades 5–8 from a school in Central Anatolia, Turkey,
spaces of students. The model included a local academic performance during the 2022–2023 academic year. The results revealed that out of
representation module obtained from online interaction activities several factors, the average number of books read per year significantly
and a global representation module constructed from attribute affected performance more than other factors. The DNN model
features with the help of dynamic graph convolution. These various achieved the highest accuracy, i.e., 90%.
data representations are integrated with a learning module that Nurudeen et al. (2024) established the correlation between the
analyses information from individual and overall perspectives to first-year GPA and the final-year CGPA. Data were collected using an
predict performance on a test. The experiment outcome showed that ex-post facto design and analysed using Pearson’s correlation and
performance was improved with 83.96% accuracy for pass/fail regression in Minitab. It was also found that first-year GPA had a
prediction and 90.18% for pass/withdraw prediction in a public consistently high correlation (i.e., 0.9334) with the final-year CGPA,
dataset. Additionally, ablation studies were performed to validate proving that early academic performance is a major determinant of
these improvements and to showcase that the proposed model success. However, other demographic characteristics were not
outperformed the other approaches. significantly related to CGPA.
Hussain et al. (2024) implemented an innovative deep learning The problem of imbalanced datasets in learning was
approach that uses the Levenberg Marquardt Algorithm (MLA), minimised by Wang et al. (2023). They proposed a ProbSAP
which solves problems like insufficient attributes and model system for predicting academic performance. The ProbSAP
complexity in the current approach. The input data included the incorporated three key modules: a cooperative data enhancement
assignments, class tests, midterm scores, and attendance. This data sub-module for improving data quality, accessible in large-scale
is fed through the NN via four input variables, three hidden layers metadata clustering sub-module for reducing potential imbalances
and an output layer. The proposed model obtained an accuracy of of academic features, and the XGBoost-based prediction
88.6%, more accurate than previous approaches. The study sub-module for final course mark prediction. The comparative
achieved its goal of predicting final grades, which proved beneficial assessments revealed that ProbSAP leads to lower mean absolute
for students, teachers, and educational leaders by providing error than the current methods, including CNN, SVR, and
actionable information. Catboost-SHAP, and improved on an average by up to 84.76%. It
Kukkar et al. (2024) developed a system that analysed the provided a sample accuracy above 98%; there is less than 1–9%
sequences and long-dependent structures of OULAD and self- prediction error. Table 1 showcases different state-of-the-art
derived emotional data using RNN and LSTM networks. studies in this domain.
Integrating RF, SVM, NB, and DT with RNN and LSTM improves
the method’s predictive capability. The proposed model with the
RNN + LSTM + RF model achieved a high accuracy of 97% as 3 Methodology
compared to the other models: RNN + LSTM + SVM with 90.67%,
RNN + LSTM + NB with 86.45% and RNN + LSTM + DT with In this section, the different methods used in this study for
84.42%. This method effectively modelled the intricate time- second-term GPA prediction are explained in detail. The design,
dependent relationships within the data and outperformed all other implementation, and evaluation of the proposed methodologies and
tested configurations. their comparison with the conventional machine learning approaches
Demographic and personality features are combined by Shaninah are also explained as follows.
and Mohd Noor (2024) to develop a SAP prediction model. They
collected the dataset from 305 students studying at Al-Zintan
University, Libya, through a questionnaire containing 44 questions. 3.1 Different methods utilised in the study
The proposed approach involved one latent dependent construct, i.e.,
SAP and five independent constructs. Both were tested using This section provides a detailed analysis of seven methods,
PLS-SEM, which was more effective in handling smaller samples and examining their architecture, functionality, and effectiveness in
complex models than CB-SEM. The research outcomes identified predicting second-term GPA. Following this, we discuss the
personality features as the most influential factors that affect advantages and disadvantages of each method in the context of
SAP performance. academic performance prediction.
The issues faced by DHH students in their education were
addressed by Raji et al. (2024). They proposed a new ML system with 3.1.1 XGBoost
LIME and SHAP methods. The proposed system predicted the student E-Xtreme gradient Boosting is a Machine learning technique
at risk and weighted the key risk factors like early intervention, family known for its exceptional predictive performance. It is also
deafness history, mode of communication, and type of schooling. renowned for its high accuracy, efficiency and speed. It creates a
They generated a new dataset combining 454 DHH student records sequence of weak learners, and based on this sequence, it develops
with synthetic and SMOTE datasets. After that, various ML methods an accurate predictive model. XGBoost minimises the overfitting
were applied, among which a stacked model with XGB + RF + Extra problem by improving generalisation. Mostly, it works on
Trees gained 92.99% accuracy. This system provided practical classification and regression problems. It can handle missing
recommendations allowing stakeholders to enhance DHH students’ values, which allows the model to handle real-world data with
performance. missing values without requiring pre-processing. Boost has key
TABLE 1 Some of the state-of-the-art studies with their findings and limitations.
Achieved 96%
RF, GB, ANN, Requires additional
Emotional + OULAD Emotional states, Stacked LSTM + accuracy; enhanced
Kukkar et al. (2023) CNN, SVM, DT, real-world validation
datasets academic records RF + GB prediction over
NB for diverse datasets.
traditional methods.
Identified effective
Limited to pre-year
features using
Demographic, social, undergraduates;
Mahawar and Rattan Online survey (pre- LR, RF, SVM, XGB, ACO-DT (98.15% advanced feature
psychological, and economic data
(2025) year undergraduate) DXK, ACO-DT accuracy) selection; improved
economic factors inconsistencies may
accuracy with
affect generalisation.
ensemble models.
Successfully
predicted final
Attendance, grades using simple Accuracy is slightly
BS program 1st- NN + MLA (88.6%
Hussain et al. (2024) assignments, midterm MLA input features; lower than modern
semester Data accuracy)
scores, class tests beneficial for ensemble methods.
educators and
policy-makers.
Captured complex
temporal
Temporal dependencies with Needs scalability
Emotional + OULAD RNN, LSTM, RF, RNN + LSTM + RF
Kukkar et al. (2023) dependencies from superior testing for larger
datasets SVM, NB, DT (97% accuracy)
sequence-based data performance datasets.
compared to other
combinations.
Identified
personality traits as
Personality traits, Limited sample size;
Shaninah and Mohd most influential on
305 students (survey) demographics, PLS-SEM, CB-SEM PLS-SEM focused only on
Noor (2024) SAP; performed
employment factors Libyan universities.
well with smaller
sample sizes.
(Continued)
TABLE 1 (Continued)
Determined books
Applied only to grades
Number of books read read per year as a
445 students (grades DNN (90% 5–8; additional factors
Kapucu et al. (2024) per year, midterm DNN significant factor for
5–8) accuracy) for higher education
scores predicting science
are not included.
course performance.
Strong correlation
between first-year
GPA and final Focused only on GPA
Regression, Regression
First- and final-year Demographics, first- CGPA; progression; external
Nurudeen et al. (2024) Pearson’s (Correlation:
GPAs year GPA demographic factors were not
Correlation 0.9334)
variables had no considered.
significant
influence.
ProbSAP reduced
MAE by 84.76% and
Requires extensive
achieved 98%
Massive educational Academic features, XGBoost, CNN, computational
Wang et al. (2023) ProbSAP accuracy in
dataset metadata clustering SVR, ProbSAP resources for large-
predictions with a
scale datasets.
reduced error
margin (1–9%).
features; it uses a decision tree as the base learner. To enhance its overfitting, high performance, interpretability and scalability are
performance, this approach supports parallel processing for the advantages of Cat Boost (Prokhorenkova et al., 2018).
improved efficiency and scalability and utilises regularisation to Mathematically, CatBoost can be calculated as follows:
avoid overfitting. Its advantages are High accuracy, efficiency,
M N
handling large datasets and interpretability (Chen and ( x ) F0 ( x ) + ∑
F= ∑ f
m 1 =i 1 m
=
( xi ,yi ) (1)
Guestrin, 2016).
3.1.3 Histogram based gradient boosting Finally, it combines forward and backward passes to capture past and
Traditional Gradient Boosting is an ensemble decision tree future context. In the forward pass, it can process the input from
algorithm; it is slow to train the model, to minimise this problem, Hist starting to ending and from ending to starting in the backward pass.
Gradient Boosting or Histogram Based Gradient Boosting (HGB) In Figure 2, the input sequence represents some data like characters
concept is introduced. Hist Gradient Boosting is an effective in a text or words in a sentence, etc., these data points are transformed into
implementation of traditional gradient boosting. This boosting technique dense vectors. The Bi-LSTM layer applies its parameter to the vector
divides data into bins and histograms, reducing the computational sequence. In the forward pass, information is collected from the past
complexity and memory usage. These bins or histograms are used to find (prior time steps), and in the backward pass, information is recorded from
the gradient of the loss function and then update the model using the the future (following time steps). The output of the BiLSTM is the
calculated gradients. It is an iterative process until it reaches the stopping combination of the hidden steps from forward and backward directions
criteria or convergence. Hist Gradient Boosting offers advantages such (Graves and Schmidhuber, 2005) (Equation 2).
as accelerated gradient computation, scalability for large datasets and
f b
high-dimensional features, and resilience to outliers and noisy data. The p=
t pt + pt (2)
common application of Hist Gradient Boosting is classification,
regression and recommendation systems (Si et al., 2017).
Where.
3.1.4 LightGBM pt is the probability record from both the forward and backward
Microsoft’s LightGBM is a fast and efficient gradient-boosting LSTM network, i.e., the final probability vector;
framework for high performance. It tackles classification, regression, ptf probability vector found from the forward LSTM network.
and ranking problems through a tree-structured approach, combining ptb probability vector found from the backward LSTM network.
weak models into a strong predictor. LightGBM’s focus on large and
small gradient instances contributes to its accuracy. It is a flexible 3.1.6 SHAP (Shaply additive explanations)
model because it can support various objective functions. Due to its The concept of cooperative game theory and sharply values is
support for sparse data, LightGBM is highly memory-efficient. Its the foundation of SHAP (Lundberg and Lee, 2017). The output of
operation involves initialising a basic model and then calculating the ML model is interpreted and explained using the Shapley
gradients. LightGBM applies some efficient algorithms to get an Additive Explanations (SHAP) framework. SHAP values help to
efficient model by searching the optimal split point in each feature.it understand the contribution of each feature in model prediction.
is an iterative process and updates the model prediction based on split SHAP values explain the significance of each feature and how it
point and calculated gradients, continuously adding new decision affects the output and interaction between features. The positive
trees until a stopping criterion is met, which may be either a maximum SHAP value of a feature gives a positive impact on model
no. of trees or minimum improvement in performance. The high prediction, and the negative value gives a negative impact on
accuracy, speed, scalability, efficient histogram construction, and low model prediction. The magnitude represents the strength of the
memory usage are the advantages of LightGBM (Ke et al., 2017). effect. SHAP uses the training data to measure the contribution of
The selection of the methodologies depends on the problems, each feature, and then a reference value is calculated. This reference
datasets and performance matrices because the following methodologies value helps to represent the average prediction for the dataset.
also have some demerits. XGBoost gives high accuracy but can suffer SHAP value defines the difference between the predicted value and
from overfitting. CatBoost can handle categorical features, but it is reference value for each SHAP value and is calculated by
resilient to outliers. Hist GB is fast and memory-efficient, but it gives considering all possible feature coalitions. Under considering all
minimum accuracy. LightGBM is also fast and memory-efficient and potential feature coalitions, the SHAP value defines the difference
gives more accuracy but can be less robust to outliers. between predicted and reference values for each. Finally, SHAP
values are used to determine how each feature affects the outcome
3.1.5 BiLSTM and to understand and interpret the result. However, gaining
Bi-directional Long Short-Term Memory, commonly known as insight helps the model to make decisions. Interpretability, model
Bi-LSTM, belongs to the recurrent neural networks (RNNs) category. explainability and feature selection are the advantages of SHAP.
It is called a sequence model because it processes sequential data. It has
two LSTM layers, so it is Bi-directional. The first one is Forward LSTM, 3.1.7 SMOTE
and the other one is Backward LSTM. Simultaneously, these two LSTM Synthetic Minority Over Sampling Technique (SMOTE) is
layers process the input sequence in forward and backward directions. known to handle imbalanced datasets of machine learning models
FIGURE 2
Structure of BiLSTM.
(Chawla et al., 2002). SMOTE helps solve oversampling, average (HSGPA), American College Testing (ACT) composite score,
undersampling and threshold moving issues. The underrepresented and grade point averages for the first (FTGPA) and second terms
minority class causes the majority class to dominate the class (STGPA). STGPA is our target variable. The dataset consisted of
distribution. Therefore, SMOTE handles these imbalanced issues three cohorts of students’ records (N = 6,500) on six variables
by generating a sample of minority classes. SMOTE identifies some (features).
minority class instances from the imbalanced dataset. Once
minority instances are identified, find their K-nearest neighbours 3.2.2 Data pre-processing
and generate synthetic samples by interpolating between each The dataset underwent a systematic preparation process to
minority instance and its K-nearest neighbours. SMOTE repeats ensure its reliability and accuracy. Data cleaning was a critical step
these steps to get a more balanced dataset. involving the identification and removal of missing values, as well
as the elimination of duplicate records to maintain data consistency.
These measures were essential to produce a clean and error-free
3.2 Data pre-processing to model dataset, providing a robust foundation for subsequent
evaluation analytical tasks.
In addition to data cleaning, data augmentation was applied
In this work, we followed a systematic methodology starting to enhance the dataset. This process involved generating new data
with data pre-processing, which involved data preparation, points by introducing small random perturbations to key features,
transformation, and oversampling to address class imbalance such as HSGPA, ACT, and FTGPA. Adding subtle variations to the
issues. The raw dataset was cleaned and transformed into a suitable data increased its diversity, better reflecting real-world variability.
format, and oversampling techniques were applied to balance the This data augmentation expanded the dataset and enhanced the
data. This resulted in a refined new dataset, which was then used model’s general ability, leading to more robust analyses. Figure 4
for model development and evaluation to assess the performance shows the distribution of classes before the data
and accuracy of the proposed approach. Figure 3 describes the augmentation process.
steps of our model. To further address the class imbalance, SMOTE (Synthetic
Minority Over-Sampling Technique) was applied. SMOTE
3.2.1 Dataset description generates synthetic data points for the minority classes, ensuring a
The dataset was collected from a Middle Western University, more balanced data distribution across all classes. This balance is
USA. The dataset comprised sex, age, high school grade point critical for training machine learning models, as it prevents bias
FIGURE 3
Structure of proposed model.
toward any particular class and ensures that the model is equally • Recurrent Layers:
exposed to all possible outcomes, improving its overall • The core of the model leverages a combination of BiLSTM and
performance and generalisation ability. The final balanced BiGRU layers:
distribution is shown in Figure 5. • 1st Layer: A Bidirectional LSTM layer with 512 units and
return_sequences = True, allowing the output sequence to
3.2.3 Model architecture be passed to the next layer.
The model implemented is a Recurrent Neural Network (RNN) • 2nd Layer: A Bidirectional GRU layer with 256 units
architecture utilising Bidirectional Long Short-Term Memory configured to output sequences for further processing.
(Bi-LSTM) and Bidirectional Gated Recurrent Units (Bi-GRU) layers • 3rd Layer: Another Bidirectional LSTM layer with 256 units,
to capture sequential patterns in the data (Figure 6 shows the proposed reducing the sequence to a single vector representation.
model architecture).
• Dense Layers:
• Input Pre-processing: • A stack of fully connected layers captures complex, high-level
• The input features are reshaped to a 3D tensor of shape representations of the processed sequential data:
(samples, time steps, features). Here: • Dense(64) → Dense(32) → BatchNormalization →
• samples correspond to the number of training/ Dense(16) → Dense(8) layers refine the feature space.
testing samples. • Batch normalisation ensures stability and mitigates the risk
• time steps are set to 1, signifying a single time step. of vanishing/exploding gradients.
• features represent the number of input features.
• Dropout:
• Dropout layers introduce regularisation, preventing
overfitting by randomly setting a fraction of units to zero
during training.
• Output Layer:
• A Dense layer with four units and a sigmoid activation
function outputs class probabilities for the four classes.
The model was trained for up to 200 epochs with a batch size of 128,
while early stopping was applied to prevent overfitting. Early stopping
monitored the validation loss and halted training if no improvement
was observed for 15 consecutive epochs, restoring the best model
weights to ensure optimal performance. Dropout was applied with a rate
of 0.2 in the fully connected layers to reduce overfitting by randomly
FIGURE 4 deactivating some units during training. The model was compiled using
Distribution of classes before the data augmentation process. the Adam optimiser, which is efficient and adaptive, and the categorical
cross-entropy loss function, suitable for multi-class classification tasks.
FIGURE 5
Distribution of classes after the data augmentation process.
FIGURE 6
Model architecture.
The accuracy metric was used to evaluate the model’s performance and F1-score, which help assess the performance of the models in
during training and validation (Figure 7). predicting the target variable, STGPA. The algorithms used in the
comparison include CatBoost, XGBoost, HistGradientBoosting, and
LightGBM (Figure 8).
For each algorithm, the following metrics were calculated based
4 Results and discussion on the values of True Positives (TP), False Positives (FP), True
Negatives (TN), and False Negatives (FN)
In this section, we will describe the results obtained from comparing
the performance of various machine learning algorithms. The evaluation 1 Accuracy: This metric measures the proportion of correct
was based on several key metrics, including accuracy, precision, recall, predictions made by the model relative to the total number of
FIGURE 7
Training and validation accuracy over epochs.
FIGURE 8
Training and validation loss over epochs.
predictions (see Equation 3). Higher accuracy indicates better (see Equation 4). This is particularly crucial when incorrect
overall performance. positive predictions have significant negative consequences.
TP + TN TP
Accuracy = (3) Precision = (4)
TP + TN + FP + FN TP + FP
3 Recall: This metric indicates how well the model identifies all
2 Precision: This metric measures the proportion of true positive relevant instances of the positive class (see Equation 5). It is
predictions among all positive predictions made by the model critical when false negatives are costly.
FIGURE 9
Performance models by metrics.
more accurate in identifying students who are truly at risk of numerical values to each attribute to identify how much they impact
underperforming, reducing false interventions. The recall prediction results. The obtained insights from SHAP evaluations can
mechanism protected the identification of most students who be seen in Figure 10 of the SHAP Violin Summary Plot and Figure 11
need attention. The F1-score demonstrated that Bi-LSTM achieves of the SHAP Heatmap Plot.
better overall performance through its single balanced metric
reflecting both precision gains and recall enhancement. The 4.2.2.1 SHAP violin plot
information presented here becomes vital for educators who need Figure 9 revealed that: Among all predictive factors, FTGPA
systems that perform detection and intervention activities without (First Term GPA) showed the greatest impact because its data
making errors. The significant margin demonstrated an important distribution extends the furthest toward zero from the x-axis.
increase in the trustworthiness of models particularly when Student performance in first term and high school together with
applied to real-world academic tasks. ACT scores demonstrated similar importance levels which capture
their academic development and standardised testing abilities. Data
4.2.2 Feature importance via SHAP values from the model indicated that AGE and SEX variables had only
The opacity of DL models required the use of SHAP to explain small predictive power due to their negligible impact. The graphical
Bi-LSTM output and validate its predictions. SHAP attributes representation proved academic historical data supersedes
FIGURE 10
Violin summary plot based on SHAP values.
FIGURE 11
Heatmap plot based on SHAP values.
demographic characteristics in predicting GPA which strengthens • Percentiles (25, 50, 75%): provide insights into the distribution of
the model’s application relevance for educational purposes. performance scores.
4.2.2.2 SHAP Heatmap These descriptive statistics reveal that Bi-LSTM consistently
Figure 11 demonstrated local explanation through visual presentation outperforms other models in accuracy and F1-score, with a notable
of how individual student predictions relate to each feature. A positive difference in precision and recall.
SHAP contribution appears as red while negative SHAP influence shows
up as blue. For instance: Predicted GPA values are consistently higher 4.3.2 Friedman test
when the students demonstrate high FTGPA and HSGPA levels which The Friedman test for repeated measures is applied to compare the
appear as red in color. The model uses blue color to identify instances models and identify any significant differences in their performance
when variables have lower values which results in predicted outcomes that across the four metrics. The results are:
decrease. The model’s predictions received confidence through this
approach allowing advisors to identify explanation reasons for each • Chi-squared: 11.1600
prediction so they can deliver tailored guidance. • p-value: 0.0109
The stakeholders can identify the students at risk early and deliver
appropriate exhortation in an auspicious manner. This can help Thus, a p-value of 0.0109 gives a sign of difference between the
prevent students from dropping out of the institution and improve the models, meaning that Bi-LSTM is statistically different from the
institution’s overall performance. others when comparing the mean value for the complete combination
of all aspects.
4.4.1 Comparative performance of models • First-Term GPA (FTGPA): Reflects initial academic performance
All performance evaluation metrics from Table 2 demonstrate a and is a strong early indicator.
clear superiority of Bi-LSTM compared to ML approaches for all • High School GPA (HSGPA): Captures foundational
precision, recall, accuracy and F1-score measures. In particular: academic preparedness.
• Standardised Test Scores (ACT): Signifies cognitive aptitude and
• Bi-LSTM achieved 88.23% accuracy, outperforming the next-best readiness for college-level curriculum.
model, XGBoost, which reached 87.14%.
• Precision and Recall, both critical for identifying at-risk students, Research in educational data mining supports a clear connection
reached 92.02 and 92.11%, respectively, for Bi-LSTM. These between previous academic performance and future student
values are significantly higher than those of all ML counterparts achievement levels. Understanding the connection between data
(which ranged from 85.79 to 87.18%). points and student outcomes through SHAP analysis makes model
• The F1-score of Bi-LSTM (91.98%) reflects an excellent balance transparency possible which leads to better adoption by HEIs top
between precision and recall, signifying that the model effectively management and administrators of non-technical backgrounds.
minimises both false positives and false negatives.
4.4.4 Statistical validation of performance
The research demonstrated that deep learning algorithms such as superiority
Bi-LSTM exceed traditional ML models when processing educational The following statistical techniques were used to validate the
data through sequential and contextual dependency modelling. The findings along with their generalizability and credibility levels:
model employed bidirectional memory to access past and future
temporal data which proved crucial for understanding • The Friedman test, a non-parametric test for comparing multiple
academic trajectories. models over multiple datasets or metrics, revealed a statistically
significant difference (χ2 = 11.16, p = 0.0109) among the models. • Temporal Dynamics: Real-time updates and time-series
This confirms that the observed performance differences are not changes have not been included into the present model
due to random variation. framework. The predictive capabilities and applicability of
• Bootstrap confidence intervals were calculated to assess the the model will improve by implementing longitudinal
uncertainty around the performance gaps. All intervals tracking systems.
comparing Bi-LSTM with other models (e.g., CatBoost, • Holistic Feature Space: Additional metadata about mental
LightGBM) had negative lower and upper bounds, indicating health and financial stress as well as engagement levels is
Bi-LSTM consistently outperformed its counterparts with missing from the current model assessment. Future versions
95% confidence. of the model must incorporate socio-emotional and
• Cohen’s d effect size provided further confirmation. The behavioural information to build a predictive instrument
magnitude of the effect sizes ranged from −3.4 to −4.4, with a broader scope.
representing very large effects. This statistically supports the
assertion that Bi-LSTM is meaningfully better, not
just marginally. 5 Conclusion
• Tukey’s HSD (Honestly Significant Difference) test
confirmed pairwise statistical superiority of Bi-LSTM over In this work, we proposed a deep learning-based model,
each individual model (p < 0.0001 in all cases), providing specifically a Bi-LSTM (Bidirectional Long Short-Term Memory)
robust post-hoc evidence to the Friedman results. network, to predict the second-term GPA. Our model was
evaluated against several other algorithms, including CatBoost,
Our analysis utilised multiple approaches for validation to XGBoost, HistGradientBoosting, and LightGBM, using key
enhance the credibility of our study’s findings. These evaluations performance metrics such as accuracy, precision, recall, and
create confidence in decision-makers who typically need F1-score. The results demonstrated that our proposed Bi-LSTM
empirical validation to feel comfortable adopting model outperforms the traditional machine learning algorithms
AI-based systems. in terms of predictive accuracy, highlighting the potential of deep
learning techniques for academic performance prediction. This
4.4.5 Relevance for non-technical stakeholders type of model can be utilised to mitigate student dropout and
The technical aspects of this study produce significant enhance the performance of the students. One of the limitations
practical benefits for educational institutions. This statistical of the study is the size of the dataset. In future, we shall try to
model generated results which serve practical strategic purposes: collect more data to boost the performance of the deep learning
model. The integration of deep learning strategies and SHAP
• HEIs top management and academic advisors can use the values in a single framework could overcome the challenges of
predictive results, along with SHAP explanations, to engage the trade-off between the student academic performance model’s
students in informed discussions and recommend tailored explainability and intricacy and augment model accuracy and
support plans. transparency. The performance of selected ML and DL models
• Administrators can incorporate the model into early alert are also compared using the mean, median, standard deviation,
systems to drive data-informed policies aimed at reducing t-test–test, bootstrap confidence levels, Friedman test, Effect
dropout rates and improving overall Sizes (Cohen’s d) and Tukey’s HSD Test. The results demonstrate
institutional performance. that BI-LSTM performance is significantly different from other
• Policymakers can explore this model as a blueprint for models. This study could open horizons for other researchers to
scalable national or state-level educational interventions, conduct analogous studies in the domain.
especially in systems that are resource-constrained but rich
in historical academic data.
Data availability statement
The Bi-LSTM model provided a unique combination between
outstanding predictive capabilities and easy interpretability The data analyzed in this study is subject to the following
which makes it essential for education domains requiring both licenses/restrictions: data will be provided on a request. Requests
technical brilliance and ethical clarity. to access these datasets should be directed to [email protected].
References
Ajibade, S. S. M., Dayupay, J., Ngo-Hoang, D. L., Oyebode, O. J., and Sasan, J. M. Kaunang, F. J., and Rotikan, R. (2018). Students' academic performance prediction
(2022). Utilization of ensemble techniques for prediction of the academic performance using data mining. In 2018 third international conference on informatics and computing
of students. J. Optoelectron. Laser 41, 48–54. (ICIC) (1–5).
Alam, A., and Mohanty, A. (2022). Predicting students’ performance employing Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. (2017). LightGBM: a
educational data mining techniques, machine learning, and learning analytics. In highly efficient gradient boosting decision tree. Adv. Neural Inf. Proces. Syst. 30,
International conference on communication, networks and computing (166–177). 3149–3157.
Cham: Springer Nature Switzerland.
Kukkar, A., Mohana, R., Sharma, A., and Nayyar, A. (2023). Prediction of student
Alamri, R., and Alharbi, B. (2021). Explainable student performance prediction academic performance based on their emotional wellbeing and interaction on various
models: a systematic review. IEEE Access 9, 33132–33143. doi: e-learning platforms. Educ. Inf. Technol. 28, 9655–9684. doi: 10.1007/s10639-022-11573-9
10.1109/ACCESS.2021.3061368
Kukkar, A., Mohana, R., Sharma, A., and Nayyar, A. (2024). A novel methodology
Al-Azazi, F. A., and Ghurab, M. (2023). ANN-LSTM: a deep learning model for early using RNN+ LSTM+ ML for predicting student’s academic performance. Educ. Inf.
student performance prediction in MOOC. Heliyon 9:e15382. doi: Technol. 29, 14365–14401. doi: 10.1007/s10639-023-12394-0
10.1016/j.heliyon.2023.e15382
Lee, C. A., Tzeng, J. W., Huang, N. F., and Su, Y. S. (2021). Prediction of student
Albreiki, B., Zaki, N., and Alashwal, H. (2021). A systematic literature review of performance in massive open online courses using deep learning system based on
student performance prediction using machine learning techniques. Educ. Sci. 11:552. learning behaviors. Educ. Technol. Soc. 24, 130–146.
doi: 10.3390/educsci11090552
Liang, G., Jiang, C., Ping, Q., and Jiang, X. (2024). Academic performance prediction
Bravo-Agapito, J., Romero, S. J., and Pamplona, S. (2021). Early prediction of associated with synchronous online interactive learning behaviours based on the
undergraduate student's academic performance in completely online learning: a machine learning approach. Interact. Learn. Environ. 32, 3092–3107. doi:
five-year study. Comput. Human Behav. 115:106595. doi: 10.1080/10494820.2023.2167836
10.1016/j.chb.2020.106595
Liu, J., and Xu, Y. (2022). T-friedman test: a new statistical test for multiple comparison
Carpenter, J., and Bithell, J. (2000). Bootstrap confidence intervals: when, which, with an adjustable conservativeness measure. Int. J. Comput. Intell. Syst. 15:29. doi:
what? A practical guide for medical statisticians. Stat. Med. 19, 1141–1164. doi: 10.1007/s44196-022-00083-8
10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F
Lundberg, S. M., and Lee, S.-I. (2017). A unified approach to interpreting model
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: predictions. Adv. Neural Inform. Proces. Syst. 30, 4765–4774.
synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. doi:
Mahawar, K., and Rattan, P. (2025). Empowering education: harnessing ensemble
10.1613/jair.953
machine learning approach and ACO-DT classifier for early student academic
Chen, T., and Guestrin, C. (2016). XGBoost: a scalable tree boosting system. performance prediction. Educ. Inf. Technol. 30, 4639–4667. doi:
Proceedings of the 22nd ACM SIGKDD international conference on knowledge 10.1007/s10639-024-12976-6
discovery and data mining, 785–794
Manigandan, E., Anispremkoilraj, P., Kumar, B. S., Satre, S. M., Chauhan, A., and
Dabhade, P., Agarwal, R., Alameen, K. P., Fathima, A. T., Sridharan, R., and Jeyaganthan, C. (2024). An effective BiLSTM-CRF based approach to predict student
Gopakumar, G. (2021). Educational data mining for predicting students’ academic achievement: an experimental evaluation. In 2024 2nd international conference on
performance using machine learning algorithms. Mater. Today Proc. 47, 5260–5267. doi: intelligent data communication technologies and internet of things (IDCIoT) (pp.
10.1016/j.matpr.2021.05.646 779–784). IEEE.
Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with Nabil, A., Seyam, M., and Abou-Elfetouh, A. (2021). Prediction of students’ academic
bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610. performance based on courses’ grades using deep neural networks. IEEE Access 9,
doi: 10.1016/j.neunet.2005.06.042 140731–140746. doi: 10.1109/ACCESS.2021.3119596
Hamsa, H., Indiradevi, S., and Kizhakkethottam, J. J. (2016). Student academic Nurudeen, A. H., Fakhrou, A., Lawal, N., and Ghareeb, S. (2024). Academic
performance prediction model using decision tree and fuzzy genetic algorithm. Procedia performance of engineering students: a predictive validity study of first-year GPA and
Technol. 25, 326–332. doi: 10.1016/j.protcy.2016.08.114 final-year CGPA. Eng. Rep. 6:e12766. doi: 10.1002/eng2.12766
Huang, Q., and Zeng, Y. (2024). Improving academic performance predictions with Penick, J. E., and Brewer, J. K. (1972). The power of statistical tests in science teachnig
dual graph neural networks. Complex Intell. Syst. 10, 3557–3575. doi: research. J. Res. Sci. Teach. 9, 377–381. doi: 10.1002/tea.3660090410
10.1007/s40747-024-01344-z
Prokhorenkova, L., Gusev, G., and Vorobev, A. (2018). CatBoost: gradient
Hussain, M. M., Akbar, S., Hassan, S. A., Aziz, M. W., and Urooj, F. (2024). Prediction boosting on decision trees with categorical features support. Proceedings of the 2nd
of student’s academic performance through data mining approach. J. Inform. Web Eng. ACM SIGKDD international conference on knowledge discovery and data mining,
3, 241–251. doi: 10.33093/jiwe.2024.3.1.16 1125–1134.
Hussain, S., and Khan, M. Q. (2023). Student-performulator: predicting students’ Raji, N. R., Kumar, R. M. S., and Biji, C. L. (2024). Explainable machine learning
academic performance at secondary and intermediate level using machine learning. prediction for the academic performance of deaf scholars. IEEE Access 12, 23595–23612.
Ann. Data Sci. 10, 637–655. doi: 10.1007/s40745-021-00341-0
Rodríguez-Hernández, C. F., Musso, M., Kyndt, E., and Cascallar, E. (2021). Artificial
Kapucu, M. S., Özcan, H., and Aypay, A. (2024). Predicting secondary school students' neural networks in academic performance prediction: systematic implementation and
academic performance in science course by machine learning. Int. J. Technol. Educ. Sci. predictor evaluation. Comput. Educ. Artif. Intell. 2:100018. doi:
8, 41–62. doi: 10.46328/ijtes.518 10.1016/j.caeai.2021.100018
Sarker, S., Paul, M. K., Thasin, S. T. H., and Hasan, M. A. M. (2024). Analyzing Si, S., Zhang, S., and Keerthi, S. S. (2017). Histogram-based gradient boosting for
students' academic performance using educational data mining. Comput. Educ. Artif. categorical and numerical features. Proceedings of the 23rd ACM SIGKDD international
Intell. 7:100263. doi: 10.1016/j.caeai.2024.100263 conference on knowledge discovery and data mining, 765–774
Sateesh, N., Rao, P. S., and Lakshmi, D. R. (2023). Deep belief bi-directional LSTM Waheed, H., Hassan, S. U., Aljohani, N. R., Hardman, J., Alelyani, S., and
network-based intelligent student's performance prediction model with entropy Nawaz, R. (2020). Predicting academic performance of students from VLE big data
weighted fuzzy rough set mining. Int. J. Intell. Inf. Database Syst. 16, 107–142. doi: using deep learning models. Comput. Human Behav. 104:106189. doi:
10.1504/IJIIDS.2023.131411 10.1016/j.chb.2019.106189
Shaninah, F. S. E., and Mohd Noor, M. H. (2024). The impact of big five personality Wang, X., Zhao, Y., Li, C., and Ren, P. (2023). ProbSAP: a comprehensive and high-
trait in predicting student academic performance. J. Appl. Res. High. Educ. 16, 523–539. performance system for student academic performance prediction. Pattern Recogn.
doi: 10.1108/JARHE-08-2022-0274 137:109309. doi: 10.1016/j.patcog.2023.109309
Shen, Y. (2024). Using long short-term memory networks (LSTM) to predict student Yağcı, M. (2022). Educational data mining: prediction of students' academic
academic achievement: dynamic learning path adjustment. In Proceedings of the 2024 performance using machine learning algorithms. Smart Learn. Environ. 9:11. doi:
international conference on machine intelligence and digital applications (627–634). 10.1186/s40561-022-00192-z