Design and Implementation of Real-Time Student Performace Evaluation and Feedback System
Design and Implementation of Real-Time Student Performace Evaluation and Feedback System
A Thesis Presented to
University of Houston
In Partial Fulfillment
Master of Science
By
Karthik Bibireddy
May 2017
i
DESIGN AND IMPLEMENTATION OF REAL-TIME STUDENT PERFORMACE
EVALUATION AND FEEDBACK SYSTEM
________________________________________
Karthik Bibireddy
APPROVED:
________________________________________
Dr. Christoph F. Eick
Department of Computer Science
_________________________________________
Dr. Sara G. McNeil
Department of Curriculum and Instruction
_________________________________________
Dr. Weidong Shi
Department of Computer Science
_________________________________________
Dean, College of Natural Sciences and
Mathematics
ii
Acknowledgements
First and foremost, I would like to thank and convey my sincere gratitude to my graduate
advisors, Dr. Christoph F. Eick, and Dr. Nouhad Rizk for giving me an opportunity to work
in the data analysis and intelligent systems team. This opportunity eventually led to the
work done in this thesis. Their advice, guidance, and support were instrumental throughout
the course of this study. It is difficult to envisage this work without their help and insight.
I would also like to thank Dr. Weidong Shi and Dr. Sara G. McNeil for agreeing to be a
part of my Thesis committee. My appreciation also goes to the Educational Data Mining
team, led by Dr. Nouhad Rizk, for their continuous efforts in obtaining the real datasets,
I would like to especially acknowledge the significant role played by my friends at the
University of Houston, for making sure that I ensconce in my personal as well as academic
life at the University of Houston. Also, I would like to take this opportunity to thank my
family for encouraging and motivating me in all my endeavors. Finally, I would like to
express my deepest gratitude towards my parents and my sister for their unparalleled love
and support. They continue to be the greatest source of inspiration for me.
iii
DESIGN AND IMPLEMENTATION OF REAL-TIME STUDENT PERFORMACE
An Abstract of a Thesis
Presented to
University of Houston
In Partial Fulfillment
Master of Science
By
Karthik Bibireddy
May 2017
iv
Abstract
graduation due to dropping courses or having to repeat courses due to low academic
to understand where they stand amongst their peers and to change the attitude with about
the course they are taking. Moreover, it is important to identify students in time who need
conducting review sessions. The goal of this thesis is the design and implementation of
Analysis System (PAS) that uses machine learning techniques to interpolate past student-
performance into future, and the development of an Emergency Warning System (EWS)
database schema that is suitable to be used to assess student’s performance across multiple
courses, which is critical for the generalizability of RSPEF system. The system design
includes core machine learning & data analysis engine, a relational database that is reusable
across courses and an interactive web-based interface to continuously collect data and
v
Table of Contents
1. Introduction ................................................................................................................... 1
vii
6. Conclusion & Future Work ....................................................................................... 68
References ........................................................................................................................ 70
viii
List of Figures
Figure 2.1. Multi-Layer Neural Network with single hidden layer .................................. 10
Figure 4.2. Student and professor entities and their associated properties ....................... 34
Figure 4.3. Course and section entities and their associated properties ........................... 34
Figure 4.4. Course element and survey entities and their associated properties .............. 35
Figure 4.7. Illustration of "course has section" and "professor teaches section"
relationships ...................................................................................................................... 36
Figure 4.9. Illustration of "course has course element" and "course element scores
students" relationships ...................................................................................................... 38
Figure 4.10. Illustration of "course has performance summary" and " student has
performance summary " relationships .............................................................................. 39
ix
Figure 4.12. Relational schema for storing student data................................................... 42
x
List of Tables
Table 4.1. Demographic & personal questions and their respective response ranges ...... 27
Table 4.3. Course specific questions and their respective response ranges ...................... 28
Table 4.4. Student specific questions and their respective response ranges ..................... 29
Table 4.5. Student-Institution involvement questions and their respective response ranges
.......................................................................................................................................... 30
Table 4.6. Student performance dataset with labels, and weights of each course element
indicated in brackets as percentages (they add upto 100%) ............................................. 31
Table 5.1. Structure of COSC1430 (FALL 2015) dataset in train mode .......................... 45
Table 5.5. Mapping between grades I, grade II and their respective numeric grade points
........................................................................................................................................... 47
Table 5.11. Review of various dataset nomenclature used and the schema they represent.
........................................................................................................................................... 56
Table 5.23. Accuracy for predicting grade point for dataset-o3 ....................................... 65
Table 5.24. MAE 2 for predicting grade point for dataset-o3 .......................................... 65
Table 5.25. RMSE 2 for predicting grade point for dataset-o3 ........................................ 65
Table 5.26. Metric comparison for detecting poor-performing student using student
survey dataset of COSC1430 (FALL 2015) ..................................................................... 66
Table 5.26. Comparison of confusion matrices across various algorithms for COSC1430
(FALL 2015) survey dataset ............................................................................................. 67
xii
1. Introduction
Despite a strong and intensive effort over decades by colleges and universities to improve
their graduation rates by trying to help students graduate in time, the graduation rates in
the US remain flat. According to a report by the National Center for Education Statistics
[1], about 28% of bachelor’s degree students entered a STEM field (i.e., chose a STEM
major) at some point within six years of entering postsecondary education in 2003-2004;
however, 48% left the field either by changing majors or leaving college without
completing a degree. Other studies indicated that many of the students who left the STEM
fields were actually high-performing students who might have made valuable additions to
the STEM workforce had they stayed [2,3]. To produce more graduates in STEM fields,
some recent U.S policies have focused on reducing student’s attrition from STEM fields in
college, arguing that increasing STEM graduation even by a small percentage can be a cost
efficient way to contribute substantially to the supply of STEM workers. [1,4,5] The
University of Houston has reported its graduation rates for a full-time first-time in college
undergraduates between 2008 and 2011. In 2008, 3486 freshmen enrolled at UH, a
cumulative graduation rate in four years was only 18%, in five years was 38% and in six
years was 48.1%. In 2009, 3100 freshmen enrolled at UH, a cumulative graduation rate in
four years was 19.7%, in five years was 41.6% and in six years was 51%. In 2010, 3453
freshmen enrolled at UH, a cumulative graduation rate in four years was 22.7% and in five
years was 42.5%. In 2011, 3556 freshmen enrolled at UH, a cumulative graduation rate in
1
four years was only 25.2%. In this research, we tackle these problems by educational data-
mining techniques.
to analyze educational datasets such as student, professor or instructor, course, and school
data, to better understand students and the environment they are learning in. Educational
administration of the school or the university will provide the data. The data is often
identify poorly performing students in an early phase of the semester so that instructors can
take appropriate interventions with such students like mentoring or reviewing topics with
them. Faculty members will need functional tools that will allow them to identify such
students. Additionally, students need a system that helps them to understand where they
stand among their peers so that they can assess themselves and plan accordingly without
The goal of this thesis is to alleviate the problems mentioned in the last paragraphs; its
focus is the design and implementation of real-time student performance evaluation and
feedback system (RSPEF). The RSPEF system has three components: a data pre-
processing layer, a machine learning and data analysis engine and an application layer. The
data pre-processing layer consists of the relational database and the dataset selector. The
data required to build models is collected through surveys which are answered by students
2
and instructors who upload student course scores through an interface. The survey and
The role of data selectors is to extract relevant training datasets from the relational
database, to clean and standardize them and passes these datasets to the machine learning
and data analysis engine which learns models from training data and makes predictions
using learned models. The machine learning and data analysis system also called MLDA
engine has three sub-systems: a Predictive Analysis System (PAS), a Report Generation
dataset is passed as input to PAS, which builds models by learning from the training
datasets and uses these models to predict student grades, starting very early in the semester.
The EWS learns models from student survey datasets and classifies students who currently
take course into either “passing” or “failing” the course. The RGS uses both student-
performance and survey datasets to compute statistical summaries, such as mean, median
and standard deviation of course elements such as attendance, quizzes, assignments, and
exams. The summaries, predicted grades and the list of identified poor-performing students
are passed on to the application layer. The application layer is web-based and will visualize
visualizes the information and predictions from these models, which are displayed to the
students and instructors. Students can register and log into their account to view their
predicted grades for various courses they are enrolled in, assess themselves based on other
information available on their dashboard. Instructors can view the list of students identified
as poor-performing for various courses in a very early stage so that the instructors can give
3
The remainder of this Master Thesis is organized as follows. In chapter 2, we will discuss
the background and the related work done in the field of educational data-mining and
machine learning using educational datasets, In chapter 3, we will discuss the architecture
of our proposed system and the functionality of its components. In chapter 4, we will briefly
explain the design of the relational database and introduce the relational schema of the
designed database. In chapter 5, we will discuss various models built by machine learning
algorithms and evaluate and compare their performance. In chapter 6, we will mention the
4
2. Background & Related Work
This chapter we will discuss the background and related work concerning educational data-
mining, prediction and classification models, and the entity-relationship model used in this
research.
so they can learn without being programmed. Machine learning is essentially classified into
three types: supervised learning, unsupervised learning, and reinforcement learning. The
main goal of supervised learning is to learn a model from labeled or training data which
will allow us to make predictions about future data. Supervised learning can be further
classified into two tasks: a classification task, where the predicted outcome is a categorical
value, and a regression task, where the outcome is a continuous value. In reinforcement
learning, the goal is to develop a system or agent that improves its performance based on
interactions with the environment there are rewards and penalties in this approach, but it is
unlabeled data we extract meaningful information without any prior knowledge about the
data. Clustering is a technique of unsupervised learning [23]. In this section, we will discuss
various classification and regression algorithms which are relevant to our thesis.
5
2.1.1 Naïve Bayes
Naïve Bayes classifier belongs to the family of probabilistic classifiers, built upon Bayes
theorem with strong independence assumptions between the features or attributes [22]. In
the equation [Link], the X denotes evidence and C is a hypothesis. P (C|X), given examples
P( X | C ) P(C )
P(C | X ) ([Link])
P( X )
likelihood, the probability of the evidence given our hypothesis, the ratio of examples
where our hypothesis C is true. P(X) is the prior probability of evidence X. Evidence X is
evidence X, the task of Naïve Bayes is to predict the hypothesis C having the highest
posterior probability.
P (C i | X ) P (C j | X ) ([Link])
The class Ci with maximum P(Ci|X) is called the maximum posterior hypothesis. An
unseen example is classified into a class with maximum posterior. Usually, our datasets
6
have many attributes Xi to Xn, it is computationally expensive to acquire all the necessary
which assumes that the probabilities of different attributes have specific values are
With the conditional independence assumption, we only need to estimate the conditional
probability of each Xi, given C and the prior probabilities of Xi and C instead of knowing
the class conditional probabilities for every combination of X. To classify an unseen record
of a student, the Naïve Bayes classifier computes the posterior probability for each class C
d
P(C ) P( X i | C )
P(C | X ) i 1
([Link])
P( X )
Since P(X) does not depend on the class membership, it is fixed for each class. Therefore,
it is sufficient to choose the class that maximizes the numerator term. From the equations
[Link] and dropping the denominator, we can deduce equation [Link]; where represents
proportional.
d
P(C | X ) P(C ) P( X i | C ) ([Link])
i 1
7
2.1.2 Logistic Regression Classification
Multinomial logistic regression classification uses one versus all policy, dividing the
problem into K (number of classes) simple logistic regression problems [24]. In each
simple logistic regression problem, one of the K classes is represented as 1 and the rest of
regression models are built one for each class. Given an unseen record as input, each model
will return a probability of that the record belongs to the class at 1’s boundary in that simple
logistic regression model. The student is classified into the class who’s probability returned
is highest.
If there are K classes for N instances with M attributes, the parameter matrix B to be
calculated is an M x (K-1) matrix. The probability for a class j with the exception of the
last class (Since ridge estimator is added as an L2 regularization for all parameters except
the last parameter we divide the equation into two parts [Link] and [Link]) is given in the
equation [Link] and the probability of last class is given in equation [Link].
(𝑋𝑖 𝐵𝑗 )
𝑒
𝑃𝑗 (𝑋𝑖 ) = (𝐾−1) (𝑋𝑖 𝐵𝑗 )
([Link])
∑𝑗=1 𝑒 +1
(𝐾−1) 1
1 − (∑𝑗=1 𝑃𝑗 (𝑋𝑖 )) = (𝐾−1) (𝑋𝑖 𝐵𝑗 )
([Link])
∑𝑗=1 𝑒 +1
8
2.1.3 Random Forest
Random forests use an ensemble of simple decision trees as classification models [25]. An
ensemble approach constructs a set of base classifiers from training data and performs
classification by taking a vote from the predictions made by each base classifier.
Random Forests use an ensemble learning approach called boosting to obtain a set of base
classifiers by generating different training sets from the training data using weighted
sampling with replacement approach. Boosting assigns a weight to each training example
and which adaptively changes at the end of each boosting round. AdaBoost [26] is the most
popular boosting algorithm. The AdaBoost algorithm initially assigns equal weights to the
training examples. A training set is generated by drawing examples from the training set.
Next, a classifier is induced from the training set and used to classify all the examples in
the original data. The weights of the training examples are updated at the end of each
boosting round. Examples that are classified incorrectly will have their weights increased
while those that are classified correctly will have their weights decreased. This forces the
classifier to focus on examples that are difficult to classify in subsequent iterations which
biological neural network such as a brain [27]. Like the neuron, a Neural Network consists
of nodes which are connected to each other. Each node will have an inhibition and
excitation threshold. Inhibition and excitation thresholds are the minimum and maximum
9
input it expects to receive. Each node has an activation function, these inputs from one or
more nodes are fed as input to this function and the output of this function is the output of
the node. The node will transmit only if the activation function's output is above the
excitation threshold. A multilayer Neural Network is an artificial neural network with one
or more hidden layer. An MLNN with a single hidden layer is illustrated in figure 2.1.
The activation flows through the network, through hidden layers, until it reaches the output
nodes. The output nodes then reflect the input in a meaningful way to the outside world.
The difference between predicted value and actual value (error) will be propagated
backward by apportioning them to each node's weights according to the amount of this
The transfer function translates the input signals to output signals. Four types of transfer
functions are commonly used but, we will discuss only unit step and sigmoid. Unit step,
the output is set at one of two levels, depending on whether the total input is greater than
or less than some threshold value (see figure 2.2). The sigmoid function consists of 2
functions, logistic and tangential. The values of logistic function range from 0 and 1 and -
10
Figure 2.2: Unit-step transfer function for MLNN
problems [28, 29]. Suppose a training dataset where we have N observations with observed
labels. The goal of SMOreg is to find a function f(x) that has at most ε deviation from the
actually obtained targets yi for all the training data, and at the same time is as flat as
11
possible. We will describe the working of SMOreg using a linear function, taking the form
𝑓(𝑥) = 𝑥 ′ 𝛽 + 𝑏 ([Link])
Now our cost function J is given by [Link], our goal is to minimize the cost function which
is by minimizing the norm value of (β’β). This is now a convex optimization problem to
minimize. Our goal here is to minimize [Link] subject to the condition stated in [Link]
1
𝐽(𝛽) = 2 𝛽′𝛽 ([Link])
No such function f(x) might exist to satisfy this constraint for all points. To deal with this
problem we introduce slack variables ξn and ξn* for each point. This approach is similar to
a soft margin concept in SVM classification, which will let the SMOreg consider points up
till the slack variables, yet still, satisfy the required conditions. We introduce slack
variables to our objective function in the equation [Link] and the conditions this function
1
𝐽(𝛽) = 2 𝛽 ′ 𝛽 + 𝐶 ∑𝑁 ∗
𝑛=1(𝜉𝑛 + 𝜉𝑛 ) ([Link])
∀𝑛 ∶ 𝑦𝑛 − (𝑥 ′ 𝑛 𝛽 + 𝑏) ≤ 𝜀 + 𝜉𝑛 ([Link])
∀𝑛 ∶ (𝑥 ′ 𝑛 𝛽 + 𝑏) − 𝑦𝑛 ≤ 𝜀 + 𝜉𝑛∗
12
∀𝑛 ∶ 𝜉𝑛 ≥ 0
∀𝑛 ∶ 𝜉𝑛∗ ≥ 0
popular for its simple graphical representations and safe constructs. ER models are
currently being used widely in databases, information systems, and software engineering.
An entity can be a real-world object, which is animate or inanimate and can exist
independently. All entities have attributes which are properties of the entity. There can be
different types of attributes, simple attribute, composite attribute, derived attribute and
multi-value attribute. A key is an attribute which will uniquely identify an instance in that
entity. See figure 2.4. A relationship is an association among entities. For example, an
13
A cardinality defines the number of instances in one entity type, which can be associated
with the number of instances of another entity type via relationship. Four kinds of
One to one: one instance in A can be associated with at most one instance in B (A and
One to many: one instance in A can be associated with more than one instance in B
Many to one: many instances in A can be associated with at most one instance in B
relationship. Total participation is represented by a small line cutting the line between
small oval near the entity rectangle represents the partial participation of that entity.
decade. The main aim of EDM is to develop models capable of better understanding how
students learn. In this section, we will briefly discuss approaches are closely related to our
study. Han and Kamber [6] describe an approach that uses association rule mining to
identify the factors influencing the success of students, and decision tree models are used
to predict student academic performance. Priya et al. [8] applied a decision tree
14
classification technique towards the end of the semester, leading to the discovery of
valuable knowledge that can be used to improve students’ performances. Many studies
(Ardila, 2001; Boxus, 1993; Busato et al., 1999, 2000; Chidolue, 1996; Furnham et al.,
1999; Gallagher, 1996; Garton et al., 2002; King, 2000; Minnaert and Janssen, 1999;
to predict their success or their failure. Galit [9] presented a case study that used students
data to analyze their learning behavior to predict the results and to warn students at risk
Goyal and Vohra [10] showed that if data-mining techniques such as clustering, decision
tree induction, and association analysis were applied to higher education processes, it
helped to predict student retention rate. Surjeet Kumar et al. [11] used decision-tree
classifiers to predict student drop-out rates. Pathan et al. [12] developed a decision-tree
model to improve students’ programming skills in the C language where they collected
data from 70 students of Structured Programming Language (SLP) course and generated
two datasets.
Furthermore, they tried to learn an appropriate weighting of the features using a genetic
algorithm approach, which further improved prediction accuracy. Pittman [14] performed
of students in a course and found that SVM-based approaches and random forests methods
accomplished the highest accuracies. Romero et al. [15] compared different, data-mining
15
methods for classifying students based on their Moodle (e-learning system) usage data and
the final marks obtained in their respective programs and they observed that decision trees
were the most suitable approach for this task. Nguyen et al. [16] compared the accuracy of
the decision tree and bayesian models for predicting the academic performance of
undergraduate and postgraduate students and observed that decision tree models
Al-Radaideh et al. [17] proposed a classification model to enhance the quality of the higher
educational system by evaluating students by predicting their final grades in courses. They
used three different classification methods ID3, C4.5, and the Naïve Bayes. The results
indicated that the decision tree model had better prediction accuracy than the other models.
Muslihan et al. [18] compared artificial neural network and the combination of clustering
and decision tree classification techniques for predicting and classifying student’s
academic performance and found that combination of clustering and decision tree
predictive model called CHAID with 7-class response variables by using predictive
students at higher secondary schools in India. Tripti et al. [20] used J48 decision trees and
students in their work and observed that random tree has better accuracy, and it consume
less time than the J48 decision tree algorithm. B, Minaei-Bifgoli et al. [21] developed a
web-based interface to track learning patterns in students and used genetic algorithms to
16
Our proposed approach is different from what has been proposed in the following aspects:
Work has been done to predict failing students or predicting final grades but in our
approach, we are building a system to predict grades, detect poor-performing students and
presenting these insights to students and instructors so that they can act well informed.
Moreover, our approach uses student personal data and class performance data to build
models. A relational database schema was designed to store and use the student
information. Additionally, the relational database is scalable across various courses and
universities. Moreover, unlike others, our approach predict grades and detects poor-
performing students based on the data which is collected in the first two to three weeks of
the start of the semester to present them to students and professors so that they have enough
17
3. Architecture of the Proposed System
This thesis centers on the design and implementation of the RSPEF system whose primary
goal is to improve the graduation rate at the University of Houston, by providing feedback
to students and instructors. In this chapter, we will present and discuss the architecture of
RSPEF.
The RSPEF system architecture has three layers: i) a Data Pre-processing Layer ii) a
Machine learning & Data Analysis (MLDA) Engine and iii) an Application Layer. Figure
18
3.1 Data Pre-Processing Layer
The Data Pre-processing Layer has two components a relational database and a dataset
selector. Performance and personal data of students are stored in the relational database. A
dataset selector is a programmed API that creates datasets from the relational database,
cleans them and standardizes them; these datasets will be later passed as input to the
When new information is inserted or updated in the relational database such as student
submitting a survey will insert student responses in the database or instructor updating the
scores will update existing information in the database. When such an insert or update
event takes place, will invoke the dataset selector. The dataset selector is programmed to
obtain metadata from these tables which will be used to identify training datasets. When a
student submits a survey an insert event happens which will evoke dataset selector which
in turn collects metadata by identifying the student by “student ID” and using this ID to
retrieve a list of courses that this student has taken. By using the information in the
metadata, the dataset selector will generate SQL commands to obtain all survey training
datasets for each of the courses. Similarly, when an instructor updates student’s scores, an
update event happens which will evoke the dataset selector to collect metadata by
identifying the course in which performance data was updated and retrieves all the relevant
training datasets which have the same structure of the current course. The dataset selector
extracts these datasets and passes them as input to MLDA engine. Current performance
dataset and its training dataset are passed as input to Predictive Analysis System (PAS) and
Report Generation System (RGS). Current survey data and its corresponding training
19
datasets are passed into Emergency Warning System (EWS) and Report Generation System
(RGS).
Analysis System (PAS) ii) a Report Generation System (RGS) and iii) an Emergency
The purpose of the Predictive Analysis System, also called PAS, is to predict student
grades. PAS receives training and current performance datasets from the dataset selector.
A model is learned from the training dataset. This model uses current performance datasets
and predicts grades for students who are currently taking a course.
The purpose of the Report Generation System helps students assess themselves amongst
their peers by generating summaries specific to courses from the student-performance and
survey datasets so they can be displayed to students and instructors using the application
layer. RGS receives student survey and performance datasets from the dataset selector and
generates statistical summaries from them. For performance datasets, RGS will compute
mean, median and standard deviation for each column in the dataset. For survey datasets,
RGS computes the individual conditional probabilities of student passing the course given
his age, gender, race, major or minor. For example, given that a student is male the
probability of the student passing the course is computed. These outputs are passed into
20
The purpose of the Emergency Warning System (EWS) is to detect poor-performing
students, very early in the semester by building models to classify students into “Pass” and
“Fail” classes. EWS receives training datasets and the dataset which needs classification
from the dataset selector. EWS learns models from the training datasets which will be used
to determine if the student will fail the course for each course the student is enrolled into.
A list of students ID’s and their corresponding course ID’s who are detected to fail that
course is created which will be displayed on instructors dashboard by the application layer.
In the Application Layer, the presentation layer is responsible for visualizing the outputs
from MLDA Engine for displaying them on dashboards and the what-if analysis
component supports students to enter their future scores and see how it affects the predicted
grade. The student dashboard displays student’s grade predicted by the PAS, summaries of
each course elements such as mean, median score, and standard deviation, and individual
probabilities of a student passing the course given his gender, race, major and age group
computed by RGS. Figure 3.2 illustrates an example display of the student dashboard. The
instructor can click on a student from the list and view information about the student such
as student name, description, grade predicted by PAS and summaries generated by RGS.
21
Figure 3.2: Example display of the student dashboard
22
Student dashboard supports what-if analysis: students can use sliders to change the scores
they obtained or can input future scores; this information will then be used to predict their
final grade based on the information they provided. The role of the application layer is to
create specific dashboards for students and instructors. These dashboards visualize and
display information such as predictions and summaries from PAS, EWS, and RGS which
23
4. Design of Relational Database to Store
Student Data
We collect a wide range of student information from surveys to student scores in courses
they take. The RSPEF system continually collects and updates data from students when
submitting surveys and instructors when uploading student scores through an interface.
The relational database stores all information necessary to build models, it is provided with
interfaces to extract relevant information from students and allow instructors to upload and
save course and student-performance information, a dataset selector interface on top of the
relational database will extract relevant datasets that are required to build models.
Moreover, a schema is designed that provides a unified representation of student and course
data that can be used across courses. The data we collect is naturally relational and requires
Therefore, there is a need to store them in independent tables which share relationships.
The relational database best fits our needs as it is a well-understood approach, maps well
to data with multiple relationships between multiple entities and support extensive join and
primary sources of data are students and instructors. The data is collected via a web
application through which students are asked to fill out surveys and instructors use an
24
interface to upload data along with metadata (course name, semester etc.) into the database.
Instructors upload old labeled data from previous semesters and new data from the current
Students are asked to register themselves on our web application. During the process of
registration, they are asked to answer 35-questions-long questionnaire and responses are
recorded in our database and we will refer to this data as student survey data. Instructors
upload and update student scores continuously during the semester as and when they are
available. We will refer this data as student-performance data. Section 4.1.2 and 4.1.3
success. Students fill out this survey on a web application. We will use this data to build a
machine learning models to classify students into the pass and fail categories. Students
classified as fail are detected as poor-performing / failing students. All questions in the
2. Financial Questions
3. Course-Specific Questions
4. Student-Specific Questions
25
5. Student-Institution Involvement Questions
Table 4.1 through table 4.5 will list each of the questions, their appropriate response range
and which classification they come under. For example, the question “What is your
gender?” can accept either of male or female as a response. Other responses are strictly not
allowed so, the response range for this question contains only male and female. Questions
like “Which high school did you graduate?” will have a textbox to allow students to type
the name of the high school they have graduated from. The response range for this question
is however very large as there are over 36,000 high schools in the USA and is dealt by
Table 4.1 lists out thirteen questions which collect demographics (age, gender, ethnicity)
and some questions are personal information (major, job, health). All questions are encoded
with a unique code, the first letter of this code represents the category they belong to,
following digit is the identifier for the question. D1, asks students to enter their ID which
will be used to associate student scores with this survey. D2 & D3 are age and gender. D4,
student’s major, for example, computer science. D5 & D6 are racial ethnicities and high
school students graduated from. D7, D8 & D11 evaluate the quantity and quality of time a
student spends every day. D7 asks if his/her day-time is occupied by a job, if the answer to
this question is a no it means the student has more time in hand which he could utilize on
something more productive. D8, if the job is major related, if yes, then the time he spends
on his job is productive in building his career. Similarly D11, a number of hours the student
spends on a job will determine the remaining hours in hand he could use to prepare for
26
homework or an exam for the current course. D9, D10, D12 & D13 are useful to determine
Table 4.1: Demographic & personal questions and their respective response ranges
Table 4.2 lists out five questions which focus on financial standing. F1, asks students if
their parents approve the major they have taken in the university, the student's response
will determine how much, morally and financially his/her parents support their education.
F2 and F3 are follow up questions to F1. These questions have an important effect on
academic success. F4 & F5, Financial aid reduced a student’s burden towards paying
tuition or a loan. Students work overtime or night shifts to earn their tuition. A financial
aid can help such students to cease working overtime or night shifts and spend more time
on their education. The amount of scholarship or aid received is proportional to the burden
27
Category Code Question Response Range
Name
Financial F1 Do your parents approve of your major you chose in the {yes, somewhat, no, don't
university? know}
Questions F2 Please estimate your household's annual income in USD. {Positive Integers}
(F) F3 What percent does your parent support financially towards your {0 to 100}
study?
F4 In the previous semester, did you receive any financial aid, {yes, no, don't know}
grants, or scholarships?
F5 How much did you receive as aid, grant, or scholarship? {Positive Integers}
Table 4.3 lists out four course-specific questions. As a student cannot evaluate a course
before it begins we conduct this part of the survey towards the end of the course. This
category of the survey is summarized and presented to the course preparers and instructors,
which in turn will help them to improve the course. Other than using this survey as a
feedback to instructors, we will use it to identify the students who really love the course
and enjoy it, and the ones who do not like the course and disappointed with it. Assuming
students satisfaction towards the course will have its influence on student achievement, we
Table 4.3: Course specific questions and their respective response ranges
Table 4.4 lists out six student-specific questions. S1 & S2 will record student’s proficiency
and experience. While S3 records if the student is getting any extra help for the student
28
group. S4 & S5 assess students based on the time they take to complete tasks or preparing
Table 4.4: Student specific questions and their respective response ranges
Table 4.5 lists out seven questions, identifying student’s involvement in institutional
support and other activities. The University of Houston supports academic achievement in
students by establishing tutoring and support centers with advisors and counselors to advise
them. Also, these centers provide tutoring to students for several courses. I1 & I2 are
attendance to any course is important to get maximum knowledge out of that course, it is
considered that a student will low average attendance will grasp less information or
knowledge of the course. Which is why this feature is considered important. I4 asks
students, the number of years they believe it is going to take them to graduate with a degree.
I4 reveals the confidence of the student towards his academics and is an important factor.
I5, I6 & I7 inquiries students if they are involved in a student organization and if it was a
major related one. These questions are reasonably good factors in determining student
success as they add to the student's holistic experience in the major they have chosen.
29
Category code Question Response Range
Name
Student- I1 How often do you meet a counselor/advisor? {weekly, monthly,
Institution quarterly, never}
Involvement I2 How often do you go to tutoring centers like CASA or {weekly, monthly,
Learning Support Service? quarterly, never}
Questions I3 How often do you skip classes? {weekly, monthly,
quarterly, never}
(I) I4 How many years will it take you to complete your degree {0 to 20}
requirements?
I5 Are you a member of a student organization at the {yes, no}
university?
I6 Are you a member of a student organization at the {yes, no}
university related to your major?
I7 What is the name of the major related organization? {text}
Table 4.5: Student-Institution involvement questions and their respective response ranges
A student-performance dataset will be used to predict student’s final grade. This dataset
conduct evaluation procedures, both subjective and objective like exams, quizzes,
assignments, or projects. These are referred as course elements. Each of these elements is
given scores based on how students have performed. Usually, the instructors have a certain
weight for each of the course elements which are later used to calculate weighted scores.
In our tables, these weights are written next to the course element in brackets. The
instructor then records scores in an EXCEL or a CSV file. There are two versions of this
dataset: labeled and unlabeled or new data. Labeled data is older data with the final grades
and are used to train and evaluate ML models. These models are used to predict labels, in
this case, grades, for new data. These datasets are uploaded to the database by instructors
30
using a web application interface. A typical student-performance dataset is illustrated in
Table 4.6
Table 4.6: Student performance dataset with labels, and weight of each course element
indicated in brackets as percentages (they add up to 100%)
machine learning algorithms. Real world data, specifically of those collected from surveys
could be dirty and could drive the process of building ML models useless. This is mainly
Incomplete datasets are created. In such cases, data values are imputed but in most cases
Noise is dealt with in the data collection stage. Rather than using text boxes to capture
responses, we use drop-down lists, sliding bars and radio buttons to eliminate noise. For
example, a question in the survey asks students for high school they graduated from which
can have infinite possible inputs. To avoid incorrect or non-standard inputs, we use a web
feature, which upon starting to write the name of the high school in the input box, the
browser send a request to a database server to look for all possible matches for the partial
31
word (involves Natural Language Processing & Hidden Markov Model) for which the
plugins are already available (Flexselect, [Link]), and returns a drop down menu
with few suggested high schools. So, now the student has to select from the drop down
menu.
Inconsistent data is nothing but sparseness in data, they are capable of learning wrong or
biased models, we normalize student scores to keep the data consistent within a small
range. In student-performance, data student scores can be sparse, for example, one student
has a final exam score of 100, first best score. The second best score is 55 and the rest of
the scores are distributed below 55. In this case, we could learn a biased model, to avoid
this situation we normalize the scores using the Z-score procedure. Equation 4.2.1
describes the formula to calculate the z-score for a particular value in a data column.
𝑋𝑛𝑒𝑤 = 𝑋−𝑦
𝜎
(4.2.1)
Once we have pre-processed the datasets we pass them to next stage of building ML models
database and the role of the database in providing datasets to learn models. The student’s
survey form is a web-based questionnaire where students enter responses to the questions
mentioned in section 4.1.1. Instructor’s upload interface is also a web-based interface, lets
instructors upload CSV (comma separated values) or EXCEL (usually tab separated
values) files to the database. Dataset selector identifies datasets which need predictions and
determines appropriate training sets. These datasets are preprocessed and further divided
into smaller datasets. If a course is identified for prediction, it will extract datasets of that
course from previous semesters to represent training data. Now from both the current and
training datasets student-performance dataset and student survey datasets are extracted.
These datasets are then fed as input to PAS, EWS, and RGS.
to perform complex join operations between related tables. In the following sections, we
will explain the entities and relationships in our ER model, entity-relationship model, and
4.4.1 Entity
Our model has seven entities,
Student
Professor
Course
33
Section
Course Element
Survey
Performance Summary
Figure 4.2: Student and professor entities and their associated properties
Student entity has an identifier student_id; the name is a composite property and comprises
two individual properties: first and last name; the address is a composite property
comprising: street number, name, apartment number, city, state, and zip code; ethnicity,
gender, and date of birth. Age is a derived property (from the date of birth). Professor entity
type has an identifier professor_id, name, and email address. See figure 4.2.
Figure 4.3: Course and section entities and their associated properties
34
Course entity has an identifier course_id, the name of the course, optional description, term
in which it is being offered, department and the number of credits awarded. Section entity
has an identifier section_id, room number, and timings. See figure 4.3.
Figure 4.4: Course element and survey entities and their associated properties
Course element has an identifier element_id, the name of the element, weight, and
maximum possible score. Survey entity has an identifier survey_id and question. This table
has only 35 instances as there are 35 survey questions. The table remains same throughout
unless the administrator decides to add or remove questions. See figure 4.4.
predictions. It has a summary_id, actual grade obtained and grade predicted by PAS. See
figure 4.5.
35
4.4.2 Relationship
A relationship defines how one entity depends on another. We have identified eight
The “student takes course” relationship is many-to-many cardinality, one or more students
can enroll in one or more courses. The table “takes” has student_id and course_id as foreign
keys which represent a list of student and course mappings. See figure 4.6.
Figure 4.7: Illustration of “course has section” and “professor teaches section”
relationship
36
The “course has section” is a special kind of relationship. Unlike the prior relationship,
there is no explicit table for ‘has’ instead section_id is placed into course as a foreign key.
It follows one-to-many cardinality, one course can have one or more sections. The
“professor teaches section” is many-to-many cardinality, one or more professors can teach
one or more sections. The table “teaches” has professor_id and section_id as foreign keys
which represents a list of professor and section mappings. See figure 4.7.
must answer every question in the survey. Responses are recorded in “fills_out” table under
37
Figure 4.9: Illustration of “course has course element” and “course element scores
students” relationships
The “course has course element” and “course element scores students” are one-to-many
cardinalities, every course has more than one course elements while every student enrolled
in these courses are scored for all the associated course elements. For example, a student’s
score in the final exam for the enrolled course 0001 is recorded under ‘score’ column in
38
Figure 4.10: Illustration of “course has performance summary” and “student has
performance summary” relationships
The “course has performance summary” and “student has performance summary” are one-
For a given problem there can be many possible ER Representations, we try to deduce the
set of all possible ER diagrams to one good diagram or model which can satisfy our goal
[22]. We have used an online tool called ERDPlus to design our model. Figure 4.11 shows
39
40
Figure 4.11: Entity-Relationship diagram for storing student data
This model recites the structure of the database. This model captures student’s data from
more than one courses which have one or more sections taught by one or more professors.
Student, course, section and the professor can have specific information associated with
them. A survey entity is a collection of 35 mandatory questions and each student must be
associated with all of them. And the responses go into the relationship ‘fills_out’. A course
will have more than one course element associated with it. And every student is scored for
the course elements which belong to the course they have taken. Performance summary
shares a similar relationship with student and course. Therefore, information like final
grades and predicted grades which are specific to student and course are recorded here.
ERDPlus has a feature to convert ER diagrams into relationship schema diagram. All
entities and relationships convert into tables. The relationship ‘has’ will not convert into a
table. Rather these relations are treated specially, a foreign key in placed in one of the
In “ student takes course” and “professor teaches section” relations, ‘takes’ and ‘teaches’
section”, which has many-to-one cardinality, section_id is put in course table as a foreign
key.
41
Figure 4.12: Relational schema for storing student data
relationships, which are one-to-many cardinalities, student_id, and course_id are placed in
particular course, we join three tables (student, course and performance summary) and
filter by specific student_id and course_id. Similarly, “course has course element” puts
Both “course element scores student” and “student fills_out survey” are many-to-many
cardinalities. Table ‘scores’ has both associated foreign keys and the column ‘score’. Table
‘fills_out’ has both associated foreign keys and the column ‘student_response’.
42
By following these standard procedures to convert an ER diagram, we generate a
reasonable relational schema diagram from which one can easily write DDL (Data
Definition Language) in SQL to create tables for any relational database in the market. We
Archived data is the major source for training and test data. We only store current semester
data in our database. Once the data is a more than 12 months old we move it into an archive
database wich is a replica of our current database. When building ML models data from
the archive is requested. The data is proportioned into training and test sets. A model is
built and evaluated using the archive data itself. During deployment, current data is passed
into the model to make predictions for the current or new data. See figure 4.13
43
5. Learning Models that Predict Student
Performance
Various models are learned by Predictive Analysis System (PAS) and the Emergency
Warning System (EWS) to predict student grades and to identify poor-performing students
in a course. In this chapter, we will briefly discuss the different approaches that were used
to obtain these models. Moreover, we will discuss the results obtained by using various
datasets to train these models and will compare the performance of different types of
models using a set of evaluation measures. Sections 5.1 and 5.2 will discuss the datasets
used in the experiments and introduce naming conventions; sections 5.3 and 5.4 will
discuss the machine learning tools used to build and evaluate models and sections 5.5 and
5.6 will give an overview of experiments and compare the obtained results.
5.1 Datasets
Over the period of two semesters (FALL 2015 & SPRING 2016) we have collected survey
and performance data from the course: COSC1430 (Introduction to Programming). The
data this course in the fall semester of 2015 has 104 instances and will be used to perform
experiments. We will refer to this dataset as COSC1430 (FALL 2015). Models are trained
44
Student_id, response1, response2 … response 35, grade
(See section 4.1.2 for questions and their possible responses). Target class here is the grade
2015). This course has the following course elements: attendance, homework, class project,
exam, and quiz. A final grade is one of 12 grades which is recorded in the column called
“grade I”. Columns, grade II (see table 5.6 for the mapping between grade I and grade II)
and grade point can be derived from grade I. See table 5.1 and 5.2.
0001 50 100 70
0002 100 100 80
0003 25 25 25
45
[Link] Datasets to Predict a Grade I
This dataset is a subset of table 5.1, all columns except grade II and grade point columns
from this [Link] in the University of Houston, a student is awarded a grade from
a pool of 12 grades: A, A-, B+, B, B-, C+, C, C-, D+, D, D- and F. Special grades such as
I (Incomplete), W (withdraw) and NO-grade are removed from the dataset. We will call
this dataset “dataset-o1” (o1 stands for output 1) indicating that our target class is column
0001 50 100 70 B+
0002 100 100 80 A-
0003 25 25 25 F
using five grades: A, B, C, D, and F. From COSC1430 (FALL 2015), a subset of data is
extracted by leaving out “grade I” and “grade point” columns. We will call this dataset
“dataset-o2” (o2 stands for output 2) indicating that our target class is column “grade II”.
46
Identifier Course_Element1 Course_Element2 Course_Element3 GRADE II
0001 50 100 70 B
0002 100 100 80 A
0003 25 25 25 F
“grade I” to their respective grade points. The column “grade point” in COSC1430 (FALL
2015) dataset is derived from “grade I”, see table 5.5. We will call this dataset “dataset-
o3” (o3 stands for output 3) indicating that our target class is column “grade point”. Table
grade I A A- B+ B B- C+ C C- D+ D D- F
grade Point 4.00 3.66 3.33 3.00 2.66 2.33 2.00 1.66 1.33 1.00 0.66 0.00
grade II A B C D F
Table 5.5: Mapping between grades I, grade II and their respective numeric grade points
47
5.2 Dataset Representations
In the previous sections, we discussed the structure of dataset COSC1430 (FALL 2015)
and extracted three datasets: dataset-o1, dataset-o2, and dataset-o3. In our experimental
evaluation we will use different subsets of dataset-o1, dataset-o2, and dataset-o3, therefore,
we have notations for the datasets so that we can easily describe their subsets using
notations. See table 5.7. These notations start by mentioning all attributes by starting with
a character ‘a’ followed by a sequence of respective column indices, then character ‘o’
When making predictions we frequently do not have data for all the course elements. For
example, early in the semester, we might have only data only for attendance and
homework. But, we need to generate prediction models for these partial datasets.
We use the following nomenclature to refer to such datasets: The course elements are
numbered sequentially from left to right. To represent columns which are attributes of
dataset-o1 (see table 5.3), we write the letter ‘a’ before “o1” and list all the columns by
48
their numbers after ‘a’. For Example, dataset a12o1 will have attendance, homework, and
grade I attribute. If there are two same course elements but are added to the dataset at a
later time they are merged into one in the pre-processing step. For example, if the dataset-
o1 has two columns of the quiz (say, quiz 1 and quiz 2) but are added to the dataset on
different dates, they are merged into one course element, quiz and the scores are averaged.
0001 50 100 70 B+
0002 100 100 80 A-
0003 25 25 25 F
0001 50 100 B
0002 100 100 A
0003 25 25 F
49
In this research, we will be using Naïve Bayes Classifier, Logistic Regression Classifier,
and Support Vector Regression. A Data-mining and machine learning tool called Weka
developed at the University of Waikato in Australia, will be used in this research. A Java
code is written using Weka API to learn models. A notation which defines the dataset is
passed as an input to the Java code. Then the code loads the dataset with respect to the
notation entered, into the system memory. By analyzing the dataset the program will decide
classifying students into a pass or fail categories it is a classification problem and if we are
otherwise we use Linear Regression and SMOreg to build models. Then we select the
The Naïve Bayes Classifier has two parameters which can be toggled based on the
algorithm and user requirement which are “-K” which provokes the model builder to use
kernel density estimator rather than normal distribution for numeric attributes, “-D” which
allows the use of supervised discretization to process numeric attributes. Both parameters
are turned on in our approach. Logistic Regression Classifier has two parameters which are
“-R” and “-M” The first is turned on which use the ridge in the log-likelihood and the later
represents a maximum number of iterations before convergence which is set to 10000. The
Random Forest uses default parameters which are: “P”, size of each bag which is set to
50
100, “I”, number of iterations which is set to 100, “S”, seed for random number generator
which is set to 1 and “B”, to break ties randomly when several attributes look equally good.
The Multi-Layer Neural Network uses 0.3 for learning rate, momentum rate for back
propagation algorithm is set to default 0.2 and the seed value for the random number
generator is set to 0.
Linear Regression in Weka uses Akaike criterion for model selection. The selection method
is set to greedy and the ridge parameter is set to 1.0e-8. Weka uses SMOreg which is a
Support Vector Regression for regression problems. Parameters are learned using various
RegSMOImporved. The model parameters are set to default. Which are: “C”, complexity
constant is set to 1, “T”, tolerance parameter for checking the stopping criterion is set to
0.001, “P”, and the epsilon for round-off error is set to 1.0e-12 and “L” the epsilon
parameter in the epsilon –insensitive loss function is set to 1.0e-3. We build the models
using these parameters, output evaluation metrics for different algorithms used and can
to other models and how credible it is to be utilized in the real world. Error on the training
data is not a good indicator of performance on future data. Because the model learns from
the training data, estimates made to this data by the model are usually optimistic. In this
51
section, we will discuss various metrics of evaluation used to measure a model’s
performance.
classification model on the set of test data of which the true values are known. There are
True Positive (TP): These are cases which are predicted to be true and are actually true.
True Negative (TN): These are cases which are predicted false and they are actually false.
False Positive (FP): These are predicted true but are actually false. (Type I error)
False Negative (FN): These cases are predicted false but are actually true. (Type II error)
5.4.2 Accuracy
For classification, accuracy is straight forward. It is defined as the percentage of instances
which are correctly classified ( for all class variables). The formula for accuracy is given
in [Link].
52
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
∑𝐾
𝑖=1 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑖
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
([Link])
When calculating accuracy for a regression model we can convert the grade points to
respective grades (from the pool of 12 grades).Then use the formula to compute accuracy
5.4.3 Precision
Precision is the fraction of retrieved instances that are relevant. It is the ratio of correctly
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ([Link])
(𝑇𝑃+𝐹𝑃)
5.4.4 Recall
The Recall is the fraction of relevant instances that are retrieved. It is the ratio of correctly
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = ([Link])
(𝑇𝑃+𝐹𝑁)
both false positives and false negatives into account. Intuitively it is not as easy to
53
understand as accuracy, but F-1score is usually more useful than accuracy. In the other
perspective, it can be said that F-1 score is the harmonic mean between precision and recall.
2(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑋 𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹 − 1 𝑠𝑐𝑜𝑟𝑒 = ([Link])
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙)
1 1
𝑀𝐴𝐸 = ∑𝑛𝑖=1|𝑓𝑖 − 𝑦𝑖 | = ∑𝑛𝑖=1|𝑒𝑖 | ([Link])
𝑛 𝑛
MAE cannot be computed for data with nominal or categorical labels. But our work around
lets us compute MAE for such datasets. Since the grades can be sorted from highest to
lowest (A to F), we assign each of the grades with a number from 1 to 12 or 1 to 5 depending
on what our outcome is. After conversion we calculate MAE. using the formula in [Link].
absolute value of the error it takes the square root of the square of the error. Which will
54
effectively amplify large error and damp smaller error. The formula to compute RMSE is
2 1 2 1
𝑅𝑀𝑆𝐸 = √𝑛 ∑𝑛𝑖=1(𝑓𝑖 − 𝑦𝑖 )2 = √𝑛 ∑𝑛𝑖=1(𝑒𝑖 )2 ([Link])
RMSE cannot be computed for data with nominal or categorical labels. But our work
around lets us compute RMSE for such datasets. Since the grades can be sorted from
highest to lowest (A to F), we assign each of the grades with a number from 1 to 12 or 1 to
5 depending on what our outcome is. After conversion we calculate RMSE. using the
formula in [Link].
were described in chapter 5.1 were used as training data to build prediction and
classification models; next, the obtained models were evaluated and compared using
metrics such as accuracy, MAE, MAE 2, RMSE, RMSE 2 for performance dataset based
models, whereas accuracy, precision, recall were used for survey dataset based models.
To perform the above-stated task we have written software in Java which operates on the
top the Weka API (Application Interface). This code is customized to compute MAE 2 and
RSME 2 which is not a part of Weka. We used 10-fold cross-validation to evaluate different
experiment into 12 classes: A, A-, B+, B, B-, C+, C, C-, D+, D, D- and F. In the experiment
we used Naïve Bayes, Classification via Logistic Regression, Random Forests and Multi-
Layered Neural Networks as models. We will run four of our selected algorithms on five
of our extracted datasets and the results can be found in tables 5.13, 5.14, 5.15, 5.16 and
5.17.
Notation schema
Dataset-o1 Id, attendance, homework, class project, exam, quiz, grade I
Dataset-a1o1 Id, attendance, grade I
Dataset-a12o1 Id, attendance, homework, grade I
Dataset-a123o1 Id, attendance, homework, class project, grade I
Dataset-a1234o1 Id, attendance, homework, class project, exam, grade I
Dataset-a12345o1 Id, attendance, homework, class project, exam, quiz, grade I
Table 5.11: Review of various dataset nomenclature used and the schema they represent
from which we extract dataset-o2. This dataset-o2 has five attributes: attendance,
homework, class project, exam, and quiz. Therefore, we extract 5 other datasets: dataset-
a12345o2. We are classifying students in these datasets into 5 classes (A, B, C, D, and F).
56
It is a multi-class classification problem. Therefore, from our bag of algorithms, we use
Naïve Bayes, Classification via Logistic Regression, Random Forests, and Multi-Layered
Neural Network. We will run four of our selected algorithms on five of our extracted
from which we extract dataset-o3. This dataset-o3 has five attributes: attendance,
homework, class project, exam, and quiz. Therefore, we extract 5 other datasets: dataset-
a12345o3. This is a regression problem as we are predicting a grade point which ranges
from 0 to 4 and can take decimal values. Therefore, from our bag of algorithms, we use
Linear Regression and Support Vector Regression popularly called SMOReg. We will run
both of our selected algorithms on five of our extracted datasets and compare results.
which records the final letter grade into either “pass” or “fail” using the following logic:
All grades from ‘A’ through ‘D-‘ are replaced with “pass” and F are replaced with “fail”;
that is, we are dealing with a binary classification problem. In the experiment, we used
Naïve Bayes, Classification via Logistic Regression, Random Forests, and Multi-Layered
57
5.6 Experimental Results & Model-Performance Evaluation
In this section, we will present and discuss results from our experimentation. The goal of
this experimentation is to determine best suitable and credible models for each of our
problems so, they can be applied in real-world scenarios. See table 5.12 gives notations
algorithms over the five datasets that use 1, 2, 3, 4, and 5 attributes, respectively. Also, we
have a color coding applied which visualizes the performance of each algorithm for a
dataset: In each column, the cell colored in green is the best result and the one in red is the
worst result.
58
Naïve Bayes, for the single attribute dataset, has an accuracy of 16.2% which is least among
all four classifiers, whereas logistic regression model is the “best” method with an accuracy
of 22.1%. The reason for such low accuracy is that the dataset has only 104 instances to
learn models that need to distinguish between 12 classes. Additionally, if there is only one
attribute to learn from then we can not expect good results. But, as we go from left to right
in the table 5.13, accuracy increases for all algorithms. When attribute “a2” (see 5.12) is
added there is a leap in accuracy for Naïve Bayes and logistic regression classification.
This kind of leap is again seen when attribute “a4” is added. It can be inferred that attribute
“a2” and “a4” could positively affect the accuracy of predicting student grades. Looking at
the green strip, it can be observed that logistic regression has consistently outperformed
other three algorithms. Random forests have lower accuracy in most cases. For multi-layer
Neural Networks, accuracy steadily increases with the number of attributes involved. The
ACCURACY
dataset- dataset- dataset- dataset- dataset-
o1 a1o1 a12o1 a123o1 a1234o1 a12345o1
Naïve Bayes 16.20 31.42 32.85 43.57 46.42
Logistic Regression
Classification 22.14 31.42 32.14 55.71 57.14
Random Forests 21.43 24.28 20.71 47.14 46.42
Multi-Layer Neural Network 21.43 27.85 28.57 52.14 52.14
MAE, MAE 2, RMSE and RMSE 2 tell how close the model's predictions are to the actual
values. MAE and MAE 2 have similar insights to accuracy. For one attribute datasets, the
predictions are 3 grades off, for two and three attribute datasets the predictions are 2 grades
59
off and for more than 3 attributes the predictions are one grade off. In general, MAE gives
us a better understanding of how far off our predictions are. In a worst-case scenario a
student who actually receives a ‘B’ grade, the model might predict an ‘A’ or a ‘C+’, In the
best case scenario, the model will predict a ‘B+’ or ‘B-‘ or even ‘B’.
MAE
dataset- dataset- dataset- dataset- dataset-
o1 a1o1 a12o1 a123o1 a1234o1 a12345o1
Naïve Bayes 3.87 2.52 2.61 1.79 1.64
Logistic Regression
Classification 3.36 2.54 2.44 1.32 1.23
Random Forests 3.59 2.67 3.20 1.56 1.48
Multi-Layer Neural Network 3.45 2.61 2.72 1.38 1.41
MAE 2
dataset- dataset- dataset- dataset- dataset-
o1 a1o1 a12o1 a123o1 a1234o1 a12345o1
Naïve Bayes 1.06 0.57 0.54 0.31 0.28
Logistic Regression
Classification 0.90 0.58 0.56 0.21 0.20
Random Forests 1.01 0.65 0.70 0.28 0.28
Multi-Layer Neural Network 1.03 0.63 0.66 0.24 0.23
MAE 2, is nothing but MAE computed after converting all the grades into grade points.
Grade points range from 0 to 4. The number represented by MAE 2 is the closeness to the
actual grade point achieved by the student. Naïve Bayes worst is 1.06 with only one
attribute and the best is 0.20 with Logistic Regression Classification involving all
60
attributes. See table 5.14 for MAE comparison and 5.15 for MAE 2 comparison. The worse
RMSE is 1.37 for (Naïve Bayes, one attribute) and the best is 0.34 (Logistic Regression
Classification, all attributes). Table 5.16 and 5.17 shows RMSE and RMSE 2 respectively
RMSE
dataset- dataset- dataset- dataset- dataset-
o1 a1o1 a12o1 a123o1 a1234o1 a12345o1
Naïve Bayes 4.90 3.75 4.05 3.17 3.15
Logistic Regression
Classification 4.52 3.88 3.74 2.70 2.59
Random Forests 4.71 3.96 4.52 2.90 2.85
Multi-Layer Neural Network 4.56 3.86 4.02 2.75 2.80
RMSE 2
dataset- dataset- dataset- dataset- dataset-
o1 a1o1 a12o1 a123o1 a1234o1 a12345o1
Naïve Bayes 1.34 0.78 0.75 0.46 0.43
Logistic Regression
Classification 1.18 0.80 0.76 0.34 0.34
Random Forests 1.33 0.86 0.91 0.42 0.42
Multi-Layer Neural Network 1.37 0.85 0.92 0.38 0.37
61
5.6.2 Discussion of Results: Predicting Student Grade II
Hoping to improve performance and allow algorithms to learn more reliable models we
have reduced the number of classes to be predicted from 12 to 5. Tables 5.18, 5.19, 5.20,
5.21 and 5.22 shows accuracy, MAE, MAE 2, RMSE and RMSE 2 for predicting grade II.
Accuracy improved by a factor of 2 for the two best-performing algorithms which are
Naïve Bayes and logistic regression classification. Logistic regression classification learns
a better model than other classifiers with 78.57% accuracy when using all five attributes
and 38.57% when only using a single attribute. Naïve Bayes underperformed in
comparison when attribute “a4” and “a5” is added. Random forests consistently
underperformed throughout. Multilayer Neural Network had its best results when attribute
ACCURACY
dataset- dataset- dataset- dataset- dataset-
o2 a1o2 a12o2 a123o2 a1234o2 a12345o2
Naïve Bayes 38.57 47.14 47.14 70.71 75.00
Logistic Regression
Classification 38.57 47.86 49.29 77.14 78.57
Random Forests 35.00 43.57 37.86 71.43 76.42
Multi-Layer Neural Network 36.42 49.28 44.29 76.43 75.71
MAE and MAE 2 shows us that logistic regression classification performs the best job in
predicting the actual results. The best MAE is observed for logistic regression classification
when five attributes are considered which is 0.45. When only one attribute is considered to
build model there is a tie between Naïve Bayes and multi-layer Neural Network, both
having MAE of 1.33. In that case, if we look at our MAE 2 to break the tie, Naïve Bayes
62
has 0.9 and Neural Network has 0.96 for MAE 2. Table 5.19 and 5.20 shows MAE and
MAE
dataset- dataset- dataset- dataset- dataset-
o2 a1o2 a12o2 a123o2 a1234o2 a12345o2
Naïve Bayes 1.33 1.09 1.12 0.68 0.53
Logistic Regression
Classification 1.34 1.04 1.06 0.48 0.45
Random Forests 1.92 1.14 1.36 0.58 0.53
Multi-Layer Neural Network 1.33 1.02 1.17 0.49 0.56
MAE 2
dataset- dataset- dataset- dataset- dataset-
o2 a1o2 a12o2 a123o2 a1234o2 a12345o2
Naïve Bayes 0.90 0.67 0.66 0.29 0.25
Logistic Regression
Classification 0.89 0.65 0.21 0.24 0.21
Random Forests 0.96 0.71 0.80 0.29 0.24
Multi-Layer Neural Network 0.96 0.64 0.71 0.24 0.25
For a dataset with one attribute, Neural Network works best according to RMSE and
logistic regression classification works best according to RMSE 2. To determine the best
algorithm we need to consider the difference between RMSE and RMSE 2. The difference
between logistic regression and Neural Network errors in RMSE is 0.04 and for RMSE 2
is -0.08 which means if we choose the wrong algorithm our grades might be off by 0.04
which very insignificant change, in the case of RMSE 2 even though -0.08 which is also
63
insignificant but, when the grade points are converted back to grades they may be off by 1
grade which is significant. See table 5.21 for RMSE comparison and 5.22 for RMSE 2
comparison.
RMSE
dataset- dataset- dataset- dataset- dataset-
o2 a1o2 a12o2 a123o2 a1234o2 a12345o2
Naïve Bayes 1.89 1.67 1.73 1.40 1.15
Logistic Regression
Classification 1.91 1.62 1.67 1.10 1.04
Random Forests 1.92 1.74 1.96 1.17 1.19
Multi-Layer Neural Network 1.87 1.62 1.78 1.11 1.21
RMSE2
dataset- dataset- dataset- dataset- dataset-
o2 a1o2 a12o2 a123o2 a1234o2 a12345o2
Naïve Bayes 1.26 0.98 0.96 0.53 0.49
Logistic Regression
Classification 1.23 0.93 0.91 0.49 0.46
Random Forests 1.33 1.00 1.10 0.53 0.50
Multi-Layer Neural Network 1.31 0.92 1.02 0.47 0.53
In summary, reducing the number of classes from 12 to 5 has shown a significant increase
Neural Networks performed second best, whereas Naïve Bayes performs moderately but
better than random forests. It is also observed that the exam attribute (a4), is essential to
regression on 5 datasets. The metrics used to compare the two algorithms were: accuracy,
ACCURACY
o3 dataset-a1o3 dataset-a12o3 dataset-a123o3 dataset-a1234o3 dataset-a12345o3
Linear Regression 35.46 41.13 41.13 69.50 69.50
SMO Regression 34.75 41.13 42.55 73.75 73.04
MAE2
o3 dataset-a1o3 dataset-a12o3 dataset-a123o3 dataset-a1234o3 dataset-a12345o3
Linear Regression 0.89 0.61 0.62 0.30 0.30
SMO Regression 0.95 0.63 0.64 0.29 0.28
RMSE2
o3 dataset-a1o3 dataset-a12o3 dataset-a123o3 dataset-a1234o3 dataset-a12345o3
Linear Regression 0.96 0.65 0.65 0.30 0.31
SMO Regression 0.93 0.66 0.67 0.29 0.29
regression. In terms of MAE 2, SMO regression was better than linear regression until the
attribute “a4” was added, but even after that, there is not much difference between two
algorithms. While RMSE 2 also has the similar result. Overall we consider SMO regression
65
5.6.4 Discussion of Results 4: Detecting Poor-performing
Student
This section will present performance metrics for experiment 4 in which we used the
random forests and multi-layer Neural Network on student survey dataset that was obtained
for the teaching of COSC1430 in Fall 2015. The metrics we used to compare the four
approach are: accuracy, precision, recall and F-1 score; Table 5.26 summarizes the results
of this experiment.
Table 5.26: Metric comparison for detecting poor-performing student using student
survey dataset of COSC1430 (FALL 2015)
66
LOGISTIC Predicted Predicted
REGRESSION TRUE FALSE
Actual TRUE 43 25
Actual FALSE 21 14
Table 5.27: Comparison of confusion matrices across various algorithms for COSC1430
(FALL 2015) survey dataset
In this experiment, Random forests outperform the other algorithms significantly in terms
of accuracy, precision, and recall although the F-1 score is similar to that of Naïve Bayes
and Logistic Regression. But our goal is not to predict accurately but to precisely detect
poor-performing students. Confusion matrices for logistic regression show that it was abe
to detect 14 students of 35 students who actually failed. Although the accuracy, in this case,
is low but on a positive note, this algorithm will be abe to detect those 14 students
accurately. Random forest algorithm will be able to learn strong models to classify students
believe better models can be built when we use larger datasets to make predictions, also a
good feature selection algorithm will be able to determine worthy features in the dataset
and only use them to make predictions. Therefore logistic regression can be used as an
system.
67
6. Conclusion and Future Work
The overall goal of this research is to design and implement a system which provides
instructors and student with valuable feedback with the goal to reduce the dropout rate. The
Real-time student performance evaluation and feedback system is a tool which collects
both student-performance data and personal data, builds models which can predict students
grades in future from their current performance in the first or second week of the course
and identify poor-performing students very early in the course. This tool will help
instructors to identify poor-performing students very early in the course which will give
them enough time to take necessary action such as mentoring the student or conducting
review sessions. Moreover, students can use this tool to view their predicted grades and act
without waiting for the instructor to intervene. This research has demonstrated that, by
using classification models that are learned from past data, we can predict student grades
and identify poor-performing students efficiently and very early in the course. Moreover,
RSPEF provides a framework for an academic environment that motivates students and
In this research, four classification models Naïve Bayes, Logistic Regression, Random
Forests, and Multi-Layered Neural Networks were used, compared, and evaluated for
real-world datasets was collected from a computer science course at the University of
Houston to assess and compare the classification models. The performance of these
machine learning algorithms was evaluated based on different performance measures like
68
accuracy, precision, recall, MAE, RSME, MAE 2 and RMSE 2. The results show that
predicting student grade and Random Forest performed better for identifying poor-
performing students.
For future work, we aim to carry out more experiments using larger datasets consisting of
thousand records of student-performance and survey data. Also, the approach to identify
failing students accomplished quite a low accuracy; a larger survey dataset and using a
feature selection algorithm will improve the accuracy. Also, approaches will be
students using the web-application. Additionally, the achieved accuracy and MAE of the
approaches like genetic algorithms, Bayesian belief networks to obtain higher accuracy.
Finally, we are planning to build and deploy this system during the first week of fall 2017
69
References
[1] ACT Institutional Data File, 2005.
[2] X. Chen, (2013). STEM Attrition: College Students’ Paths Into and Out of STEM
Fields (NCES 2014-001). National Center for Education Statistics, Institute of Education
Sciences, U.S. Department of Education. Washington, DC, 2013.
[3] R. Levitz. 2016 National Freshman Motivation to Complete College Report. Cedar
Rapids, Iowa. Retrieved from [Link]
[4] E. Seymour and N. M. Hewitt. Talking about leaving: Why undergraduates leave the
sciences. Boulder, CO: Westview Press. 1997.
[5] B. L. Lowell, H. Salzman, H. Bernstein, and E. Henderson. Steady as she goes? Three
generations of students through the science and engineering pipeline. Paper presented at
the Annual Meeting of the Association for Public Policy Analysis and Management,
Washington, DC., November 7, 2009.
[6] Han and Kamber. Data Mining: concepts and techniques. Elsevier, (2011).
[8] Shanmuga Priya and Senthil Kumar. Improving the Student’s Performance Using
Educational Data Mining. International Journal of Advanced Networking and
Applications 4.4 (2013): 1806.
[9] Ben-Zadok and Galit. Examining online learning processes based on log files
analysis: A Case Study. 5th International Conference on Multimedia and ICT in
Education (2007).
[11] Surjeet Kumar, Brijesh Bharadwaj, and Saurabh Pal. Data mining applications: A
comparative study for predicting student's performance (2012).
[12] Ashraful Alam Pathan, Mehedi Hasan, Ferdous Ahmed, and Dewan Farid. A Mining
Model for Developing Students’ Programming Skills. 8th International Conference on
IEEE (2014): 1-5.
70
[13] Minaei, Bidgoli B. Predicting student performance: an application of data mining
methods with an educational web-based system. In 33rd ASEE/IEEE Frontiers in
Education Conference (2003): 1-6.
[14] Pittman K. Comparison of data mining techniques used to predict student retention.
Ph.D. Thesis, Nova Southeastern University (2008).
[16] Nguyen, Paul, and Peter H. A Comparative Analysis of Techniques for Predicting
Academic Performance. In Proceedings of the 37th ASEE/IEEE Frontiers in Education
Conference (2007): 7-12.
[17] Al-Radaideh, Al-Shawakfa, and AI-Najjar M. Mining Student Data using Decision
Trees. In Proceedings of the International Arab Conference on Information Technology
(ACIT'2006), Yarmouk University, Jordan (2006).
[18] Muslihah, Yuhanim, Norshahriah, Mohd Rizal, Fatimah, and Hoo Y. S. Predicting
NDUM Student’s Academic Performance Using Data Mining Techniques. In
Proceedings of the Second International Conference on Computer and Electrical
Engineering, IEEE Computer Society (2009).
[20] Mishra, Kumar, and Gupta S. Mining Students Data for Prediction Performance. In
Advanced Computing & Communication Technologies (ACCT), Fourth International
Conference on IEEE (2014): 255-262.
[24] le Cessie, S. & van Houwelingen, J.C. (1992). Ridge Estimators in Logistic
Regression. Applied Statistics, 41, 191-201.
[25] Breiman Leo. Random forests. Machine learning 45.1 (2001): 5-32.
71
[26] Chan, Jonathan Cheung-Wai, and Desiré Paelinckx. Evaluation of Random Forest
and Adaboost tree-based ensemble classification and spectral band selection for ecotype
mapping using airborne hyperspectral imagery. Remote Sensing of Environment 112.6
(2008): 2999-3011.
[28] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy: Improvements to the
SMO Algorithm for SVM Regression. In: IEEE Transactions on Neural Networks, 1999.
72