REPORT OF THE PLAGIARISM CHECK
THIS REPORT CERTIFIES THAT THE ATTACHED WORK
plagiarism.docx
WAS CHECKED WITH THE PLAGIARISM PREVENTION SERVICE
MY.PLAGRAMME.COM AND HAS:
SIMILARITY
16%
RISK OF THE PLAGIARISM
38%
PARAPHRASE
3%
IMPROPER CITATIONS
0%
File name: plagiarism.docx
File checked: 2024-03-26
Report generated: 2024-03-26
CHAPTER -1
INTRODUCTION
Prediction of the performance of students became an urgent need in most
of the educational institutes. That is essential as it is important to help at-
risk students and prevent their detention, providing them with learning
resources and educational experience, and also improving the
university’s ranking and reputation.
This research also explores the ways of identifying various key
indicators present in the (tel.archives-ouvertes.fr) dataset, which will be
used in creating the most accurate prediction model, using different
types of visualization, clustering and regression algorithms. The Best
indicators were chosen to feed the various machine learning algorithms
being used to evaluate them for the most accurate model. From the
algorithms that are used, the results show the ability of each algorithm in
detecting key parameters in small datasets. (www.proquest.com)
Student performance prediction poses a crucial challenge within the
realm of education. Various factors influence students' academic
outcomes, consequently impacting the reputation of universities.
Sustaining high educational standards in institutions becomes intricate
when dealing with underperforming students. Extensive research offers
diverse avenues for enhancing the educational system to cater to
students' individual requirements. The complexity of education data
mining presents novel research opportunities. Techniques employed in
education, which fall under educational data mining, involve the
automatic extraction of valuable insights from raw data.
LITERATURE
2.1 ALGORITHMS
In machine learning (ML), algorithms are categorized based on
the type of task they are designed to solve. Here are the
(etheses.dur.ac.uk) main types of (digitizeventure.com) ML
algorithms:
1. Supervised Learning Algorithms:
o Classification: These algorithms are used when the
(digitizeventure.com) target variable is categorical.
They predict the class or category of new instances
based on past observations.
o Regression: Regression algorithms are used when the
target variable is continuous. They predict a numerical
value based on (digitizeventure.com) input features.
2. Unsupervised Learning Algorithms:
o Clustering: Clustering algorithms group similar
instances together based on their (digitizeventure.com)
features. They are used for tasks such as customer
segmentation, anomaly detection, and pattern
recognition.
o Dimensionality Reduction: These algorithms reduce
the number of features in the dataset while preserving
the most important information. They are
(digitizeventure.com) used for visualization, feature
extraction, and data compression.
3. Semi-Supervised Learning Algorithms:
o Semi-supervised learning algorithms work with
partially labeled data, where only
(digitizeventure.com) a small portion of the data has
labels. They combine aspects of supervised and
unsupervised learning to make predictions.
4. Reinforcement Learning Algorithms:
o Reinforcement learning algorithms learn through trial
and error by interacting with an environment. They
aim to maximize cumulative reward by taking actions
that lead to desirable outcomes.
5. Deep Learning Algorithms:
o Deep learning algorithms are a subset of neural
networks with multiple layers (deep architectures).
They are capable of learning complex patterns and
representations directly from raw data and have
achieved state-of-the-art performance in various
domains such as image recognition, natural language
processing, and speech recognition.
6. Ensemble Learning Algorithms:
o Ensemble learning algorithms combine the predictions
of multiple base (dk.upce.cz) learners to improve
performance. Examples include bagging (e.g.,
Random Forest), boosting (e.g., Gradient Boosting
Machines), and stacking.
7. Instance-Based Learning Algorithms:
o Instance-based learning algorithms make predictions
by comparing new instances to instances seen during
training. K-Nearest Neighbors (KNN) is a popular
example of an instance-based learning algorithm.
8. Anomaly Detection Algorithms:
o Anomaly detection algorithms identify outliers or
anomalies in data. They are used to detect unusual
patterns that do not conform to expected behavior.
These are the main types of ML algorithms, each suited to
(digitalcommons.law.scu.edu) different types of data and tasks.
Choosing the right algorithm depends on factors such as the
nature of the problem, the type of data available, the desired
output, and computational resources.
2.2 TYPES OF REGRESSION ALGORITHMS
Regression algorithms are used in supervised learning tasks
where the goal is to predict a continuous target variable based on
input features. There are several types of regression algorithms,
each with its own characteristics and assumptions. Here are
some common types of (www.freshbooks.com) regression
algorithms:
1. Linear Regression:
o Linear regression is one of the simplest and most
widely used regression algorithms.
o It assumes a linear relationship between the input
features and the target variable.
o Linear regression models the relationship using a
linear equation of the form: y=β0+β1x1+β2x2+…
+βnxn+ϵy=β0+β1x1+β2x2+…+βnxn+ϵ, where y is the
target variable, x1,x2,…,xnx1,x2,…,xn are the input
features, β0,β1,…,βnβ0,β1,…,βn are the coefficients,
and ϵϵ is the error term.
2. Polynomial Regression:
o Polynomial regression extends linear regression by
allowing for non-linear relationships between the input
features and the target variable.
o It models the relationship using a polynomial equation
of a specified degree, such as quadratic (degree 2) or
cubic (degree 3).
3. Ridge Regression (L2 Regularization):
o Ridge regression is a linear regression technique that
incorporates L2 regularization to prevent overfitting.
o
o
o It adds a penalty term to the loss function that
penalizes large coefficient values, encouraging simpler
models.
o Ridge regression is useful when multicollinearity is
present among the input features.
4. Lasso Regression (L1 Regularization):
o Lasso regression is similar to ridge regression but uses
L1 regularization instead of L2 regularization.
o It adds a penalty term to the loss function that
encourages sparsity by promoting some coefficients to
be exactly zero.
o Lasso regression is useful for feature selection and
building more interpretable models.
5. ElasticNet Regression:
o ElasticNet regression combines both L1 and L2
regularization penalties.
o It aims to leverage the benefits of both ridge and lasso
regression, allowing for variable selection and
regularization.
6. Support Vector Regression (SVR):
o Support vector regression is a regression variant of
support vector machines (SVMs), a powerful
algorithm for classification tasks.
o SVR finds the hyperplane that best fits the data while
maximizing the margin between the data points and
the hyperplane.
o It can handle non-linear relationships through the use
of kernel functions.
7. Decision Tree Regression:
o Decision tree regression models the relationship
between input features and the target variable using a
tree structure.
o It recursively splits the feature space into regions, with
o each region associated with a predicted value.
o Decision tree regression is flexible and can capture
non-linear relationships but is (www.springer.com)
prone to overfitting.
8. Random Forest Regression:
o Random forest regression is an ensemble learning
technique that combines multiple decision trees to
make predictions.
o It (www.mdpi.com) builds a forest of trees by training
each tree on a bootstrapped sample of the training data
and using random feature (dspace.library.uu.nl)
subsets for splitting.
o Random forest regression is robust, accurate, and less
prone to overfitting compared to individual decision
trees. (www.mdpi.com)
9. Gradient Boosting Regression:
o Gradient boosting regression is another ensemble
learning technique that sequentially adds decision
trees to the ensemble.
o It optimizes a loss function by fitting each tree to the
negative gradient of the loss function with respect to
the current ensemble prediction.
o Gradient boosting regression is known for its high
predictive accuracy and ability to handle complex
relationships in the data.
These are just a few examples of regression algorithms
commonly used in machine learning. The choice of algorithm
depends on factors such as the nature of the data, the complexity
of the problem, and the specific requirements of the task. It's
often beneficial to experiment with multiple algorithms and
compare their performance empirically before selecting the best
one for a particular application.
GBM Algorithms
Choosing gradient boosting over other algorithms depends on
various factors, including the nature of the dataset, the
complexity of the problem, computational resources, and
specific requirements of the task. Here are several reasons why
gradient boosting may be preferred over other algorithms in
certain scenarios:
1. High Predictive Accuracy:
o Gradient boosting often produces highly accurate
predictions compared to other algorithms, especially
when the dataset is large, complex, or noisy.
o Its ability to sequentially add weak learners to the
ensemble and optimize the loss function leads to
improved predictive performance.
2. Robustness to Overfitting:
o Gradient boosting incorporates regularization
techniques, such as tree pruning, shrinkage, and
stochastic gradient boosting, to mitigate the risk of
overfitting.
o By sequentially fitting trees to the residuals of the
previous models, gradient boosting can capture
complex patterns in the data while maintaining
robustness against overfitting.
3. Handling Complex Relationships:
o Gradient boosting is capable of capturing complex
nonlinear relationships between (medium.com)
features and the target variable.
o Its flexibility in modeling interactions and
dependencies between variables makes it suitable for
tasks where the underlying relationships are intricate
or difficult to specify.
4. Feature Importance Analysis:
o Gradient boosting provides insights into feature
importance, (medium.com) allowing users to identify
the most influential predictors in the dataset.
o Feature importance scores can help interpret the
model, understand the drivers of predictions, and
guide feature selection or engineering efforts.
5. Flexibility and Customization:
o Gradient boosting frameworks like XGBoost,
LightGBM, and CatBoost offer a wide range of
hyperparameters that can be tuned to optimize model
performance.
o Users can customize various aspects of the algorithm,
including tree structure, learning rate, regularization
parameters, and handling of categorical features.
6. Efficiency and Scalability:
o While gradient boosting can be computationally
intensive, recent advancements in algorithmic
optimizations and parallel computing techniques have
improved its efficiency and scalability.
o Implementations like LightGBM and CatBoost are
designed for distributed and efficient training, making
them suitable for large-scale datasets and high-
dimensional feature spaces.
7. Wide Adoption and Community Support:
o Gradient boosting algorithms, particularly XGBoost
and LightGBM, have gained widespread adoption in
both academia and industry.
o They have active developer communities, extensive
documentation, and a wealth of resources, including
tutorials, case studies, and pre-trained models.
Despite these advantages, it's essential to note that gradient
boosting may not always be the best choice for every problem.
Depending on the specific requirements and characteristics of
the dataset, other algorithms like random forests, support vector
machines, or neural networks may offer competitive
performance or better suitability for certain tasks. Therefore, it's
recommended to experiment with multiple algorithms and
compare their performance empirically before making a final
decision.
LGBM:
LightGBM (Light Gradient Boosting Machine) is a popular machine
learning algorithm that falls under the category of gradient boosting
(medium.com) frameworks. It is specifically designed for speed and
efficiency, making it a powerful choice for handling large datasets and
achieving high performance in various tasks such as classification,
regression, and ranking.
Here are some key features and characteristics of the LightGBM
algorithm:
Gradient Boosting Algorithm: LightGBM implements the gradient
boosting framework, which is an ensemble learning (soulpageit.com)
technique that builds strong predictive models by combining multiple
weak learners (intellipaat.com) (usually decision trees) sequentially. It
uses gradient descent optimization to minimize the loss function
(www.researchgate.net) at each iteration.
Leaf-Wise Tree Growth (intellipaat.com): (intellipaat.com) LightGBM
grows trees in a leaf-wise manner rather than level-wise. This strategy
selects the leaf with the maximum delta loss to grow the tree, resulting
in faster convergence and potentially better accuracy.
Gradient-Based Learning: LightGBM uses gradient-based techniques
to optimize the loss function. It computes gradients of the loss with
respect to predictions and employs these gradients to guide the tree
construction process, making it more efficient in finding informative
splits.
Histogram-Based Learning: One of LightGBM's unique features is its
histogram-based approach for computing gradients. It bins the features
into discrete bins and constructs histograms for efficient computation of
gradients and splits. This speeds up the training process and reduces
memory usage.
Categorical Feature Handling: LightGBM has built-in support for
categorical features. It converts categorical features into integers and
then treats them as ordinal variables. This allows the algorithm to
efficiently handle categorical data without the need for one-hot
encoding.
Regularization: LightGBM provides several options for regularization
to prevent overfitting. These include parameters like max_depth,
min_child_samples, and lambda (L2 regularization).
Lightweight and Fast: As the name suggests, LightGBM is designed to
be lightweight and fast. Its efficient tree construction and histogram-
based approach contribute to its speed advantage over other gradient
boosting implementations.
Parallel and GPU Learning: LightGBM supports parallel and GPU
learning, further enhancing its training speed, especially when dealing
with large datasets.
Tuning and Hyperparameters: Like any machine learning algorithm,
LightGBM has several hyperparameters that can be tuned to achieve
optimal performance. Common hyperparameters include learning rate,
number of trees (boosting rounds), tree depth, and more.
Python Interface: LightGBM provides a Python interface along with
APIs for other languages like R, Java, and C++. This makes it accessible
and usable within a wide range of programming environments.
LightGBM has gained popularity in machine learning competitions and
real-world applications due to its speed, efficiency, and strong predictive
performance. However, as with any algorithm, it's important to
experiment with different hyperparameters and settings to find the
configuration that works best for (archive.org) your specific task and
dataset.
CHAPTER-2
SOFTWARE ENVIORNMENT
2.1.PYTHON: Python is a important, interactive, object- acquainted, and
interpreted scripting language. Python has been created to be veritably
readable. It has smaller syntactical structures than other languages and
generally employs English keywords rather than punctuation. The main
defense for exercising Python to carry out the data gathering and
processing is given below.
Increased productivity of programmers: The language boosts the output
of the programmer by utilizing a sizable help library and a sophisticated
article aligned pattern.
Feature of Integration: Python integrates COBRA or COM
characteristics to enable Enterprise Application Integration, which
enhances online services. As a result, it has great authority skills because
Python can rapidly access C, C++, or Java.. Python also refers to XML
and other markup languages because it may be used on every new
operating system utilizing a comparable set of bytes for symbols.
Extensive Support Libraries: This includes the vastness. Examples
include the implementation of (archive.org) internet services, OS
commands and interfaces, and a number of actions that make use of
high-quality libraries. It already has a major portion of the frequently
used programming functions, which limits the number of ciphers that
may be created in Python. The classifiers are called by identifying built-
in functions in the Python SKlearn package for machine learning.
GUI Programming in Python allows for the creation and porting of GUI
applications to a variety of (www.jetir.org) system calls, libraries, and
windows platforms, including Windows MFC, Macintosh, and the X
Window system of Unix. • Scalable – Python offers larger projects better
structure and support (coderzcolumn.com) than shell scripting.
2.1.2.NumPy:
NumPy is a general-purpose library for managing arrays. It provides an
extremely quick multidimensional array object along with the ability to
interact with these arrays. The foundational Python module for scientific
computing, to put it simply. The software is freely available. It has
(www.slideshare.net) a variety of (www.jetir.org) traits, however the
following are the most important ones:
• Integration tools for C/C++ and Fortran; practical Fourier transform,
random number, and linear algebra (www.slideshare.net) capabilities;
sophisticated (broadcasting) functions; a robust N-dimensional array
object.
NumPy is a potent multi-dimensional data container with a wide range
of non- science uses. NumPy can connect quickly and efficiently with a
various types of databases thanks to its ability to declare any data-types.
2.1.3.Seaborn:
Python's Seaborn visualization module is fantastic for conniving
statistical visualizations. It offers lovely dereliction styles and color
schemes to enhance the appeal of statistics maps. It's tightly constructed
on top of the matplotlib library. With Seaborn, visualization will be at
the heart of data disquisition and appreciation.
offers dataset- acquainted APIs, allowing us to move between colorful
visual representations of the same variables for bettered dataset
appreciation.
2.1.4.Matplotlib:
Matplotlib is an excellent visualization package in Python for 2D array
displays. To handle the larger SciPy mound, the Matplotlib amuletic-
platform data visualization software was developed and is based on
NumPy arrays. John Hunter initially presented it in the year 2002. One
of visualization's main (www.ijrti.org) benefits is that it allows us visual
access to enormous amounts of data in easily understandable forms.
Matplotlib offers a wide variety of graphs, including line, bar, scatter,
and (www.ijrti.org) histogram.
Plotting Functions: Matplotlib offers a wide range of functions for
creating different types of plots, (www.ijrti.org) including line plots,
scatter plots, bar plots, histogram plots, pie charts, 3D plots, and more.
These functions provide a simple and intuitive interface for visualizing
data in various formats.
Integration with NumPy: Matplotlib seamlessly integrates with NumPy
arrays, making it easy to visualize data stored in (archive.org) NumPy
arrays. This integration enables users to plot data directly without the
need for extensive data format conversions.
2.1.5.Sklearn:
The most effective and dependable Python machine literacy library is
called Skearn(SciKit- Learn). Through a Python thickness interface, it
offers a variety of effective tools for statistical modeling and machine
literacy, including bracket, retrogression, clustering, and dimensionality
reduction. (progerhub.com) This library is grounded on NumPy, SciPy,
and Matplotlib and was written primarily in Python.
CHAPTER – 3
SYSTEM ANALYSIS
Finding the optimal answer to an issue involves analysis. In order to
readily accomplish them, a system must be developed after careful
analysis of a business or operation to determine its goals and objectives.
The process of learning about current issues, defining objects and
requirements, and assessing potential solutions is known as system
analysis. It is a mode of thinking about the organization and the issues it
faces, as well as a group of technologies that aid in problem solving. In
system analysis, which provides the aim for design and development,
feasibility studies are crucial.
Requirements analysis is an important step of systems engineering and
software engineering. It concentrates on tasks such as analyzing,
validating, documenting, and managing the software or system
requirements while taking into account the potentially conflicting
requirements of various stakeholders.
A systems or software project's success or failure depends on the
(archive.org) results of the requirements analysis. The requirements
ought to be well-documented, usable, quantifiable, testable, traceable,
tied to recognized business opportunities or needs, and sufficiently
defined for system design.
3.1 Proposed System
In this system we will manage the various features that are to
be considered for the prediction of performance and by
reading the information about the various students we will
analyze the student performance by using various features
from the dataset
In this system we will train the machine learning model using
the student performance dataset derived from Kaggle
We will use Light Gradient Boosting Machine (LGBM)
Regressor algorithm in order to predict the students’
performance.
3.2 Software Requirements
The Software Requirements specify the logical characteristics of each
interface and software components of the system. (archive.org)
The following are the required software specifications
• Operating system : Windows 8, 10
• Languages : Python
• Back end : Machine Learning
• IDE : Jupyter
3.3 Hardware Requirements:
The Hardware interfaces specify the logical characteristics of each
interface between the software program and the hardware components
(4thsemesternotes.yolasite.com)
of the system. (archive.org)
The following are the required hardware specifications
• Processor : Intel Dual Core @ CPU 2.90 GHz.
• Hard disk : 500GB and Above
• RAM : 4GB and Above
Chapter-4
IMPLEMENTATION
4.1 Introduction
The implementation of (archive.org) Student Performance Prediction is done by
using one algorithm named LGBM(light gradient boosting machine). The IDE that
we are using for this implementation is Jupyter Notebook.
4.2 Jupyter Notebook
Step 1: for installing jupyter notebook first we have to open command
prompt check the version of python. Using “python –version”
command .It will show python 3.7.8 like that otherwise we have to
install python first.
Step 2: After that check the pip version also using “pip –version”
Step 3: Now we have to install Jupiter lab using “pip install jupypter
lab”
command in command prompt.
Step 4: After successful installation of jupyter lab then Jupiter notebook
using command “pip install jupyter notebook”. If there is any upgrade
required
then update.
Step 5: create one folder name test jupyter and open that folder path in
command prompt.
Step 6:Now the jupyter home page will be appear like this as shown in .
Using this we can create python files for our model.
4.3 Data set
This dataset contains information of high school students and their
(www.fi.uu.nl) performance in mathematics, including their grades and
demographic information. The data was acquired from three high
schools in the United States of America.
Columns:
• Gender: The gender of the student (male/female)
• Race/ethnicity: Racial or ethnic background of the students (Asian,
African-American, Hispanic, etc.)
• Parental level of education: The highest level of education attained
by the student's parent(s) or guardian(s) (www.fi.uu.nl)
• Lunch: Whether the student (www.studypool.com) is receiving
reduced price or free lunch (yes/no)
• Test preparation course: If the student has completed a test
preparation course(yes/no)
• Math score: The score of the students in the standardized mathematics
test
• Reading score: The score of the students in the standardized reading
test
• Writing score: The score of the students in (library.oum.edu.my) the
standardized writing test
This dataset could be used for various research questions related to
education, such as examining the impact of parental education or test
preparation courses on student performance. It could also be used to
(archive.org) develop machine learning models to predict student
performance based on demographic and other factors.
4.4 Data Visualization
Importing libraries and dataset path reading: The following required
libraries are imported to perform operations and analyzing python
project through executing following commands in Jupyter notebook and
set the dataset path present in a csv file.
4.5 Data Preprocessing
Data preprocessing is the cleaning of data which is next step
involved in execution of our project. This technique is applied at
early stage of machine learning. By this technique we can
improve the quality of data. Data preprocessing transforms the
raw data into more easily and efficient format.
Handling missing values:
For handle the missing values in our data we have to methods they are 1.
Removing the null value records from the dataset. 2.filling the null
values with its Mean value. Here we will fill null values with their mean
because if you delete the records data will become small. Filling the
missing values with their mean value by executing following command
CHAPTER-5
OUTPUT
Applying LGBM Regressor Algorithm to get the result :
7.2 Performance Evaluation of The LGBM Model:
For regression tasks like student performance prediction, accuracy is
typically measured using metrics that evaluate the model's ability to
predict continuous numerical values. The most commonly used metrics
for regression tasks are:
1. Mean Squared Error (MSE): This metric calculates the average
of the squared differences between the predicted
(induraj2020.medium.com) values and the actual values. Lower
MSE values indicate better model performance.
where n is the number of (induraj2020.medium.com) samples, yi is
the actual value, and y^i is the predicted value (doi.org) for the i-th
sample.
2. R-squared (R2): R-squared measures the proportion of the
variance in the target variable that is explained by the model. It
ranges from 0 to 1, with higher values indicating better
(induraj2020.medium.com) model fit.
where yˉ is the (doi.org) mean of the actual values.
CHAPTER-6
CONCLUSION
In this project we have used Light Gradient Boosting
Machine (LGBM) Regressor Algorithm to get the accurate
results. The LGBM algorithm gives more accuracy and
can handle large amounts of data. Using this algorithm, we
have predicted the marks of the students in the sample data
and how many marks the student would get.