0% found this document useful (0 votes)
15 views5 pages

Survey Paper On Machine Learning Algorithms

Uploaded by

niravbhatt11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Survey Paper On Machine Learning Algorithms

Uploaded by

niravbhatt11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369417666

Comparative Analysis of Machine Learning Algorithms : Random Forest


algorithm, Naive Bayes Classifier and KNN -A survey

Conference Paper · March 2023

CITATION READS

1 1,133

5 authors, including:

Pallavi Wankhede Akshay Gole


St Vincent Pallotti College of Engineering & Technology St Vincent Pallotti College of Engineering & Technology
11 PUBLICATIONS 54 CITATIONS 2 PUBLICATIONS 10 CITATIONS

SEE PROFILE SEE PROFILE

Prathmesh Kanherkar Sankalp Singh


St Vincent Pallotti College of Engineering & Technology St Vincent Pallotti College of Engineering & Technology
2 PUBLICATIONS 10 CITATIONS 2 PUBLICATIONS 10 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Pallavi Wankhede on 22 March 2023.

The user has requested enhancement of the downloaded file.


Comparative Analysis of Machine Learning Algorithms : Random
Forest algorithm, Naive Bayes Classifier and KNN - A survey

Akshay Gole Sankalp Singh


Department of Computer Engineering, Department of Computer Engineering,
St. Vincent Pallotti College of Engineering & Technology, St. Vincent Pallotti College of Engineering & Technology,
Nagpur, Maharashtra, India. Nagpur, Maharashtra, India.

Prathmesh Kanherkar
Department of Computer Engineering, P.R.Abhishek
St. Vincent Pallotti College of Engineering & Technology, Department of Computer Engineering,
Nagpur, Maharashtra, India. St. Vincent Pallotti College of Engineering & Technology,
Nagpur, Maharashtra, India.

Prof . Pallavi Wankhede


Assistant Professor
Department of Computer Engineering,
St. Vincent Pallotti College of Engineering & Technology,
Nagpur, Maharashtra, India.

Abstract— Machine learning is a branch of computer science search, as well as a vastly better understanding of our genomes.
in which a computer predicts the next task to be performed by There are a lot of things you probably do every day that utilize
analysing the data that is provided to it. The computer can machine learning, but you might not even know it.[1] Often,
access data in the form of digitised training sets or through machine learning is classified by how an algorithm improves in
interaction with the environment. The primary goal of this paper its abilities to make predictions. We can categorize learning
is to provide a general comparison of the Random Forest approaches into four groups: supervised, unsupervised, semi-
algorithm, the Naive Bayes Classifier, and the KNN algorithm supervised, reinforcement, and ensemble learning. Based on
all aspects. "Random Forest Classifier" is made up of many what type of data scientists want to predict, they choose what
decision trees. To promote uncorrelated forests, the algorithm algorithm to use.
leverages randomization to form each individual tree, which
then uses the forest's predictive powers to make accurate
decisions. The Naive Bayes Classifier is a simple and effective 2. LITRATURE REVIEW
classification method that aids in the development of fast We've covered the relevance of machine learning in this section,
machine learning models capable of making quick predictions. as well as the Random Forest method, the Naive Bayes
“K-Nearest Neighbour”. The algorithm can be used to handle Classifier, and the KNN algorithm.
problems involving classification and regression. These
algorithms are surveyed on the basis of aim, methodology,
advantages and disadvantages 2.1 Machine learning:
1. INTRODUCTION
Machine learning, in short, is the science of getting computers
to act automatically without explicit programming. We’ve been
Machine learning, in short, is the science of getting computers able to use machine learning for many things over the past
to act automatically without explicit programming .We've been decade, from self-driving cars to speech recognition and web
able to use machine learning for many things over the past search, as well as a vastly better understanding of our genomes.
decade, from self-driving cars to speech recognition and web There are a lot of things you probably do every day that utilize

194
It results in more or
machine learning, but you might not even know it.[1] Often, less correlated predictions and errors made by each tree in the
machine learning is classified by how an algorithm improves in ensemble. These less correlated trees often perform better than
its abilities to make predictions. We can categorize learning bagged decision trees when their predictions are averaged to
approaches into four groups: supervised, unsupervised, semi- make a prediction.[2]
supervised, reinforcement, and ensemble learning. Based on
what type of data scientists want to predict, they choose what The number of random features to consider at each split point
algorithm to use. is probably the most important hyperparameter for tuning
random forests.
[3] This hyperparameter should be set to 1/3 of the number of
Ensemble learning: To understand the Random Forest input features as a heuristic for regression.
machine learning algorithm we need to first understand
ensemble learning.
num_features_for_split = total_input_features / 3

[4] This hyperparameter should be set to the square root of the


[2] The concept of ensemble learning basically refers to a number of input features as a heuristic for classification
method of making predictions based on the prediction of several N
different models. Ensemble models are more flexible and less
sensitive to data because they combine individual models. um_features_for_split = sqrt(total_input_features)

The depth of the decision trees is another important


Bagging and boosting are most popular ensemble learning hyperparameter. In addition to more overfitting, deeper trees are
methods: less correlated, which may enhance the ensemble performance.
1 to 10 levels of depths may be effective.[2]

Bagging: A bunch of individual models are trained As a last step, you can choose how many decision trees will be
simultaneously. Data from random subsets is used to train the included in the ensemble. The number is often increased until
models no further improvements can be observed.

Advantages of Random Forest [5]:


Boosting: Individual models are trained in sequential way.
1. In decision trees, it reduces overfitting and improves
Throughout the learning process, each new model learns from
accuracy.
the mistakes of the previous one.[2]
2. This algorithm is flexible enough to be used for both
classification and regression problems.
2.2 Random Forest: 3. It can be used for both categorical and continuous
Random forests are ensemble models using bagging as the values.
ensemble method and decision trees as the individual model. 4. It automates the process of filling in missing values in
As a result of averaging the predictions from the trees, the the data.
model performs better than any one decision tree alone.[2] 5. It uses a rule-based approach, so it does not require
In a regression problem, the prediction is achieved by averaging data normalization.
the predictions of all the trees. For a classification problem, the
prediction is the class label that has the largest majority vote Disadvantages of Random Forest [5]:
across the trees of the ensemble. 1. As it builds numerous trees to combine their outputs,
● Regression: A prediction is an average of all the it requires a lot of computational power and resources.
predictions in the decision tree. 2. Additionally, it requires a great deal of time for
● Classification: The prediction is the class label with training since it combines a lot of decision trees.
the most votes across all decision trees. 3. Moreover, because it uses an ensemble of decision
trees, it is hard to interpret and does not indicate the significance
A random forest is constructed by putting bootstrapping
of each variable.
samples from a training dataset into a large number of decision
trees. At each split point in the construction of trees, random
forest also selects a subset of input features (columns or Applications of Random Forest:
variables) from the input, unlike bagging. The process of
building a decision tree involves selecting a split point based on A banking analysis contains a high risk of profit and loss, thus
the value of each input variable in the data. As the features are requiring a lot of effort. In the banking industry, customer
reduced to a random subset that can be considered at each split analysis is one of the most commonly used studies. Random
point, the ensemble decision trees become more diverse. forests are perfect for detecting any fraud transaction or

195
• When the
problems such as calculating the likelihood of a customer assumptions of independence hold, the Naïve Bayes
defaulting on a loan.[5] classifier performs well than most of the other existing
models
1. It can be used in pharmaceutical industries to assess the • It performs well with categorical labels than numerical
potential of a particular medicine or to identify the chemical variables
composition needed for a medicine.
Disadvantages
2. In addition, hospitals can use it to identify illnesses suffered • The independent assumptions are a big factor in
by patients, cancer risk in patients, and many other diseases that making guesses hence if that doesn’t hold then the Naïve Bayes
depend on early diagnosis and research. classifiers fail to give the correct output
• If a label is observed in test data but not training data
then the model assigns 0 value to it and will be unable to make
2.3 Naïve Bayes predictions for it
Naïve Bayes is a classification algorithm that works on the
concept of Bayes theorem of probability to predict the classes Applications
for an unknown dataset. In simpler terms, the Naïve Bayes
algorithm classifies each feature of the given dataset Used for Document classification
independently irrespective of its relation to any of the other
features. Used for Email filtering, Spam Filtering
The Bayes theorem provides a way for the user to calculate the
Used for construction of recommendation system which is
posterior probability P(c|x) from P(c), P(x), and P(x|c). The
used for data mining
equation for the same is
Used for real time predictions
𝑃(𝑥|𝑐) × 𝑃(𝑐)
𝑃(𝑐|𝑥) =
𝑃(𝑥) 2.4 K - Nearest Algorithm (KNN Algorithm)
Where,
• P(c|x) is the posterior probability of class The k-nearest algorithm or the k-nearest neighbors algorithm is
• P(c) is the previous probability of class a non-parametric supervised learning method which was first
• P(x|c) is the probability of the predictor of the given developed in the year 1951 by Joseph Hodges and Evelyn Fix,
class which was later expanded upon by Thomas Cover.
• P(x) is the previous probability of predictor
The KNN algorithm is used to solve classification as well as
regression problems.
Going by the types of Naïve Bayes classifiers there are three It works by finding the distances between a query and all the
types namely: examples in the data, selecting the specified number examples
Multinomial Naïve Bayes (K) closest to the query, then votes for the most frequent label
This is generally used when the task at hand is document or averages the labels in case of classification or in the case of
classification. For example, if we have to classify the document regression respectively.
into types like sports magazines or political magazines.
Bernoulli Naïve Bayes In the case of classification and regression, choosing the right
This classifier is similar to the Multinomial Naïve Bayes K for the data is done by trying several different Ks and then
classifier with the difference being that the Bernoulli classifier picking the one that works best according to our needs.
only has predictors in boolean variables i.e., they take up values
only in the form of Yes or No. Advantages

Gaussian Naïve Bayes KNN is widely used because of the vast advantages it offers. It
In this classifier, the predictors take up continuous values is very simple to implement and understand. The KNN
instead of discrete values. Since the values present in the dataset algorithm has no explicit training step and all the work happens
change, the formula for conditional probability changes to, during prediction itself. As new data is added to the data-set, the
1 (𝑥𝑖 − 𝜇𝑦)2 prediction is adjusted without having to retrain a new model as
𝑃(𝑥𝑖|𝑦) = exp (− ) there is no explicit training set for it. Also, since there is only a
√2𝜋𝜎𝑦2 2𝜎𝑦2 single hyper parameter, i.e, the value of K, hence hyper
parameter tuning becomes pretty easy.
Advantages
• It is much faster and easier to predict classes for a Disadvantages
dataset.

196
like all other algorithms, the KNN algorithm isn’t perfect
either. When there is a high amount of data set to process, the The above is the comparison of widely used and most popular
prediction complexity becomes very high. Also for higher supervised classification algorithms. Accuracy is determined by
dimensional data too, the prediction complexity becomes high. comparing the confusion matrix. As a measure of performance,
The KNN algorithm is sensitive when features like distance, accuracy is the ratio of correct predictions to all observations. It
dimension etc. have different ranges. Also noisy data can result is the most intuitive measure. The accuracy is compared by
in over-fitting or under-fitting of data. applying the algorithms to the dataset. [6]

Application
4. Conclusion
The KNN algorithm has applications in various fields. Some of
the common applications of KNN are as follows:
1. Facial recognition systems.
2. Recommendation Systems.
3. In the agricultural sector to predict various factors.

The KNN algorithm is used in different platforms such as on


Netflix or on Amazon where the user or the customer is given
recommendation for movies, series, products etc, based on their
previous searches or watch history.
Clarity on Best Best Average
Classification
prediction

Parameters Average Best Average


handling for
model

Overall Best Worst Good


accuracy (84.13%) (80.14%) (83.65%)

3.Comparative Analysis We examined three basic algorithms in depth in this


comprehensive survey: the Random Forest method, the Naive
This section provides a comparison of the above-mentioned Bayes Classifier, and the KNN algorithm. These three
algorithms with respect to a few important parameters. In the algorithms were compared based on a number of parameters.
end, we evaluate the overall accuracy of these algorithms. This This paper will aid researchers in determining which one of
analysis is based on the earlier mentioned dataset. these three algorithms is the best to use in their future research.

Table 1 An analysis of three widely used supervised


classification algorithms.[6]
REFERENCES
Parameters for Random Naive k-NN
comparison Forest Bayes
[1https://www.geeksforgeeks.org/machine-learning/
Speed of Average Best Best
learning [2] https://machinelearningmastery.com/random-forest-
ensemble-in-python/
Classification Best Best Worst
speed [3] Page 199, Applied Predictive Modeling, 2013

Performance Average Best Worst [4] Page 387, Applied Predictive Modeling, 2013.
when value is
missing [5] https://www.mygreatlearning.com/blog/random-
forestalgorithm/#AdvantagesandDisadvantagesofRandomFore st
Performance Average Good Good
with irrelevant [6] [6] Sen, Pratap & Hajra, Mahimarnab & Ghosh, Mitadru.
features (2020). Supervised Classification Algorithms in Machine
Learning: A Survey and Review. 10.1007/978-981-13-7403-
Noise tolerance Good Average Average 6_11.

Performance on Good Average Average 197


discrete/binary
attributes
View publication stats

You might also like