Comparison of machine learning algorithms for
animal classification and clustering using orange
data mining tool
M A Aditya M Jeevan Pranav
School of Science and Engineering School of Science and Engineering
SRM University SRM University
Chennai, India Chennai, India
am0307@[Link] jm4192@[Link]
Abstract—This study explores the use of machine learning II. H OW I MAGE C LUSTERING W ORKS
algorithms for classification and clustering of animal dataset
using the orange data mining tool. The results indicate significant Image clustering in Orange data mining tools involves
improvements in accuracy and efficiency compared to traditional grouping similar images together based on their visual fea-
methods. Furthermore, the findings suggest that integrating these tures. The process typically begins with feature extraction,
algorithms can enhance predictive modelling in wildlife conser- where algorithms analyze each image to identify key charac-
vation efforts. Additionally, the analysis highlights the potential
for real-time data processing, enabling quicker responses to
teristics such as color distribution, texture patterns, and shape
environmental changes and better-informed decision-making in information. These extracted features are then represented
conservation strategies. as numerical vectors in a high-dimensional space. Orange
employs various clustering algorithms, such as k-means, hier-
I NTRODUCTION archical clustering, or DBSCAN, to group these feature vectors
into clusters.
This paper explores the application of machine learning for
classification and clustering tasks using Orange data mining L ITERATURE R EVIEW
tool, highligheting its effectiveness in simplifying data analysis
through visual programming and evaluating model perfor- A. Classification of Supervised Learning Algorithms:
mance on selected data set. The study emphasizes the user- The supervised machine learning algorithms which deals more
friendly interface of Orange, which allows researchers and with classification includes the following: Linear Classifiers,
practitioners to easily manipulate data and visualize results, Logistic Regression, Naïve Bayes Classifier, Perceptron, Sup-
thereby enhancing their understanding of complex datasets. port Vector Machine; Quadratic Classifiers, K-Means Clus-
Furthermore, the integration of various machine learning al- tering, Boosting, Decision Tree, Random Forest (RF); Neural
gorithms within Orange enables users to experiment with networks, Bayesian Networks and so on. [1]
different approaches, facilitating a deeper insight into the
strengths and weaknesses of each ml algorithm. 1) Logistic regression: This is a classification function
that uses class for building and uses a single multinomial
IMAGE CLASSIFICATION AND CLUSTERING logistic regression model with a single estimator. Logistic re-
gression usually states where the boundary between the classes
This paper will classify the Dataset containing Animal
exists, also states the class probabilities depend on distance
images based on different ml algorithms by using Image Clas-
International Journal of Computer Trends and Technology
sification and Image Clustering. The algorithms will compare
(IJCTT) – Volume 48 Number 3 June 2017 ISSN: 2231-2803
and give the best algorithm which is suitable and fast for the
[Link] Page 130 from the boundary, in a
Image classification of given Dataset.
specific approach. This moves towards the extremes (0 and 1)
more rapidly when data set is larger. These statements about
I. H OW I MAGE C LASSIFICATION W ORKS
probabilities which make logistic regression more than just a
Image Classification in Orange data mining tool is a pow- classifier. It makes stronger, more detailed predictions, and can
erful feature that allows users to categorize images based on be fit in a different way; but those strong predictions could be
their content, enabling efficient analysis and decision-making. wrong. Logistic regression is an approach to prediction, like
Classification is performed by various machine learning algo- Ordinary Least Squares (OLS) regression. However, with lo-
rithms that can learn from labelled datasets, making it possible gistic regression, prediction results in a dichotomous outcome.
to identify patterns and features within the images. Logistic regression is one of the most commonly used tools for
applied statistics and discrete data analysis. Logistic regression as a separate cluster and eventually combines related groups
is linear interpolation. [2] according to a similarity and distance value. The algorithm
assesses the closeness of clusters and picks the clusters to
2) Support Vector Machines (SVMs): These are the combine at each stage According to the data and problematic
most recent supervised machine learning [Link] domain, the similarity and distance metrics employed in hi-
Vector Machine (SVM) models are closelyrelated to classical erarchical clustering might differ. Distance measures that are
multilayer perceptron neural [Link] revolve around commonly employed comprise Euclidean distance, Manhattan
the notion of a margin either side of a hyperplane that separates distance, and correlation distance. These measurements assess
two data classes. Maximizing the margin and thereby creating the dissimilarity and similarity of data points and serve as a
the largest possible distance between the separating hyperplane basis for the clustering method . [6]
and the instances on either side of it has been proven to reduce
an upper bound on the expected generalisation error . [3]
6)Louvain Clustering: The Louvain method is an algorithm
3) K-means: Kmeans is one of the simplest unsuper- to detect communities in large networks. It maximizes a
vised learning algorithms that solve the well-known clustering modularity score for each community, where the modularity
problem. The procedure follows a simple and easy way to quantifies the quality of an assignment of nodes to communi-
classify a given data set through a certain number of clusters ties. This means evaluating how much more densely connected
(assume k clusters) fixed a priori.K-Means algorithm is be the nodes within a community are, compared to how connected
employed when labeled data is not [Link] method they would be in a random network.
of converting rough rules of thumb into highly accurate predic- The Louvain algorithm is a hierarchical clustering algo-
tion rule. Given weak learning algorithm that can consistently rithm, that recursively merges communities into a single node
find classifiers (rules of thumb) at least slightly better than and executes the modularity clustering on the condensed
random, say, accuracy - 55algorithm can provably construct graphs. [7]
single classifier with very high accuracy, say, 99percent. [4]
4) Neural Networks:Opened Neural Networks (NN) that 7)Random Forest: Random forest is a classification tech-
can actually perform a number of regression and/or classifica- nique proposed by (Breiman, 2001). When given a set of
tion tasks at once, although commonly each network performs class-labeled data, Random Forest builds a set of classification
only one. In the vast majority of cases, therefore, the network trees. Each tree is developed from a bootstrap sample from the
will have a single output variable, although in the case of training data. When developing individual trees, an arbitrary
many-state classification problems, this may correspond to a subset of attributes is drawn (hence the term “random”), from
number of output units (the post-processing stage takes care of which the best attribute for the split is selected. Classification
the mapping from output units to output variables).Artificial is based on the majority vote from individually developed tree
Neural Network (ANN) depends upon three fundamental classifiers in the forest. [8]
aspects, input and activation functions of the unit, network
architecture and the weight of each input connection. Given
that the first two aspects are fixed, the behavior of the ANN 8)KNN Algorithm: The k-nearest neighbors (KNN) al-
is defined by the current values of the weights. The weights gorithm is a non-parametric, supervised learning classifier,
of the net to be trained are initially set to random values, and which uses proximity to make classificaion classifiers used in
then instances of the training set are repeatedly exposed to machine learning [Link] or predictions about the grouping
the net. The values for the input of an instance are placed on of an individual data point. It is one of the popular and simplest
the input units and the output of the net is compared with the classification and regress [9]
desired output for this instance. Then, all the weights in the
net are adjusted slightly in the direction that would bring the
output values of the net closer to the values for the desired For classification problems, a class label is assigned on the
output. There are several algorithms with which a network can basis of a majority vote—i.e. the label that is most frequently
be trained. [5] represented around a given data point is used. While this is
technically considered “plurality voting”, the term, “majority
5)Hierarchical Clustering: Hierarchical clustering is a vote” is more commonly used in literature. The distinction be-
popular unsupervised learning strategy for grouping similar tween these terminologies is that “majority voting” technically
data elements. It creates a cluster of hierarchical structure by requires a majority of greater than 50 percent, which primarily
repeatedly merging and dividing clusters based on similarity works when there are only two categories. When you have
and dissimilarity [29]. The core concept underlying hierarchi- multiple classes—e.g. four categories, you don’t necessarily
cal clustering is to construct a dendrogram, which is a tree-like need 50 percent of the vote to make a conclusion about a
framework that depicts the connections between data points class; you could assign a class label with a vote of greater
and clusters. The dendrogram begins with every data point than 25 percent.
M ETHODOLOGY
[Link]:, an auxiliary of Google LLC, is an online foun-
dation of data investigators and professionals in software engi-
neering. Kaggle empowers customers to find and submit data
sets, research and build models in an electronic data science
environment, work with various other data investigators and Al
professionals, and contend to address data science challenges.
B. Dataset In this paper, we have chosen the Animal
Dataset. From this dataset, we will be showing the relation
Fig. 1. Image Classification in Orange tool
between the patient and other factors like age, gender, state,
nationality, transmission mode etc. With this we will be able to
generate graphs to show the hidden patterns that exist within
Comparing the machine learning algorithms by cross vali-
this dataset. This dataset was provided on Kaggle. Dataset as
dation(no. of folds):
of 4 June 2020.
C. Orange: Orange is a C++ center item and organizes
1. No. of folds = 5
library that consolidates a massive gathering of standard and
non- standard AI and data mining counts. It is an open-
source information portrayal, data mining, and AI instrument. After giving the number of folds as 5, we will get the
Orange is a scriptable zone for quick prototyping of the most comparison between different algorithms used. We will get
recent calculations and testing structures. It is an assembly the Logistic Regression algorithm as having best Classification
of python-based modules that exist within the inside library. Accuracy which is [Link] 0.994.
It recuperates two or three functionalities for which run time
isn’t fundamental, and that is done in Python. Orange is a lot
of graphical widgets (Fig.2) that use structures from the center
library and orange modules and gives a legitimate UI (User
Interface).
Orange is proposed for both experienced clients and con-
trollers in data mining and Al who need to make and test
their own computations while reusing numerous pieces of the
code as could be normal considering the situation present,
and furthermore for those basically entering the field who can
make short python substance for information investigation.
A NALYSIS OF DATASET
A. Classification of Animal Dataset using different ml
algorithms: [10]Implement a Image dataset in orange data
mining tool and select image embedding for the data table and Fig. 2. Test and Score for 5 folds
use different machine learning algorithms for the [Link],
you have to review and compare all the datasets using Text
and Score Widget by connecting all the ml algorithms to it.
Test and Score contains number of [Link] minimum fold Confusion Matrix for the trained dataset;
will be given as 5 and the maximum is given as 20 number of
folds. So, there will be three different comparisons as number
of folds varies from 5,10 and [Link] Confusion Matrix, it In the confusion Matrix, we can detect 3 errors which are
will detect the errors and predections as which are correct and predicted wrong and the classification accuracy is equal to
wrong and can be viewed by the Widget Image Viewer. 0.994 for the best algotithm.
Fig. 5. Confusion Matrix for 10 folds
Fig. 3. Confusion Matrix for Animal Classification(5)
1. No. of folds = 20
Comparison for 5 folds:
Logistic regression - 0.994(Having higher Classification After giving the number of folds as 20, we will get the
Accuracy) comparison between different algorithms [Link] will change
SVM - 0.975 the Number of folds because, more folds we provide will
KNN - 0.994(Same CA as Logistic regression) varies the performance and reduce computation time. The
Random Forest - 0.960 Classification Accuracy changes for every machine algorithm
used and you can see the difference in the changed values
1. No. of folds = 10 below in the figure mentioned.
After giving the number of folds as 10, we will get the
comparison between different algorithms used. As we changed
the number of folds into 10, the Classification Accuracy will
be changing for every algorithm used and you can observe
this in the figure below.
Fig. 6. Test and Score Results with 20 Folds Cross-Validation
Confusion Matrix for the trained dataset;
Fig. 4. Test and Score Results for 10 folds
Confusion Matrix for the trained dataset;
Comparison for 10 folds:
Logistic regression - 0.994(Having higher Classification Ac-
curacy)
SVM - 0.973 Fig. 7. Confusion Matrix for 20 folds
KNN - 0.994(Same CA as Logistic regression)
Random Forest - 0.973 Comparison for 20 folds:
If we compare this with Number of folds for 5, Logistic Logistic regression - 0.994(Having higher Classification Ac-
Regression and KNN having same Classification Accuracy. curacy)
SVM - 0.973 Hierarchical clustering in the Orange Data Mining tool is
KNN - 0.994(Same CA as Logistic regression) a method used to group data points into clusters by building
Random Forest - 0.963 a hierarchy based on their similarities. It works by initially
If we compare this with Number of folds for (5,10), Logistic treating each data point as a separate cluster and then pro-
Regression and KNN having same Classification Accuracy gressively merging similar clusters. This process is visualized
and Random forest changes to 0.963 compared to other folds. through a dendrogram, a tree-like diagram where the height of
Random Forest has more difference in Classification Accuracy each merge represents the distance between the clusters. [12]
when compared to other number of folds.
Image showing the Hierarchial clustering on implemented
B. Clustering of Animal Dataset using different ml Animal dataset;
algorithms: Clustering requires seperate Dataset because,
Clustering itself manages the types mentioned without classi-
fying them into seperate folders or branches. If we just upload
the images in random. [11]
Fig. 10. Hierarchial Clustering in Orange tool
2. K-means Clustering:
K-means clustering in the Orange data mining tool is a
method of partitioning data into k distinct clusters. It groups
data points based on their similarity, minimizing the variance
Fig. 8. Image Clustering in Orange tool within each cluster. In this process, the algorithm assigns data
points to the nearest cluster by calculating the distance from
It will automatically classify them into different clusters the cluster’s centroid (mean of the points). The centroids are
based on their [Link] will use Unsupervised Models for recalculated iteratively until the data points’ assignments no
Image Clustering process. longer change significantly, resulting in clusters where each
data point is closer to its own cluster centroid than to others.
[13]
Image showing the K-Means clustering on implemented
Animal dataset;
In k means it will plot all images which features are similar
it will plot near by near and which are not similar will plot by
long distance and In hierarchial clustering it is easy to view.
CONCLUSION
This is a IEEE paper which is Comparison of machine
learning algorithms for Animal Classification and Clustering
Fig. 9. Example for Clustering using Orange data mining tool. The study explores various
algorithms, including decision trees, support vector machines,
and k-means clustering, to evaluate their effectiveness in accu-
Here, we can observe how Clustering works on a sample rately classifying and grouping different animal species based
Dataset and divides into different Clusters. on their features. The Conclusion for this paper is that while
all algorithms demonstrated varying degrees of success, the
[Link] Clustering: support vector machine and Logistic Regression outperformed
The final conclusion is Logistic Regression and Support
Vectoer Machine(SVM) are best used machine learning Al-
gorithms for Image Classification. For Image Clustering, we
will give different dataset without classification in between and
both hierarchial and k-means Clustering Unsupervised Models
. This comparion between different ml Algorithms is made by
changing number of folds for different train models.
R EFERENCES
[1] T. O. Ayodele, “Types of machine learning algorithms,” New advances
in machine learning, vol. 3, no. 19-48, pp. 5–1, 2010.
[2] R. Reddy and U. A. Kumar, “Classification of user’s review using
modified logistic regression technique,” International Journal of System
Assurance Engineering and Management, vol. 15, no. 1, pp. 279–286,
Fig. 11. K-Means Clustering(Scatter Plot) in Orange tool 2024.
[3] Rangayya, Virupakshappa, and N. Patil, “Improved face recognition
method using svm-mrf with ktbd based kcm segmentation approach,”
International Journal of System Assurance Engineering and Manage-
ment, vol. 15, no. 1, pp. 1–12, 2024.
[4] L. Sun, Q. Zhang, W. Ding, T. Wang, and J. Xu, “Fcpfs: Fuzzy
granular ball clustering-based partial multilabel feature selection with
fuzzy mutual information,” IEEE Transactions on Emerging Topics in
Computational Intelligence, 2024.
[5] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-time learning capability
of neural networks,” IEEE Transactions on Neural Networks, vol. 17,
no. 4, pp. 863–878, 2024.
[6] I. B. Prasad, S. Gangwar, Yogita, S. S. Yadav, and V. Pal, “Hcm: a hier-
archical clustering framework with moora based cluster head selection
approach for energy efficient wireless sensor networks,” Microsystem
Technologies, vol. 30, no. 4, pp. 393–409, 2024.
[7] H. Mardiansyah, S. Suwilo, E. B. Nababan, and S. Efendi, “The role
of louvain-coloring clustering in the detection of fraud transactions,”
International Journal of Electrical and Computer Engineering (IJECE),
vol. 14, no. 1, pp. 608–616, 2024.
[8] R. Iranzad and X. Liu, “A review of random forest-based feature selec-
tion methods for data science education and applications,” International
Journal of Data Science and Analytics, pp. 1–15, 2024.
[9] A.-Y. Guerrero-Estrada, L. Quezada, and G.-H. Sun, “Benchmarking
Fig. 12. K-Means Clustering(Silhouette Plot) in Orange tool quantum versions of the knn algorithm with a metric based on amplitude-
encoded features,” Scientific Reports, vol. 14, no. 1, p. 16697, 2024.
[10] J. Chappidi and D. M. Sundaram, “A comparative study of animal
detection and classification algorithms, applications and challenges,” in
the others in terms of accuracy and precision, making it the AIP Conference Proceedings, vol. 3112, no. 1. AIP Publishing, 2024.
most suitable choice for this classification task. Furthermore, [11] T. Waele, A. Shahid, D. Peralta, F. Tuyttens, and E. Poorter, “Towards
the decision tree algorithm provided valuable insights into unsupervised animal activity recognition: a deep learning based cluster-
ing algorithm for equine gait classification,” 2024.
feature importance, allowing for a better understanding of the [12] S. Stiller, J. F. Dueñas, S. Hempel, M. C. Rillig, and M. Ryo, “Deep
characteristics that distinguish each species. learning image analysis for filamentous fungi taxonomic classification:
Dealing with small datasets with class imbalance and hierarchical
Additionally, k-means clustering revealed interesting pat- grouping,” Biology Methods and Protocols, vol. 9, no. 1, p. bpae063,
2024.
terns in the data, highlighting potential relationships between [13] J. Liu, D. W. Bailey, H. Cao, T. C. Son, and C. T. Tobin, “Development
species that may not have been immediately apparent through of a novel classification approach for cow behavior analysis using
traditional clustering methods. These findings suggest that a tracking data and unsupervised machine learning techniques,” Sensors,
vol. 24, no. 13, p. 4067, 2024.
hybrid approach, combining the strengths of these algorithms,
could lead to even more robust classification models in future
research. Moreover, incorporating ensemble methods could en-
hance model performance by leveraging the diverse strengths
of individual algorithms, ultimately leading to more accurate
predictions and deeper insights into the underlying data struc-
ture. Furthermore, exploring the integration of deep learning
techniques may also offer new avenues for uncovering com-
plex patterns and interactions within the dataset, potentially
revolutionizing our understanding of species classification. As
researchers continue to refine these methodologies, it will be
crucial to validate the results through extensive field studies
and cross-validation with existing taxonomic frameworks.