0% found this document useful (0 votes)

18 views5 pages

Gaussian Naive Bayesian Data Classification Model

This document presents a Gaussian Naive Bayesian data classification model that utilizes a clustering algorithm for the efficient classification of unknown continuous data. The model extracts representative samples using information entropy and applies hierarchical clustering to generate class labels, which are then used for classification without prior knowledge. Simulation results demonstrate the model's effectiveness in classifying new data while significantly reducing computational resources and time.

Uploaded by

thaotran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views5 pages

Gaussian Naive Bayesian Data Classification Model

Uploaded by

thaotran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Advances in Intelligent Systems Research, volume 168

International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)

Gaussian Naive Bayesian Data Classification Model Based on Clustering

Algorithm

Zeng-jun BI*, Yao-quan HAN, Cai-quan HUANG and Min WANG

Air Force Early Warning Academy, Wuhan, China
*Corresponding author

Keywords: Clustering algorithm, Naive bayesian algorithm, Classification model.

Abstract. A gaussian naive bayesian data classification model based on clustering algorithm was
proposed for fast recognition and classification of unknown continuous data containing a large
number of non-priori knowledge. Firstly, the unknown data were extracted from the representative
samples according to the information entropy measure for clustering to generate class labels. Then,
the mapping relationship between data and class labels was established by using the gaussian naive
bayes algorithm, and the classification model was obtained through training. Simulation results show
that this unsupervised analysis process has a good classification effect on new data.

Introduction
Classification is an important part of data mining. By learning training data, the mapping
relationship between training data and predefined classes can be established[1]. In order to make the
traditional classification algorithm classify data well without predetermined classification for
learning semi-supervised or even unsupervised methods are used to improve the classification
algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish
initial classification for a small number of data sets with class labels, and continuously updates the
data with high classification accuracy to the training set when predicting and classifying the data
without labels, so as to realize semi-supervised learning of data classification. However, this
algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be
classified, and prior knowledge still plays a crucial role in the training of classification algorithm.
Clustering is an unsupervised process in which the most similar objects are divided into a class based
on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised
clustering to text clustering and constructs an automatic text classification model based on vector
space model. However, the model is not suitable for the classification of continuous variables.
Therefore, this paper combines the clustering algorithm with the gaussian naive bayes
classification algorithm, and proposes an unsupervised classification model suitable for continuous
variable data. In this method, small representative samples are extracted from large samples by
information entropy theory, and prediction classes of observation data are generated by clustering
algorithm as predefined target classes of classification algorithm, so that data are classified and
prediction models are established without prior knowledge. Simulation results show that this model is
efficient in classifying and processing new data, and only a small part of sample extraction is needed
to train the classification model of the whole data, which greatly saves computing resources and time.

Selection of Clustering Algorithm

Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based
clustering algorithm and density-based clustering algorithm. The corresponding representative
classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN.
Clustering performance measurement measures the performance of clustering algorithms under
different environments according to the accuracy, consistency and other indicators of various
clustering algorithms for sample division. ARI index is used to measure the consistency between the
data label calculated by the clustering algorithm and the original label. The expression is:
Copyright © 2019, the Authors. Published by Atlantis Press.
This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). 396
Advances in Intelligent Systems Research, volume 168

RI-E[RI]
ARI  (1)
max RI-E[RI]
Where, RI  (a  b)/Cn2 , E[RI ] is the variance of RI, and a and b respectively represent the number of
data pairs that are still in the same class before and after clustering and the number of data pairs that
are not in the same class. ARI index reflects the consistency between class tags obtained by clustering
algorithm and self-carried class tags in data. Higher ARI value can reflect more accurate clustering
performance.
ARI index was used to measure the clustering effect of k-means, hierarchical clustering and
DBSCAN clustering algorithm on a sample size of 50-100. Generate three two-dimensional random
point groups with [1.5,1.5], [2,2] and [3,3] as the center and subject to gaussian distribution, with
sample size ranging from 50 to 100. The relation between ARI index and sample size of the three
clustering algorithms is shown in Fig. 1.

Figure 1. Relation between ARI index and sample size

As can be seen from figure 1, the ARI index of the hierarchical clustering algorithm between the
sample size of 65-80 is the highest and relatively stable, which proves that the hierarchical clustering
algorithm has the advantage of high accuracy in the small sample environment. Therefore, this paper
will use this feature to take the hierarchical clustering algorithm as the clustering algorithm to
generate the class tags of sample data.

Gaussian Naive Bayes Model Combined with Clustering Algorithm

In this paper, a data classification model combining the clustering algorithm and the gaussian
bayesian classification algorithm is proposed. On the premise of no prior knowledge, a classification
model is generated to assign an appropriate class tag Y to the sample DATA, and DATA of the same
type, NEW_DATA_i, is classified according to the unified standard. The model building process is
shown in Fig. 2.
NEW_DATA_1

NEW_DATA_2
DATA DATA_sample Cluster
NEW_DATA_1
...

NEW_DATA _n
check

Gaussian
Classification model Classifier Naive DATA_sample y
Bayes

NEW_DATA_1 NEW_Y_1 split

NEW_DATA_2 NEW_Y_2 train

DATA Y x_train y_train

NEW_DATA_1 NEW_Y_1
... ...
test
x_test y_test
NEW_DATA _n NEW_Y _n

Figure 2. Model building process

The DATA_sample was extracted from large-scale DATA according to the information entropy as
the learning DATA set of the model, and the DATA_sample was input into the clustering algorithm

397
Advances in Intelligent Systems Research, volume 168

to obtain the class tag y of the DATA_ sample. It can be considered that the correspondence between
DATA_sample and class tag y can represent the correspondence between DATA and y.
DATA_sample was randomly split into training set (x_train) and test set (x_test) according to a
certain ratio. Correspondingly, y was split into y_train and y_test.The x_train and y_train were input
into the Gaussian Bayes algorithm for Classifier training. Test the classifier with x_test and y_test
and check the classifier's accuracy. The calculation criterion of accuracy is the percentage of the
number of samples that the predicted class label of the classifier on x_test is consistent with y_test.
The classifier is encapsulated to generate a classification model that masters the distribution and
Classification rules of DATA. After the model is deployed, the corresponding class tag can be
obtained according to the consistent principle for all sample data DATA or DATA with the same
distribution NEW_DATA input into the model.

Experimental Simulation
This experiment was conducted on a 64-bit Windows 10 operating system with 8GB of computer
memory. The algorithm uses Python language to compile and run on Jypyter software.
2000 sample points subject to gaussian distribution containing four classes were generated, and
each point carried the original class label labels_true as the basis for model performance
measurement. The sample distribution is shown in Fig. 3.

Figure 3. Sample distribution diagram

The entropy of information and the accuracy of corresponding classification algorithm under
training sets of different scales are calculated as shown in Fig. 4.

Figure 4. Information entropy and accuracy of corresponding classification algorithm under training sets of different
scales
By calculation, the second derivative of the information entropy curve is zero when the training set
size is 126, and the accuracy of the classification algorithm is 0.74.
126 sample points were extracted from 2000 sample data as the training set of the model. After
clustering, 126 sample points were assigned with corresponding class labels. The distribution of
training sets before and after clustering is shown in Fig. 5, where the sample points of different
classes are expressed in different shapes after clustering.

398
Advances in Intelligent Systems Research, volume 168

Figure 5. Training set distribution before and after clustering

The training set with the class tag after clustering is input into the naive bayes new algorithm for
training, and the accuracy of the model obtained by the test set (x_test) is 0.75.
Further calculation shows that when the overall sample size increases, the advantages of this
method in processing large sample data are more obvious, as shown in Fig. 6.

(a) (b)
Figure 6. Relationship between sample size and model accuracy and information entropy
As can be seen from Fig. 6(a), with the expansion of sample size, the accuracy of the model is
constantly improved and in a high state. As can be seen from Fig. 6(b), information entropy, as a
measure of training set size selection, enables samples of different sizes to reasonably select training
sets reflecting the law of sample distribution, which not only ensures the accuracy of classification
algorithm, but also controls the overall small size of training set relative to samples.

Conclusion
In order to realize unsupervised classification of unknown data, this paper proposes a continuous data
analysis model combining clustering algorithm and naive bayesian classification algorithm. Using
the information entropy theory drawn from the larger data sample smaller data set for the model study,
based on hierarchical clustering algorithm under small data set is of high accuracy, and use of the
advantages of hierarchical clustering algorithm for bayesian classification algorithm to generate the
target class, makes the probability rule of bayes algorithm is effective to master data. Through the test
of simulated data, it is proved that the method in this paper can only extract a small part of data in the
sample to train the classification model of the overall data, which greatly saves the machine
computing resources and model training time.

References
[1] Le mingming, research and application of data mining classification algorithm [D]. Chengdu
university of electronic science and technology, 2017: 12-16.
[2] Kong yiqing, semi-supervised learning and its application research [D]. Wuxi, Jiangnan
University, 2009: 33-39.

399
Advances in Intelligent Systems Research, volume 168

[3] Dong liyan, sui peng, sun peng, li yongli, a new naive bayesian algorithm based on
semi-supervised learning [J]. Journal of Jilin University (engineering science edition), 2016, 46(3):
884-889.
[4] IEEE Translations on Power Systems, 2006, 21(2):933-940.
[5] Zhang bin, zhuang chijie, hu jun, et al. Power load curve integrated clustering algorithm
combined with dimension reduction technology [J]. Chinese journal of electrical engineering. 2015.
35(15): 3741-3749.
[6] Zhu cuiling, research on text classification methods based on unsupervised clustering and naive
bayes classification [D]. Jinan: Shandong University, 2005: 33-39.

400

A N M F C R - B N B C: Ovel Ethodology OR Onstructing ULE Ased Aïve Ayesian Lassifiers
No ratings yet
A N M F C R - B N B C: Ovel Ethodology OR Onstructing ULE Ased Aïve Ayesian Lassifiers
13 pages
Chapter-V CLASSIFICATION & CLUSTERING
No ratings yet
Chapter-V CLASSIFICATION & CLUSTERING
153 pages
Post Op Weka Data Set Sample PDF
No ratings yet
Post Op Weka Data Set Sample PDF
8 pages
Unit-4 DM
No ratings yet
Unit-4 DM
15 pages
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
No ratings yet
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
4 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Screenshot 2025-01-03 at 8.05.30 PM
No ratings yet
Screenshot 2025-01-03 at 8.05.30 PM
20 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
41 pages
Memory-Based Classifiers for Multivariate Data
No ratings yet
Memory-Based Classifiers for Multivariate Data
14 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Classification Clustering
No ratings yet
Classification Clustering
44 pages
Understanding Classification in Data Mining
No ratings yet
Understanding Classification in Data Mining
18 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
34 pages
Overview Basics
No ratings yet
Overview Basics
16 pages
Multi-Class Classification Algorithms Review
No ratings yet
Multi-Class Classification Algorithms Review
10 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
1 IJISAE Yemona
No ratings yet
1 IJISAE Yemona
15 pages
Attribute Filtered Data Mining
No ratings yet
Attribute Filtered Data Mining
7 pages
CCS - Lec 5
No ratings yet
CCS - Lec 5
33 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
26 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
CH 4
No ratings yet
CH 4
21 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
9 pages
Unit 5 Notes DWM
No ratings yet
Unit 5 Notes DWM
18 pages
Classification and Regression Trees in ML
No ratings yet
Classification and Regression Trees in ML
36 pages
DWDM Module IV
No ratings yet
DWDM Module IV
57 pages
Decision Trees. These Models Use Observations About Certain
No ratings yet
Decision Trees. These Models Use Observations About Certain
6 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
50 pages
ML Unit-2
No ratings yet
ML Unit-2
51 pages
CS1004 DataMining Unit 4 Notes
No ratings yet
CS1004 DataMining Unit 4 Notes
8 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
DWM Unit-V Notes
No ratings yet
DWM Unit-V Notes
15 pages
WINSEM2023-24 MCSE602L TH VL2023240501960 2024-03-13 Reference-Material-I
No ratings yet
WINSEM2023-24 MCSE602L TH VL2023240501960 2024-03-13 Reference-Material-I
132 pages
A Study of Classification Algorithms Using Rapidminer
No ratings yet
A Study of Classification Algorithms Using Rapidminer
12 pages
482 LectureNotes Chapter 5
No ratings yet
482 LectureNotes Chapter 5
22 pages
Bayesian Classification Theory: Robin Hanson John Stutz Peter Cheeseman
No ratings yet
Bayesian Classification Theory: Robin Hanson John Stutz Peter Cheeseman
10 pages
Advances in Semi-Supervised Learning
No ratings yet
Advances in Semi-Supervised Learning
1 page
Camintac Essay - Nubbh Kejriwal
No ratings yet
Camintac Essay - Nubbh Kejriwal
4 pages
MODULE 3 Classification
No ratings yet
MODULE 3 Classification
5 pages
Rockburst Prediction Using Gaussian Process Machine Learning
No ratings yet
Rockburst Prediction Using Gaussian Process Machine Learning
4 pages
Supervised Machine Learning: A Review of Classification Techniques
No ratings yet
Supervised Machine Learning: A Review of Classification Techniques
20 pages
Assignment 6 Amandeep Singh
No ratings yet
Assignment 6 Amandeep Singh
2 pages
Ijet V3i5p39
No ratings yet
Ijet V3i5p39
15 pages
Unit 4
No ratings yet
Unit 4
186 pages
CS-DM Module-4
No ratings yet
CS-DM Module-4
22 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Model Based Evaluation of Clustering
No ratings yet
Model Based Evaluation of Clustering
18 pages
Unit IV Clustering
No ratings yet
Unit IV Clustering
60 pages
Categorical Data Clustering Method
No ratings yet
Categorical Data Clustering Method
5 pages
Module 3
No ratings yet
Module 3
64 pages
Unit-Iii: Classification and Prediction
No ratings yet
Unit-Iii: Classification and Prediction
21 pages
Boedeker Kearns 2019 Linear Discriminant Analysis For Prediction of Group Membership A User Friendly Primer
No ratings yet
Boedeker Kearns 2019 Linear Discriminant Analysis For Prediction of Group Membership A User Friendly Primer
14 pages
Model Choice and Specification Analysis
No ratings yet
Model Choice and Specification Analysis
46 pages
Konstantinos-Eleftherios Metallinos
No ratings yet
Konstantinos-Eleftherios Metallinos
2 pages
2marks With Answers
No ratings yet
2marks With Answers
10 pages
Bayesian Inference Fundamentals
No ratings yet
Bayesian Inference Fundamentals
195 pages
Artificial Intelligence and Machine Learning Question Bank
No ratings yet
Artificial Intelligence and Machine Learning Question Bank
23 pages
10 1002@wics 199
No ratings yet
10 1002@wics 199
5 pages
DNN Exam Paper May 2025
No ratings yet
DNN Exam Paper May 2025
2 pages
Making Models With Bayes
No ratings yet
Making Models With Bayes
51 pages
Rosenberg Et Al 2025 A Mathematical Theory of Evolution Phylogenetic Models Dating Back 100 Years
No ratings yet
Rosenberg Et Al 2025 A Mathematical Theory of Evolution Phylogenetic Models Dating Back 100 Years
6 pages
1 Overview of Artificial Intelligence PDF
No ratings yet
1 Overview of Artificial Intelligence PDF
57 pages
Complete Bundle Machine Learning Bayesian and Optimization Perspective 2nd Edition HQ File
No ratings yet
Complete Bundle Machine Learning Bayesian and Optimization Perspective 2nd Edition HQ File
404 pages
Artificial Intelligence and Machine Learning (Theory Exam)
No ratings yet
Artificial Intelligence and Machine Learning (Theory Exam)
65 pages
Optativas Dbabba
No ratings yet
Optativas Dbabba
641 pages
Distributed NLI: Learning To Predict Human Opinion Distributions For Language Reasoning
No ratings yet
Distributed NLI: Learning To Predict Human Opinion Distributions For Language Reasoning
16 pages
Bozorgzadeh Et Al. (2018) - Comp. Stat. Analysis of Intact Rock Strength For Reliability-Based Design
No ratings yet
Bozorgzadeh Et Al. (2018) - Comp. Stat. Analysis of Intact Rock Strength For Reliability-Based Design
14 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
Binomial Distribution & Bayes' Theorem
No ratings yet
Binomial Distribution & Bayes' Theorem
18 pages
Tospj 8 27
No ratings yet
Tospj 8 27
12 pages
MSC Artificial Intll Ud 2024 25 (1) - OK
No ratings yet
MSC Artificial Intll Ud 2024 25 (1) - OK
75 pages
Home Work Reinforcement Learning Multi Armed Bandit
No ratings yet
Home Work Reinforcement Learning Multi Armed Bandit
8 pages
Probabilistic Aspects of Scramjet Design
No ratings yet
Probabilistic Aspects of Scramjet Design
8 pages
Rumus
No ratings yet
Rumus
28 pages
A Bayesian Adaptive Phase III Design For Multi-Arm Trials With Time-To-Event Endpoint For Nonproportional Hazards Utilizing The Generalized Gamma Dist
No ratings yet
A Bayesian Adaptive Phase III Design For Multi-Arm Trials With Time-To-Event Endpoint For Nonproportional Hazards Utilizing The Generalized Gamma Dist
27 pages
A16 Simple Decisions
No ratings yet
A16 Simple Decisions
16 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
Stats COurses
No ratings yet
Stats COurses
9 pages
Scientific Method in Practice - Hugh G. Gauch
0% (1)
Scientific Method in Practice - Hugh G. Gauch
46 pages
Classification - Data Mining
No ratings yet
Classification - Data Mining
6 pages
1563 Bayesian Enhancement Mode
No ratings yet
1563 Bayesian Enhancement Mode
30 pages

Gaussian Naive Bayesian Data Classification Model

Uploaded by

Gaussian Naive Bayesian Data Classification Model

Uploaded by

Advances in Intelligent Systems Research, volume 168

Gaussian Naive Bayesian Data Classification Model Based on Clustering

Zeng-jun BI*, Yao-quan HAN, Cai-quan HUANG and Min WANG

Keywords: Clustering algorithm, Naive bayesian algorithm, Classification model.

Selection of Clustering Algorithm

Figure 1. Relation between ARI index and sample size

Gaussian Naive Bayes Model Combined with Clustering Algorithm

NEW_DATA_1 NEW_Y_1 split

NEW_DATA_2 NEW_Y_2 train

DATA Y x_train y_train

Figure 2. Model building process

Figure 3. Sample distribution diagram

Figure 5. Training set distribution before and after clustering

You might also like