0% found this document useful (0 votes)

8 views20 pages

Using Medical Data and Clustering

This study presents a smart healthcare system (SHCM) that utilizes clustering techniques and medical data to analyze patient risks and trends, enhancing personalized medical services. The research emphasizes the importance of data quality and employs various clustering methods, primarily K-means, to identify similarities among patient groups, including indigenous and non-indigenous populations. Findings indicate that the SHCM can improve hospital management and resource utilization by providing insights into health conditions and potential risks.

Uploaded by

surajkumar.19990208

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views20 pages

Using Medical Data and Clustering

Uploaded by

surajkumar.19990208

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

electronics

Article
Using Medical Data and Clustering Techniques for a Smart
Healthcare System
Wen-Chieh Yang 1 , Jung-Pin Lai 2 , Yu-Hui Liu 3 , Ying-Lei Lin 3 , Hung-Pin Hou 1 and Ping-Feng Pai 3,4, *

1 Puli Christian Hospital, Puli 54546, Taiwan

2 Department of Multimedia Game Development and Application, HungKuang University,
Taichung 43302, Taiwan
3 PhD Program in Strategy and Development of Emerging Industries, National Chi Nan University,
Nantou 54561, Taiwan
4 Department of Information Management, National Chi Nan University, Nantou 54561, Taiwan
* Correspondence: [email protected]

Abstract: With the rapid advancement of information technology, both hardware and software, smart
healthcare has become increasingly achievable. The integration of medical data and machine-learning
technology is the key to realizing this potential. The quality of medical data influences the results of a
smart healthcare system to a great extent. This study aimed to design a smart healthcare system based
on clustering techniques and medical data (SHCM) to analyze potential risks and trends in patients in
a given time frame. Evidence-based medicine was also employed to explore the results generated by
the proposed SHCM system. Thus, similar and different discoveries examined by applying evidence-
based medicine could be investigated and integrated into the SHCM to provide personalized smart
medical services. In addition, the presented SHCM system analyzes the relationship between health
conditions and patients in terms of the clustering results. The findings of this study show the
similarities and differences in the clusters obtained between indigenous patients and non-indigenous
patients in terms of diseases, time, and numbers. Therefore, the analyzed potential health risks
could be further employed in hospital management, such as personalized health education control,
personal healthcare, improvement in the utilization of medical resources, and the evaluation of
medical expenses.
Citation: Yang, W.-C.; Lai, J.-P.; Liu,
Y.-H.; Lin, Y.-L.; Hou, H.-P.; Pai, P.-F.
Keywords: clustering; medical data; smart healthcare
Using Medical Data and Clustering
Techniques for a Smart Healthcare
System. Electronics 2024, 13, 140.
https://doi.org/10.3390/
electronics13010140 1. Introduction
Due to the progress and advantages of information technology and data analysis tech-
Academic Editors: Antoni Morell
and Chunping Li
niques, smart medical care plays an important role in the modern medical field. Machine-
learning and data-mining techniques have provided hospital practitioners with more
Received: 15 November 2023 effective and efficient medical solutions in personalized medicine and led to disease pre-
Revised: 18 December 2023 dictions, medical efficiency improvement, and medical resource optimization. To identify
Accepted: 27 December 2023 similarities among patients, grouping patients into clinically meaningful clusters is essen-
Published: 28 December 2023
tial [1]. Healthcare organizations and physicians take advantage of clustering results to
analyze similarities among patients. By clustering patients in terms of diseases, risk factors,
lifestyles, or other relevant factors, clustering results can help physicians gain insights into
Copyright: © 2023 by the authors.
patients’ needs and provide personalized treatments.
Licensee MDPI, Basel, Switzerland. Previous studies have pointed out the importance of using medical management
This article is an open access article databases to analyze patient clusters to learn trends of diseases according to clustering
distributed under the terms and results [2]. The clustering technique is one of the most useful methods for analyzing patient
conditions of the Creative Commons similarities for precision medicine [1]. Analyzing a patient’s potential risks and trends
Attribution (CC BY) license (https:// requires a lot of patient-related data, which are recorded every time a patient visits a
creativecommons.org/licenses/by/ hospital for medical treatment. In the era of big data, electronic records include a large
4.0/). amount of text, such as the clinical narration of doctors’ advice. Thus, the analysis of

Electronics 2024, 13, 140. https://doi.org/10.3390/electronics13010140 https://www.mdpi.com/journal/electronics

Electronics 2024, 13, 140 2 of 20

electronic records has become more complex than before. In addition, due to the high
dimensions of input data, the reduction in dimensions or feature selection can improve
model efficiency and the performance of clustering tasks. Zelina et al. [3] proposed a
natural language processing (NLP) method to investigate the clinician dataset of Czech
breast cancer patients. The developed RobeCzech model is a general-purpose Czech
transformer language model and is used for the unsupervised extraction, labeling, and
clustering of fragments from clinical records. This study indicated the feasibility as well
as the possibility of dealing with unstructured Czech clinical records in a non-supervised
training manner. Irving et al. [4] employed electronic medical record (EMR) data to enhance
the detection and prediction of psychosis risk in South London. In addition to basic patient
information, clinical characteristics, symptoms, and substances, the EMR data included
NLP predictions. The authors reported that using NLP to cope with EMRs can significantly
improve the prognostic accuracy of psychosis risk.
Issues of concern in existing electronic medical records and eHealth systems include
technical aspects, managerial factors, and particularly the quality of data in systems [5].
Additionally, as previously pointed out, the quality of the data is essential for healthcare
systems [6]. Thus, this study aimed to deal with various data types by applying data
preprocessing with data merging, data conversion, data cleaning, data selection, and data
normalization. Then, clustering techniques were employed to group patients with similar
medical features to improve the data quality for the healthcare system.
This investigation used demographic information, drug items, doctors’ advice, and
exam items to perform clustering tasks and then to analyze the results in terms of in-
digenous people and non-indigenous people. Four clustering methods were used in this
study, namely, K-means, hierarchical clustering, autoencoder, and SOM-KM. The clustering
performance was evaluated through three indicators: the Calinski–Harabasz index (CH),
Davies–Bouldin index (DB), and Silhouette Coefficient (SC). For most cases and indices,
K-means outperformed the other methods. Therefore, K-means was used to analyze the
clustering results. The rest of this study is organized as follows. Section 2 illustrates
the clustering methods and applications in medical data analysis. The presented smart
healthcare system based on clustering techniques and big data is introduced in Section 3.
Section 4 depicts numerical examples. Finally, conclusions are presented in Section 5.

2. Clustering Techniques and Applications for Medical Data Analysis

Ezugwu et al. [7] and Saxena et al. [8] reported that clustering techniques can be
divided into two major categories, namely, hierarchical clustering algorithms and parti-
tion clustering algorithms. More clustering categories, including grid clustering, density
clustering, and model clustering, were proposed by Chaudhry et al. [9] and Oyewole
and Thopil [10]. K-means and hierarchical clustering techniques are the most widely
used algorithms in the literature. K-means clustering is one of the partition clustering
methods. Applications of clustering approaches in medical data analysis include dis-
ease nosology [11], early diagnosis of diseases [12,13], predictions of diseases [14,15], etc.
The clustering of diseases is mostly for chronic diseases and severe illnesses, for exam-
ple, diabetes [13,16,17], heart failure [18,19], cancer [20,21], stroke [22,23], and COVID-19
cases [24,25]. Arora et al. [16] used K-means clustering for the prediction of diabetes.
Jasinska-Piadlo et al. [18] employed K-means to cluster emergency readmissions of heart
failure patients by using a data-driven approach and domain-leading methods. Heart
failure patients usually have various characteristics at the physiological level. The study
indicated that the K-means clustering algorithm could identify patients with heart failure
very well. Ilbeigipour et al. [25] used SOM (self-organizing map) neural networks and
the K-means technique to cluster COVID-19 patients and investigated the relationships
between different symptoms of cases. The findings of this study could help health special-
ists improve their services by considering other important factors in treating COVID-19
patients in different ethnic groups. Table 1 lists a summary of recent clustering approaches
for medical data. It can be observed that most studies dealt with a single disease, and
Electronics 2024, 13, 140 3 of 20

K-means was commonly used as a popular clustering technique for analyzing medical
data. The clustering approaches can be generally classified into categories: hierarchical
clustering algorithms and partition clustering algorithms [7–10]. This study employed four
clustering methods, K-means (KM), hierarchical clustering (HC), the K-means autoencoder
(AEKM), and the K-means self-organizing map (SOMKM), to analyze medical data.

Table 1. Recent clustering methods for medical data.

References Years Applications Methods of Clustering

Santamaría et al. [11] 2020 Analysis of new nosological models DBSCAN *
K-means,
Farouk and Rady [12] 2020 Early diagnosis of Alzheimer’s disease
K-medoids
As a feature-grouping model for early diabetes
Hassan et al. [13] 2022 K-means
detection
K-means,
DBSCAN *,
Antony et al. [14] 2021 Chronic kidney disease prediction
I-Forest *,
Autoencoder
K-means,
Enireddy et al. [15] 2021 Prediction of diseases Agglomerative,
Fuzzy C-means
As a feature-extracted tool for diabetes patient
Arora et al. [16] 2022 K-means
prediction
Discovering and clustering phenotypes of
Parikh et al. [17] 2023 K-means
atypical diabetes
Jasinska-Piadlo et al. [18] 2023 Clustering heart failures K-means
K-prototype,
K-means,
Agglomerative,
Mpanya et al. [19] 2023 Clustering heart failure phenotypes BIRCH *,
OPTICS *,
DBSCAN *,
GMM *
Exploring associations between risk factors and
Florensa et al. [20] 2022 K-means
likelihood of colorectal cancer
Exploring the clustering of 17 chronic conditions
Koné et al. [21] 2023 K-means
with cancer
Classification of stiff-knee gait kinematic severity
Chantraine et al. [22] 2022 K-means
after stroke
Yasa et al. [23] 2022 Classification of stroke K-means
K-Efficient (a hybrid of
Al-Khafaji and Jaleel [24] 2022 Controlling and forecasting COVID-19 cases
K-medoids and K-means)
SOM,
Ilbeigipour et al. [25] 2022 The analysis of COVID-19 cases
K-means
Note: * I-Forest = Isolation Forest; BIRCH = Balanced Iterative Reducing and Clustering Hierarchies; OPTICS = Or-
dering Points to Identify the Clustering Structure; DBSCAN = Density-Based Spatial Clustering of Applications
with Noise; GMM = Gaussian Mixture Model.

The K-means method [26] involves dividing a sample dataset into k subsets, forming
k clusters, and assigning n data points to these k clusters, with each data point exclusively
belonging to one cluster. The K-means algorithm is an iterative process that consists of two
primary steps. Initially, it selects k cluster centers, and subsequently, it assigns data points
to the nearest center to obtain an initial result. Following this, the centroids of each cluster
are updated as new centers, and these two steps are repeated iteratively. The objective of
Electronics 2024, 13, 140 4 of 20

the clustering results is to minimize the distance between data points and their respective
cluster centers. The objection function of the K-means algorithm is shown in the following
equations. Equation (1) employs the Euclidean distance to ensure that data point xi is
closest to its assigned center, while Equation (2) is used to update the center as the mean
value [27–30].
k N
Obj = ∑ ∑ xi − x j
2
(1)
i =1 j =1

1 N
N ∑ i =1 i
Xk = x (2)

where k is the number of cluster centers, N is the number of data points in the ith cluster, x j
is the cluster mean, and xi is the ith point in the dataset.
For the K-means clustering algorithm, it is necessary to pre-specify the number of
clusters denoted by K. This is an important hyperparameter of the algorithm. To determine
the most suitable number of clusters for the experimental data, the Elbow Method is
employed in this approach [31].
Hierarchical clustering (HC) constructs a hierarchy of clusters by iteratively merging or
dividing clusters based on a distance metric. This method provides a visual representation
of the data structure through dendrogram plots. There are two main types of hierarchical
clustering: agglomerative and divisive. In our study, we employed the agglomerative
clustering approach because our data samples were generated from patient records. This
method begins with each sample being treated as an individual cluster and then progres-
sively merges clusters that are close in proximity until a certain termination condition is
met. For hierarchical clustering, three essential elements, the similarity distance, merging
rules, and termination conditions, need to be considered [32]. The hierarchical clustering
process is irreversible, and due to its consideration of each individual data point, it can be
computationally time-consuming.
Developed in the 1980s by Hinton and the PDP group [33], the autoencoder is an artifi-
cial neural network with an input layer, a hidden layer, and an output layer. The main pur-
pose of the autoencoder is to perform representation learning on the input data and make
the output and input have the same meaning. Autoencoders have been widely used in fea-
ture extraction [29,34–36]. An m-dimensional dataset is considered as X = {X1 , X2 , . . . , Xm }.
The compressed data features are generated by the encoder E, and following that, the
output X∗ is generated by the decoder D, which can be expressed by Equation (3):

X∗ = D(E(x)) (3)

The training goal of the autoencoder is to minimize the error. The loss function can be
expressed as Equation (4):

Loss function(X, X∗ ) = (X − X∗ )2 (4)

After establishing the autoencoder model, the K-means method is then used because
the autoencoder is not a clustering tool [35].
The self-organizing map (SOM) [37] is a method consisting of a two-dimensional
grid used for mapping input data. During the training process, the SOM forms an elastic
grid to envelop the distribution of input data, mapping adjacent input data to nearby
grid units. SOM training is an iterative process that adjusts the positions of grid units by
computing distances and finding the Best-Matching Unit (BMU) with prototype vectors.
Furthermore, the SOM’s computational complexity scales linearly with the number of
data samples, making it memory-efficient, but scales quadratically with the number of
map units. Training large maps can be time-consuming, although it can be expedited with
specialized techniques. Apart from the SOM, alternative variants are available, though they
may require more complex visualization methods. In summary, the SOM is an effective
Electronics 2024, 13, 140 5 of 20

approach for processing large datasets while preserving the topological characteristics of
the input space [38].
SOM training is conducted iteratively. In each training step, a sample vector is ran-
domly chosen from the input dataset. Distances between this sample vector and all proto-
type vectors are computed. The Best-Matching Unit (BMU), denoted by BMU, is the map
unit whose prototype vector is closest to the sample vector. Subsequently, the prototype
vectors are updated. The BMU and its topological neighbors are adjusted toward the
sample vector in the input space. The rule for the prototype vector of unit “i” is updated as
expressed in Equation (5):

vi (t + 1) = vi (t) + α(t) · hij (t) · [ x (t) − vi (t)] (5)

where
vi (t + 1) is the updated prototype vector for unit i at time t + 1.
vi (t) is the current prototype vector for unit i at time t.
α(t) is the adaptation coefficient at time t.
hij (t) is the neighborhood kernel centered on the winning unit at time t.
The SOM is commonly used for dimensionality reduction and data visualization; it
maps high-dimensional data into two- or three-dimensional spaces, providing a significant
advantage when dealing with complex data. In this study, we leveraged the strengths of
both models by first mapping the data into a two-dimensional representation through a
SOM and then performing clustering using K-means.

3. The Proposed SHCM System

After reviewing the clustering techniques and medical data analysis, the proposed
smart healthcare system based on clustering techniques and medical data (SHCM) is intro-
duced in this section. Figure 1 depicts the structure and procedures of the designed SHCM
system. The SHCM contains four parts: data preprocessing, clustering, performance evalu-
ation, and result analysis. The data were collected from the outpatient clinic database of
Puli Christian Hospital. Then, the data preprocessing process was conducted. Sequentially,
four clustering techniques were employed to perform grouping tasks. Three measurements,
namely, the Calinski–Harabasz index, the Davies–Bouldin Index, and the Silhouette Co-
efficient, were utilized to evaluate the performance of the clustering techniques. Based
on three measurements, K-means can mostly generate better results than the other three
clustering approaches in 12 months. Therefore, the clustering results provided by the
K-means methods were used to observe the grouping data and discuss them with medical
doctors. Finally, similar and different discoveries investigated by applying evidence-based
medicine could be identified and provided for further use in personalized health education
and healthcare. In addition, the utilization of medical resources and the evaluation of
medical expenses could possibly be improved.

3.1. Data and Data Preprocessing

The data were collected from the outpatient clinic database of Puli Christian Hospital
and included structured data and unstructured data from patient consultation information.
Because of the diversity of data formats and data structures, data preprocessing procedures
need to be conducted first. After the data preprocessing procedure, a total of 63,151 records
in this study contained patients who visited the outpatient clinic from 1 January to 31
December 2022. Four major attributes used in clustering model experiments include
demographic information, drug items, doctors’ advice, and exam items. Figure 2 illustrates
the data preprocessing steps with five stages: data merging, data conversion, data cleaning,
data selection, and data normalization.
Electronics 2024,13,
Electronics2024, 13,140
x FOR PEER REVIEW 6 of
of 20
21

Figure1.1.The
Figure Theproposed
proposedsmart
smarthealthcare
healthcaresystem
systembased
basedon
onclustering
clusteringtechniques
techniquesand
andmedical
medicaldata
data
(SHCM).
(SHCM).

3.1. Data and Data

The raw data Preprocessing
collected were presented in four major categories: gender and age,
Theadvice,
doctors’ data were
drugcollected from the
descriptions, andoutpatient
exam items. clinic
Thedatabase of Puli
merged data Christian
included Hospital
structured
and
andunstructured data. In data
included structured orderand
to achieve numerical
unstructured datavalues that canconsultation
from patient be recognized by
infor-
the clustering
mation. model,
Because of theunstructured such as and
diversity of data formats text data
and symbols were
structures, dataconverted into
preprocessing
numerical
procedures forms.
need Table 2 shows thefirst.
to be conducted conversion
After themethods according toprocedure,
data preprocessing the attributes, and
a total of
following that are
63,151 records inthe details.
this studyCategorical data included
contained patients gender the
who visited and outpatient
some of theclinic
examfrom
items.1
Doctors’
January advice, drug descriptions,
to 31 December 2022. Fourand some
major contentsused
attributes of exam items were
in clustering expressed
model as text.
experiments
Thus, the bidirectional encoder representation transformer (BERT) was
include demographic information, drug items, doctors’ advice, and exam items. Figure 2used to convert
text into vectors,
illustrates andpreprocessing
the data a principal component
steps withanalysis (PCA)
five stages: was
data employed
merging, to conversion,
data reduce the
high
datadimensions of converted
cleaning, data results
selection, and datainto 10 dimensions. Age and most exam items were
normalization.
represented by numerical forms. Finally, normalization was performed. The MinMaxScaler
Electronics 2024, 13, 140 7 of 20
Electronics 2024, 13, x FOR PEER REVIEW 7 of 21

was used for data normalization in this study and is represented by Equation (6). Table 3
shows the number of attributes and patient visits in 12 months.

(1) Data (2) Data (3) Data (4) Data (5) Data
merging conversion cleaning selection normalization

• Merge data from • From demographic • Remove duplicate • Gender • MinMaxScaler

files in terms of information to values • Age
demographic gender and age • Remove null values • Doctorsʹ advice
information, • From doctorsʹ • Drug items
doctorʹs orders, advice and drug
drug codes, and • Exam items
descriptions to data
exam items in vectors
• Exam results to
numerical data by
labeling

Figure 2. Data preprocessing steps.

Figure 2. Data preprocessing steps.
X − Xmin
X MinMaxScaler = (6)
max −major
The raw data collected were presented inXfour Xmin categories: gender and age, doc-
tors’ advice,
where drug descriptions,
X MinMaxScaler and exam
is the normalized items.
feature, andThe merged
Xmax data
and Xmin areincluded structured
the maximum and and
minimum values of the feature X.
unstructured data. In order to achieve numerical values that can be recognized by the
clustering model, unstructured data such as text and symbols were converted into numer-
Table 2. Conversion methods for attributes.
ical forms. Table 2 shows the conversion methods according to the attributes, and follow-
ing that areVariables
the details. Categorical data included gender and
Attributes some ofMethods
Conversion the exam items.
Doctors’ advice, X1drug descriptions, andGender some contents of exam items were expressed as
Labeling
text. Thus, the X2 Age
bidirectional encoder representation From(BERT)
transformer birthdayswas
to ages
used to con-
X3~X12 Drug items BERT and PCA
vert text into vectors, and a principal component analysis (PCA) was employed to reduce
X13~X22 Doctors’ advice BERT and PCA
the high dimensions
X23~Xn of converted results Examinto
items10 dimensions. AgeLabeling
and most exam items
were represented by numerical forms. Finally, normalization was performed. The
MinMaxScaler was
Table 3. Numbers of used for and
attributes data normalization
visits inmonths.
of patients in 12 this study and is represented by Equa-
tion (6). Table 3 shows the number of attributes and patient visits in 12 months.
Datasets Jan. Feb. Mar. Apr. May Jun.
𝑋 𝑋
Number of attributes 456 𝑋 394 418 447 417 416 (6)
Visits of patients 4765 4405 𝑋
5667 𝑋5410 7593 5397

where 𝑋 Datasets is theJul. Sep.and 𝑋 Oct. and 𝑋Nov. are the

Aug. feature,
normalized Dec.
maximum
and Number
minimum values of the
of attributes feature 𝑋.
455 447 445 404 444 470
Visits of patients 5136 5341 5023 5233 4675 4506
Table 2. Conversion methods for attributes.
3.2. Performance Measurements
Variables Attributes
Three measurements, the Calinski–Harabasz index (CH), the Davies–Bouldin index
Conversion Methods
X1Silhouette Coefficient (SC),Gender
(DB), and the were used in this study to evaluateLabeling
the perfor-
X2 Age From birthdays
mance of the clustering techniques. The Calinski–Harabasz index (CH) [39] assesses tothe
ages
concentration of data in the clustering results
X3~X12 Drugby calculating the ratio of BERT
items the sumand
of squared
PCA
distancesX13~X22
between clusters (BGSS) to the within-cluster
Doctors’ advice sum of squares (WGSS). It is
BERT and PCA one of
the commonly used evaluation metrics in K-means and hierarchical clustering. The CH
X23~Xn Exam items Labeling
index calculation formula under the assumption of N data points divided into K clusters is
shown in Equation (7); with this calculation method, the larger the value, the better [40]:
Table 3. Numbers of attributes and visits of patients in 12 months.
BGSS N−K
Datasets CH =
Jan. Feb.× K −Mar.
WGSS 1 Apr. May (7)
Jun.
Number of attributes 456 394 418 447 417 416
The DB index [41] evaluates the clustering results by considering both the similarity
Visits of patients 4765 4405 5667 5410 7593 5397
and separation between different clusters. It calculates the similarity (Ci ) between two
Datasets Jul. Aug. Sep. Oct. Nov. Dec.
Number of attributes 455 447 445 404 444 470
Visits of patients 5136 5341 5023 5233 4675 4506
Electronics 2024, 13, 140 8 of 20

clusters based on the distances between data points within each cluster. It then identifies
the cluster (Cj ) with the highest similarity and divides it by the cluster’s dispersion (S),
which is computed by averaging the distances between data points within that cluster, and
can be expressed by Equation (8). The DB index is established by averaging these cluster
similarities across all clusters, and a smaller DB index value indicates better clustering
results [42]. !
1 k Ci + Cj
DB = ∑i=1 max j̸=i (8)
k Sij

The Silhouette Coefficient (SC) [43] considers the similarity of each data point to others
within its cluster (A) and the dissimilarity to other clusters (B). The within-cluster similarity
(A) measures the distance between the data point and other data points within the same
cluster. The between-cluster dissimilarity (B) measures the distance between the data point
and data points in other clusters. The formula for calculating the Silhouette Coefficient is
in Equation (9):
B−A
SC = (9)
max( A, B)
The SI index’s values range from −1 to 1, where a value close to 1 indicates that data
points within their assigned cluster are very similar and dissimilar to data points in other
clusters, while a value close to −1 suggests that data points are more likely to be assigned
to the wrong cluster [44].
The clustering and storing of data in this study were implemented in the Anaconda
environment based on the python programming language and the scikit-learn library. In
the next section, numerical data are employed to demonstrate the performance of the
SHCM system. Then, numerical results generated by the SHCM system are observed and
analyzed, and conclusions are drawn.

4. Numerical Results
4.1. Clustering Performance with Three Measurements
Table 4 indicates the cluster numbers obtained by the four clustering methods in
12 months. Tables 5–7 list three measurements: the Calinski–Harabasz index (CH), the
Davies–Bouldin index (DB), and the Silhouette Coefficient (SC) were used in this study
to evaluate the performance of the clustering techniques. The number of suitable clusters
falls between clusters 3 and 5, and the three clusters appeared the most frequently. Table 5
illustrates the CH indexes of the four clustering approaches. A larger CH value means a
better clustering result. Table 6 shows the DB indicators of the different clustering methods.
A smaller DB value implies a better clustering result. Table 7 depicts the SI coefficients. An
SI value close to 1 means a better clustering result. In summary, the K-means method is
mostly superior to the other clustering methods for 12 months of data. It has been pointed
out that the K-means approach can provide quite satisfactory results compared to the other
clustering methods [15,30]. Harada et al. [45] reported that the advantages of K-means
are its simple principle and high flexibility. Therefore, the clustering results generated by
k-means were used to illustrate the clustering results for this study.

Table 4. The numbers of clusters using four methods from January to December.

Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 4 4 3 3 5 3 3 3 3 3 3 3
AEKM 4 4 3 3 5 3 3 3 3 3 3 3
SOMKM 4 4 3 3 4 3 4 3 4 4 3 3
HC 4 4 3 3 3 3 3 3 3 3 3 4
Electronics 2024, 13, 140 9 of 20

Table 5. The clustering performance in terms of the Calinski–Harabasz index.

Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 1310.21 1431.42 1988.48 2107.83 1810.97 1892.51 1914.83 2082.41 1682.57 1731.59 1590.45 1323.84
AEKM 257.73 268.64 217.65 399.86 319.53 653.52 709.87 900.43 347.86 336.36 384.98 163.36
SOMKM 1281.33 1423.37 1988.48 1012.57 1377.30 1888.91 1432.10 1841.56 1391.69 1416.60 1590.45 1233.08
HC 1264.75 1368.76 1883.29 2008.26 2060.34 1847.02 1805.66 2035.00 1632.01 1690.22 1512.80 1112.01

Table 6. The clustering performance in terms of the Davies–Bouldin index.

Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 1.43 1.28 1.43 1.25 1.6 1.31 1.27 1.28 1.44 1.5 1.35 1.58
AEKM 4.59 3.68 5.55 6 5.33 4.02 3.12 3.74 3.58 4.34 3.11 6.18
SOMKM 1.42 1.29 1.43 2.19 1.96 1.31 1.91 1.5 1.57 1.67 1.35 1.7
HC 1.45 1.31 1.43 1.27 1.49 1.32 1.34 1.28 1.43 1.52 1.38 1.53

Table 7. The clustering performance in terms of the Silhouette Coefficient.

Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 0.27 0.29 0.27 0.32 0.24 0.31 0.31 0.32 0.27 0.25 0.29 0.23
AEKM 0.01 0.03 0.02 0.07 0.01 0.13 0.16 0.13 0.04 0.03 0.04 0.01
SOMKM 0.26 0.29 0.27 0.16 0.2 0.31 0.24 0.31 0.28 0.25 0.29 0.21
HC 0.27 0.29 0.25 0.31 0.26 0.3 0.3 0.32 0.26 0.25 0.29 0.25

4.2. Preliminary Analysis of ICD-10-CM Codes between Indigenous Patients and

Non-Indigenous Patients
This study focuses on the analysis of disease codes among patients at Puli Christian
Hospital. Given the hospital’s service to many indigenous populations, patients were ini-
tially categorized into indigenous and non-indigenous groups. Subsequently, we collected
and analyzed the International Classification of Diseases (ICD-10-CM) codes assigned by
doctors and recorded the top ten most frequent disease codes each month for both groups.
Table A1 in Appendix A depicts ICD-10-CM codes and the corresponding diseases. Two
main trends were observed. Firstly, type 2 diabetes (E11) consistently ranked within the
top three for both indigenous and non-indigenous groups. This highlights a substantial
demand for medical care relating to metabolic diseases in the Puli regions. Thus, various
complications associated with diabetes underline the need for an in-depth understanding
of the medical requirements at different stages. Secondly, the ranking of bacterial infectious
diseases (Z20) was initially not prominent but escalated to the top position in June. No-
tably, 2022 was the year of the COVID-19 pandemic. Comparing this trend with Taiwan’s
COVID-19 statistical data, a similar pattern emerges. These two disease codes represent
chronic and infectious diseases, respectively, and indicate a diversity and complexity of
health issues.

4.3. Analysis of ICD-10-CM after Clustering

To gain a deeper understanding of the differences in medical needs between indige-
nous and non-indigenous patients, this study employed clustering techniques to stratify
the patient population. This stratification facilitated an in-depth exploration of the predom-
inant health conditions within each cluster. Additionally, to compare the primary diseases
between indigenous and non-indigenous groups, the criteria for listing major diseases
included not only the frequency of disease codes but also proportional representations
within each cluster. Disease codes were ranked based on occurrences, and the top ten
diseases were selected for further analysis. The proportions of these top ten codes within
each cluster were calculated and are listed in Tables 8–15.
Electronics 2024, 13, 140 10 of 20

Table 8. Clustering analysis of indigenous patients in January.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4

Ranks Disease Group Group Group Group
Codes Visits Codes Visits Codes Visits Codes Visits Codes Visits
% % % % %
1 E11 9% 178 E11 87 49% E11 24 13% E11 62 35% Z20 65 94%
2 E78 5% 112 I10 45 50% E78 17 15% E78 51 46% J18 38 46%
3 I10 4% 90 E78 44 39% N39 17 50% I11 28 35% J01 16 36%
4 J18 4% 82 I11 43 53% I10 16 18% I10 27 30% R50 14 45%
5 I11 4% 81 J18 22 27% N18 12 32% I25 21 45% E11 5 3%
6 Z20 3% 69 B18 22 55% R50 10 32% M10 20 65% J20 5 18%
7 I25 2% 47 K21 21 54% I11 9 11% B18 17 43% J45 5 17%
8 J01 2% 44 Z34 18 60% Z34 8 27% J18 16 20% R10 5 19%
9 B18 2% 40 I25 15 32% I25 7 15% J01 15 34% I25 4 9%
10 K21 2% 39 J45 14 48% R10 7 26% K21 14 36% N18 4 11%
11 N18 2% 37 N20 7 47% J30 4 17%
12 N39 2% 34 M10 6 19% A09 4 21%
13 R50 2% 31 J18 6 7% R11 4 31%
14 M10 2% 31 E86 4 50%
15 Z34 1% 30 Z34 4 13%

Table 9. Clustering analysis of indigenous patients in March.

Total Cluster 1 Cluster 2 Cluster 3

Ranks
Group Group Group
Codes Disease % Visits Codes Visits Codes Visits Codes Visits
% % %
1 E11 8% 206 E11 30 15% E11 93 45% E11 83 40%
2 E78 6% 154 R50 25 45% Z20 79 74% E78 63 41%
3 I10 4% 108 N39 23 56% E78 69 45% I10 36 33%
4 Z20 4% 107 E78 22 14% I10 55 51% I11 34 34%
5 I11 4% 99 I10 17 16% I11 53 54% Z11 33 60%
6 J18 3% 80 N18 14 25% J18 41 51% J01 31 41%
7 J01 3% 76 I11 12 12% J01 39 51% M10 30 60%
8 B18 2% 56 J18 12 15% K21 29 63% Z20 28 26%
9 R50 2% 56 R10 11 35% Z34 28 82% J18 27 34%
10 N18 2% 56 I25 9 20% B18 26 46% B18 26 46%
11 Z11 2% 55 N18 26 46%
12 M10 2% 50
13 K21 2% 46
14 I25 2% 46
15 N39 2% 41

Table 10. Clustering analysis of indigenous patients in May.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Ranks Disease Group Group Group Group Group
Codes % Visits Codes Visits % Codes Visits % Codes Visits % Codes Visits % Codes Visits %
1 Z20 22% 660 E11 58 35% E11 83 50% Z20 240 36% N39 23 62% Z20 346 52%
2 E11 6% 167 E78 46 43% Z20 56 8% R05 56 45% R50 21 22% R05 66 53%
3 R05 4% 125 I11 27 34% E78 49 45% J06 30 37% E11 21 13% U07 40 48%
4 E78 4% 108 B18 20 47% I10 40 61% R50 29 31% I11 12 15% J06 39 48%
5 R50 3% 94 I10 19 29% I11 38 48% U07 25 30% E78 11 10% R50 29 31%
6 U07 3% 83 K21 14 42% Z34 30 65% J18 10 19% N18 10 20% J18 9 17%
7 J06 3% 81 J44 14 61% I25 23 50% J20 10 33% Z34 9 20% J00 8 50%
8 I11 3% 79 Z20 13 2% B18 20 47% J00 8 50% N40 9 64% Z34 7 15%
9 I10 2% 66 J18 13 24% N18 20 41% J01 7 39% I25 7 15% J20 7 23%
10 J18 2% 54 N18 12 24% K21 18 55% N18 6 12% M10 7 24% R10 5 23%
11 N18 2% 49 I25 12 26% J18 16 30% J30 5 26% I10 7 11% J02 5 63%
12 Z34 2% 46 M10 12 41% E11 4 2% J18 6 11%
13 I25 2% 46 I25 4 9% N20 6 55%
14 B18 1% 43 R51 4 20% R10 6 27%
15 N39 1% 37 R11 4 27% M54 6 30%

Table 11. Clustering analysis of indigenous patients in June.

Total Cluster 1 Cluster 2 Cluster 3

Ranks Disease Group Group Group
Codes Visits Codes Visits Codes Visits Codes Visits
% % % %
1 Z20 15% 309 E11 58 40% E11 83 57% Z20 240 78%
2 E11 7% 145 E78 46 47% Z20 56 18% R05 56 95%
3 E78 5% 97 I11 27 40% E78 49 51% J06 30 75%
4 I11 3% 67 B18 20 50% I10 40 68% R50 29 66%
5 I10 3% 59 I10 19 32% I11 38 57% U07 25 64%
Electronics 2024, 13, 140 11 of 20

Table 11. Cont.

Total Cluster 1 Cluster 2 Cluster 3

Ranks Disease Group Group Group
Codes Visits Codes Visits Codes Visits Codes Visits
% % % %
6 R05 3% 59 K21 14 44% Z34 30 100% J18 10 26%
7 R50 2% 44 J44 14 78% I25 23 59% J20 10 56%
8 B18 2% 40 Z20 13 4% B18 20 50% J00 8 100%
9 J06 2% 40 J18 13 33% N18 20 53% J01 7 44%
10 J18 2% 39 I25 12 31% K21 18 56% N18 6 16%
11 I25 2% 39 N18 12 32% J18 16 41% J30 5 33%
12 U07 2% 39 M10 12 55% E11 4 3%
13 N18 2% 38 Z11 11 48% I25 4 10%
14 K21 2% 32 K25 11 50% R51 4 24%
15 Z34 1% 30 R50 9 20% R11 4 40%

Table 12. Clustering analysis of non-indigenous patients in January.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4

Ranks Disease Group Group Group Group
Codes % Visits Codes Visits % Codes Visits % Codes Visits % Codes Visits %
1 E11 8% 985 E11 364 37% E11 176 18% E11 414 42% Z20 412 96%
2 E78 6% 790 E78 316 40% N39 135 73% E78 349 44% J18 50 32%
3 I11 5% 680 I11 303 45% E78 121 15% I11 286 42% E11 31 3%
4 Z20 3% 429 I10 149 38% N18 81 20% I10 169 43% N18 30 7%
5 N18 3% 407 N18 142 35% I11 80 12% N18 154 38% R50 21 24%
6 I10 3% 391 K21 109 51% N20 72 51% I25 114 48% J01 18 21%
7 I25 2% 237 B18 88 43% I10 59 15% B18 104 51% I10 14 4%
8 K21 2% 212 G47 88 55% N40 54 42% K21 89 42% R10 14 11%
9 B18 2% 204 I25 82 35% R10 44 34% M10 77 66% K92 13 21%
10 N39 1% 184 K59 68 46% R50 35 41% N40 72 55% A09 13 19%
11 G47 1% 161 I25 30 13% E86 13 24%
12 J18 1% 158 Z34 26 28%
13 K59 1% 147 R31 25 78%
14 N20 1% 140
15 N40 1% 130

Table 13. Clustering analysis of non-indigenous patients in March.

Total Cluster 1 Cluster 2 Cluster 3

Ranks Disease Group Group Group
Codes Visits Codes Visits Codes Visits Codes Visits
% % % %
1 E11 7% 953 N39 177 71% E11 406 43% E11 393 41%
2 E78 6% 827 E11 154 16% Z20 394 58% E78 351 42%
3 I11 5% 676 E78 109 13% E78 367 44% I11 284 42%
4 Z20 5% 674 N18 103 23% I11 313 46% Z20 280 42%
5 I10 3% 464 I11 79 12% I10 184 40% I10 210 45%
6 N18 3% 450 R50 75 50% N18 154 34% N18 193 43%
7 I25 2% 283 I10 70 15% G47 111 60% I25 156 55%
8 N39 2% 251 R10 58 43% K21 104 49% J44 106 70%
9 K21 1% 213 N20 47 38% I25 98 35% J18 105 53%
10 J18 1% 199 N40 36 27% K59 80 53% B18 100 54%
11 B18 1% 186 M10 32 26% Z34 78 71% N40 97 73%
12 G47 1% 186 Z34 32 29% J18 78 39% K21 95 45%
13 J44 1% 152 R31 30 70% J30 76 51% M10 72 58%
14 K59 1% 151 I25 29 10% N20 66 53%
15 R50 1% 151

Table 14. Clustering analysis of non-indigenous patients in May.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Ranks Disease Group Group Group Group Group
Codes % Visits Codes Visits % Codes Visits % Codes Visits % Codes Visits % Codes Visits %
1 Z20 19% 2922 E11 367 41% E11 354 40% Z20 1029 35% E11 151 17% Z20 1343 46%
2 E11 6% 888 E78 311 42% Z20 325 11% R05 242 44% N39 131 72% R05 301 54%
3 E78 5% 740 I11 285 44% E78 308 42% U07 129 37% E78 111 15% U07 153 43%
4 I11 4% 648 Z20 212 7% I11 279 43% J06 120 45% N18 92 22% J06 111 42%
5 R05 4% 553 N18 171 41% I10 149 39% R50 88 31% I11 78 12% R50 75 27%
6 N18 3% 420 I10 156 41% N18 138 33% J00 38 54% R50 73 26% J00 31 44%
7 I10 2% 382 I25 109 44% K21 102 55% R11 16 29% I10 62 16% Z34 20 16%
Electronics 2024, 13, 140 12 of 20

Table 14. Cont.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Ranks Disease Group Group Group Group Group
Codes % Visits Codes Visits % Codes Visits % Codes Visits % Codes Visits % Codes Visits %
8 U07 2% 352 N40 84 64% I25 101 41% J20 16 29% N20 48 44% J20 16 29%
9 R50 2% 281 B18 78 49% G47 87 54% E11 14 2% N40 43 33% R11 13 24%
10 J06 2% 265 K21 72 39% Z34 79 64% N18 14 3% R10 32 31% R07 11 16%
11 I25 2% 248 B18 74 46% R35 30 67% J03 10 34%
12 K21 1% 186 C50 58 91% R51 10 25%
13 N39 1% 181 K25 55 54% O47 10 77%
14 G47 1% 162 R00 54 62%
15 B18 1% 160

Table 15. Clustering analysis of non-indigenous patients in June.

Total Cluster 1 Cluster 2 Cluster 3

Ranks Disease Group Group Group
Codes Visits Codes Visits Codes Visits Codes Visits
% % % %
1 Z20 11% 1440 Z20 867 60% Z20 558 39% E11 140 17%
2 E11 6% 822 E11 342 42% E11 340 41% N39 111 66%
3 E78 5% 681 E78 294 43% E78 285 42% E78 102 15%
4 I11 5% 588 I11 259 44% I11 250 43% N18 89 22%
5 N18 3% 399 N18 147 37% I10 165 45% I11 79 13%
6 I10 3% 365 I10 145 40% N18 163 41% R50 69 29%
7 U07 2% 267 U07 124 46% I25 115 54% I10 55 15%
8 R50 2% 238 Z34 102 83% U07 106 40% U07 37 14%
9 I25 2% 214 I25 80 37% J44 96 69% N20 37 44%
10 N39 1% 168 G47 77 53% R50 93 39% R10 31 34%
11 B18 1% 148 J18 74 54% N40 26 27%
12 K21 1% 147 M10 71 63% R31 25 86%
13 G47 1% 145 N40 70 73%
14 J06 1% 143 B18 67 45%
15 J44 1% 140 K21 65 44%

4.4. Analysis of Major Disease Codes over 12 Months in Indigenous Groups

An overview of the data from the clustering results shows that in January, the disease
codes among indigenous patients were categorized into four main groups. The prevalent
conditions included hypertension (I10), hypertensive heart disease (I11), chronic viral
hepatitis (B18), gastroesophageal reflux disease (K21), and pregnancy-related care (Z34).
As the months progressed to February, new codes emerged, such as lipid disorders (E78),
chronic ischemic heart disease (I25), asthma (J45), and sleep disorders (G47), indicating
shifts in health issues among indigenous populations. By March, the groupings were
reduced to three, with an increased frequency of urinary system diseases (N39), suggesting
that this is an area of health concern worth further exploration. In April, the grouping
included hypertensive heart disease (I11), gastroesophageal reflux disease (K21), and
pregnancy-related care (Z34), alongside infectious diseases (Z20) and chronic viral hepatitis
(B18), hinting at possible commonalities among these conditions. In May, the number
of groups increased to five, with respiratory diseases like chronic obstructive pulmonary
disease (J44) and acute rhinitis (J00) becoming more prominent, potentially relating to
prevalent diseases at the time. From June to December, the groupings remained consistent
at three, with the recurrent appearances of codes like gout (M10) and pneumonia (J18)
often in the same group, highlighting the significance of urinary (e.g., N39) and respiratory
diseases (e.g., J18, J20).
Throughout the year, hypertension- and heart-disease-related codes (such as I10 and
I11) were almost consistently present, while infectious diseases (like Z20) and seasonal
illnesses (such as J18) showed an increase in specific months. Particular conditions like
gout (M10) and urinary system diseases (e.g., N39) were especially pronounced among the
indigenous population, suggesting possible environmental or physiological factors.

4.5. Analysis of Major Disease Codes in Non-Indigenous Groups

According to the clustering results, in January, non-indigenous patients were grouped
into four categories focused on digestive system diseases (K21), sleep disorders (G47),
Electronics 2024, 13, 140 13 of 20

urinary system diseases (N39, N20), gout (M10), chronic viral hepatitis (B18), blood dis-
orders (R31), and infectious diseases (Z20). The grouping pattern continued similarly in
February. In March and April, the number of groups was reduced to three, combining
infectious diseases (Z20), pregnancy-related care (Z34), digestive system diseases (K21), and
sleep disorders (G47), along with groups featuring gout (M10) and chronic viral hepatitis
(B18) that now included respiratory disease codes (I25, J44, J18). In April, the group with
pregnancy-related care (Z34) also showed occurrences of breast cancer (C50). By May, the
number of groups increased to five, including groups with respiratory disease codes (J00)
and respiratory symptoms (R05, O47). From June to December, the groups were consis-
tently divided into three: a group focusing on infectious diseases (Z20), pregnancy-related
care (Z34), and sleep disorders (G47); a group with urinary system diseases (N39), often
accompanied by hematuria (R31) and kidney stones (N20); and a group dominated by gout
(M10) and chronic viral hepatitis (B18), frequently associated with respiratory diseases (J44,
J18) and chronic ischemic heart disease (I25).
Observations throughout the year show that urinary diseases (N class) and infectious
diseases (Z20) were almost constantly present, along with the consistent appearance of
respiratory diseases (J class), indicating these as the main health concerns affecting the
non-indigenous population. The presence of chronic ischemic heart disease (I25) may relate
to lifestyle factors in the non-indigenous community.

4.6. Comparative Analysis of Major Disease Codes among Indigenous and Non-Indigenous
Patients
Utilizing K-means clustering, which resulted in groups of three, four, and five, an
examination of the primary disease codes was conducted. Across eight months, namely,
March, April, and July to December, where the data were clustered into three groups, the
same disease codes were consistently observed each month in the indigenous population.
To facilitate a clearer observation and comparison of similar codes across different groups,
the results are documented in Table 16. Consequently, the outcomes were categorized
into three classes: class A, predominantly featuring codes I10, I11, K21, and Z34; class B,
primarily centered around code N39; and class C, focusing on code M10.

Table 16. Clusters of diseases.

Clusters of Indigenous Patients Clusters of Non-Indigenous Patients

Months
Class A Class B Class C Class D Class E Class A Class B Class C Class D Class E
Jan. I10, I11, N39 M10 Z20, E86 K21, G47 N39, N20, B18, M10, Z20
B18, K21, R31 N40
Z34
Feb. E78, Z34, N39, N20 M10, B18, Z20, R07, G47, Z34 N39 B18, N40, Z20
I25, K21, Z11 L08, B08 M10
J45, G47,
M19
Mar. E11, I10, N39 Z11, M10 Z20, G47, N39, R50, I25, J44,
I11, J18, K59, Z34, R31 J18, B18,
J01, K21, J30 N40, M10,
Z34 N20
Apr. Z20, I11, N39, N20 J01, N18, Z20, K21, N39, R35 I25, B18,
K21, Z34, Z11 G47, Z34, J44, J18,
B18 C50 M10, N40,
K25, I20
May E11, E78, N39, Z34, E78, B18, Z20, R05, J00, R05 E78, I11, N39, N20, E11, E78, Z20, R05, J00, R05,
I10, I11, N20 K21, J44, U07, J06, K21, I25, R35 I11, N18, U07, J06, J06
Z34, I25, M10 J00, J02 G47, Z34, I10, I25, J00, O47
B18, N18, B18, C50, N40, B18
K21 K25, R00
Electronics 2024, 13, 140 14 of 20

Table 16. Cont.

Clusters of Indigenous Patients Clusters of Non-Indigenous Patients

Months
Class A Class B Class C Class D Class E Class A Class B Class C Class D Class E
Jun. E11, E78, B18, J44, Z20, R05, Z20, Z34, N39, R31 I25, J44,
I10, I11, M10, K25 J06, R50, G47 J18, M10,
Z34, I25, U07, J20, N40
B18, N18, J00
K21
Jul. Z20, E11, N20 M10 Z20, Z34, N39 I25, B18,
E78, I11, K21, G47, J44, N40,
I10, Z34, K59, R42, J18, M10
K21, B18 R00
Aug. Z20, E11, N39 J18, I25, Z20, K21, N39, R31 I25, B18,
E78, I10, M10, J44 Z34 J44, N40,
I11, Z34, M10, J18
K21, J01,
B18, J20,
K59
Sep. Z20, E78, N39 J18, M10 Z20, Z34, N39, R31 I25, B18,
I11, B18, K21, K59, J19, N40,
K21, Z34 G47, R42 J44, M10
Oct. E11, Z20, N39, R10, R50, J18, Z20, Z34, N39, N20, R50, N18,
I11, K21, N13 M10, N18, G47 R31 J18, Z20,
Z34, J45 J20, J01, I25, B18,
J06 U07, J20,
M10, N40
Nov. I10, Z34, N39 J18, M10, Z34, K21, N39, R31, J18, I25,
J12, Z20, J45, J20 R42, Z20, R80 J44, B18,
P59 G47 N40
Dec. E11, I10, N39, R10, J18, M10 Z34, G47, N39, R31, I25, J18,
I11, K21, N20, J03 K59, C50, R35 J44, M10,
P59, J45, R42 N40, I20
Z34, J20,
J30
SAME I10, I11, N39 M10 Z20 J00 K21, G47 N39 M10, N40 Z20 J00
CODE K21, Z34

In the non-indigenous groups, similar patterns were observed. In class A, the codes
included K21 and G47, sharing K21 with the indigenous group but adding G47 while
lacking I10, I11, and Z34. Both indigenous and non-indigenous groups had N39 as the
main code in class B. In class C, both groups shared the M10 code, but an additional N40
code was observed in the non-indigenous group. During the same period, when the data
were categorized into three groups, a unique situation was noted in June for the indigenous
population. They had groups belonging to class A and class C, but not class B. Instead,
there was an additional group, classified as class D, characterized by Z20 as the primary
code, accompanied by respiratory diseases (U07, J00, J02, J06, R05). In contrast, the non-
indigenous population continued to be categorized under the original classes A, B, and C.
This variance in June, particularly considering the specific impact of COVID-19 (U07) in
2022, suggests that the indigenous population was more significantly affected during this
month, leading to a different clustering trend. After classifying the main diseases based
on their similarities, it becomes easier to observe the differing trends between infectious
and chronic diseases over the 12 months. Consequently, the subsequent analysis focuses on
infectious diseases and chronic diseases.

4.6.1. Impacts of Infectious Diseases

In the cluster analysis conducted in January and February, when the data were seg-
mented into four groups, both indigenous and non-indigenous populations exhibited
grouping patterns similar to the three groups observed in other months. This included
class A, predominantly characterized by gastroesophageal reflux disease (K21), class B,
Electronics 2024, 13, 140 15 of 20

centered around urinary system infections (N39), and class C, led by gout (M10). However,
a category similar to that observed in June for the indigenous population emerged: class
D. Unlike in June, the frequency of respiratory disease codes was lower in January and
February, with pneumonia (B18) notably being the second most common condition. Given
the outbreak of COVID-19 in Taiwan during these months, the appearance of class D could
serve as an early warning signal of the epidemic.
In the analysis of May, considering the greater number of assigned groups due to
potential overlapping diseases in patients, primary diseases were selected based on a
threshold of 40%. In addition to the similar classes A, B, C, and D, an additional class E
was identified, which is primarily associated with acute upper respiratory infections (J00).
In this group, the specific COVID-19 disease code (U07) exceeded 43% in both indigenous
and non-indigenous populations, aligning with the surge in COVID-19 cases in Taiwan
in May. This finding indicates the emergence of a new group of patients seeking medical
assistance due to the epidemic, in addition to regular medical patients. Among groups in
May, 48% of the indigenous population had the U07 code, compared to 43% in the non-
indigenous population. The disparity in the proportion of disease codes suggests a greater
impact on indigenous communities. Considering lifestyle factors, indigenous communities,
often residing in closely knit tribes, have a higher interaction frequency compared to
non-indigenous populations. In addition, indigenous people are slower to receive disease
information and initiate preventive measures compared to people in urban areas.

4.6.2. Impacts of Type 2 Diabetes

In the analysis of major diseases, type 2 diabetes (code E11), initially ranked in the
top three by patient frequency, did not consistently feature as a primary disease when the
concept of proportionality was adopted. Unlike infectious diseases, which were frequently
observed in more than four groups in January, February, and May, chronic diseases were
consistently presented in all monthly groupings. This illustrates that after the grouping,
each cluster maintained a certain proportion of chronic disease codes, which were differen-
Electronics 2024, 13, x FOR PEER REVIEW 16 of 21
tiated by the accompanying comorbidities. Additionally, since chronic diseases are closely
linked to time and disease progression, age was factored into the analysis. A comparison
was made across groups in terms of age and found that each group appeared to have
athat
lag for
of ten
the years. Figures
indigenous 3–5 indicating
group, illustrate the
an observation
earlier onset of
of the occurrences
E11-related in impacts
health March as in
an example.
the indigenous population.

Figure3.3.The
Figure Theage
agedistribution
distributionofofpatients
patientswith
withE11
E11disease
diseasecode
codeofofclass
classAAininMarch.
March.
Electronics 2024, 13, 140 16 of 20

Figure 3. The age distribution of patients with E11 disease code of class A in March.
Figure 3. The age distribution of patients with E11 disease code of class A in March.

Figure4.4.The
The agedistribution
distribution ofpatients
patients withE11
E11 diseasecode
code of classBBininMarch.
March.
Figure 4. Theage
Figure age distributionof
of patientswith
with E11disease
disease codeofofclass
class B in March.

Figure 5. The age distribution of patients with E11 disease code of class C in March.
Figure5.5.The
Figure Theage
agedistribution
distributionof
ofpatients
patientswith
withE11
E11disease
diseasecode
codeofofclass
classCCininMarch.
March.

4.6.3. Impacts
4.6.3.The
Impacts of
of Essential
Essential Hypertension and Hypertensive Heart Diseases
bar chart displaysHypertension
the number ofand Hypertensive
individuals withHeart Diseases
the E11 code in terms of age.
Observations
The line
Observations from
graph represents the analysis
cumulative
from the of
analysiscases.major diseases
Observing
of major reveal
thereveal
diseases distinct
cumulative patterns
cases,
distinct within
it is evident
patterns within
chronic
that heart diseases
the proportion
chronic of
of the
heart diseases the
ofE11 I class.
thecode Among
in the
I class. the indigenous
indigenous
Among population,
populationpopulation,
the indigenous the
is mostly higher prevalent
than in
the prevalent
the non-indigenous groups. The bar chart reveals similar trends in both groups, but with
different age brackets. The trend for the non-indigenous group appears a decade later than
that for the indigenous group, indicating an earlier onset of E11-related health impacts in
the indigenous population.

4.6.3. Impacts of Essential Hypertension and Hypertensive Heart Diseases

Observations from the analysis of major diseases reveal distinct patterns within chronic
heart diseases of the I class. Among the indigenous population, the prevalent codes include
I10 (essential hypertension) and I11 (hypertensive heart diseases), predominantly occurring
in disease groups associated with K21 (gastroesophageal reflux disease). In contrast, the
non-indigenous population predominantly showed the presence of I25 (chronic ischemic
heart disease). The prevalence of the I10 code, which is linked to genetic factors, indicates
a heightened need for preventive measures against heart disease risks in the indigenous
population.
Upon categorizing the clustering results based on the main diseases, distinct trends
were observed between chronic and infectious diseases. Chronic diseases displayed a more
consistent distribution across all 12 months. However, infectious diseases exhibited greater
variability. A month-by-month analysis was performed to track the dynamics. Therefore,
using clustering methods to differentiate between chronic diseases and infectious diseases
Electronics 2024, 13, 140 17 of 20

provides a clearer outline of regional disease patterns and healthcare needs. This approach
allows for a better understanding of medical requirements in different areas and facilitates
the provision of appropriate medical assistance tailored to the needs of specific subgroups.
Additionally, in the context of epidemic prevention and control, this method can enable
early prevention and management based on local infection trends.

5. Conclusions
This study employed clustering techniques to group and then analyze diseases in the
indigenous population and the non-indigenous population. K-means clustering obtained
better results than the other three clustering techniques in terms of three measurements.
The developed model can learn distances between clusters and further investigate relations
among diseases in patients through the features of clusters. Although the medical condi-
tions of patient groups vary each month, a consistent clustering trend is observed overall.
This trend is particularly pronounced in cases where the primary disease is the same,
indicating a higher probability of certain disease codes appearing together. This result lays
the foundation for a deeper exploration of potential correlations between different diseases.
From the perspectives of chronic diseases and bacterial or viral infections, we noted
distinct clustering behaviors between the two. The chronic disease group exhibited consis-
tency in the monthly analyses, while the clustering of bacterial or viral infections showed a
close correlation with the stages of epidemic development. This was particularly evident in
the context of the 2022 epidemic trends in Taiwan, where changes in the number of clusters
were highly correlated with different stages of the epidemic.
The unsupervised clustering method helps in identifying correlations in complex and
varied data that are not readily observable. However, the diversity of the data, along with
the varying medical needs of patients at the time of consultation, poses challenges in data
preprocessing. Moreover, challenges arise due to the large data scales, diversities, and
complexity. Both structured and unstructured data are included in the dataset. In addition
to medical treatment, there may be return visits or patients with chronic diseases who
only receive medicine without treatment. Thus, the presentation of data should not only
consider the identity of the patient but also consider the timeliness of the patient’s visits.
Only data from 2022 were employed in this study. Data gathered in other years could
be employed to examine the feasibility of the proposed SHCM system. In addition, only
data collected from the Puli Christian Hospital served as data for the SHCM system. Data
collected from other hospitals could be utilized to investigate the generalization ability of
the developed system. Finally, some deep clustering techniques could be employed to deal
with the clustering tasks for the presented system.

Author Contributions: Conceptualization, P.-F.P., W.-C.Y. and H.-P.H.; methodology, J.-P.L. and
P.-F.P.; software, J.-P.L., Y.-H.L. and Y.-L.L.; formal analysis, Y.-H.L.; writing—original draft prepa-
ration, Y.-H.L., Y.-L.L. and P.-F.P.; writing—review and editing, P.-F.P.; visualization, Y.-H.L. and
Y.-L.L.; supervision, P.-F.P. and W.-C.Y. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was supported by funding from Puli Christian Hospital/Chi Nan National
University Joint Research Program under grant number 112-PuChi-AIR-001.
Institutional Review Board Statement: Ethical review and approval were waived for this study, due
to the use of a database with data aggregated by age (10-year age-groups) and diagnosis categories.
Informed Consent Statement: Informed consent was not required as cohort members were unidentifiable.
Data Availability Statement: The data presented in this study are available on reasonable request
from the corresponding author. The data are not public available due to the privacy.
Acknowledgments: This work was supported by Kai Yen, Bing-Cheng Chiu, and Yan-Song Chang,
who assisted in data analysis.
Conflicts of Interest: The authors declare no conflicts of interest.
Electronics 2024, 13, 140 18 of 20

Appendix A
In this study, the analysis of electronic medical record (EMR) data involved labeling
patients’ diseases according to the International Statistical Classification of Diseases and Related
Health Problems, 10th Revision (ICD-10 CM). The ICD-10, established by the World Health
Organization (WHO), categorizes diseases based on their characteristics and represents
them using a coding system. This system is crucial for accurately and systematically
recording cases, and it plays a significant role in clinical diagnosis, epidemiological research,
health management, and data collection. This appendix only includes disease codes directly
relevant to our study. These codes represent specific disease types involved in our research
and are intended to help readers better understand the scope and focus of our study. Each
code is accompanied by a brief description of the disease, making it accessible to readers
who are not specialists in the field.

Table A1. The ICD-10 CM and brief descriptions.

ICD-10 CM Diseases
A09 Infectious gastroenteritis and colitis, unspecified
B08 Other viral infections characterized by skin and mucous membrane lesions, not elsewhere classified
B18 Chronic viral hepatitis
C50 Malignant neoplasm of breast
D64 Other anemias
E11 Type 2 diabetes mellitus
E78 Disorders of lipoprotein metabolism and other lipidemias
E86 Volume depletion
E87 Other disorders of fluid, electrolyte and acid-base balance
G47 Sleep disorders
I10 Essential (primary) hypertension
I11 Hypertensive heart disease
I20 Angina pectoris
I25 Chronic ischemic heart disease
I50 Heart failure
J00 Acute nasopharyngitis [common cold]
J01 Acute sinusitis
J02 Acute pharyngitis
J03 Acute tonsillitis
J06 Acute upper respiratory infections of multiple and unspecified sites
J12 Viral pneumonia, not elsewhere classified
J18 Pneumonia, unspecified organism
J20 Acute bronchitis
J30 Vasomotor and allergic rhinitis
J44 Other chronic obstructive pulmonary disease
J45 Asthma
K21 Gastroesophageal reflux disease
K25 Gastric ulcer
K29 Gastritis and duodenitis
K59 Other functional intestinal disorders
K92 Other diseases of digestive system
L03 Cellulitis and acute lymphangitis
L08 Other local infections of skin and subcutaneous tissue
M10 Gout
M19 Other and unspecified osteoarthritis
M54 Dorsalgia
N13 Obstructive and reflux uropathy
N18 Chronic kidney disease (CKD)
N20 Calculus of kidney and ureter
N39 Other disorders of urinary system
N40 Benign prostatic hyperplasia
O47 False labor
P59 Neonatal jaundice from other and unspecified causes
R00 Abnormalities of heart beat
R05 Cough
R07 Pain in throat and chest
R10 Abdominal and pelvic pain
R11 Nausea and vomiting
R31 Hematuria
R35 Polyuria
Electronics 2024, 13, 140 19 of 20

Table A1. Cont.

ICD-10 CM Diseases
R42 Dizziness and giddiness
R50 Fever of other and unknown origin
R51 Headache
R80 Proteinuria
U07 Emergency use of U07
Z11 Encounter for screening for infectious and parasitic diseases
Z20 Contact with and (suspected) exposure to communicable diseases
Z34 Encounter for supervision of normal pregnancy

References
1. Parimbelli, E.; Marini, S.; Sacchi, L.; Bellazzi, R. Patient similarity for precision medicine: A systematic review. J. Biomed. Inform.
2018, 83, 87–96. [CrossRef] [PubMed]
2. Lambert, J.; Leutenegger, A.-L.; Jannot, A.-S.; Baudot, A. Tracking clusters of patients over time enables extracting information
from medico-administrative databases. J. Biomed. Inform. 2023, 139, 104309. [CrossRef] [PubMed]
3. Zelina, P.; Halámková, J.; Nováček, V. Unsupervised extraction, labelling and clustering of segments from clinical notes. In
Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8
December 2022; pp. 1362–1368.
4. Irving, J.; Patel, R.; Oliver, D.; Colling, C.; Pritchard, M.; Broadbent, M.; Baldwin, H.; Stahl, D.; Stewart, R.; Fusar-Poli, P. Using
natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr. Bull.
2021, 47, 405–414. [CrossRef] [PubMed]
5. Ebad, S.A. Healthcare software design and implementation—A project failure case. Softw. Pract. Exp. 2020, 50, 1258–1276.
[CrossRef]
6. Mashoufi, M.; Ayatollahi, H.; Khorasani-Zavareh, D.; Talebi Azad Boni, T. Data quality in health care: Main concepts and
assessment methodologies. Methods Inf. Med. 2023, 62, 005–018. [CrossRef] [PubMed]
7. Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A comprehensive survey of
clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng.
Appl. Artif. Intell. 2022, 110, 104743. [CrossRef]
8. Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques
and developments. Neurocomputing 2017, 267, 664–681. [CrossRef]
9. Chaudhry, M.; Shafi, I.; Mahnoor, M.; Vargas, D.L.R.; Thompson, E.B.; Ashraf, I. A systematic literature review on identifying
patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry 2023, 15, 1679. [CrossRef]
10. Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [CrossRef]
11. Santamaría, L.P.; del Valle, E.P.G.; García, G.L.; Zanin, M.; González, A.R.; Ruiz, E.M.; Gallardo, Y.P.; Chan, G.S.H. Analysis of new
nosological models from disease similarities using clustering. In Proceedings of the 2020 IEEE 33rd International Symposium on
Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 183–188.
12. Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci. 2020, 20,
112–124. [CrossRef]
13. Hassan, M.M.; Mollick, S.; Yasmin, F. An unsupervised cluster-based feature grouping model for early diabetes detection. Healthc.
Anal. 2022, 2, 100112. [CrossRef]
14. Antony, L.; Azam, S.; Ignatious, E.; Quadir, R.; Beeravolu, A.R.; Jonkman, M.; De Boer, F. A comprehensive unsupervised
framework for chronic kidney disease prediction. IEEE Access 2021, 9, 126481–126501. [CrossRef]
15. Enireddy, V.; Anitha, R.; Vallinayagam, S.; Maridurai, T.; Sathish, T.; Balakrishnan, E. Prediction of human diseases using
optimized clustering techniques. Mater. Today Proc. 2021, 46, 4258–4264. [CrossRef]
16. Arora, N.; Singh, A.; Al-Dabagh, M.Z.N.; Maitra, S.K. A novel architecture for diabetes patients’ prediction using k-means
clustering and svm. Math. Probl. Eng. 2022, 2022, 4815521. [CrossRef]
17. Parikh, H.M.; Remedios, C.L.; Hampe, C.S.; Balasubramanyam, A.; Fisher-Hoch, S.P.; Choi, Y.J.; Patel, S.; McCormick, J.B.;
Redondo, M.J.; Krischer, J.P. Data mining framework for discovering and clustering phenotypes of atypical diabetes. J. Clin.
Endocrinol. Metab. 2023, 108, 834–846. [CrossRef] [PubMed]
18. Jasinska-Piadlo, A.; Bond, R.; Biglarbeigi, P.; Brisk, R.; Campbell, P.; Browne, F.; McEneaneny, D. Data-driven versus a domain-led
approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. 2023, 15, 49–66. [CrossRef]
19. Mpanya, D.; Celik, T.; Klug, E.; Ntsinjana, H. Clustering of heart failure phenotypes in johannesburg using unsupervised machine
learning. Appl. Sci. 2023, 13, 1509. [CrossRef]
20. Florensa, D.; Mateo-Fornés, J.; Solsona, F.; Pedrol Aige, T.; Mesas Julió, M.; Piñol, R.; Godoy, P. Use of multiple correspondence
analysis and k-means to explore associations between risk factors and likelihood of colorectal cancer: Cross-sectional study. J.
Med. Internet Res. 2022, 24, e29056. [CrossRef]
21. Koné, A.P.; Scharf, D.; Tan, A. Multimorbidity and complexity among patients with cancer in ontario: A retrospective cohort
study exploring the clustering of 17 chronic conditions with cancer. Cancer Control 2023, 30, 10732748221150393. [CrossRef]
Electronics 2024, 13, 140 20 of 20

22. Chantraine, F.; Schreiber, C.; Pereira, J.A.C.; Kaps, J.; Dierick, F. Classification of stiff-knee gait kinematic severity after stroke
using retrospective k-means clustering algorithm. J. Clin. Med. 2022, 11, 6270. [CrossRef]
23. Yasa, I.; Rusjayanthi, N.; Luthfi, W.B.M. Classification of stroke using k-means and deep learning methods. Lontar Komput. J. Ilm.
Teknol. Inf. 2022, 13, 23. [CrossRef]
24. Al-Khafaji, H.M.R.; Jaleel, R.A. Adopting effective hierarchal iomts computing with k-efficient clustering to control and forecast
covid-19 cases. Comput. Electr. Eng. 2022, 104, 108472. [CrossRef] [PubMed]
25. Ilbeigipour, S.; Albadvi, A.; Noughabi, E.A. Cluster-based analysis of covid-19 cases using self-organizing map neural network
and k-means methods to improve medical decision-making. Inform. Med. Unlocked 2022, 32, 101005. [CrossRef] [PubMed]
26. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 18–21 June 1965 and 27 December 1965–7 January
1966; University of California Press: Oakland, CA, USA, 1967; pp. 281–297.
27. Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings
of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4
April 2010; pp. 63–67.
28. Alam, M.S.; Rahman, M.M.; Hossain, M.A.; Islam, M.K.; Ahmed, K.M.; Ahmed, K.T.; Singh, B.C.; Miah, M.S. Automatic human
brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data
Cogn. Comput. 2019, 3, 27. [CrossRef]
29. Lee, H.; Choi, Y.; Son, B.; Lim, J.; Lee, S.; Kang, J.W.; Kim, K.H.; Kim, E.J.; Yang, C.; Lee, J.-D. Deep autoencoder-powered pattern
identification of sleep disturbance using multi-site cross-sectional survey data. Front. Med. 2022, 9, 950327. [CrossRef] [PubMed]
30. Setiawan, K.E.; Kurniawan, A.; Chowanda, A.; Suhartono, D. Clustering models for hospitals in jakarta using fuzzy c-means and
k-means. Procedia Comput. Sci. 2023, 216, 356–363. [CrossRef] [PubMed]
31. Yuan, C.; Yang, H. Research on k-value selection method of k-means clustering algorithm. J 2019, 2, 226–235. [CrossRef]
32. Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
2012, 2, 86–97. [CrossRef]
33. Rumelhart, D.; Hinton, G.; Williams, R. Learning internal representations by error propagation. In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986; Chapter 8; Volume 1, pp. 318–362.
34. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised
and Transfer Learning, Bellevue, WA, USA, 2 July 2011; JMLR Workshop and Conference Proceedings. ML Research Press:
London, UK, 2012; pp. 37–49.
35. Zhang, L.; Lv, C.; Jin, Y.; Cheng, G.; Fu, Y.; Yuan, D.; Tao, Y.; Guo, Y.; Ni, X.; Shi, T. Deep learning-based multi-omics data
integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 2018, 9, 477. [CrossRef]
36. Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge
Discovery Handbook; Springer: New York, NY, USA, 2023; pp. 353–374.
37. Kohonen, T. The self-organizing map. Proc. IEEE 1990, 78, 1464–1480. [CrossRef]
38. Vesanto, J.; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans. Neural Netw. 2000, 11, 586–600. [CrossRef] [PubMed]
39. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [CrossRef]
40. Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34.
41. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227.
[CrossRef]
42. Xiao, J.; Lu, J.; Li, X. Davies bouldin index based hierarchical initialization k-means. Intell. Data Anal. 2017, 21, 1327–1338.
[CrossRef]
43. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987,
20, 53–65. [CrossRef]
44. Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International
Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748.
45. Harada, D.; Asanoi, H.; Noto, T.; Takagawa, J. Different pathophysiology and outcomes of heart failure with preserved ejection
fraction stratified by k-means clustering. Front. Cardiovasc. Med. 2020, 7, 607760. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Healthcare ML Proj
No ratings yet
Healthcare ML Proj
6 pages
Clustering in Healthcare Data
No ratings yet
Clustering in Healthcare Data
13 pages
Survey On Predictive Medical Data Analysis Using Pattern Recognition Algorithm
No ratings yet
Survey On Predictive Medical Data Analysis Using Pattern Recognition Algorithm
7 pages
U3895 PDF
No ratings yet
U3895 PDF
13 pages
Develop The Hybrid Adadelta Stochastic Gradient Classifier Wit - 2023 - Measurem
No ratings yet
Develop The Hybrid Adadelta Stochastic Gradient Classifier Wit - 2023 - Measurem
7 pages
K-Mean Clustering for Healthcare Data
No ratings yet
K-Mean Clustering for Healthcare Data
9 pages
Prediction of Diseases Using Random Forest
No ratings yet
Prediction of Diseases Using Random Forest
8 pages
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
No ratings yet
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
10 pages
Algoritmos de Aprendizaje Automatic, Medicina
No ratings yet
Algoritmos de Aprendizaje Automatic, Medicina
4 pages
Data Science in Healthcare
No ratings yet
Data Science in Healthcare
5 pages
Disease Prediction by Machine Learning
No ratings yet
Disease Prediction by Machine Learning
7 pages
Data Mining in Nigerian Maternal Health
No ratings yet
Data Mining in Nigerian Maternal Health
12 pages
1 s2.0 S1532046420302550 Main
No ratings yet
1 s2.0 S1532046420302550 Main
17 pages
Modernistic Approach To Clustering Algorithms
No ratings yet
Modernistic Approach To Clustering Algorithms
5 pages
Health Care Chapter - Big Data
No ratings yet
Health Care Chapter - Big Data
39 pages
Thesis Updated
No ratings yet
Thesis Updated
151 pages
Data Mining in Healthcare: Techniques
No ratings yet
Data Mining in Healthcare: Techniques
9 pages
Predictive Health Analytics
No ratings yet
Predictive Health Analytics
47 pages
Keshav Kumar K
No ratings yet
Keshav Kumar K
14 pages
Attgru Hmsi: Enhancing Heart Disease Diagnosis Using Hybrid Deep Learning Approach
No ratings yet
Attgru Hmsi: Enhancing Heart Disease Diagnosis Using Hybrid Deep Learning Approach
19 pages
Healthcare Data Analytics Optimization
No ratings yet
Healthcare Data Analytics Optimization
15 pages
Smart Health PDF
No ratings yet
Smart Health PDF
6 pages
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
No ratings yet
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
5 pages
Text Recognition Past, Present and Future
No ratings yet
Text Recognition Past, Present and Future
7 pages
Fuzzy Logic
No ratings yet
Fuzzy Logic
14 pages
Healthcare Predictive Analytics Using Machine Learning and Deep Learning Techniques: A Survey
No ratings yet
Healthcare Predictive Analytics Using Machine Learning and Deep Learning Techniques: A Survey
45 pages
Application of Data Science and Bioinformatics in Healthcare Technologies
No ratings yet
Application of Data Science and Bioinformatics in Healthcare Technologies
12 pages
Prediction of Heart Disease by Clustering and Classification Techniques Prediction of Heart Disease by Clustering and Classification Techniques
No ratings yet
Prediction of Heart Disease by Clustering and Classification Techniques Prediction of Heart Disease by Clustering and Classification Techniques
8 pages
Preview-9781482232127 A25892874
No ratings yet
Preview-9781482232127 A25892874
76 pages
Previewpdf
No ratings yet
Previewpdf
288 pages
Machine Learning For Improved Diagnosis and Prognosis in Healthcare
No ratings yet
Machine Learning For Improved Diagnosis and Prognosis in Healthcare
9 pages
AI in Healthcare - Article
No ratings yet
AI in Healthcare - Article
7 pages
Machine Learning in Medicine Cookbook Premium Download
100% (17)
Machine Learning in Medicine Cookbook Premium Download
17 pages
Irjet V6i31160
No ratings yet
Irjet V6i31160
7 pages
The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics
No ratings yet
The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics
9 pages
Sensors 23 04178
No ratings yet
Sensors 23 04178
21 pages
Applsci 12 04455
No ratings yet
Applsci 12 04455
18 pages
Predictive Algorithms in Healthcare Analysis
No ratings yet
Predictive Algorithms in Healthcare Analysis
7 pages
Machine Learning in Healthcare Systems
No ratings yet
Machine Learning in Healthcare Systems
251 pages
Ijcns 2022111614325160
No ratings yet
Ijcns 2022111614325160
17 pages
Team DLJ Researchpaper
No ratings yet
Team DLJ Researchpaper
8 pages
Time Series Prediction Using Deep Learning Methods in Healthcare
No ratings yet
Time Series Prediction Using Deep Learning Methods in Healthcare
29 pages
Data-Driven Diagnosis in Psychiatry
No ratings yet
Data-Driven Diagnosis in Psychiatry
155 pages
Decision Tree Based Health Prediction System
No ratings yet
Decision Tree Based Health Prediction System
8 pages
Heart Disease Prediction With Machine Learning: Dhyan Jariwala
No ratings yet
Heart Disease Prediction With Machine Learning: Dhyan Jariwala
5 pages
Federated Learning For Healthcare - Systematic Review and Architecture Proposal
No ratings yet
Federated Learning For Healthcare - Systematic Review and Architecture Proposal
23 pages
Batch Members (Reg No) : Health Analytics Optimization For Enhanced Patient Care Data
No ratings yet
Batch Members (Reg No) : Health Analytics Optimization For Enhanced Patient Care Data
15 pages
Introduction To Artificial Intelligence in Medicine
No ratings yet
Introduction To Artificial Intelligence in Medicine
19 pages
XAI Framework For Cardiovascular Disease
No ratings yet
XAI Framework For Cardiovascular Disease
30 pages
CNN-MDRP: Accurate Disease Prediction
No ratings yet
CNN-MDRP: Accurate Disease Prediction
3 pages
Demystifying Big Data, Machine Learning, and Deep Learning For Healthcare Analytics Pradeep N Sandeep Kautish Sheng-Lung Peng Newest Edition 2025
No ratings yet
Demystifying Big Data, Machine Learning, and Deep Learning For Healthcare Analytics Pradeep N Sandeep Kautish Sheng-Lung Peng Newest Edition 2025
93 pages
Paper 68-A Comprehensive Review of Healthcare Prediction
No ratings yet
Paper 68-A Comprehensive Review of Healthcare Prediction
13 pages
Latest Seminar Report Yash Ingole
No ratings yet
Latest Seminar Report Yash Ingole
35 pages
Intelligent Disease Diagnosis: A Multi-Disease Prediction Approach Using Machine Learning
No ratings yet
Intelligent Disease Diagnosis: A Multi-Disease Prediction Approach Using Machine Learning
12 pages
Data Mining Techniques For Medical Data A Review PDF
No ratings yet
Data Mining Techniques For Medical Data A Review PDF
12 pages
Lecture 1.4 Greedy - Scheduling Problems
No ratings yet
Lecture 1.4 Greedy - Scheduling Problems
70 pages
Mastering Relative Clauses
No ratings yet
Mastering Relative Clauses
13 pages
IMO - 2017 - Answer
No ratings yet
IMO - 2017 - Answer
9 pages
Hysys Conversion Reactors: By: Eko Ariyanto, ST., Mchemeng
No ratings yet
Hysys Conversion Reactors: By: Eko Ariyanto, ST., Mchemeng
26 pages
Cns Assessment
No ratings yet
Cns Assessment
27 pages
Boesch Symbolic Comm in Wild Chimps 1991
No ratings yet
Boesch Symbolic Comm in Wild Chimps 1991
9 pages
El Filibusterismo
No ratings yet
El Filibusterismo
37 pages
Gautam 2020a
No ratings yet
Gautam 2020a
18 pages
Language of Chemistry Overview
No ratings yet
Language of Chemistry Overview
7 pages
Hall Effect Experiment in Semiconductors
No ratings yet
Hall Effect Experiment in Semiconductors
4 pages
778 1726 1 SM
No ratings yet
778 1726 1 SM
15 pages
ENG110 Unit 2.2 Assignment Template-1
No ratings yet
ENG110 Unit 2.2 Assignment Template-1
11 pages
IT in Business During COVID-19
No ratings yet
IT in Business During COVID-19
7 pages
Udp Fpga
No ratings yet
Udp Fpga
12 pages
8.ACCT112 Relevant Costing - LMS
No ratings yet
8.ACCT112 Relevant Costing - LMS
42 pages
Scilympics 2013-14: Student Science Contest
100% (2)
Scilympics 2013-14: Student Science Contest
2 pages
User Manual of Egov 11.0 Implementation of It Solution For RVNL D3799 Document Version / Details
No ratings yet
User Manual of Egov 11.0 Implementation of It Solution For RVNL D3799 Document Version / Details
64 pages
Concrete Class Specifications and Mixes
100% (3)
Concrete Class Specifications and Mixes
5 pages
Wellbeing For Students - How Can We Make Assessments A Positive Experience?
No ratings yet
Wellbeing For Students - How Can We Make Assessments A Positive Experience?
5 pages
LYRICS
No ratings yet
LYRICS
10 pages
Internet Addiction in Young Adults A Meta-Analysis and Systematic Review
No ratings yet
Internet Addiction in Young Adults A Meta-Analysis and Systematic Review
10 pages
JAR A320 Neo Checklist-Rev8-1
No ratings yet
JAR A320 Neo Checklist-Rev8-1
1 page
The Relaxation & Stress Reduction Workbook
No ratings yet
The Relaxation & Stress Reduction Workbook
11 pages
Detection, Reporting and Management of ADRs - Kerala
No ratings yet
Detection, Reporting and Management of ADRs - Kerala
47 pages
Pnoz m1p
No ratings yet
Pnoz m1p
43 pages
303-06 Starting System - Removal and Installation - Starter Motor
No ratings yet
303-06 Starting System - Removal and Installation - Starter Motor
4 pages
Advanced Biochemical Methods
No ratings yet
Advanced Biochemical Methods
23 pages
The Scope of Historical Research
0% (1)
The Scope of Historical Research
3 pages
Deen Dayal Swasthya Seva Yojana Guide
No ratings yet
Deen Dayal Swasthya Seva Yojana Guide
5 pages
Leadership in Coordination Management
No ratings yet
Leadership in Coordination Management
13 pages

Using Medical Data and Clustering

Uploaded by

Using Medical Data and Clustering

Uploaded by

electronics

1 Puli Christian Hospital, Puli 54546, Taiwan

Electronics 2024, 13, 140. https://doi.org/10.3390/electronics13010140 https://www.mdpi.com/journal/electronics

2. Clustering Techniques and Applications for Medical Data Analysis

Table 1. Recent clustering methods for medical data.

References Years Applications Methods of Clustering

Loss function(X, X∗ ) = (X − X∗ )2 (4)

vi (t + 1) = vi (t) + α(t) · hij (t) · [ x (t) − vi (t)] (5)

3. The Proposed SHCM System

3.1. Data and Data Preprocessing

3.1. Data and Data

• Merge data from • From demographic • Remove duplicate • Gender • MinMaxScaler

Figure 2. Data preprocessing steps.

where 𝑋 Datasets is theJul. Sep.and 𝑋 Oct. and 𝑋Nov. are the

Table 5. The clustering performance in terms of the Calinski–Harabasz index.

Table 6. The clustering performance in terms of the Davies–Bouldin index.

Table 7. The clustering performance in terms of the Silhouette Coefficient.

4.2. Preliminary Analysis of ICD-10-CM Codes between Indigenous Patients and

4.3. Analysis of ICD-10-CM after Clustering

Table 8. Clustering analysis of indigenous patients in January.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4

Table 9. Clustering analysis of indigenous patients in March.

Total Cluster 1 Cluster 2 Cluster 3

Table 10. Clustering analysis of indigenous patients in May.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Table 11. Clustering analysis of indigenous patients in June.

Total Cluster 1 Cluster 2 Cluster 3

Table 11. Cont.

Total Cluster 1 Cluster 2 Cluster 3

Table 12. Clustering analysis of non-indigenous patients in January.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4

Table 13. Clustering analysis of non-indigenous patients in March.

Total Cluster 1 Cluster 2 Cluster 3

Table 14. Clustering analysis of non-indigenous patients in May.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Table 14. Cont.

Total Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Table 15. Clustering analysis of non-indigenous patients in June.

Total Cluster 1 Cluster 2 Cluster 3

4.4. Analysis of Major Disease Codes over 12 Months in Indigenous Groups

4.5. Analysis of Major Disease Codes in Non-Indigenous Groups

Table 16. Clusters of diseases.

Clusters of Indigenous Patients Clusters of Non-Indigenous Patients

Table 16. Cont.

Clusters of Indigenous Patients Clusters of Non-Indigenous Patients

4.6.1. Impacts of Infectious Diseases

4.6.2. Impacts of Type 2 Diabetes

4.6.3. Impacts of Essential Hypertension and Hypertensive Heart Diseases

Table A1. The ICD-10 CM and brief descriptions.

Table A1. Cont.

You might also like