Using Medical Data and Clustering
Using Medical Data and Clustering
Article
Using Medical Data and Clustering Techniques for a Smart
Healthcare System
Wen-Chieh Yang 1 , Jung-Pin Lai 2 , Yu-Hui Liu 3 , Ying-Lei Lin 3 , Hung-Pin Hou 1 and Ping-Feng Pai 3,4, *
Abstract: With the rapid advancement of information technology, both hardware and software, smart
healthcare has become increasingly achievable. The integration of medical data and machine-learning
technology is the key to realizing this potential. The quality of medical data influences the results of a
smart healthcare system to a great extent. This study aimed to design a smart healthcare system based
on clustering techniques and medical data (SHCM) to analyze potential risks and trends in patients in
a given time frame. Evidence-based medicine was also employed to explore the results generated by
the proposed SHCM system. Thus, similar and different discoveries examined by applying evidence-
based medicine could be investigated and integrated into the SHCM to provide personalized smart
medical services. In addition, the presented SHCM system analyzes the relationship between health
conditions and patients in terms of the clustering results. The findings of this study show the
similarities and differences in the clusters obtained between indigenous patients and non-indigenous
patients in terms of diseases, time, and numbers. Therefore, the analyzed potential health risks
could be further employed in hospital management, such as personalized health education control,
personal healthcare, improvement in the utilization of medical resources, and the evaluation of
medical expenses.
Citation: Yang, W.-C.; Lai, J.-P.; Liu,
Y.-H.; Lin, Y.-L.; Hou, H.-P.; Pai, P.-F.
Keywords: clustering; medical data; smart healthcare
Using Medical Data and Clustering
Techniques for a Smart Healthcare
System. Electronics 2024, 13, 140.
https://doi.org/10.3390/
electronics13010140 1. Introduction
Due to the progress and advantages of information technology and data analysis tech-
Academic Editors: Antoni Morell
and Chunping Li
niques, smart medical care plays an important role in the modern medical field. Machine-
learning and data-mining techniques have provided hospital practitioners with more
Received: 15 November 2023 effective and efficient medical solutions in personalized medicine and led to disease pre-
Revised: 18 December 2023 dictions, medical efficiency improvement, and medical resource optimization. To identify
Accepted: 27 December 2023 similarities among patients, grouping patients into clinically meaningful clusters is essen-
Published: 28 December 2023
tial [1]. Healthcare organizations and physicians take advantage of clustering results to
analyze similarities among patients. By clustering patients in terms of diseases, risk factors,
lifestyles, or other relevant factors, clustering results can help physicians gain insights into
Copyright: © 2023 by the authors.
patients’ needs and provide personalized treatments.
Licensee MDPI, Basel, Switzerland. Previous studies have pointed out the importance of using medical management
This article is an open access article databases to analyze patient clusters to learn trends of diseases according to clustering
distributed under the terms and results [2]. The clustering technique is one of the most useful methods for analyzing patient
conditions of the Creative Commons similarities for precision medicine [1]. Analyzing a patient’s potential risks and trends
Attribution (CC BY) license (https:// requires a lot of patient-related data, which are recorded every time a patient visits a
creativecommons.org/licenses/by/ hospital for medical treatment. In the era of big data, electronic records include a large
4.0/). amount of text, such as the clinical narration of doctors’ advice. Thus, the analysis of
electronic records has become more complex than before. In addition, due to the high
dimensions of input data, the reduction in dimensions or feature selection can improve
model efficiency and the performance of clustering tasks. Zelina et al. [3] proposed a
natural language processing (NLP) method to investigate the clinician dataset of Czech
breast cancer patients. The developed RobeCzech model is a general-purpose Czech
transformer language model and is used for the unsupervised extraction, labeling, and
clustering of fragments from clinical records. This study indicated the feasibility as well
as the possibility of dealing with unstructured Czech clinical records in a non-supervised
training manner. Irving et al. [4] employed electronic medical record (EMR) data to enhance
the detection and prediction of psychosis risk in South London. In addition to basic patient
information, clinical characteristics, symptoms, and substances, the EMR data included
NLP predictions. The authors reported that using NLP to cope with EMRs can significantly
improve the prognostic accuracy of psychosis risk.
Issues of concern in existing electronic medical records and eHealth systems include
technical aspects, managerial factors, and particularly the quality of data in systems [5].
Additionally, as previously pointed out, the quality of the data is essential for healthcare
systems [6]. Thus, this study aimed to deal with various data types by applying data
preprocessing with data merging, data conversion, data cleaning, data selection, and data
normalization. Then, clustering techniques were employed to group patients with similar
medical features to improve the data quality for the healthcare system.
This investigation used demographic information, drug items, doctors’ advice, and
exam items to perform clustering tasks and then to analyze the results in terms of in-
digenous people and non-indigenous people. Four clustering methods were used in this
study, namely, K-means, hierarchical clustering, autoencoder, and SOM-KM. The clustering
performance was evaluated through three indicators: the Calinski–Harabasz index (CH),
Davies–Bouldin index (DB), and Silhouette Coefficient (SC). For most cases and indices,
K-means outperformed the other methods. Therefore, K-means was used to analyze the
clustering results. The rest of this study is organized as follows. Section 2 illustrates
the clustering methods and applications in medical data analysis. The presented smart
healthcare system based on clustering techniques and big data is introduced in Section 3.
Section 4 depicts numerical examples. Finally, conclusions are presented in Section 5.
K-means was commonly used as a popular clustering technique for analyzing medical
data. The clustering approaches can be generally classified into categories: hierarchical
clustering algorithms and partition clustering algorithms [7–10]. This study employed four
clustering methods, K-means (KM), hierarchical clustering (HC), the K-means autoencoder
(AEKM), and the K-means self-organizing map (SOMKM), to analyze medical data.
The K-means method [26] involves dividing a sample dataset into k subsets, forming
k clusters, and assigning n data points to these k clusters, with each data point exclusively
belonging to one cluster. The K-means algorithm is an iterative process that consists of two
primary steps. Initially, it selects k cluster centers, and subsequently, it assigns data points
to the nearest center to obtain an initial result. Following this, the centroids of each cluster
are updated as new centers, and these two steps are repeated iteratively. The objective of
Electronics 2024, 13, 140 4 of 20
the clustering results is to minimize the distance between data points and their respective
cluster centers. The objection function of the K-means algorithm is shown in the following
equations. Equation (1) employs the Euclidean distance to ensure that data point xi is
closest to its assigned center, while Equation (2) is used to update the center as the mean
value [27–30].
k N
Obj = ∑ ∑ xi − x j
2
(1)
i =1 j =1
1 N
N ∑ i =1 i
Xk = x (2)
where k is the number of cluster centers, N is the number of data points in the ith cluster, x j
is the cluster mean, and xi is the ith point in the dataset.
For the K-means clustering algorithm, it is necessary to pre-specify the number of
clusters denoted by K. This is an important hyperparameter of the algorithm. To determine
the most suitable number of clusters for the experimental data, the Elbow Method is
employed in this approach [31].
Hierarchical clustering (HC) constructs a hierarchy of clusters by iteratively merging or
dividing clusters based on a distance metric. This method provides a visual representation
of the data structure through dendrogram plots. There are two main types of hierarchical
clustering: agglomerative and divisive. In our study, we employed the agglomerative
clustering approach because our data samples were generated from patient records. This
method begins with each sample being treated as an individual cluster and then progres-
sively merges clusters that are close in proximity until a certain termination condition is
met. For hierarchical clustering, three essential elements, the similarity distance, merging
rules, and termination conditions, need to be considered [32]. The hierarchical clustering
process is irreversible, and due to its consideration of each individual data point, it can be
computationally time-consuming.
Developed in the 1980s by Hinton and the PDP group [33], the autoencoder is an artifi-
cial neural network with an input layer, a hidden layer, and an output layer. The main pur-
pose of the autoencoder is to perform representation learning on the input data and make
the output and input have the same meaning. Autoencoders have been widely used in fea-
ture extraction [29,34–36]. An m-dimensional dataset is considered as X = {X1 , X2 , . . . , Xm }.
The compressed data features are generated by the encoder E, and following that, the
output X∗ is generated by the decoder D, which can be expressed by Equation (3):
X∗ = D(E(x)) (3)
The training goal of the autoencoder is to minimize the error. The loss function can be
expressed as Equation (4):
After establishing the autoencoder model, the K-means method is then used because
the autoencoder is not a clustering tool [35].
The self-organizing map (SOM) [37] is a method consisting of a two-dimensional
grid used for mapping input data. During the training process, the SOM forms an elastic
grid to envelop the distribution of input data, mapping adjacent input data to nearby
grid units. SOM training is an iterative process that adjusts the positions of grid units by
computing distances and finding the Best-Matching Unit (BMU) with prototype vectors.
Furthermore, the SOM’s computational complexity scales linearly with the number of
data samples, making it memory-efficient, but scales quadratically with the number of
map units. Training large maps can be time-consuming, although it can be expedited with
specialized techniques. Apart from the SOM, alternative variants are available, though they
may require more complex visualization methods. In summary, the SOM is an effective
Electronics 2024, 13, 140 5 of 20
approach for processing large datasets while preserving the topological characteristics of
the input space [38].
SOM training is conducted iteratively. In each training step, a sample vector is ran-
domly chosen from the input dataset. Distances between this sample vector and all proto-
type vectors are computed. The Best-Matching Unit (BMU), denoted by BMU, is the map
unit whose prototype vector is closest to the sample vector. Subsequently, the prototype
vectors are updated. The BMU and its topological neighbors are adjusted toward the
sample vector in the input space. The rule for the prototype vector of unit “i” is updated as
expressed in Equation (5):
where
vi (t + 1) is the updated prototype vector for unit i at time t + 1.
vi (t) is the current prototype vector for unit i at time t.
α(t) is the adaptation coefficient at time t.
hij (t) is the neighborhood kernel centered on the winning unit at time t.
The SOM is commonly used for dimensionality reduction and data visualization; it
maps high-dimensional data into two- or three-dimensional spaces, providing a significant
advantage when dealing with complex data. In this study, we leveraged the strengths of
both models by first mapping the data into a two-dimensional representation through a
SOM and then performing clustering using K-means.
Figure1.1.The
Figure Theproposed
proposedsmart
smarthealthcare
healthcaresystem
systembased
basedon
onclustering
clusteringtechniques
techniquesand
andmedical
medicaldata
data
(SHCM).
(SHCM).
was used for data normalization in this study and is represented by Equation (6). Table 3
shows the number of attributes and patient visits in 12 months.
(1) Data (2) Data (3) Data (4) Data (5) Data
merging conversion cleaning selection normalization
clusters based on the distances between data points within each cluster. It then identifies
the cluster (Cj ) with the highest similarity and divides it by the cluster’s dispersion (S),
which is computed by averaging the distances between data points within that cluster, and
can be expressed by Equation (8). The DB index is established by averaging these cluster
similarities across all clusters, and a smaller DB index value indicates better clustering
results [42]. !
1 k Ci + Cj
DB = ∑i=1 max j̸=i (8)
k Sij
The Silhouette Coefficient (SC) [43] considers the similarity of each data point to others
within its cluster (A) and the dissimilarity to other clusters (B). The within-cluster similarity
(A) measures the distance between the data point and other data points within the same
cluster. The between-cluster dissimilarity (B) measures the distance between the data point
and data points in other clusters. The formula for calculating the Silhouette Coefficient is
in Equation (9):
B−A
SC = (9)
max( A, B)
The SI index’s values range from −1 to 1, where a value close to 1 indicates that data
points within their assigned cluster are very similar and dissimilar to data points in other
clusters, while a value close to −1 suggests that data points are more likely to be assigned
to the wrong cluster [44].
The clustering and storing of data in this study were implemented in the Anaconda
environment based on the python programming language and the scikit-learn library. In
the next section, numerical data are employed to demonstrate the performance of the
SHCM system. Then, numerical results generated by the SHCM system are observed and
analyzed, and conclusions are drawn.
4. Numerical Results
4.1. Clustering Performance with Three Measurements
Table 4 indicates the cluster numbers obtained by the four clustering methods in
12 months. Tables 5–7 list three measurements: the Calinski–Harabasz index (CH), the
Davies–Bouldin index (DB), and the Silhouette Coefficient (SC) were used in this study
to evaluate the performance of the clustering techniques. The number of suitable clusters
falls between clusters 3 and 5, and the three clusters appeared the most frequently. Table 5
illustrates the CH indexes of the four clustering approaches. A larger CH value means a
better clustering result. Table 6 shows the DB indicators of the different clustering methods.
A smaller DB value implies a better clustering result. Table 7 depicts the SI coefficients. An
SI value close to 1 means a better clustering result. In summary, the K-means method is
mostly superior to the other clustering methods for 12 months of data. It has been pointed
out that the K-means approach can provide quite satisfactory results compared to the other
clustering methods [15,30]. Harada et al. [45] reported that the advantages of K-means
are its simple principle and high flexibility. Therefore, the clustering results generated by
k-means were used to illustrate the clustering results for this study.
Table 4. The numbers of clusters using four methods from January to December.
Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 4 4 3 3 5 3 3 3 3 3 3 3
AEKM 4 4 3 3 5 3 3 3 3 3 3 3
SOMKM 4 4 3 3 4 3 4 3 4 4 3 3
HC 4 4 3 3 3 3 3 3 3 3 3 4
Electronics 2024, 13, 140 9 of 20
Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 1310.21 1431.42 1988.48 2107.83 1810.97 1892.51 1914.83 2082.41 1682.57 1731.59 1590.45 1323.84
AEKM 257.73 268.64 217.65 399.86 319.53 653.52 709.87 900.43 347.86 336.36 384.98 163.36
SOMKM 1281.33 1423.37 1988.48 1012.57 1377.30 1888.91 1432.10 1841.56 1391.69 1416.60 1590.45 1233.08
HC 1264.75 1368.76 1883.29 2008.26 2060.34 1847.02 1805.66 2035.00 1632.01 1690.22 1512.80 1112.01
Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 1.43 1.28 1.43 1.25 1.6 1.31 1.27 1.28 1.44 1.5 1.35 1.58
AEKM 4.59 3.68 5.55 6 5.33 4.02 3.12 3.74 3.58 4.34 3.11 6.18
SOMKM 1.42 1.29 1.43 2.19 1.96 1.31 1.91 1.5 1.57 1.67 1.35 1.7
HC 1.45 1.31 1.43 1.27 1.49 1.32 1.34 1.28 1.43 1.52 1.38 1.53
Methods Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
KM 0.27 0.29 0.27 0.32 0.24 0.31 0.31 0.32 0.27 0.25 0.29 0.23
AEKM 0.01 0.03 0.02 0.07 0.01 0.13 0.16 0.13 0.04 0.03 0.04 0.01
SOMKM 0.26 0.29 0.27 0.16 0.2 0.31 0.24 0.31 0.28 0.25 0.29 0.21
HC 0.27 0.29 0.25 0.31 0.26 0.3 0.3 0.32 0.26 0.25 0.29 0.25
urinary system diseases (N39, N20), gout (M10), chronic viral hepatitis (B18), blood dis-
orders (R31), and infectious diseases (Z20). The grouping pattern continued similarly in
February. In March and April, the number of groups was reduced to three, combining
infectious diseases (Z20), pregnancy-related care (Z34), digestive system diseases (K21), and
sleep disorders (G47), along with groups featuring gout (M10) and chronic viral hepatitis
(B18) that now included respiratory disease codes (I25, J44, J18). In April, the group with
pregnancy-related care (Z34) also showed occurrences of breast cancer (C50). By May, the
number of groups increased to five, including groups with respiratory disease codes (J00)
and respiratory symptoms (R05, O47). From June to December, the groups were consis-
tently divided into three: a group focusing on infectious diseases (Z20), pregnancy-related
care (Z34), and sleep disorders (G47); a group with urinary system diseases (N39), often
accompanied by hematuria (R31) and kidney stones (N20); and a group dominated by gout
(M10) and chronic viral hepatitis (B18), frequently associated with respiratory diseases (J44,
J18) and chronic ischemic heart disease (I25).
Observations throughout the year show that urinary diseases (N class) and infectious
diseases (Z20) were almost constantly present, along with the consistent appearance of
respiratory diseases (J class), indicating these as the main health concerns affecting the
non-indigenous population. The presence of chronic ischemic heart disease (I25) may relate
to lifestyle factors in the non-indigenous community.
4.6. Comparative Analysis of Major Disease Codes among Indigenous and Non-Indigenous
Patients
Utilizing K-means clustering, which resulted in groups of three, four, and five, an
examination of the primary disease codes was conducted. Across eight months, namely,
March, April, and July to December, where the data were clustered into three groups, the
same disease codes were consistently observed each month in the indigenous population.
To facilitate a clearer observation and comparison of similar codes across different groups,
the results are documented in Table 16. Consequently, the outcomes were categorized
into three classes: class A, predominantly featuring codes I10, I11, K21, and Z34; class B,
primarily centered around code N39; and class C, focusing on code M10.
In the non-indigenous groups, similar patterns were observed. In class A, the codes
included K21 and G47, sharing K21 with the indigenous group but adding G47 while
lacking I10, I11, and Z34. Both indigenous and non-indigenous groups had N39 as the
main code in class B. In class C, both groups shared the M10 code, but an additional N40
code was observed in the non-indigenous group. During the same period, when the data
were categorized into three groups, a unique situation was noted in June for the indigenous
population. They had groups belonging to class A and class C, but not class B. Instead,
there was an additional group, classified as class D, characterized by Z20 as the primary
code, accompanied by respiratory diseases (U07, J00, J02, J06, R05). In contrast, the non-
indigenous population continued to be categorized under the original classes A, B, and C.
This variance in June, particularly considering the specific impact of COVID-19 (U07) in
2022, suggests that the indigenous population was more significantly affected during this
month, leading to a different clustering trend. After classifying the main diseases based
on their similarities, it becomes easier to observe the differing trends between infectious
and chronic diseases over the 12 months. Consequently, the subsequent analysis focuses on
infectious diseases and chronic diseases.
centered around urinary system infections (N39), and class C, led by gout (M10). However,
a category similar to that observed in June for the indigenous population emerged: class
D. Unlike in June, the frequency of respiratory disease codes was lower in January and
February, with pneumonia (B18) notably being the second most common condition. Given
the outbreak of COVID-19 in Taiwan during these months, the appearance of class D could
serve as an early warning signal of the epidemic.
In the analysis of May, considering the greater number of assigned groups due to
potential overlapping diseases in patients, primary diseases were selected based on a
threshold of 40%. In addition to the similar classes A, B, C, and D, an additional class E
was identified, which is primarily associated with acute upper respiratory infections (J00).
In this group, the specific COVID-19 disease code (U07) exceeded 43% in both indigenous
and non-indigenous populations, aligning with the surge in COVID-19 cases in Taiwan
in May. This finding indicates the emergence of a new group of patients seeking medical
assistance due to the epidemic, in addition to regular medical patients. Among groups in
May, 48% of the indigenous population had the U07 code, compared to 43% in the non-
indigenous population. The disparity in the proportion of disease codes suggests a greater
impact on indigenous communities. Considering lifestyle factors, indigenous communities,
often residing in closely knit tribes, have a higher interaction frequency compared to
non-indigenous populations. In addition, indigenous people are slower to receive disease
information and initiate preventive measures compared to people in urban areas.
Figure3.3.The
Figure Theage
agedistribution
distributionofofpatients
patientswith
withE11
E11disease
diseasecode
codeofofclass
classAAininMarch.
March.
Electronics 2024, 13, 140 16 of 20
Figure 3. The age distribution of patients with E11 disease code of class A in March.
Figure 3. The age distribution of patients with E11 disease code of class A in March.
Figure4.4.The
The agedistribution
distribution ofpatients
patients withE11
E11 diseasecode
code of classBBininMarch.
March.
Figure 4. Theage
Figure age distributionof
of patientswith
with E11disease
disease codeofofclass
class B in March.
Figure 5. The age distribution of patients with E11 disease code of class C in March.
Figure5.5.The
Figure Theage
agedistribution
distributionof
ofpatients
patientswith
withE11
E11disease
diseasecode
codeofofclass
classCCininMarch.
March.
4.6.3. Impacts
4.6.3.The
Impacts of
of Essential
Essential Hypertension and Hypertensive Heart Diseases
bar chart displaysHypertension
the number ofand Hypertensive
individuals withHeart Diseases
the E11 code in terms of age.
Observations
The line
Observations from
graph represents the analysis
cumulative
from the of
analysiscases.major diseases
Observing
of major reveal
thereveal
diseases distinct
cumulative patterns
cases,
distinct within
it is evident
patterns within
chronic
that heart diseases
the proportion
chronic of
of the
heart diseases the
ofE11 I class.
thecode Among
in the
I class. the indigenous
indigenous
Among population,
populationpopulation,
the indigenous the
is mostly higher prevalent
than in
the prevalent
the non-indigenous groups. The bar chart reveals similar trends in both groups, but with
different age brackets. The trend for the non-indigenous group appears a decade later than
that for the indigenous group, indicating an earlier onset of E11-related health impacts in
the indigenous population.
provides a clearer outline of regional disease patterns and healthcare needs. This approach
allows for a better understanding of medical requirements in different areas and facilitates
the provision of appropriate medical assistance tailored to the needs of specific subgroups.
Additionally, in the context of epidemic prevention and control, this method can enable
early prevention and management based on local infection trends.
5. Conclusions
This study employed clustering techniques to group and then analyze diseases in the
indigenous population and the non-indigenous population. K-means clustering obtained
better results than the other three clustering techniques in terms of three measurements.
The developed model can learn distances between clusters and further investigate relations
among diseases in patients through the features of clusters. Although the medical condi-
tions of patient groups vary each month, a consistent clustering trend is observed overall.
This trend is particularly pronounced in cases where the primary disease is the same,
indicating a higher probability of certain disease codes appearing together. This result lays
the foundation for a deeper exploration of potential correlations between different diseases.
From the perspectives of chronic diseases and bacterial or viral infections, we noted
distinct clustering behaviors between the two. The chronic disease group exhibited consis-
tency in the monthly analyses, while the clustering of bacterial or viral infections showed a
close correlation with the stages of epidemic development. This was particularly evident in
the context of the 2022 epidemic trends in Taiwan, where changes in the number of clusters
were highly correlated with different stages of the epidemic.
The unsupervised clustering method helps in identifying correlations in complex and
varied data that are not readily observable. However, the diversity of the data, along with
the varying medical needs of patients at the time of consultation, poses challenges in data
preprocessing. Moreover, challenges arise due to the large data scales, diversities, and
complexity. Both structured and unstructured data are included in the dataset. In addition
to medical treatment, there may be return visits or patients with chronic diseases who
only receive medicine without treatment. Thus, the presentation of data should not only
consider the identity of the patient but also consider the timeliness of the patient’s visits.
Only data from 2022 were employed in this study. Data gathered in other years could
be employed to examine the feasibility of the proposed SHCM system. In addition, only
data collected from the Puli Christian Hospital served as data for the SHCM system. Data
collected from other hospitals could be utilized to investigate the generalization ability of
the developed system. Finally, some deep clustering techniques could be employed to deal
with the clustering tasks for the presented system.
Author Contributions: Conceptualization, P.-F.P., W.-C.Y. and H.-P.H.; methodology, J.-P.L. and
P.-F.P.; software, J.-P.L., Y.-H.L. and Y.-L.L.; formal analysis, Y.-H.L.; writing—original draft prepa-
ration, Y.-H.L., Y.-L.L. and P.-F.P.; writing—review and editing, P.-F.P.; visualization, Y.-H.L. and
Y.-L.L.; supervision, P.-F.P. and W.-C.Y. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was supported by funding from Puli Christian Hospital/Chi Nan National
University Joint Research Program under grant number 112-PuChi-AIR-001.
Institutional Review Board Statement: Ethical review and approval were waived for this study, due
to the use of a database with data aggregated by age (10-year age-groups) and diagnosis categories.
Informed Consent Statement: Informed consent was not required as cohort members were unidentifiable.
Data Availability Statement: The data presented in this study are available on reasonable request
from the corresponding author. The data are not public available due to the privacy.
Acknowledgments: This work was supported by Kai Yen, Bing-Cheng Chiu, and Yan-Song Chang,
who assisted in data analysis.
Conflicts of Interest: The authors declare no conflicts of interest.
Electronics 2024, 13, 140 18 of 20
Appendix A
In this study, the analysis of electronic medical record (EMR) data involved labeling
patients’ diseases according to the International Statistical Classification of Diseases and Related
Health Problems, 10th Revision (ICD-10 CM). The ICD-10, established by the World Health
Organization (WHO), categorizes diseases based on their characteristics and represents
them using a coding system. This system is crucial for accurately and systematically
recording cases, and it plays a significant role in clinical diagnosis, epidemiological research,
health management, and data collection. This appendix only includes disease codes directly
relevant to our study. These codes represent specific disease types involved in our research
and are intended to help readers better understand the scope and focus of our study. Each
code is accompanied by a brief description of the disease, making it accessible to readers
who are not specialists in the field.
ICD-10 CM Diseases
A09 Infectious gastroenteritis and colitis, unspecified
B08 Other viral infections characterized by skin and mucous membrane lesions, not elsewhere classified
B18 Chronic viral hepatitis
C50 Malignant neoplasm of breast
D64 Other anemias
E11 Type 2 diabetes mellitus
E78 Disorders of lipoprotein metabolism and other lipidemias
E86 Volume depletion
E87 Other disorders of fluid, electrolyte and acid-base balance
G47 Sleep disorders
I10 Essential (primary) hypertension
I11 Hypertensive heart disease
I20 Angina pectoris
I25 Chronic ischemic heart disease
I50 Heart failure
J00 Acute nasopharyngitis [common cold]
J01 Acute sinusitis
J02 Acute pharyngitis
J03 Acute tonsillitis
J06 Acute upper respiratory infections of multiple and unspecified sites
J12 Viral pneumonia, not elsewhere classified
J18 Pneumonia, unspecified organism
J20 Acute bronchitis
J30 Vasomotor and allergic rhinitis
J44 Other chronic obstructive pulmonary disease
J45 Asthma
K21 Gastroesophageal reflux disease
K25 Gastric ulcer
K29 Gastritis and duodenitis
K59 Other functional intestinal disorders
K92 Other diseases of digestive system
L03 Cellulitis and acute lymphangitis
L08 Other local infections of skin and subcutaneous tissue
M10 Gout
M19 Other and unspecified osteoarthritis
M54 Dorsalgia
N13 Obstructive and reflux uropathy
N18 Chronic kidney disease (CKD)
N20 Calculus of kidney and ureter
N39 Other disorders of urinary system
N40 Benign prostatic hyperplasia
O47 False labor
P59 Neonatal jaundice from other and unspecified causes
R00 Abnormalities of heart beat
R05 Cough
R07 Pain in throat and chest
R10 Abdominal and pelvic pain
R11 Nausea and vomiting
R31 Hematuria
R35 Polyuria
Electronics 2024, 13, 140 19 of 20
ICD-10 CM Diseases
R42 Dizziness and giddiness
R50 Fever of other and unknown origin
R51 Headache
R80 Proteinuria
U07 Emergency use of U07
Z11 Encounter for screening for infectious and parasitic diseases
Z20 Contact with and (suspected) exposure to communicable diseases
Z34 Encounter for supervision of normal pregnancy
References
1. Parimbelli, E.; Marini, S.; Sacchi, L.; Bellazzi, R. Patient similarity for precision medicine: A systematic review. J. Biomed. Inform.
2018, 83, 87–96. [CrossRef] [PubMed]
2. Lambert, J.; Leutenegger, A.-L.; Jannot, A.-S.; Baudot, A. Tracking clusters of patients over time enables extracting information
from medico-administrative databases. J. Biomed. Inform. 2023, 139, 104309. [CrossRef] [PubMed]
3. Zelina, P.; Halámková, J.; Nováček, V. Unsupervised extraction, labelling and clustering of segments from clinical notes. In
Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8
December 2022; pp. 1362–1368.
4. Irving, J.; Patel, R.; Oliver, D.; Colling, C.; Pritchard, M.; Broadbent, M.; Baldwin, H.; Stahl, D.; Stewart, R.; Fusar-Poli, P. Using
natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr. Bull.
2021, 47, 405–414. [CrossRef] [PubMed]
5. Ebad, S.A. Healthcare software design and implementation—A project failure case. Softw. Pract. Exp. 2020, 50, 1258–1276.
[CrossRef]
6. Mashoufi, M.; Ayatollahi, H.; Khorasani-Zavareh, D.; Talebi Azad Boni, T. Data quality in health care: Main concepts and
assessment methodologies. Methods Inf. Med. 2023, 62, 005–018. [CrossRef] [PubMed]
7. Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A comprehensive survey of
clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng.
Appl. Artif. Intell. 2022, 110, 104743. [CrossRef]
8. Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques
and developments. Neurocomputing 2017, 267, 664–681. [CrossRef]
9. Chaudhry, M.; Shafi, I.; Mahnoor, M.; Vargas, D.L.R.; Thompson, E.B.; Ashraf, I. A systematic literature review on identifying
patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry 2023, 15, 1679. [CrossRef]
10. Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [CrossRef]
11. Santamaría, L.P.; del Valle, E.P.G.; García, G.L.; Zanin, M.; González, A.R.; Ruiz, E.M.; Gallardo, Y.P.; Chan, G.S.H. Analysis of new
nosological models from disease similarities using clustering. In Proceedings of the 2020 IEEE 33rd International Symposium on
Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 183–188.
12. Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci. 2020, 20,
112–124. [CrossRef]
13. Hassan, M.M.; Mollick, S.; Yasmin, F. An unsupervised cluster-based feature grouping model for early diabetes detection. Healthc.
Anal. 2022, 2, 100112. [CrossRef]
14. Antony, L.; Azam, S.; Ignatious, E.; Quadir, R.; Beeravolu, A.R.; Jonkman, M.; De Boer, F. A comprehensive unsupervised
framework for chronic kidney disease prediction. IEEE Access 2021, 9, 126481–126501. [CrossRef]
15. Enireddy, V.; Anitha, R.; Vallinayagam, S.; Maridurai, T.; Sathish, T.; Balakrishnan, E. Prediction of human diseases using
optimized clustering techniques. Mater. Today Proc. 2021, 46, 4258–4264. [CrossRef]
16. Arora, N.; Singh, A.; Al-Dabagh, M.Z.N.; Maitra, S.K. A novel architecture for diabetes patients’ prediction using k-means
clustering and svm. Math. Probl. Eng. 2022, 2022, 4815521. [CrossRef]
17. Parikh, H.M.; Remedios, C.L.; Hampe, C.S.; Balasubramanyam, A.; Fisher-Hoch, S.P.; Choi, Y.J.; Patel, S.; McCormick, J.B.;
Redondo, M.J.; Krischer, J.P. Data mining framework for discovering and clustering phenotypes of atypical diabetes. J. Clin.
Endocrinol. Metab. 2023, 108, 834–846. [CrossRef] [PubMed]
18. Jasinska-Piadlo, A.; Bond, R.; Biglarbeigi, P.; Brisk, R.; Campbell, P.; Browne, F.; McEneaneny, D. Data-driven versus a domain-led
approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. 2023, 15, 49–66. [CrossRef]
19. Mpanya, D.; Celik, T.; Klug, E.; Ntsinjana, H. Clustering of heart failure phenotypes in johannesburg using unsupervised machine
learning. Appl. Sci. 2023, 13, 1509. [CrossRef]
20. Florensa, D.; Mateo-Fornés, J.; Solsona, F.; Pedrol Aige, T.; Mesas Julió, M.; Piñol, R.; Godoy, P. Use of multiple correspondence
analysis and k-means to explore associations between risk factors and likelihood of colorectal cancer: Cross-sectional study. J.
Med. Internet Res. 2022, 24, e29056. [CrossRef]
21. Koné, A.P.; Scharf, D.; Tan, A. Multimorbidity and complexity among patients with cancer in ontario: A retrospective cohort
study exploring the clustering of 17 chronic conditions with cancer. Cancer Control 2023, 30, 10732748221150393. [CrossRef]
Electronics 2024, 13, 140 20 of 20
22. Chantraine, F.; Schreiber, C.; Pereira, J.A.C.; Kaps, J.; Dierick, F. Classification of stiff-knee gait kinematic severity after stroke
using retrospective k-means clustering algorithm. J. Clin. Med. 2022, 11, 6270. [CrossRef]
23. Yasa, I.; Rusjayanthi, N.; Luthfi, W.B.M. Classification of stroke using k-means and deep learning methods. Lontar Komput. J. Ilm.
Teknol. Inf. 2022, 13, 23. [CrossRef]
24. Al-Khafaji, H.M.R.; Jaleel, R.A. Adopting effective hierarchal iomts computing with k-efficient clustering to control and forecast
covid-19 cases. Comput. Electr. Eng. 2022, 104, 108472. [CrossRef] [PubMed]
25. Ilbeigipour, S.; Albadvi, A.; Noughabi, E.A. Cluster-based analysis of covid-19 cases using self-organizing map neural network
and k-means methods to improve medical decision-making. Inform. Med. Unlocked 2022, 32, 101005. [CrossRef] [PubMed]
26. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 18–21 June 1965 and 27 December 1965–7 January
1966; University of California Press: Oakland, CA, USA, 1967; pp. 281–297.
27. Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings
of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4
April 2010; pp. 63–67.
28. Alam, M.S.; Rahman, M.M.; Hossain, M.A.; Islam, M.K.; Ahmed, K.M.; Ahmed, K.T.; Singh, B.C.; Miah, M.S. Automatic human
brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data
Cogn. Comput. 2019, 3, 27. [CrossRef]
29. Lee, H.; Choi, Y.; Son, B.; Lim, J.; Lee, S.; Kang, J.W.; Kim, K.H.; Kim, E.J.; Yang, C.; Lee, J.-D. Deep autoencoder-powered pattern
identification of sleep disturbance using multi-site cross-sectional survey data. Front. Med. 2022, 9, 950327. [CrossRef] [PubMed]
30. Setiawan, K.E.; Kurniawan, A.; Chowanda, A.; Suhartono, D. Clustering models for hospitals in jakarta using fuzzy c-means and
k-means. Procedia Comput. Sci. 2023, 216, 356–363. [CrossRef] [PubMed]
31. Yuan, C.; Yang, H. Research on k-value selection method of k-means clustering algorithm. J 2019, 2, 226–235. [CrossRef]
32. Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
2012, 2, 86–97. [CrossRef]
33. Rumelhart, D.; Hinton, G.; Williams, R. Learning internal representations by error propagation. In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986; Chapter 8; Volume 1, pp. 318–362.
34. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised
and Transfer Learning, Bellevue, WA, USA, 2 July 2011; JMLR Workshop and Conference Proceedings. ML Research Press:
London, UK, 2012; pp. 37–49.
35. Zhang, L.; Lv, C.; Jin, Y.; Cheng, G.; Fu, Y.; Yuan, D.; Tao, Y.; Guo, Y.; Ni, X.; Shi, T. Deep learning-based multi-omics data
integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 2018, 9, 477. [CrossRef]
36. Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge
Discovery Handbook; Springer: New York, NY, USA, 2023; pp. 353–374.
37. Kohonen, T. The self-organizing map. Proc. IEEE 1990, 78, 1464–1480. [CrossRef]
38. Vesanto, J.; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans. Neural Netw. 2000, 11, 586–600. [CrossRef] [PubMed]
39. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [CrossRef]
40. Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34.
41. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227.
[CrossRef]
42. Xiao, J.; Lu, J.; Li, X. Davies bouldin index based hierarchical initialization k-means. Intell. Data Anal. 2017, 21, 1327–1338.
[CrossRef]
43. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987,
20, 53–65. [CrossRef]
44. Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International
Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748.
45. Harada, D.; Asanoi, H.; Noto, T.; Takagawa, J. Different pathophysiology and outcomes of heart failure with preserved ejection
fraction stratified by k-means clustering. Front. Cardiovasc. Med. 2020, 7, 607760. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.