Medical Informatics and Data Analysis
Medical Informatics and Data Analysis
www.mdpi.com/journal/applsci
Medical Informatics and Data Analysis
Medical Informatics and Data Analysis
Editor
Pentti Nieminen
MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin
Editor
Pentti Nieminen
University of Oulu
Finland
Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland
This is a reprint of articles from the Special Issue published online in the open access journal
Applied Sciences (ISSN 2076-3417) (available at: https://www.mdpi.com/journal/applsci/special
issues/Medical Informatics Data Analysis).
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Volume Number,
Page Range.
© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents
Pentti Nieminen
Applications of Medical Informatics and Data Analysis Methods
Reprinted from: Appl. Sci. 2020, 10, 7359, doi:10.3390/app10207359 . . . . . . . . . . . . . . . . . 1
Pentti Nieminen
Ten Points for High-Quality Statistical Reporting and Data Presentation
Reprinted from: Appl. Sci. 2020, 10, 3885, doi:10.3390/app10113885 . . . . . . . . . . . . . . . . . 21
Jihwan Park, Mi Jung Rho, Hyong Woo Moon and Ji Youl Lee
Castration-Resistant Prostate Cancer Outcome Prediction Using Phased Long Short-Term
Memory with Irregularly Sampled Serial Data
Reprinted from: Appl. Sci. 2020, 10, 2000, doi:10.3390/app10062000 . . . . . . . . . . . . . . . . . 99
v
Daniel Clavel, Cristian Mahulea, Jorge Albareda, and Manuel Silva
A Decision Support System for Elective Surgery Scheduling under Uncertain Durations
Reprinted from: Appl. Sci. 2020, 10, 1937, doi:10.3390/app10061937 . . . . . . . . . . . . . . . . . 179
vi
About the Editor
Pentti Nieminen is a senior scientist at the University of Oulu. He completed his Ph.D. degree
in 1996 and is employed as professor emeritus in medical informatics and data analysis at the
University of Oulu. He has worked over 40 years as an academic teacher in knowledge management
and data analysis. Much of his teaching and research work has been conducted within the following
fields: biostatistics, data analysis methods in health care and medicine, data informatics, statistics
in scientific journals, statistical modeling, publications, bibliometrics, information retrieval, and
educational practices in teaching scientific research and communication. To date, he has published
over 240 scientific articles. His current research projects include studies of statistical intensity,
statistical reporting, and quality of data presentation in medical articles. His goal is to improve the
quality of published research papers and thus to contribute to societal welfare and human well-being
through his experience in data analysis. Outside of professional interests, he enjoys orienteering,
hiking, and traveling.
vii
applied
sciences
Editorial
Applications of Medical Informatics and Data
Analysis Methods
Pentti Nieminen
Medical Informatics and Data Analysis Research Group, University of Oulu, 90014 Oulu, Finland;
pentti.nieminen@oulu.fi
1. Introduction
The science of statistics contributes to the development and application of tools for the design,
analysis, and interpretation of empirical medical studies. The development of new statistical tools for
medical applications depends on the innovative use of statistical inference theory, good understanding
of clinical and epidemiological research questions, and an understanding of the importance of statistical
software. First, statisticians develop a method in response to a need felt in a particular field of the
health sciences, after which the new method is disseminated in the form of presentations, reports,
and publications. It is also necessary to develop tools for implementing the method: software and
manuals. From this point onwards, the extent to which the procedure is adopted will depend on its
usefulness. The broader introduction and acceptance of a new analysis method (as useful as the method
might be) into medical and health care publications seems to require the method being incorporated
into the standard statistical packages generally used by researchers. In addition, if readers do not
understand the mathematics or reporting style, or if the conclusions have been drawn on the basis of
advanced mathematics or computationally complex procedures not visible in the data (tables or graphs)
presented, then clinicians may not be convinced of the results. The lead time from the description of a
new technique to its entering into the practice of medical investigators is long [1].
Unsustainable promises and unfulfillable expectations should be avoided in the context of data
mining and machine learning [2]. The broader introduction and expansion of a new analysis method
to medical publication seems to require that the method helps to solve a data analysis problem,
where basic statistical methods have not been useful or applicable. Simpler classical approaches can
often provide elegant and sufficient answers to important questions.
This Special Issue on Medical Informatics and Data Analysis was an opportunity for the scientific
community to present research on the application and complexity of data analytical methods, and to
give insight into new challenges in biostatistics, epidemiology health sciences, dentistry, and clinical
medicine. The 13 contributed articles belong to four broad groups: (i) basic statistical methods,
(ii) data-oriented practical approaches, (iii) complex machine learning and deep learning predictive
algorithms, (iv) medical informatics.
t-test with log transformation has superior performance over the GLM method for any data that are
not normal and follow beta or gamma distributions. Alternatively, for exponentially distributed data,
the GLM method has superior performance over the t-test with log transformation.
Several findings have demonstrated that too many medical articles do not provide a sufficiently
clear, accurate, or complete account of what was done and what was found. In his article, Ten Points
for High-Quality Statistical Reporting and Data Presentation [5], Pentti Nieminen proposes an applicable
checklist for quickly testing the statistical reporting quality of manuscripts and published research
papers. The developed instrument is applicable for a wide variety of medical and health care research
forums, including both clinical and basic sciences. Editors and reviewers could use the short quality test
proposed in this paper for deciding when the presentation in a manuscript is clearly inadequate. If the
reviewer cannot find the basic information and description related to the data analysis, the reviewer
does not need to read the whole article. After checking tables and figures and reading through
the statistical analysis subsection in the methods section, the reviewer can reject the manuscript on
good grounds. When the proposed simple quality test shows that the statistical reporting and data
presentation are appropriate, the whole article needs to be read and further reviewed [5].
2
Appl. Sci. 2020, 10, 7359
the maps searching for factors such as gender, marital status, kind of work, and economic situation,
which are essential at the retirement threshold.
5. Medical Informatics
Medical informatics focuses on the information technology that enables the effective collection of
data using technology tools to develop medical knowledge and to facilitate the delivery of patient
medical care [16]. The goal of medical informatics is to ensure access to critical patient medical
information at the precise time and place it is needed to make medical decisions. Medical informatics
also focuses on the management of medical data for research and education. Three papers in this
Special Issue present applications for clinical decision making.
Daniel Clavel and his co-authors [17] present a decision support system to organize and order
possible surgeries. Their study has the potential to reduce the workload of the healthcare system in
scheduling—which is very labor-intensive work. A heuristic algorithm is proposed and included in
the decision support system. Different features are implemented in a software tool with a friendly user
interface. A simulation comparison of the scheduling obtained using the approach presented in this
3
Appl. Sci. 2020, 10, 7359
paper and other similar approaches is shown and analyzed. In addition, the impact of the software
tool on the efficiency and quality of surgical services is studied in one hospital setting.
In their paper, A Prostate MRI Segmentation Tool Based on Active Contour Models Using a Gradient
Vector Flow [18], Joaquín Rodríguez, Gilberto Ochoa-Ruiz, and Christian Mata describe in a fully and
detailed way a new GUI tool based on a semi-automated prostate segmentation. The purpose is to
facilitate the time-consuming segmentation process used for annotating images in clinical practice.
To support the efficiency of their method, the authors describe an experimental case.
The paper entitled Medical Assistant Mobile Application for Diabetes Control by Simulating a
Compartmental Model authored by Hernandez-Ordonez and his coworkers [19] is very interesting
and innovative. The authors present an application for mobile phones to assistant patients with
type 1 diabetes. The proposed application is based on four mathematical models that describe
glucose–insulin–glucagon dynamics using a compartmental model, with additional equations to
reproduce aerobic exercise, gastric glucose absorption by the gut, and subcutaneous insulin absorption.
Such developments are always welcome since diabetes became a civilization disease that affects a
number of people every year.
References
1. Nieminen, P.; Miettunen, J.; Koponen, H.; Isohanni, M. Statistical methodologies in psychopharmacology:
A review. Hum. Psychopharmacol. Exp. 2006, 21, 195–203. [CrossRef]
2. Caliebe, A.; Leverkus, F.; Antes, G.; Krawczak, M. Does big data require a methodological change in medical
research? BMC Med. Res. Methodol. 2019, 19, 125. [CrossRef]
3. Indrayan, A. Reporting of Basic Statistical Methods in Biomedical Journals: Improved SAMPL Guidelines.
Indian Pediatr. 2020, 57, 43–48. [CrossRef]
4. Hammouri, H.M.; Sabo, R.T.; Alsaadawi, R.; Kheirallah, K.A. Handling Skewed Data: A Comparison of Two
Popular Methods. Appl. Sci. 2020, 10, 6247. [CrossRef]
5. Nieminen, P. Ten points for high-quality statistical reporting and data presentation. Appl. Sci. 2020, 10, 3885.
[CrossRef]
6. Weon, B.M. Stretched exponential survival analysis for South Korean females. Appl. Sci. 2019, 9, 4230.
[CrossRef]
7. Chen, S.H.; Pai, F.Y.; Yeh, T.M. Using the importance-satisfaction model and service quality performance
matrix to improve long-term care service quality in Taiwan. Appl. Sci. 2020, 10, 85. [CrossRef]
8. Lei, Z.; Fasanella, L.; Martignetti, L.; Li-Jessen, N.Y.K.; Mongeau, L. Investigation of vocal fatigue using a
dose-based vocal loading task. Appl. Sci. 2020, 10, 1192. [CrossRef] [PubMed]
9. Bielińska, A.; Bielińska-Waz, D.; Waz, P. Classification maps in studies on the retirement threshold. Appl. Sci.
2020, 10, 1282. [CrossRef]
10. Nieminen, P.; Kaur, J. Reporting of data analysis methods in psychiatric journals: Trends from 1996 to 2018.
Int. J. Methods Psychiatr. Res. 2019, 28, e1784. [CrossRef] [PubMed]
11. Nieminen, P.; Vähänikkilä, H. Use of data analysis methods in dental publications: Is there evidence of a
methodological change? Publications 2020, 8, 9. [CrossRef]
12. Park, J.; Rho, M.J.; Moon, H.W.; Lee, J.Y. Castration-resistant prostate cancer outcome prediction using
phased long short-term memory with irregularly sampled serial data. Appl. Sci. 2020, 10, 2000. [CrossRef]
13. Moons, E.; Khanna, A.; Akkasi, A.; Moens, M.F. Article a comparison of deep learning methods for ICD
coding of clinical records. Appl. Sci. 2020, 10, 5262. [CrossRef]
4
Appl. Sci. 2020, 10, 7359
14. De la Torre, J.; Marin, J.; Ilarri, S.; Marin, J.J. Applying machine learning for healthcare: A case study on
cervical pain assessment with motion capture. Appl. Sci. 2020, 10, 5942. [CrossRef]
15. Alhassan, A.M.; Wan Zainon, W.M.N. Taylor Bird Swarm Algorithm Based on Deep Belief Network for
Heart Disease Diagnosis. Appl. Sci. 2020, 10, 6626. [CrossRef]
16. Melton, B.L. Systematic Review of Medical Informatics–Supported Medication Decision Making.
Biomed. Inform. Insights 2017, 9, 117822261769797. [CrossRef] [PubMed]
17. Clavel, D.; Mahulea, C.; Albareda, J.; Silva, M. A decision support system for elective surgery scheduling
under uncertain durations. Appl. Sci. 2020, 10, 1937. [CrossRef]
18. Rodríguez, J.; Ochoa-Ruiz, G.; Mata, C. A Prostate MRI Segmentation Tool Based on Active Contour Models
Using a Gradient Vector Flow. Appl. Sci. 2020, 10, 6163. [CrossRef]
19. Hernández-Ordoñez, M.; Nuño-Maganda, M.A.; Calles-Arriaga, C.A.; Rodríguez-Leon, A.;
Ovando-Chacon, G.E.; Salazar-Hernández, R.; Montaño-Rivas, O.; Canseco-Cortinas, J.M. Medical Assistant
Mobile Application for Diabetes Control by Simulating a Compartmental Model. Appl. Sci. 2020, 10, 6846.
[CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
5
applied
sciences
Article
Handling Skewed Data: A Comparison of Two
Popular Methods
Hanan M. Hammouri 1, *, Roy T. Sabo 2 , Rasha Alsaadawi 1 and Khalid A. Kheirallah 3
1 Department of Mathematics and Statistics, Faculty of Arts and Science,
Jordan University of Science and Technology, Irbid 22110, Jordan; [email protected]
2 Department of Biostatistics, School of Medicine, Virginia Commonwealth University,
Richmond, VA 23298, USA; [email protected]
3 Department of Public Health, Faculty of Medicine, Jordan University of Science and Technology,
Irbid 22110, Jordan; [email protected]
* Correspondence: [email protected]
Abstract: Scientists in biomedical and psychosocial research need to deal with skewed data all the
time. In the case of comparing means from two groups, the log transformation is commonly used as a
traditional technique to normalize skewed data before utilizing the two-group t-test. An alternative
method that does not assume normality is the generalized linear model (GLM) combined with
an appropriate link function. In this work, the two techniques are compared using Monte Carlo
simulations; each consists of many iterations that simulate two groups of skewed data for three different
sampling distributions: gamma, exponential, and beta. Afterward, both methods are compared
regarding Type I error rates, power rates and the estimates of the mean differences. We conclude that
the t-test with log transformation had superior performance over the GLM method for any data that
are not normal and follow beta or gamma distributions. Alternatively, for exponentially distributed
data, the GLM method had superior performance over the t-test with log transformation.
Keywords: biostatistics; GLM; skewed data; t-test; Type I error; power simulation; Monte Carlo
1. Introduction
In the biosciences, with the escalating numbers of studies involving many variables and subjects,
there is a belief between non-biostatistician scientists that the amount of data will simply reveal all
there is to understand from it. Unfortunately, this is not always true. Data analysis can be significantly
simplified when the variable of interest has a symmetric distribution (preferably normal distribution)
across subjects, but usually, this is not the case. The need for this desirable property can be avoided by
using very complex modeling that might give results that are harder to interpret and inconvenient for
generalizing—so the need for a high level of expertise in data analysis is a necessity.
As biostatisticians with the main responsibility for collaborative research in many biosciences’
fields, we are commonly asked the question of whether skewed data should be dealt with using
transformation and parametric tests or using nonparametric tests. In this paper, the Monte Carlo
simulation is used to investigate this matter in the case of comparing means from two groups.
Monte Carlo simulation is a systematic method of doing what-if analysis that is used to measure the
reliability of different analyses’ results to draw perceptive inferences regarding the relationship between
the variation in conclusion criteria values and the conclusion results [1]. Monte Carlo simulation,
which is a handy statistical tool for analyzing uncertain scenarios by providing evaluations of multiple
different scenarios in-depth, was first used by Jon von Neumann and Ulam in the 1940s. Nowadays,
Monte Carlo simulation describes any simulation that includes repeated random generation of samples
and studying the performance of statistical methods’ overpopulation samples [2]. Information obtained
from random samples is used to estimate the distributions and obtain statistical properties for different
situations. Moreover, simulation studies, in general, are computer experiments that are associated with
creating data by pseudo-random sampling. An essential asset of simulation studies is the capability
to understand and study the performance of statistical methods because parameters of distributions
are known in advance from the process of generating the data [3]. In this paper, the Monte Carlo
simulation approach is applied to find the Type I error and power for both statistical methods that we
are comparing.
Now, it is necessary to explain the aspects of the problem we are investigating. First, the normal
distribution holds a central place in statistics, with many classical statistical tests and methods requiring
normally or approximately normally distributed measurements, such as t-test, ANOVA, and linear
regression. As such, before applying these methods or tests, the measurement normality should be
assessed using visual tools like the Q–Q plot, P–P plot, histogram, boxplot, or statistical tests like the
Shapiro–Wilk, Kolmogrov–Smirnov, or Anderson–Darling tests. Some work has been done to compare
between formal statistical tests and a Q–Q plot for visualization using simulations [4,5].
When testing the difference between two population means with a two-sample t-test, normality
of the data is assumed. Therefore, actions improve the normality of such data that must occur
before utilizing the t-test. One suggested method for right-skewed measurements is the logarithmic
transformation [6]. For example, measurements in biomedical and psychosocial research can often
be modelled with log-normal distributions, meaning the values are normally distributed after log
transformation. Such log transformations can help to meet the normality assumptions of parametric
statistical tests, which can also improve graphical presentation and interpretability (Figure 1a,b).
The log transformation is simple to implement, requires minimal expertise to perform, and is available
in basic statistical software [6].
(a) (b)
Figure 1. Simulated data from gamma distribution before and after log transformation.
(a) The histogram of the sample before the application of log transformation with fitted normal
and kernel curves; (b) The histogram of the sample after the application of log transformation with
fitted normal and kernel curves.
However, while the log transformation can decrease skewness, log-transformed data are not
guaranteed to satisfy the normality assumption [7]. Thus, the normality of the data should also be
checked after transformation. In addition, the use of log transformations can lead to mathematical
errors and misinterpretation of results [6,8].
Similarly, the attitudes of regulatory authorities profoundly influence the trials performed by
pharmaceutical companies; Food and Drug Administration (FDA) guidelines state that unnecessary data
transformation should be avoided, raising doubts about using transformations. If data transformation
is performed, a justification for the optimal data transformation, aside from the interpretation of the
estimates of treatment effects based on transformed data, should be given. An industry statistician
should not analyze the data using several transformations and choose the transformation that yields
8
Appl. Sci. 2020, 10, 6247
the most satisfactory results. Unfortunately, the guideline includes the log transformation with all
other kinds of transformation and gives it no special status [9].
An alternative approach is the generalized linear model (GLM), which does not require the
normality of data to test for differences between two populations. The GLM is a wide range of models
first promoted by Nelder and Wedderburn in 1972 and then by McCullagh and Nelder in 1989 [10,11].
The GLM was presented as a general framework for dealing with a variety of standard statistical models
for both normal and non-normal data, like ANOVA, logistic regression, multiple linear regression,
log-linear models, and Poisson regression. The GLM can be considered as a flexible generalization
of ordinary linear regression, which extends the linear modeling framework to response variables
that have non-normal error distributions [12]. It generalizes linear regression by connecting the linear
model to the response variable via a link function, and by permitting the magnitude of the variance of
each measurement to be a function of its expected value [10].
The GLM consists of:
i A linear predictor
ηi = β0 + β1 x1i + · · · + βp xpi = Xβ, (1)
iii A variance function that defines how the variance, Var( yi ), depends on the mean
Var( yi ) = φV (μi ), where the dispersion parameter φ is a constant. Replacing the μi in V (μi ) with
g−1 (ηi ) also makes the variance a function of the linear predictor.
In the GLM, the form of E( yi ) and Var( yi ) are determined by the distribution of the dependent
variable yi and the link function g. Furthermore, no normality assumption is required [13,14]. All the
major statistical software platforms such as STATA, SAS, R and SPSS include facilities for fitting GLMs
to data [15].
Because finding appropriate transformations that simultaneously provide constant variance and
approximate normality can be challenging, the GLM becomes a more convenient choice, since the
choice of the link function and the random component (which specifies the probability distribution for
response variable (Y) are separated. If a link function is convenient in the sense that the inverse-linked
linear model of explanatory variables adheres to the support for the expected value for that outcome,
it does not further need to stabilize variance or produce normality; this is because the fitting process
maximizes the likelihood for the choice of the probability distribution for Y, and that choice is not limited
to normality [16]. Alternatively, the transformations used on data are often undefined on the boundary
of the sample space, like the log transformation with a zero-valued count or a proportion. Generalized
linear models are now pervasive in much of applied statistics and are valuable in environmetrics,
where we meet non-normal data frequently, as counts or skewed frequency distributions [17].
Lastly, it is worth mentioning that the two methods discussed here are not the only methods
available to handle skewed data. Many nonparametric tests can be used, though their use requires
the researcher to re-parameterize or reformat the null and alternative hypotheses. For example,
The Wilcoxon–Mann–Whitney (WMW) test is an alternative to a t-test. Yet, the two have quite different
hypotheses; whereas t-test compares population means under the assumption of normality, the WMW
test compares medians, regardless of the underlying distribution of the outcome; the WMW test can
also be thought of as comparing distributions transformed to the rank-order scale [18]. Although
9
Appl. Sci. 2020, 10, 6247
the WMW and other tests are valid alternatives to the two-sample t-test, we will not consider them
further here.
In this work, the two-group t-test on log-transformed measures and the generalized linear model
(GLM) on the un-transformed measures are compared. Through simulation, we study skewed data
from three different sampling distributions to test the difference between two-group means.
10
Appl. Sci. 2020, 10, 6247
means difference of the original untransformed data (which should be close to zero under (H0 ) to see
which of the two methods gave mean difference estimates that are not significantly different from
the estimates of the actual mean difference. We also compared the estimates of the difference of the
standard deviations between the log-transformed and the original data under the assumption that H0
is true (while testing Type I error), so we could use pooled standard deviation.
Figure 2. Simulation and Hypothesis Testing algorithm (for Type I error, we simulated data from
distributions with the same means, and for power, we simulated data from distributions with
different means).
In three applications to real-life data, we applied the two methods to determine whether the
methods give consistent or contradicting results. By using visual inspection for this simulation study,
Q–Q plots were used to test the normality of the data before and after the application of the log
transformation to make sure that targeted variables were not normal before transformation and then
became normal after transformation. After that, we used the t-test. Then, we used the bias-corrected
Akaike information criterion (AICc) after fitting different continuous distributions to determine which
distribution and link function to use with the GLM model [19,20]. Finally, we compared the results
from both models. We generated all simulated data and performed all procedures using SAS codes.
Moreover, the data that support the findings in the real-life applications of this study are openly
available from the JMP software.
3. Results
11
Appl. Sci. 2020, 10, 6247
Table 1. Alpha values for gamma-distributed data tested using t-test and GLM.
Parameters
Example Sample Size t-Test Alpha (α1) GLM Alpha (α2) Diff. = α1 − α2
(Shape, Scale) 1
1 25 2, 1 0.0501 0.0490 0.0011
2 50 2, 1 0.0502 0.0499 0.0003
3 100 2, 1 0.0502 0.0499 0.0003
4 25 3, 2 0.0501 0.0526 −0.0024
5 50 3, 2 0.0502 0.0512 −0.0010
6 100 3, 2 0.0500 0.0507 −0.0007
7 25 5, 1 0.0501 0.0555 −0.0054
8 50 5, 1 0.0499 0.0526 −0.0027
9 100 5, 1 0.0500 0.0517 −0.0018
10 25 4, 0.25 0.0503 0.0549 −0.0046
11 50 4, 0.25 0.0503 0.0536 −0.0033
12 100 4, 0.25 0.0502 0.0516 −0.0014
13 25 6, 3 0.0501 0.0562 −0.0062
14 50 6, 3 0.0501 0.0533 −0.0031
15 100 6, 3 0.0500 0.0517 −0.0017
16 25 9, 0.5 0.0502 0.0577 −0.0075
17 50 9, 0.5 0.0500 0.0535 −0.0035
18 100 9, 0.5 0.0500 0.0516 −0.0016
19 25 3, 0.5 0.0501 0.0529 −0.0029
20 50 3, 0.5 0.0501 0.0514 −0.0012
21 100 3, 0.5 0.0501 0.0511 −0.0010
1 The two simulated groups have the same parameters.
Table 2 contains summary information for the Type I error results for the gamma, exponential and
beta distributed data. For the exponentially distributed data examples, the Type I error rates for the
GLM across all parameter values were lower than the t-test rates in all of the settings, though the t-test
Type I error rates did not exceed 0.0504 (0.0497 to 0.0504 for the t-test; 0.0367 to 0.0477 for the GLM).
The t-test Type I error rates for beta distributed outcomes were lower than those from the GLM in all
settings (0.0497 to 0.0511 for the t-test; 0.0518 to 0.0589 for the GLM). Figure 3 shows the Type I error
for both methods, which are compared to 0.05 for the three distributions.
Table 2. Summary of Type I error rates for gamma, exponential and beta distributed data tested using
the t-test and GLM.
t-Test GLM
Dist.
Average α Min α Max α Average α Min α Max α
Gamma 0.0501 0.0499 0.0503 0.0525 0.0490 0.0577
Exponential 0.0502 0.0497 0.0504 0.0432 0.0367 0.0477
Beta 0.0503 0.0497 0.0511 0.0544 0.0518 0.0589
12
Appl. Sci. 2020, 10, 6247
Figure 3. Type I error rates for the t-test and GLM compared to 0.05 for the three distributions.
13
Appl. Sci. 2020, 10, 6247
Table 3. Power values for beta distributed data tested using the t-test and GLM: P0 is the power value
of the two-group t-test calculated prior to the simulations, P1 is the empirical power of the two-group
t-test conducted on the log-transformed groups, and P2 is the empirical power of the GLM-fitted data.
Group1 Shape Group2 Shape Original t-Test t-Test GLM
Example Sample Size Power Diff. = P1 − P2
Parameters (α, β) Parameters (α, β) Power (P0 ) Power (P1 ) Power (P2 )
1 25 5, 3 5, 5 0.792 0.8217 2 0.8531 2 −0.0314
2 50 5, 3 5, 5 0.977 0.9815 2 0.9858 2 −0.0043
3 100 5, 3 5, 5 >0.999 0.9998 2 0.9999 2 −0.0001
4 25 2, 3 0.5, 0.5 0.224 0.1158 0.2042 −0.0884
5 25 2, 3 1, 3 0.75 0.7633 2 0.7532 0.0101
6 50 2, 3 1, 3 0.965 0.9552 0.9456 0.0095
7 25 2, 5 1, 3 0.108 0.1262 2 0.1142 2 0.0120
8 50 2, 5 1, 3 0.17 0.2071 2 0.1718 2 0.0352
9 25 2, 2 2, 5 0.967 0.9637 0.97392 −0.0102
10 50 2, 2 2, 5 >0.999 0.9995 2 0.9997 2 −0.0002
11 100 2, 2 2, 5 >0.999 1.0000 2 1.0000 2 0.0000
12 25 2, 3 2, 2 0.371 0.3909 2 0.4318 2 −0.0409
13 50 2, 3 2, 2 0.645 0.6374 0.6723 2 −0.0350
14 100 2, 3 2, 2 0.912 0.8824 0.9021 −0.0197
15 25 2, 3 2, 5 0.586 0.5948 2 0.6321 2 −0.0373
16 50 2, 3 2, 5 0.876 0.8599 0.8788 2 −0.0189
17 100 2, 3 2, 5 0.993 0.9874 0.9903 −0.0029
18 25 1, 3 2, 2 0.985 0.9822 0.9828 −0.0005
19 50 1, 3 2, 2 >0.999 0.9998 2 0.9998 2 0.0000
2 Power exceeds the original t-test expected power P0 .
Figure 4. Power rates for the t-test and GLM compared to the computed power rates for the
three distributions.
14
Appl. Sci. 2020, 10, 6247
Table 4. Summary of power values for exponential and gamma-distributed data tested using the
t-test and GLM: P0 is the power value of the two-group t-test calculated prior to the simulations, P1 is
the empirical power of the two-group t-test conducted on the log-transformed groups, and P2 is the
empirical power of the GLM-fitted data.
Table 5. Mean differences estimates’ values for original, log-transformed and GLM-fitted exponential
distributed data.
Finally, we compared the estimates of the standard deviation between the log-transformed data
and the original skewed data. According to the resulting p-values, the estimates of the pooled standard
deviation of the log-transformed data were significantly smaller than the estimates of the pooled
standard deviation of the original skewed data in all the tested examples, as expected. Table 6 presents
the parameters and the estimates values of standard deviation for the original and log-transformed
gamma-distributed data. This component, however, was not provided by the GLM procedure.
15
Appl. Sci. 2020, 10, 6247
Table 6. Standard deviation estimates for the original and log-transformed gamma-distributed data.
Two-Group t-Test between
Parameters SD(x1 ) and SD(y1 )
Example Sample Size
(Shape, Scale)
Trans. Mean Actual Mean
1 25 2, 1 0.8839 1.4017
2 50 2, 1 0.8767 1.3931
3 100 2, 1 0.87 1.3878
4 25 3, 2 1.0554 3.518
5 50 3, 2 1.0456 3.4678
6 100 3, 2 1.045 3.4814
7 25 5, 1 0.72 2.313
8 50 5, 1 0.7135 2.2487
9 100 5, 1 0.7116 2.2406
10 25 4, 0.25 0.4167 0.5086
11 50 4, 0.25 0.4088 0.4941
12 100 4, 0.25 0.4062 0.4898
13 25 6, 3 0.7747 7.6397
14 50 6, 3 0.7695 7.4241
15 100 6, 3 0.7657 7.4052
16 25 9, 0.5 0.487 1.567
17 50 9, 0.5 0.484 1.5254
18 100 9, 0.5 0.482 1.5037
19 25 3, 0.5 0.6107 0.8733
20 50 3, 0.5 0.6023 0.8558
21 100 3, 0.5 0.6003 0.8534
Figure 5. The normal Q–Q plots (up) for the Triglycerides variable before and after transformation
and the frequency histogram of the Triglycerides variable on the left (down) is fitted with a gamma
distribution (green).
16
Appl. Sci. 2020, 10, 6247
Figure 6. The normal Q–Q plots (up) for the weight variable before and after transformation and the
frequency histogram of the weight variable on the left (down) fitted to a gamma distribution (red).
Then, the t-test after log transformation was used; we got t394.29 = −3.39 with a p-value = 0.0004,
indicating a significant difference between the two groups; the average for the children who lost the
IV weight was lower than that of the other group, who did not lose the IV. An application of the
GLM to the original data with gamma distribution and log link function returned a p-value = 0.0042.
For this data, the results from the t-test with log-transformed data and the GLM indicated evidence of
a difference in mean for the weight variable.
17
Appl. Sci. 2020, 10, 6247
p-value = 0.4908 indicates that there is no significant difference between the assets’ means of the
two types of companies. So, the two methods gave contradicting results and this issue puts us on a
crossroads. We need to decide which result to adopt.
Figure 7. The normal Q–Q plots (up) for the Assets variable before and after transformation and the
frequency histogram of the Assets on the left (down) fitted to an exponential distribution (red).
4. Discussion
In this work, we compared the use of a t-test on log-transformed data and the use of GLM on
untransformed skewed data. The log transformation was studied because it is one of the most common
transformations used in biosciences research. If such an approach is used, the scientist must be careful
about its limitations; especially when interpreting the significance of the analysis of transformed
data for the hypothesis of interest about the original data. Moreover, a researcher who uses log
transformation should know how to facilitate log-transformed data to give inferences concerning
the original data. Furthermore, log transformation does not always help make data less variable
or more normal and may, in some situations, make data more variable and more skewed. For that,
the variability and normality should always be examined after applying the log transformation.
On the other hand, GLM was used because other nonparametric tests’ inferences concern medians
and not means. In addition, GLM models deal differently with response variables depending on
their population distributions, which provides the scientist with flexibility in modeling; GLM allows
for response variables to have different distributions, and each distribution has an appropriate link
function to vary linearly with the predicted values.
Each comparison was made for two simulated groups from several sampling distributions with
varying sample sizes. The comparisons regarded Type I error rates, power rates and estimates of the
means. Overall, the t-test method with transformed data produced smaller Type I error rates and closer
estimations. The GLM method, however, produced a higher power rate compared to t-test methods,
though both reported acceptable power rates.
For gamma distribution, Type I error rates in the t-test case were very close to 0.05 (0.0497 to
0.0504), while the Type I error rates of the GLM method had a wider range (0.0490 to 0.0577). For most
examples in the gamma-distributed data, the Type I error rates of the t-test method were smaller than
the respective rates in the GLM method. Regarding power, the GLM rates were higher in about 85%
of the settings than the ones using the t-test, with absolute values of differences, ranging from 0.000
to 0.363.
18
Appl. Sci. 2020, 10, 6247
The back-transformed estimates of the mean differences in the t-test case were not significantly
different from the estimates of the original data mean differences in the t-test method. The GLM
estimates, in contrast, were significantly different from the estimates of the original data. So, if we are
looking for lower Type I error and closer estimates, we can use the t-test method with transformed
data. However, if we are looking for a method with higher power rates, we recommend choosing the
GLM method.
In the exponentially distributed data, the GLM method has achieved a noticeably lower Type I
error rate and higher power in most of the settings than the t-test method. Regarding the estimates,
the t-test method gave closer estimates. Despite the closer estimates for the t-test method, our advice is
to use the GLM method.
For beta distributed data, Type I error rates seem to favor the t-test method with transformed
data. The power rates of the GLM method were higher than the power rates in the t-test method,
with absolute values of differences ranging from 0 to 0.0884. Furthermore, by looking at Figure 4,
we can see that the two methods have very close power rates. So, both methods seem to be good
enough in this matter. Nevertheless, since the t-test method has lower Type I rates and closer estimates
in the beta distributed data, we recommend it over GLM.
The missing rates for some of the parameters’ combinations, especially in calculating power
rates, are due to two reasons. First, in most cases, rates were not missing, but the counts for accepted
simulated samples were less than 10,000. That caused the setting to be rejected. Particularly in the case
of calculating power rates, the two groups are from the same distribution with different parameters,
which made it harder to apply the two filters (both groups should not be normally distributed before
the use of log transformation and normally distributed after the application of log transformation).
Although being less than 10,000 caused the estimates to vary as a response to changing the random
seed, it gave the same conclusion. For example, if the GLM had a higher power rate in one sitting of
parameters, it kept having a higher power rate even if we changed the seed. Yet, we preferred not to
include these settings because we needed the estimates to be reproducible.
Second, in rare cases, none of the samples’ normality issues were resolved with log transformation,
so we had zero accepted simulated samples. As a result, our comparison does not apply to that
parameter combination. In conclusion, we did not consider missing rates as an issue since GLM had a
higher power rate, as well as the t-test had closer mean difference estimates in all other accepted settings.
Our results were consistent across parameter settings and sample sizes (N = 25, 50, 100), so we
expect that the difference in sample sizes will not affect the method choice no matter what the sample
size effect is over the method performance itself.
After analyzing the data from our real-life examples, we recommend reporting the results from
the t-test with log-transformed data for the lipid data example and catheters data example, since the
best fit for both examples was a gamma distribution. Because the catheters data example had larger
sample size (n = 592) than used in our simulations, we conducted additional simulations with sample
size of 296 per group. Though not reported, these results concurred with our reported findings at
lower sample sizes.
We followed the same steps for the third example (the pharmaceutical and computer data
example), which had the exponential fit as the best fit. We conducted additional (though not reported)
simulations with 16 subjects per group, and again observed results that agreed with our reported
findings. Thus, our recommendation is to report the GLM method results.
Therefore, for any Bio-application research, studying the appropriate statistical distribution that
fits the dependent variable can help us to determine if a parametric model can reasonably test the
data after log transformation is used. Alternatively, it would probably be better to abandon the classic
approach and switch to the GLM method.
19
Appl. Sci. 2020, 10, 6247
Author Contributions: Conceptualization, H.M.H.; Investigation, H.M.H., R.T.S., R.A. and K.A.K.; Methodology,
H.M.H.; Software, H.M.H. and R.A.; Writing—original draft, H.M.H., R.T.S., R.A. and K.A.K.; Writing—review
and editing, H.M.H., R.T.S., R.A. and K.A.K. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Al Garni, H.Z.; Awasthi, A. A Monte Carlo approach applied to sensitivity analysis of criteria impacts
on solar PV site selection. In Handbook of Probabilistic Models, 1st ed.; Samui, P., Bui, D.T., Chakraborty, S.,
Deo, R.C., Eds.; Butterworth-Heinemann: Oxford, UK, 2020.
2. Martinez, W.L.; Martinez, A.R. Computational Statistics Handbook with Matlab, 2nd ed.; CRC Press:
Boca Raton, FL, USA, 2007.
3. Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods. Stat Med.
2019, 38, 2074–2102. [CrossRef] [PubMed]
4. Ghasemi, A.; Zahediasl, S. Normality tests for statistical analysis: A guide for non-statisticians. Int. J.
Endocrinol. Metabol. 2012, 10, 486. [CrossRef] [PubMed]
5. Wang, C.C.; Lee, W.C. Evaluation of the normality assumption in meta-analyses. Am. J. Epidemiol. 2020, 189,
235–242. [CrossRef] [PubMed]
6. Feng, C.; Wang, H.; Lu, N.; Chen, T.; He, H.; Lu, Y. Log-transformation and its implications for data analysis.
Shanghai Arch. Psychiatry 2014, 26, 105. [CrossRef] [PubMed]
7. Curran-Everett, D. Explorations in statistics: The log transformation. Adv. Physiol. Educ. 2018, 42, 343–347.
[CrossRef] [PubMed]
8. Hassani, H.; Yeganegi, M.R.; Khan, A.; Silva, E.S. The effect of data transformation on singular spectrum
analysis for forecasting. Signals 2020, 1, 2. [CrossRef]
9. Keene, O.N. The log transformation is special. Stat. Med. 1995, 14, 811–819. [CrossRef] [PubMed]
10. Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A 1972, 135, 370–384.
[CrossRef]
11. McCullagh, P.; Nelder, J.A. Generalized Linear Models; CRC Press: Boca Raton, FL, USA, 1989.
12. Song, L.; Langfelder, P.; Horvath, S. Random generalized linear model: A highly accurate and interpretable
ensemble predictor. BMC Bioinf. 2013, 14, 5. [CrossRef]
13. Nelder, J.A.; Baker, R.J. Generalized Linear Models; Wiley Online Library: Hoboken, NJ, USA, 2006.
14. Dobson, A.J.; Barnett, A. An Introduction to Generalized Linear Models; CRC Press: Boca Raton, FL, USA, 2008.
15. Müller, M. Generalized linear models. In Handbook of Computational Statistics; Gentle, J., Härdle, W.,
Mori, Y., Eds.; Springer: Berlin, Germany, 2012; pp. 681–709.
16. Agresti, A. Foundations of Linear and Generalized Linear Models; John Wiley & Sons: Hoboken, NJ, USA, 2015.
17. Jørgensen, B. Generalized linear models. In Encyclopedia of Environmetrics, 2nd ed.; El-Shaarawi, A.H.,
Piegorsch, W.W., Eds.; Wiley: Hoboken, NJ, USA, 2013; Volume 3.
18. Fay, M.P.; Proschan, M.A. Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and
multiple interpretations of decision rules. Stat. Surv. 2010, 4, 1. [CrossRef]
19. Claeskens, G.; Hjort, N.L. Model Selection and Model Averaging; Cambridge University Press: Cambridge,
UK, 2008.
20. Lindsey, J.K.; Jones, B. Choosing among generalized linear models applied to medical data. Stat. Med. 1998,
17, 59–68. [CrossRef]
21. Mann, J.; Larsen, P.; Brinkley, J. Exploring the use of negative binomial regression modeling for pediatric
peripheral intravenous catheterization. J. Med. Stat. Inf. 2014, 2, 6. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
20
applied
sciences
Commentary
Ten Points for High-Quality Statistical Reporting and
Data Presentation
Pentti Nieminen
Medical Informatics and Data Analysis Research Group, University of Oulu, 90014 Oulu, Finland;
pentti.nieminen@oulu.fi
Featured Application: In this work, an applicable instrument is proposed to quickly evaluate the
quality of the statistical reporting and data presentation in research papers.
Abstract: Background: Data analysis methods have become an essential part of empirical research
papers, especially in health sciences and medical research. It has previously been reported that
a noteworthy percentage of articles have flaws in their statistical reporting. Reporting problems have
been a long-term issue, and despite continued efforts to improve the situation, improvements have
been far from satisfactory. One explanation is an inadequate assessment of statistical reporting during
peer review. This communication proposes a short instrument to assess the quality of data analysis
reporting in manuscripts and published papers. Method: A checklist-type instrument was developed
by selecting and refining items from previous reports about the quality of statistical reporting in
medical journals and from published guidelines for reporting and data presentation. Items were
pretested and modified during pilot studies. A total of 160 original medical research articles that were
published in 4 journals were evaluated to test the instrument. Interrater and intrarater agreements
were examined by comparing quality scores assigned to 40 articles published in a psychiatric journal.
Results: The data analysis reporting test consists of nine questions that assess the quality of health
research from a reader’s perspective. The composed scale has a total score ranging from 0 to 10 and
discriminated between journals and study designs. A high score suggested that an article had a good
presentation of findings in tables and figures and that the description of analysis methods was helpful
to readers. Interrater and intrarater agreements were high. Conclusion: An applicable checklist for
quickly testing the statistical reporting quality of manuscripts and published research papers was
developed. This instrument aims to improve the quality of empirical research in scientific fields
where statistical methods play an important role.
Keywords: data analysis; statistics; reporting; data presentation; publications; medicine; health care
1. Introduction
Statistical reporting plays an important role in medical publications [1]. A high proportion of
medical articles are essentially statistical in their presentation. For the reader, the main outward
demonstration of a publication consists of statistical expressions that summarize the raw data used
in the research. Regardless of the quality of the data and the variables chosen to express the results,
the overt evidence of the research is produced as the lists of numbers, tables, plots, graphs, and other
displays, i.e., the descriptive statistics. The communication of the descriptive statistics may also be
combined with statistical inference procedures, such as test statistics and p-values [2].
The accurate communication of research results to the scientific community (i.e., other researchers)
is essential providing reliable results [3,4]. Failure to report analysis methods or inadequate data
presentation could lead to inaccurate inferences, even if the statistical analyses were performed correctly.
Several findings have demonstrated that a noteworthy percentage of articles, even those published in
high-prestige journals, have flaws in their statistical reporting [3–8]. Methodological and reporting
problems have been reported over several decades, and despite continued efforts to improve the
situation [9], the improvements have been far from satisfactory. One explanation is the inadequate
assessment of statistical reporting during peer review [8].
Author instructions and editorials alone have been insufficient to change the quality of statistical
reporting. Numerous reporting guidelines offer tools or checklists to evaluate the reporting of health
research [10]. The EQUATOR Network listed 424 such guidelines in March 2020 [11]. What is often
problematic in these checklists is their extent. They try to evaluate all possible aspects of analysis
from study design, settings, participants, description of measurements and variables, sample sizes,
data analysis methods, specific methods, descriptive data, outcome data, and ancillary analysis. Often it
is impossible to evaluate all these points. When reviewing and editing health care and medical papers,
I have noticed that a few points will reveal the quality of reporting of data analysis. There is no
need to read the whole manuscript and check all the statistical details to ensure that the manuscript
contains all the information that readers need to assess the study’s methodology and validity of its
findings. Editorial processes of medical manuscripts often fail to identify obvious indicators of good
and poor data presentation and reporting of results. Editors, reviewers and readers need new tools
to quickly identify studies with inadequate descriptions of statistical methods and poor quality of
data presentation.
I took on the challenge of compacting and refining the available quality criteria for statistical
reporting and data presentation. I had three guiding starting points for assessing statistical reporting.
The first key point was that if the data analysis methods are described with enough details, readers
will be able to follow the flow of the analyses and to critically assess whether the data analyses provide
reliable results. My second key point was the use of tables and figures in data presentation. Tables
and figures are a fundamental means of scientific communications and they are included in most
submitted and published medical and health care articles [12]. Most readers browse through tables and
illustrations, but few read the whole article. Well-prepared tables and figures with proper titles, clear
labeling, and optimally presented data will empower readers to scrutinize the data. I presume that the
quality of tables and figures indicates the overall quality of data presentation. My third starting key
point was that the new tool should be easy to use. I wanted to develop a short evaluation form with
a low number of items.
In short, my objective was to develop a handy and reliable instrument to assess the quality of
statistical reporting and data presentation that was applicable for a wide variety of medical and health
care research forums, including both clinical and basic sciences. In this article, I define nine items for
the proposed quality test. In addition, I report findings from an initial evaluation of its functionality
and reliability.
2.1. Prework
My intention was to develop a test or checklist that would complement and not replicate other
checklists. The construction of my quality evaluation tool began with a literature search. It included
the following steps:
• Search for papers evaluating the use and misuse of statistics in medical articles.
• Search for papers evaluating statistical reporting and data presentation.
• Review medical statistics textbooks.
• Review statistical reporting guidelines, including journal guidelines for authors.
• Assess my experience as a handling editor and referee for medical and medicine-related journals.
22
Appl. Sci. 2020, 10, 3885
After reading and reviewing the existing body of evidence, I discovered that including
an assessment of statistical errors into a quality instrument would be very difficult. Several studies
have reviewed medical papers and tried to find errors in the selection of statistical procedures [5,8,13].
Most of the statistical problems in medical journals reported in these reviews are related to elementary
statistical techniques [14]. The errors in medical papers are probably more a matter of judgment.
There is no general agreement on what constitutes a statistical error [6,15,16]. The recommended
methods for a particular type of question may not be the only feasible methods, and may not be
universally agreed upon as the best methods [17]. In addition, I decided to avoid any judgment of the
content, such as the originality, ethics, or scientific relevance.
I found hundreds of journal guidelines for statistical reporting. I observed strong heterogeneity
among the guidelines. Some journals have kept these guidelines to the minimum required for a decent
manuscript presentation, whereas a large number of journals displayed too many instructions and
journal-specific recommendations. Some recommendations regarding the test statistics (reporting
of values, degrees of freedom and p-values) or the contents of tables were even contradictory
(between journals). Most reporting guidelines said more about the presentation of methods than
about the reporting of results. In addition, medical statistics textbooks did not cover the topic of
statistical reporting.
By the end of the prework process, I had identified and formulated 20 items phrased as questions
and grouped them into two domains (description of methods and data presentation). The draft
questions were as follows:
Statistical analysis (or Data analysis) subsection in the Material and Methods section:
Data presentation:
• Was a table included where the basic characteristics of the study subjects were summarized with
descriptive statistics?
• Were total and group sample sizes reported for each analysis in all tables and figures?
• Did all tables and figures have a clear and self-explanatory title?
• Were statistical abbreviations explained in all tables and figures?
• Were summary statistics, tests or methods identified and named in all tables and figures?
• Were p-values reported properly (e.g., no expressions like NS, p < 0.05, p = 0.000)
• Was the total number of reported p-values and confidence intervals in tables and figures less
than 100?
• Were all tables and figures appropriate?
In line with my original key principles, the evaluation form should include two sections:
(1) Material and Methods section and (2) tables and figures in Results section. In addition, the test form
should be only one page long. I developed several pilot versions to structure the assessment of the
23
Appl. Sci. 2020, 10, 3885
quality of statistical reporting in an article or manuscript. Within these frames I reviewed the 20 draft
questions for their importance and uniqueness with my health care students, who tested the instrument
drafts during their studies. Further testing of the instrument, a review of the literature related to the
topic and expert opinions from my colleagues resulted in the generation of 9 items pertaining to the
description of statistical and data management procedures and the reporting of results in tables and
figures. Following the pilot studies, several items were also reworded, rearranged, and consolidated in
the test form for clarity. Because tables and figures should be self-explanatory, I decided to place the
assessment of tables and figures at the beginning of the evaluation form. Thus, the assessment of the
description of the statistical analysis in the Materials and methods section now follows the assessment
of the tables and figures. The updated version of the instrument is included in as Table 1. In the
evaluation form, questions 1–4 relate to basic guidelines for preparing effective tables and figures.
Questions 5–9 evaluate the description of the data analysis methods.
2.2. Ten Items to Assess the Quality of Statistical Reporting and Data Presentation
In the following chapter, I present the items as a sequence of 9 numbered questions and explain
some of the reasoning behind the questions.
2.2.1. Item 1: Was a Table Included Where the Basic Characteristics of the Study Subjects Were
Summarized with Descriptive Statistics?
Most papers reporting an analysis of health care and medical data will at some point use statistics
to describe the sociodemographic characteristics, medical history and main outcome variables of the
study participants. An important motive for doing this is to give the reader some idea of the extent to
which study findings can be generalized to their own local situation [18]. The production of descriptive
statistics is a straightforward matter, but often authors need to decide which statistics to present.
The selected statistics should be included in a paper in a manner that is easy for readers to assimilate.
When many patient characteristics are being described, the details of the statistics used and the number
of participants contributing to analysis are best incorporated in tabular presentation [12,18,19].
I give one point if the main features of important characteristics of the participants were displayed
in a table.
24
Appl. Sci. 2020, 10, 3885
2.2.2. Item 2: Was the Total Number of Participants Provided in all Tables and Figures?
The medical literature shows a strong tendency to accentuate significance testing, specifically
“statistically significant” outcomes. Most papers published in medical journals contain tables and
figures reporting p-values [16,20,21]. Finding statistically significant or nonsignificant results depends
on the sample size [22,23]. When evaluating the validity of the findings, the reader must know the
number of study participants. Sample size is an important consideration for research. A larger sample
size leads to a higher level of precision and thus a higher level of power for a given study to detect
an effect of a given size. An excessively large number of study participants may lead to statistically
significant results even when there is no clinical practicality; an inappropriately small number of study
participants can fail to reveal important and clinically significant differences. The total number of
participants or sample size of each group should be clearly reported.
I give one point if most of the tables and figures (at least 75%) have provided the total number of
participants or number of participants in each subgroup.
2.2.3. Item 3: Were Summary Statistics, Tests, and Methods Identified and Named in all Tables
and Figures?
Tables and figures should be able to stand alone [24]. That is, all information necessary for
interpretation should be included within the table or figure, legend or footnotes [25]. This means
that descriptive statistics, significance tests, and multivariable modeling methods used are named.
Many readers will skim an article before reading it closely and identifying data analysis methods in the
tables or figure will allow the readers to understand the procedures immediately. It is not necessary to
define well known statistical abbreviations, such as SD, SE, OR, RR, HR, CI, r, R2, n, or NA, but the
methods producing these statistics should be named.
I give one point if statistical summary statistics, tests, and methods in tables and figures
were named.
25
Appl. Sci. 2020, 10, 3885
p-Values were denoted with asterisks or with a system of letters in tables or figures,
and actual p-values were not reported. Actual p-values should be reported, without
false precision, whenever feasible. Providing the actual p-values prevents problems of
interpretation related to p-values close to 0.05 [12,22]. Very small p-values do not need
exact representation and p < 0.001 is usually sufficient.
Numbers were not reported with an appropriate degree of precision in tables.
In interpreting the findings, the reader cannot pay attention to the numbers presented
with several decimals.
The standard error of the mean (SE) was used to indicate the variability of a data set.
Confidence intervals were not reported with the effect sizes (regression coefficients,
ORs, HRs, or IRRs) in regression analyses or meta-analyses. The results of the primary
comparisons should always be reported with confidence intervals [27].
A table included only p-values. p-value cannot tell readers the strength or size of an effect,
change, or relationship. In the end, patients and physicians want to know the magnitude of
the benefit, change or association, not the statistical significance of individual studies [23,28].
I give zero points if more than 50% of the tables and figures included presentation issues.
I give one point if 50% or fewer of the tables and figures included presentation issues.
I give two points if all tables and figures were prepared efficiently and accurately, and the
presentation issues mentioned above were avoided.
2.2.5. Item 5: Was a Statistical Analysis (or Data Analysis) Subsection Provided in the
Methods Section?
Most general reporting guidelines and recommendations require that original research articles
include a Methods section [10,24,29]. In these recommendations, it is stated that the Methods section
should aim to be sufficiently detailed such that others with access to the data would be able to reproduce
the results. This section should include at least the following subsections: Selection and description of
participants, technical information about variables (primary and secondary outcomes, explanatory
variables, other variables) and statistical methods. To many researchers, it seems quite obvious that
a Statistical analysis (or Data analysis) subsection with a clear subheading should be provided in
the Materials and Methods section when the manuscript contains some elements of statistical data
analysis. However, in my experience, this is not obvious to all biomedical or health science researchers.
I have reviewed several biomedical manuscripts for journals wherein the laboratory experiments
were described in depth, but nothing was said about how the reported p-values were obtained.
When statistical methods are described with enough detail in the Statistical analysis subsection,
a knowledgeable reader can judge the appropriateness of the methods for the study and verify the
reported methods.
I therefore give one point if the Methods section included a subsection headed with Statistical
analysis (or Data analysis or Statistical methods).
2.2.6. Item 6: Did Authors Identify the Variables and Methods for Each Analysis?
Authors know their own work so well that some find it difficult to put themselves in the position
of a reader encountering the study for the first time. As a reader, I often request more information about
some aspect of the data analysis. Within the Statistical analysis section, authors need to explain which
statistical methods or tests were used for each analysis, rather than just listing all the statistical methods
used in one place [30]. In a well-written statistical analysis subsection, authors should also identify the
variables used in each analysis. Readers should be informed which variables were analyzed with each
method. Care must be taken to ensure that all methods are listed and that all tests listed are indeed
applied in the study [31]. The statistical section should be consistent with the Results section.
26
Appl. Sci. 2020, 10, 3885
I give one point if authors describe the goal (research question), the variables and the method
used for each analysis done in the study.
2.2.7. Item 7: Was It Verified that the Data Conformed to the Assumptions and Preconditions of the
Methods Used to Analyze Them?
All basic data analysis methods and multivariable techniques depend on assumptions about
the characteristics of the data [32]. If an analysis is performed without satisfying the assumptions,
incorrect conclusions may be made on the basis of erroneous results. For example, a widely applied
analysis of variance depends at least on three assumptions [33]. In actual data analysis it is unlikely
that all the assumptions for the analysis of variance will be satisfied. Some statistical assumptions are
essential, while some assumptions are quite lenient. A normal distribution of main variables is a strong
requirement in a number of statistical techniques and should be verified and reported. On the other
hand, the use of a nonparametric significance test instead of a more powerful parametric test should
be justified. If a brief justification is provided, readers may better understand why a specific data
analysis method has been applied [14]. Regression modeling has several constraints or preconditions,
such as linearity, independence between explanatory variables, and number of participants per variable.
In an excellent description of multivariable methods, authors should describe possible limitations
or preconditions.
I give one point if authors describe how the data satisfactorily fulfill the underlying assumptions
and preconditions of the main analysis methods.
2.2.9. Item 9: Was the Statistical Software Used in the Analysis Reported?
Identifying the statistical software package used in the data analysis is important because all
statistical programs do not use the same algorithms or default options to compute the same statistics.
As a result, the findings may vary from package to package or from algorithm to algorithm. In addition,
privately developed algorithms may not be validated and updated [12].
I give one point for reporting statistical software if the name and version of the statistical program
or package (name, version) is provided.
27
Appl. Sci. 2020, 10, 3885
28
Appl. Sci. 2020, 10, 3885
3. Results
Table 2. Distribution of study design and sample size of the evaluated articles by the publication journal.
Figure 1. Column graph of the data analysis reporting quality score of 160 articles. Articles with poor
quality (0–2) are denoted with red column color, those with weak quality (3–4) with yellow color and at
least acceptable (5–10) articles with green color.
29
Appl. Sci. 2020, 10, 3885
The distribution of the data analysis reporting score is summarized in Table 3 by article
characteristics. The mean (SD) of quality score in all 160 evaluated reports was 5.7 (SD 2.2). The reporting
quality was highest in the meta-analyses and lowest in laboratory works. The data analysis reporting
quality was associated with the sample size of the evaluated study.
Table 3. Mean and standard deviation of the data analysis reporting score from 160 articles by study
design and sample size.
Number of Standard
Mean
Articles Deviation
Study design
• Observational studies 78 5.5 2.2
• Experimental studies 48 6.5 1.8
• Reliability 5 5.4 2.6
• Laboratory works 18 3.9 1.9
• Meta-analysis 11 7.4 2.2
Sample size
• <99 37 4.7 2.2
• 100–499 37 5.6 2.0
• 500–2999 40 7.2 1.5
• >3000 38 6.0 1.9
• Missing 8 2.9 1.0
All articles 160 5.7 2.2
p-Value of ANOVA < 0.001 for study design and sample size.
Table 4. The distributions of the data analysis reporting test items by the journals.
Materials and
methods section
Statistical analysis
37 (92.5) 37 (92.5) 30 (75.0) 28 (70.0) 132 (82.5) 0.011
subsection provided
Variables identified 23 (57.5) 35 (87.5) 22 (55.0) 22 (55.0) 102 (63.7) 0.005
Assumptions verified 14 (35.0) 14 (35.0) 11 (27.5) 15 (37.5) 54 (33.8) 0.858
Software reported 32 (80.0) 31 (77.5) 29 (72.5) 31 (77.5) 123 (76.9) 0.911
References to
14 (35.0) 25 (62.5) 14 (35.0) 11 (27.5) 64 (40.0) 0.008
statistical literature
Software reported 32 (80.0) 31 (77.5) 29 (72.5) 31 (77.5) 123 (76.9) 0.911
Total number of articles 40 40 40 40 160
JAMA Psy = JAMA Psychiatry, IJERPH = International Journal of Environmental Research and Public Health
and JD = Journal of Dentistry.
30
Appl. Sci. 2020, 10, 3885
Table 4 also compares the prevalence of adequate description of methods in the four journals.
The failure to verify that the data conformed to the assumptions and preconditions of the methods
used was very common (66.2%). A total of 28 articles (17.5%) did not provide a statistical analysis
subsection; 60.0% (96 articles) did not provide references to statistical literature; and 23.1% (37 articles)
did not report the statistical software.
Figure 2 shows how the quality score varies by journals. The high-impact general medical journal
Lancet had mean score of 6.9 (SD 1.7) while the mean score in the leading psychiatric journal JAMA
Psy was 6.1 (SD 1.4). The quality core also identified the journals (IJERPH and JD) that publish more
laboratory studies with smaller sample sizes. Articles published in these journals had lower scores,
with a mean score of 5.2 (SD 2.3) in IJERPH and a mean score of 5.0 (SD 2.3) in JD. The mean values
were statistically significantly different (p-value of ANOVA < 0.001).
Figure 2. Box plot graph of the data analysis reporting quality score of 160 articles published in four
journals. There were 40 articles from each journal. JAMA Psy = JAMA Psychiatry, IJERPH = International
Journal of Environmental Research and Public Health, and JD = Journal of Dentistry.
31
Appl. Sci. 2020, 10, 3885
the reliability measures provided complete agreement. This preliminary intrarater validation analysis
revealed that disagreement arose especially for item 6.
4. Discussion
My purpose was to help authors, editors, reviewers, and readers to evaluate the quality of statistical
reporting and data presentation in research papers when data analysis procedures are invoked to clarify
findings and to draw conclusions from raw data. I developed a test for quickly assessing the quality by
selecting the checklist items from readers’ and reviewers’ perspectives. The composed scale from 0 to
10 assessed the selected reporting characteristics in the reviewed articles. A high value suggested that
an article had a good presentation of findings in tables and figures and that the description of analysis
methods was helpful to readers. This indicates an emphasis on analysis and reporting in the study and
indicates that the methodological quality is high and understandable. The items included in the index
also gave detailed information to reviewers and authors about the defects with respect to specific parts
of the reporting of data analysis findings.
The problems in the reproducibility of biomedical research have received considerable attention in
recent years [8,36–38]. One explanation is the poor presentation of statistical findings and inadequate
reporting of data analysis techniques by authors. Low quality has been a long-term issue in medical
research papers [4,8,36]. There seems to be a cultural component to these practices [39]. The reporting
practices of senior scientists are transmitted to junior scientists. Authors will often copy inadequate
reporting practices from previously published papers in their field and ignore the recommendations of
current general guidelines. Another contributing factor is the disregard for medical biostatistics in
biomedical and clinical education [36,40–43]. Many researchers clearly recognize the importance of
data analysis concepts, but they continue to misunderstand elementary data analysis techniques and
statistical concepts [44].
To clinicians and scientists, the literature is an important means for acquiring new information to
guide health care research and clinical decision making. Poor reporting is unethical and may have
serious consequences for clinical practice, future research, policy making, patient care, and ultimately
for patients [10]. Securing the quality of the publications is an important activity of journals in their
editorial policy. Journals need to be more proactive in providing information about the quality of
what journals publish [10,45]. Journals should consider strategies and actions to verify that authors
realize full responsibility for the statistical reporting quality of their manuscripts. Use of the short
checklist proposed in this paper with reporting guidelines would be an important step towards
quality-secured research.
Peer review provides the foundation for the scientific publishing system. During the editorial
process peer reviewers are required to comment on whether the methods and findings are clearly
reported [46]. This approach heavily relies on the statistical expertise of subject reviewers. In general,
peer reviewers are competent in a specific range of statistical methods, but they may not necessarily
be aware of more general statistical issues and best practices [16,36,47]. It has been claimed that due
to the inadequate assessment of statistical reporting and data presentation during peer review and
editorial processes, the quality of biomedical articles has not improved [8,36,47–49].
Medical journals often ask their subject reviewers if they are able to assess all statistical aspects of
the manuscript themselves or whether they recommend an additional statistical review [47]. Leading
medical journals, such as Lancet, BMJ, Annals of Medicine and JAMA have adopted statistical review.
Despite demonstration of widespread statistical and data presentation errors in medical articles,
increasing the use of statistical reviewers has been slow [50]. A recent survey found that only 23% of
the top biomedical journals reported that they routinely employed statistical review for all original
research articles [51]. Introduction of specialist statisticians to the peer review process has made peer
review more specialized. In addition, statistical reviewing is time intensive, limited by both reviewer
supply and expense.
32
Appl. Sci. 2020, 10, 3885
In biomedical journals, there is no single model for statistical review in peer review strategies [51–53].
Some journals recruit statistical methodologists to the editorial board, some draw their statistical
reviewers from an external pool. If all papers cannot be statistically reviewed, editors have to select
which manuscripts should undergo statistical scrutiny. There are also models where subject reviewers
are assisted to comment on the statistical aspects of a manuscript [47,48,54]. However, these checklists
cover extensively all aspects of data analysis and are not straightforward for non-statistical reviewers
to get an overall impression of the statistical quality. Some form of simplified checklist could be handy
for editors and reviewers to spot the issues that might indicate more serious problems in the reporting
of scientific articles. I hope that editors and reviewers could use the short quality test proposed in
this paper for deciding when the presentation in a manuscript is clearly inadequate and they should
recommend rejecting the manuscript. I agree with Kyrgidis and Triaridis [48] that if the reviewer
cannot find the basic information and description related to the data analysis, the reviewer does not
need to read the whole article. After checking tables and figures and reading through the statistical
analysis subsection in the methods section, the reviewer can reject the manuscript on good grounds.
When the proposed simple quality test shows that the statistical reporting and data presentation are
appropriate, the whole article needs to be read and further reviewed.
In recent years, several journals have tried to improve peer review processes [55]. Their efforts
have been focused on introducing openness and transparency to the models of peer review [56].
New strategies in peer review might help to address persistent statistical reporting and data presentation
issues in the medical literature [55]. Software algorithms and scanners have been developed to assess
internal consistency and validity of statistical tests in academic writing [57]. However, their use is
still rare and limited to flag specific potential errors. The open peer review, where all peer reviews are
made openly available brings into use new models where the quality of a paper may be assessed by the
whole scientific community. The pre- and post-publication peer review models include commenting
systems for the readership. Readers could use tools, such as proposed in this paper, to give feedback to
authors. Subsequently, authors prepare a second version of the manuscript reflecting the comments
and suggestions proposed by the scientific community.
My validation set of articles included four journals with different levels of prestige and visibility.
The reporting of data analysis information was more detailed and useful for the reader in the more
visible journals (Lancet and JAMA Psychiatry). This is in line with previous studies [5,6]. High-impact
journals have more rigorous review process, including extensive statistical reviews. Several researchers
in scientific communication have recommended sending out all manuscripts with numerical data
for statistical review [8,58,59]. In addition, the strict monitoring of revisions made in manuscripts
is important.
The proposed quality test is short. It includes only nine items. These items measure the data
presentation quality and the description of data analysis methods. It does not include items about
study design, justification of sample size, potential sources of bias, common statistical methodological
issues, or interpretation of results. Several reporting guidelines and checklists are available to ensure
that authors have paid attention to these other aspects [10,11]. However, these checklists have had
little impact on the quality of statistical reporting, and they have not been sufficient for ensuring
that issues related to poor statistical reporting and inadequate data presentation are eliminated from
manuscripts [3,8,60]. In addition, the general checklists do not include detailed recommendations for
reporting how statistical analyses were performed and how to present data.
The article set for pilot testing the proposed instrument included only published studies. Editors
and reviewers can use the proposed tool to judge whether manuscripts should be accepted for
publication. Educators can utilize it when educating researchers about how to improve statistical
reporting and data presentation. A future study is required to apply the instrument at the initial
submission of manuscripts to scientific journals and to test how applicable it is in the peer review
process. In addition, content and construct validity properties [61] need to be evaluated in future studies.
33
Appl. Sci. 2020, 10, 3885
It should be noted that the instrument was specifically developed for health care, biomedical and
clinical studies, and limitations may arise if it is used in nonmedical fields that apply statistical methods.
5. Conclusions
In summary, I have developed an applicable checklist for quickly testing the statistical reporting
quality of manuscripts and published research papers. One aim of this work was to help authors
prepare manuscripts. Good research deserves to be presented well, and good presentation is as much
a part of the research as the collection and analysis of the data. With this instrument authors could test
the effectiveness of their presentations (tables and figures) for their readers. In addition, they could
check that they have not described statistical methods too briefly or superficially.
Competent editors and scientific reviewers should be able to identify errors in basic statistical
analysis and reporting. However, there is accumulative evidence that inadequate reporting of basic
statistics is a persistent and ubiquitous problem in published medical studies. Editors and reviewers
can improve the quality of reporting by identifying submitted manuscripts with poor-quality statistical
reporting and data presentation. I hope that the editors and reviewers of scientific journals will
incorporate the proposed test instrument in their editorial process. For a manuscript to be considered
for full review, its statistical presentation has to be good enough for readers, subject reviewers,
and statistical reviewers to understand what statistical methods have been applied and what data are
presented in tables and figures. Manuscripts with poor-quality statistical reporting (total score ranges
from 0 to 2) should be rejected. Giving detailed comments in review reports to improve the reporting
may not be straightforward in those cases.
Readers can use the proposed data analysis reporting test as an initial indicator of research quality.
They can easily check whether, based on numerical data analyses, a published research article is
readable, understandable and accessible to healthcare professionals who might wish to use similar
techniques in their future work or to those who are not experts in a particular subfield. A high score
indicates that the authors have focused on reporting and readers should be able to express an opinion
about the results after reading the main text.
Further refinement of the data analysis reporting instrument will continue, and I invite feedback.
Author Contributions: P.N. has undertaken all the works of this paper.
Funding: This research received no external funding.
Acknowledgments: I am grateful to Jasleen Kaur who provided useful comments on the draft of the instrument
items. I wish to thank all health since students at the University of Oulu who participated in the piloting of
the instrument.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Sato, Y.; Gosho, M.; Nagashima, K.; Takahashi, S.; Ware, J.H.; Laird, N.M. Statistical Methods in the
Journal—An Update. N. Engl. J. Med. 2017, 376, 1086–1087. [CrossRef] [PubMed]
2. Nieminen, P.; Miettunen, J.; Koponen, H.; Isohanni, M. Statistical methodologies in psychopharmacology:
A review. Hum. Psychopharmacol. Clin. Exp. 2006, 21, 195–203. [CrossRef] [PubMed]
3. Norström, F. Poor quality in the reporting and use of statistical methods in public health—The case of
unemployment and health. Arch. Public Health 2015, 73, 56. [CrossRef] [PubMed]
4. Diong, J.; Butler, A.A.; Gandevia, S.C.; Héroux, M.E. Poor statistical reporting, inadequate data presentation
and spin persist despite editorial advice. PLoS ONE 2018, 13, e202121. [CrossRef] [PubMed]
5. Altman, D.G. Poor-quality medical research: What can journals do? JAMA 2002, 287, 2765–2767. [CrossRef]
6. Nieminen, P.; Carpenter, J.; Rucker, G.; Schumacher, M. The relationship between quality of research and
citation frequency. BMC Med. Res. Methodol. 2006, 6, 42. [CrossRef]
7. McClean, M.; Silverberg, J. Statistical reporting in randomized controlled trials from the dermatology
literature: A review of 44 dermatology journals. Br. J. Dermatol. 2015, 173, 172–183. [CrossRef]
34
Appl. Sci. 2020, 10, 3885
8. Dexter, F.; Shafer, S.L. Narrative Review of Statistical Reporting Checklists, Mandatory Statistical Editing,
and Rectifying Common Problems in the Reporting of Scientific Articles. Anesth. Analg. 2017, 124, 943–947.
[CrossRef]
9. Vandenbroucke, J.P.; Von Elm, E.; Altman, D.G.; Gøtzsche, P.C.; Mulrow, C.D.; Pocock, S.J.; Poole, C.;
Schlesselman, J.J.; Egger, M. Strengthening the Reporting of Observational Studies in Epidemiology
(STROBE): Explanation and elaboration. PLoS Med. 2007, 4, 1628–1654. [CrossRef]
10. Moher, D.; Altman, D.G.; Schultz, K.F.; Simera, I.; Wager, E. Guidelines for Reporting Health Research.
User’s Manual; John Wiley & Sons, Ltd.: Chichester, UK, 2014.
11. EQUATOR Network. Available online: https://www.equator-network.org/ (accessed on 30 March 2020).
12. Lang, T.A.; Michelle, S. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors and
Reviewers, 2nd ed.; American College of Physicians: Philadelphia, PA, USA, 2006.
13. Strasak, A.M.; Zaman, Q.; Pfeiffer, K.P.; Gobel, G.; Ulmer, H. Statistical errors in medical research—A review
of common pitfalls. Swiss Med. Wkly. 2007, 137, 44–49.
14. Lee, S. Avoiding negative reviewer comments: Common statistical errors in anesthesia journals.
Korean J. Anesthesiol. 2016, 69, 219–226. [CrossRef] [PubMed]
15. Altman, D.G. Statistical reviewing for medical journals. Stat. Med. 1998, 17, 2661–2674. [CrossRef]
16. Nieminen, P.; Virtanen, J.I.; Vähänikkilä, H. An instrument to assess the statistical intensity of medical
research papers. PLoS ONE 2017, 12. [CrossRef] [PubMed]
17. Bland, M. An Introduction to Medical Statistics; Oxford University Press: Oxford, UK, 2015; ISBN 0199589925.
18. Pickering, R.M. Describing the participants in a study. Age Ageing 2017, 46, 576–581. [CrossRef] [PubMed]
19. Lang, T.A.; Altman, D.G. Basic statistical reporting for articles published in Biomedical Journals:
“The Statistical analyses and methods in the published literature” or the SAMPL guidelines. Int. J.
Nurs. Stud. 2015, 52, 5–9. [CrossRef]
20. Nieminen, P.; Kaur, J. Reporting of data analysis methods in psychiatric journals: Trends from 1996 to 2018.
Int. J. Methods Psychiatr. Res. 2019, 28. [CrossRef]
21. Nieminen, P.; Vähänikkilä, H. Use of data analysis methods in dental publications: Is there evidence of
a methodological change? Publications 2020, 8, 9. [CrossRef]
22. Thiese, M.S.; Ronna, B.; Ott, U. P value interpretations and considerations. J. Thorac. Dis. 2016, 8, E928–E931.
[CrossRef]
23. Motulsky, H. Intuitive Biostatistics, 3rd ed.; Oxford University Press: New York, NY, USA, 2014;
ISBN 9780199946648.
24. American Medical Association. AMA Manual of Style: A Guide for Authors and Editors, 11th ed.;
Oxford University Press: New York, NY, USA, 2020; ISBN 978-0190246556.
25. Oberg, A.L.; Poland, G.A. The process of continuous journal improvement: New author guidelines for
statistical and analytical reporting in VACCINE. Vaccine 2012, 30, 2915–2917. [CrossRef]
26. American Psychological Assiciation. Publication Manual of the American Psychological Assiciation, 6th ed.;
American Psychological Assiciation: Washington, DC, USA, 2017; ISBN 978-1-4338-0561-5.
27. Gardner, M.J.; Altman, D.G. Confidence intervals rather than P values: Estimation rather than hypothesis
testing. Br. Med. J. (Clin. Res. Ed.) 1986, 292, 746–750. [CrossRef]
28. Spector, R.; Vesell, E.S. Pharmacology and statistics: Recommendations to strengthen a productive partnership.
Pharmacology 2006, 78, 113–122. [CrossRef] [PubMed]
29. ECMJE Recommendations for He Conduct, Reporting, Editing, and Publication of Scolarly Work in Medical
Journals. Available online: http://www.icmje.org/recommendations/ (accessed on 25 May 2020).
30. Lang, T.A.; Altman, D.G. Statistical Analyses and Methods in the Published Literature: The SAMPL
Guidelines. In Guidelines for Reporting Health Research: A User’s Manual; John Wiley & Sons, Ltd.: Oxford, UK,
2014; pp. 264–274. ISBN 9780470670446.
31. Simundic, A.-M. Practical recommendations for statistical analysis and data presentation in Biochemia
Medica journal. Biochem. Med. 2012, 22, 15–23. [CrossRef] [PubMed]
32. Mayo, D. Statistical Inference as Severe Testing. How to Get beyond the Statistical Wars; Cambridge University
Press: Cambridge, UK, 2018; ISBN 978-1-107-66464-7.
33. Indrayan, A.; Malhotra, R.K. Medical Biostatistics, 4th ed.; CRC Press: Cambridge, UK, 2018; ISBN 1498799531.
34. Indrayan, A. Reporting of Basic Statistical Methods in Biomedical Journals: Improved SAMPL Guidelines.
Indian Pediatr. 2020, 57, 43–48. [CrossRef] [PubMed]
35
Appl. Sci. 2020, 10, 3885
35. Gwet, K.L. Handbook of Inter-Rater Reliability, 4th ed.; Advances Analytics, LLS: Gaithersburg, MD, USA,
2014; ISBN 9780970806284.
36. Glasziou, P.; Altman, D.G.; Bossuyt, P.; Boutron, I.; Clarke, M.; Julious, S.; Michie, S.; Moher, D.; Wager, E.
Reducing waste from incomplete or unusable reports of biomedical research. Lancet 2014, 383, 267–276.
[CrossRef]
37. The Academy of Medical Sciences Reproducibility and Reliability of Biomedical Research: Improving Research Practice;
The Academy of Medical Sciences: London, UK, 2015.
38. Vetter, T.R.; McGwin, G.; Pittet, J.F. Replicability, Reproducibility, and Fragility of Research
Findings-Ultimately, Caveat Emptor. Anesth. Analg. 2016, 123, 244–248. [CrossRef]
39. Smaldino, P.E.; McElreath, R. The natural selection of bad science. R. Soc. Open Sci. 2016, 3, 9. [CrossRef]
40. Shetty, A.C.; Al Rasheed, N.M.; Albwardi, S.A. Dental professionals’ attitude towards biostatistics. J. Dent.
Oral Hyg. 2015, 7, 113–118.
41. Hannigan, A.; Hegarty, A.C.; McGrath, D. Attitudes towards statistics of graduate entry medical students:
The role of prior learning experiences. BMC Med. Educ. 2014, 14, 70. [CrossRef]
42. Batra, M.; Gupta, M.; Dany, S.S.; Rajput, P. Perception of Dental Professionals towards Biostatistics. Int. Sch.
Res. Not. 2014, 2014, 1–6. [CrossRef]
43. Altman, D.G.; Goodman, S.N.; Schroter, S. How statistical expertise is used in medical research. JAMA 2002,
287, 2817–2820. [CrossRef]
44. Belia, S.; Fidler, F.; Williams, J.; Cumming, G. Researchers misunderstand confidence intervals and standard
error bars. Psychol. Methods 2005, 10, 389–396. [CrossRef]
45. Moher, D. Reporting guidelines: Doing better for readers. BMC Med. 2018, 16, 18–20. [CrossRef]
46. Glonti, K.; Cauchi, D.; Cobo, E.; Boutron, I.; Moher, D.; Hren, D. A scoping review on the roles and tasks of
peer reviewers in the manuscript review process in biomedical journals. BMC Med. 2019, 17, 1–14. [CrossRef]
[PubMed]
47. Greenwood, D.C.; Freeman, J.V. How to spot a statistical problem: Advice for a non-statistical reviewer.
BMC Med. 2015, 13, 1–3. [CrossRef] [PubMed]
48. Kyrgidis, A.; Triaridis, S. Methods and Biostatistics: A concise guide for peer reviewers. Hippokratia 2010, 14,
13–22. [PubMed]
49. Patel, J. Why training and specialization is needed for peer review: A case study of peer review for
randomized controlled trials. BMC Med. 2014, 12, 1–7. [CrossRef] [PubMed]
50. Horbach, S.P.J.M.; Halffman, W. The changing forms and expectations of peer review. Res. Integr. Peer Rev.
2018, 3, 1–15. [CrossRef]
51. Hardwicke, T.E.; Goodman, S.N. How often do leading biomedical journals use statistical experts to evaluate
statistical methods? The results of a survey. MetArXiv Prepr. 2020, 1–29.
52. Hardwicke, T.E.; Frank, M.C.; Vazire, S.; Goodman, S.N. Should Psychology Journals Adopt Specialized
Statistical Review? Adv. Methods Pract. Psychol. Sci. 2019, 2, 240–249. [CrossRef]
53. Overgaard, K.; Roberts, J.; Schaeffer, M.B.; Pessin, R. the Trenches. Sci. Ed. 2015, 38, 2015.
54. Curtis, M.J.; Alexander, S.; Cirino, G.; Docherty, J.R.; George, C.H.; Giembycz, M.A.; Hoyer, D.; Insel, P.A.;
Izzo, A.A.; Ji, Y.; et al. Experimental design and analysis and their reporting II: Updated and simplified
guidance for authors and peer reviewers. Br. J. Pharmacol. 2018, 175, 987–993. [CrossRef]
55. Fresco-Santalla, A.N.A.; Hernández-Pérez, T. Current and evolving models of peer-review. Ser. Libr. 2014, 67,
373–398. [CrossRef]
56. Vercellini, P.; Buggio, L.; Vigano, P.; Somigliana, E. Peer review in medical journals: Beyond quality of reports
towards transparency and public scrutiny of the process. Eur. J. Intern. Med. 2016, 31, 15–19. [CrossRef]
57. Nuijten, M.B.; Hartgerink, C.H.J.; van Assen, M.A.L.M.; Epskamp, S.; Wicherts, J.M. The prevalence of
statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 2016, 48, 1205–1226. [CrossRef]
[PubMed]
58. Vahanikkila, H.; Tjaderhane, L.; Nieminen, P. The statistical reporting quality of articles published in 2010 in
five dental journals. Acta Odontol. Scand. 2015, 73, 76–80. [CrossRef] [PubMed]
59. Lukic, I.; Marusic, M. Appointment of statistical editor and quality of statistics in a small medical journal.
Croat. Med. J. 2001, 42, 500–503. [PubMed]
36
Appl. Sci. 2020, 10, 3885
60. Pouwels, K.B.; Widyakusuma, N.N.; Groenwold, R.H.H.; Hak, E. Quality of reporting of confounding
remained suboptimal after the STROBE guideline. J. Clin. Epidemiol. 2016, 69, 217–224. [CrossRef]
61. Mokkink, L.B.; Terwee, C.B.; Patrick, D.L.; Alonso, J.; Stratford, P.W.; Knol, D.L.; Bouter, L.M.; de Vet, H.C.W.
The COSMIN study reached international consensus on taxonomy, terminology, and definitions of
measurement properties for health-related patient-reported outcomes. J. Clin. Epidemiol. 2010, 63, 737–745.
[CrossRef]
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
37
applied
sciences
Article
Stretched Exponential Survival Analysis
for South Korean Females
Byung Mook Weon 1,2,3
1 Soft Matter Physics Laboratory, School of Advanced Materials Science and Engineering, SKKU Advanced
Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon 16419, Korea; [email protected]
2 Research Center for Advanced Materials Technology, Sungkyunkwan University, Suwon 16419, Korea
3 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
Featured Application: The stretched exponential survival analysis enables quantitative and
comparative study of female lifespan between South Korea and other countries, particularly
quantifying the survival rate, the characteristic life, and the maximum lifespan.
Abstract: South Korea has recently exhibited a remarkable rapid increase in female lifespan.
Here, a mathematical analysis is suggested for a clear interpretation of current trends in female
lifespan in South Korea. To mathematically analyze life tables, a modified stretched exponential
function is employed and demonstrated to estimate current trends of female lifespan in South
Korea based on reliable life tables from 1987 to 2016 taken from the Korean Statistical Information
Service. This methodology enables us to perform quantitative and comparative analyses of female
lifespan in South Korea with representative industrialized countries such as Japan, France, Australia,
Switzerland, UK, Sweden, and USA. This analysis provides quantitative and comparative evidence
that South Korea has the highest increase rate of female lifespan over the past three decades. Further
application would be feasible for a better estimation of human aging statistics.
1. Introduction
Statistical analysis of human lifespan is a central topic in demography and gerontology. Life
expectancy of humans has risen steadily and rapidly over the past 150 years in most countries and
keeps increasing in the 21st century [1–3]. A recent study of 35 industrialized countries shows an
interesting projection: there is more than 50% probability that by 2030, female life expectancy is likely
to break the 90-year barrier [4]. Interestingly, South Korea is predicted to have the highest female life
expectancy at birth by 2030 [4]. This prediction is achieved by applying a mathematical ensemble of
21 forecasting models to reliable statistics data about life tables for humans [4]. South Korean females
are an important model to understand current trends in human lifespan and a better survival analysis
is required for a better understanding of human lifespan.
To achieve a more reliable prediction, a mathematical model of human survival curves is helpful in
quantifying current trends in human lifespan for human aging and demography research [5–8]. In fact,
many mathematical models for survival curves have been proposed, including the Gompertz, Weibull,
Heligman-Pollard, Kannisto, quadratic, and logistic models [5]. In a recent work, a mathematical model
has been suggested owing to a usefulness to describe survival curves with complexity in shape by
adopting a modified function from the stretched exponential (Weibull or Kohlrausch–Williams–Watts)
function [9–14]. Flexibility of a modified stretched exponential function, originating from an
age-dependent stretched exponent, enables us to mathematically describe plasticity and rectangularity
of survival curves [5,13]. In this work, by incorporating this mathematical model to the reliable life
table datasets taken from the Korean Statistical Information Service (KOSIS, http://kosis.kr/eng),
a practical application is demonstrated to quantify current trends in female lifespan for South Korea
over three recent decades, from 1987 to 2016. This analysis for South Korea is compared with that for
Japan, France, Australia, Switzerland, UK, Sweden, and USA, which are the representative leading
countries in longevity [4].
In this study, a mathematical analysis with a modified stretched exponential model is
demonstrated to interpret the recent survival curves for South Korean female, taken from the Korean
Statistical Information Service. The current trends for 1987–2016 are estimated from the female survival
curves in South Korea based on the estimates of the characteristic life and the maximum lifespan.
2. Methods
The stretched exponential survival analysis is summarized as follows. The methodology is based
on the modified stretched exponential function to describe the human survival curves, which has been
developed for the past decade [5,12–14]. This section briefly introduces into the mathematical approach.
The survival rate, s( x ), is a function of age, x. Mathematically the mortality rate (equivalently the
hazard function or the force of mortality), μ( x ) = −d ln(s( x ))/dx, is the derivative of the survival
rate with respect to age. As a mathematical constraint, s( x ) is a monotonic decrease function with
age, starting from 1 and ending to 0. To mathematically describe the survival curves, a modified
stretched exponential function is quite flexible by applying a form of s( x ) = exp(−( x/α) β( x) ) where
the stretched exponent, β( x ), is an age-dependent term as β( x ) = ln[− ln(s( x ))]/ ln( x/α) and the
characteristic life, α, is a reference age, determined at s(α) = exp(−1) that equals to the interception
point between s( x ) and s(α) = exp(−1) [5,12–14]. The characteristic life is an alternative to the life
expectancy at birth, ε [14,15]. The age dependence of the stretched exponent is the critical feature
of the modified stretched exponential function, which is fundamentally different from the classical
stretched exponential function [5,12–14]. The characteristic life and the stretched exponent are useful
to respectively describe the scale effect (associated with living longer, characterized by α) and the
shape effect (associated with growing older, characterized by β( x )) of the individual survival curve [14].
For old-age patterns, the quadratic formula of β( x ) = β 0 + β 1 x + β 2 x2 is applicable [5,12], which
leads to the occurrence of the (mathematical) maximum lifespan, ω, which is obtained at the age
dβ( x ) ds( x )
for β( x ) = − x ln( x/α) dx by the mathematical constraint of dx → 0 [12]. This methodology is
beneficial in modeling the human mortality curves in very old age [5].
40
Appl. Sci. 2019, 9, 4230
a b
92
1.0
2016
90
0.8
Survival rate, s(x)
Scale effect,
1987 88
0.6
exp(−1) 86
0.4
84
0.2
82
0.0
0 20 40 60 80 100 1985 1990 1995 2000 2005 2010 2015
Age (x) Calendar
year
c d
14
12 12
Shape effect, (x)
10 11
8
2016 10
6 2016−1995
9
4
1994−1987
2 8
1987
0
0 20 40 60 80 100 60 65 70 75 80 85 90 95 100 105
Age (x) Age (x)
Figure 1. Survival analysis. (a) The survival curves for Korean female in recent decades, 1987–2016,
were obtained at the complete life tables taken from the Korean Statistical Information Service
(http://kosis.kr/eng). (b) The characteristic life of the given survival curve. (c) The age-dependent
stretched exponent. (d) The old-age patterns of the stretched exponent at ages over 60 years could
be fitted with the quadratic formula of age. The phase transition in the stretched exponent with age
appears for 1995–2016 but not for 1987–1994.
The characteristic life, α, indicating the scale effect of the individual survival curve, was measured
for each survival curve through the graphical analysis at s(α) = exp(−1) ≈ 0.367879. The recent
historic trends in the α values were summarized in Figure 1b. The characteristic life for Korean
female was depicted as a function of the calendar year (1987–2016) in Figure 1b. This result shows
that the α value linearly increases at a constant rate of ∼2.88 years per decade (as marked by the
solid line in Figure 1b). This steady increase rate for South Korean female is much higher than the
rate of ∼1.20 years per decade for Swedish female in the previous analyses (data from the Human
Mortality Database, http://www.mortality.org) [5,14]. The historic trends in the characteristic life
clearly demonstrate that South Korean women have lived longer in the past three decades.
The β( x ) value, indicating the shape effect of the individual survival curve, was obtained from
dβ( x )
β( x ) = − x ln( x/α) dx as shown in Figure 1c. Particularly as demonstrated in Figure 1d, old-age
patterns of β( x ) at ages over 60 years could be empirically fitted with the quadratic formula as
β( x ) = β 0 + β 1 x + β 2 x2 as suggested previously [5,12,13]. The modified stretched exponential function
is successfully applicable to the survival curves and to the age-dependent stretched exponents for all
ages (Figure 1c) and for old ages over 60 years (Figure 1d). Most remarkably, the β( x ) curve smoothly
varies with age at old ages over 60 years and old-age β( x ) patterns over 90 years can be depicted with
41
Appl. Sci. 2019, 9, 4230
the quadratic function of age. As found previously, the β( x ) patterns for old-age populations undergo
the phase transition with age: β( x ) increases with age at x < α and decreases with age at x > α [12].
The phase transition appears in South Korean female datasets after the calendar year of 1995: there are
phase transitions for 1995–2016 and no transitions for 1987–1994, as separately marked in Figure 1d.
Considering old-age β( x ) patterns, the female lifespan of South Korea would experience the critical
phase change through the calendar year of 1995. The historic trends in age dependence in β( x ) clearly
show how the female lifespan in South Korea has been growing older during 1987–2016.
0.30
Increase of characteristic life,
0.25
0.20
0.15
0.10
0.05
0.00
UK
South Korea
USA
Japan
France
Australia
Switzerland
Sweden
Figure 2. Comparison in the increase rate of the characteristic life per decade among representative
industrialized countries. The data were taken for South Korea (1987–2016) from the Korean Statistical
Information Service (http://kosis.kr/eng) and for other countries (1980–2010) from the Human
Mortality Database (http://www.mortality.org).
To demonstrate the association of the characteristic life (α), taken from Figure 2, with the life
expectancy at birth (ε), the annual increase rates of α and ε are compared in Figure 3. The ε values for
1980–2015 were taken from the Organization for Economic Cooperation and Development [16] and
42
Appl. Sci. 2019, 9, 4230
the increase rates of the ε values were achieved by linear fits (the error bars taken from the standard
errors: some of the error bars are smaller than the symbol sizes). The increase rates of the ε values per
decade were evaluated as ∼4.15 years for South Korea, ∼2.33 years for Japan, ∼2.15 years for France,
∼1.96 years for Australia, ∼1.78 years for Switzerland, ∼1.93 years for UK, ∼1.43 years for Sweden,
and ∼1.06 years for USA. The annual increase rate of the ε values (marked by m) is proportionally
correlated with the annual increase rate of the α values (marked by n) as n ≈ 1.173 m − 0.046 (adj.
R2 ∼ 0.85243) except for South Korea. This analysis result suggests that the annual increase rate of the
ε values for South Korea looks an extraordinary case among the representative high-income countries.
0.30
Increase of characteristic life,
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5
expectancy
Increase of life
at birth
Figure 3. Association of the characteristic life, taken from Figure 2, with the life expectancy at birth
per decade in representative industrialized countries, taken from the Organization for Economic
Cooperation and Development [16].
The highest increase rate in the female lifespan (in both α and ε) for South Korea in Figure 3 is
consistent with that the female life expectancy at birth in South Korea is the sixth ranked in 2010 and
expected to be the first ranked in 2030 among the 35 industrialized countries [4]. So far, little attention
has been given to South Korea in research on human health and longevity except for a few reports [4].
This analysis supports that South Korea has the highest increase rate in the female lifespan in the
world in the past three decades.
43
Appl. Sci. 2019, 9, 4230
ds( x )
in Figure 4. The estimates of ω are obtained by the mathematical constraint of dx → 0 [5,14].
The fitting of β( x ) with the quadratic formula enables us to determine the estimates of ω at the
dβ( x )
specific age of γ( x ) = − x ln( x/α) dx which corresponds to the intercept point β( x ) = γ( x ) [5]. This
estimation is practically available if and only if the phase transition exists in old-age β( x ) curves.
Indeed, the estimates of ω are significant over the calendar year of 2002, as shown in Figure 4.
140
Maximum lifespan, (year)
life, (year)
130
≈ 125
120
110 –
Characteristic
100
≈ 95
90
80
Looking at the recent trends in ω and α for the South Korean female in Figure 4, the estimates of ω
gradually decrease at a constant rate of ∼2.99 years per decade and the estimates of α gradually increase
at a constant rate of ∼3.30 years per decade. Assuming continuity in the increments, the estimates
of ω and α are predicted to reach ∼125 years and ∼95 years respectively around 2030. The change
rates for South Korean female are much higher than those of ∼1.6 years per decade in the ω values
and ∼1.2 years per decade in the α values for Swedish female [5]. The estimates of ω and α eventually
become closer together over time: this tendency for South Korean female is consistent with Swedish
female [5]. This result indicates that the survival curves have been increasingly concentrated at old
ages, demonstrating the population aging [17,18]. As α becomes closer to ω (Figure 4), the survival
curves tend to become strongly concentrated at very old ages, corresponding to the retangularization
of survival curves [13,19]. This result would be relevant to the compression of morbidity, which is
responsible for the accumulation of the very old populations in the population configuration [5].
3.4. Discussion
Now, it is noteworthy that the stretched exponential survival analysis enables the quantitative
and comparative study of female lifespan between South Korea and other countries, particularly
quantifying the survival rate, the characteristic life, and the maximum lifespan. All evidence
44
Appl. Sci. 2019, 9, 4230
suggests that South Korea has a remarkable rapid increase in female lifespan during the three recent
decades (from 1987 to 2016), as demonstrated from the characteristic life analysis in Figures 2 and 3.
An interesting question is why South Korea exhibits the highest increase rate in female lifespan.
Economic development in South Korea is supposed to be beneficial to health care improvement,
presumably provoking many potential social or cultural changes that are favorable for healthy
daily habits and welfare. Obviously, health is the result of a combination of individual, social, and
environmental factors [20]. In general, the recent increase in the numbers of older people in developed
countries is a consequence of advances in hygiene and biomedicine as well as an artefact of human
civilization [21]. In particular, South Korea has experienced rapid economic development and a
substantial increase in life expectancy in an extremely short period, between 1970 and 2010, mainly
due to the rapid processes of industrialization and urbanization [22,23].
Specifically, South Korea is expected to become the first country where the life expectancy at birth
will exceed 90 years for women [24]. The recent industrialization and the subsequent improvements
in living standards, nutrition, and health care have often been cited as major contributions to the
remarkable improvements in health for South Korea [25]. The rapid increases in life expectancy in
South Korea could be mostly achieved by reductions in infant mortality and in decreases related
to infections and blood pressure [25]. More recent gains in South Korea have been largely due to
postponement of death from chronic diseases [25]. These gains were mainly due to broad-based
inclusive improvements in economic status and social capital (including education), which improved
childhood and adolescent nutrition [26], expanded access to primary and secondary health care,
and facilitated rapid scale-up of new medical technologies [25]. South Korea has also maintained
lower body-mass index and blood pressure than other industrialized countries [27,28], and lower
smoking in women. Additionally, South Korea might have lower health inequalities (for cancer
and cardiovascular disease mortality, and for self-reported health status) than some of their western
counterparts, especially for women [24,29,30]. The death rates from all causes in South Korea decreased
significantly in both genders in the last three decades except for a period following the economic crisis
in the late 1990s [31]. Looking at the trends in infectious disease mortality for South Korea, 1983−2015,
infant mortality caused by infectious diseases has substantially decreased, while death rates from
infectious disease for elderly populations with lower education levels and subgroups susceptible to
respiratory infections and sepsis has not decreased overall [32]. The potential reasons for South Korea
may be plausible but not for other countries.
Finally, the lifespan trends for Korean female may give a clue for the question regarding if
there is a biological or social limit to human lifespan [3,33–36]. The heterogeneity in population
and the dynamics of heterogeneous populations would become important in interpreting human
lifespan [37,38]. The question about whether the upper limit to the maximum lifespan is fixed or flexible
has been topics of continuous debate [34,39–45]. The decrease of the ω estimate in Figure 4 means that
more people can reach the biological lifespan limit as expected to be 125 years [12]. The mathematical
evaluation for female survival curves would be useful in achieving reliability and reproducibility
in aging research [6–8] as well as in demonstrating the effectiveness of anti-aging therapies to slow
the aging process [46]. Finding the appropriate mathematical model is quite important to accurately
evaluate the human survival curves that exhibit a variety of heterogeneity and complexity. The practical
implication of this study is that the potential reasons of South Korea may be useful to improve human
lifespan, suggesting a further application of the stretched exponential survival analysis to identify
personal, social, and environmental factors on human lifespan.
4. Conclusions
A steady increase in female lifespan is obvious in recent decades for most industrialized countries.
This study shows the quantitative and comparative evidence for a remarkable rapid increase of female
lifespan in South Korea during three recent decades, from 1987 to 2016, by applying the stretched
exponential survival analysis. Based on the estimation of the increase rate of female lifespan for South
45
Appl. Sci. 2019, 9, 4230
Korea, the comparative analysis with Japan, France, Australia, Switzerland, UK, Sweden, and USA
implies that South Korea would have the highest increase rate of female lifespan during recent three
decades. This result suggests that Korean female datasets would be a representative population dataset
for further study on what contributions would be essential to human longevity. Further application
would be feasible for a better estimation of human aging statistics, which will be useful for future
population prediction and policy making.
Funding: This research was supported by Basic Science Research Program through the National Research
Foundation of Korea (NRF) funded by the Ministry of Education (Grant No. NRF-2016R1D1A1B01007133 and
Grant No. 2019R1A6A1A03033215).
Conflicts of Interest: The author declares no conflict of interest. The funders had no role in the design of the
study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to
publish the results.
References
1. Wilmoth, J.R.; Deegan, L.J.; Lundström, H.; Horiuchi, S. Increase of maximum life-span in Sweden, 1861–1999.
Science 2000, 289, 2366–2368. [CrossRef] [PubMed]
2. Oeppen, J.; Vaupel, J.W. Broken limits to life expectancy. Science 2006, 296, 1029–1031. [CrossRef] [PubMed]
3. Olshansky, S.J. Ageing: Measuring our narrow strip of life. Nature 2016, 538, 175–176. [CrossRef] [PubMed]
4. Kontis, V.; Bennett, J.E.; Mathers, C.D.; Li, G.; Foreman, K.; Ezzati, M. Future life expectancy in 35
industrialised countries: Projections with a Bayesian model ensemble. Lancet 2017, 389, 1323–1335. [CrossRef]
5. Weon, B.M. A solution to debates over the behavior of mortality at old ages. Biogerontology 2015, 16, 375–381.
[CrossRef] [PubMed]
6. Petrascheck, M.; Miller, D.L. Computational analysis of lifespan experiment reproducibility. Front. Genet.
2017, 8, 92. [CrossRef] [PubMed]
7. Ruby, J.G.; Smith, M.; Buffenstein, R. Naked mole rat mortality rates defy Gompertzian laws by not increasing
with age. eLife 2018, 7, e31157. [CrossRef] [PubMed]
8. Beltrán-Sánchez, H.; Finch, C. Life expectancy: Age is just a number. eLife 2018, 7, e34427. [CrossRef]
[PubMed]
9. Weibull, W.A. A statistical distribution function of wide applicability. J. Appl. Mech. 1951, 18, 293–297.
10. Kohlrausch, R. Theorie des elektrischen rückstandes in der leidener flasche. Pogg. Ann. Phys. Chem. 1854, 91,
179–214. [CrossRef]
11. Williams, G.; Watts, D.C. Non-symmetrical dielectric relaxation behavior arising from a simple empirical
decay function. Trans. Faraday Soc. 1970, 66, 80–85. [CrossRef]
12. Weon, B.M.; Je, J.H. Theoretical estimation of maximum human lifespan. Biogerontology 2009, 10, 65–71.
[CrossRef] [PubMed]
13. Weon, B.M.; Je, J.H. Plasticity and rectangularity in survival curves. Sci. Rep. 2011, 1, 104. [CrossRef]
[PubMed]
14. Weon, B.M.; Je, J.H. Trends in scale and shape of survival curves. Sci. Rep. 2012, 2, 504. [CrossRef] [PubMed]
15. Wrycza, T.; Baudisch, A. The pace of aging: Intrinsic time scales in demography. Demogr. Res. 2014, 30,
1571–1590. [CrossRef]
16. OECD. Health at a Glance 2017: OECD Indicators; OECD Publishing: Paris, France, 2017. [CrossRef]
17. Anderson, G.F.; Hussey, P.S. Population aging: A comparison among industrialized countries. Health Aff.
2000, 19, 191–203. [CrossRef] [PubMed]
18. Robine, J.M.; Michel, J.P. Looking forward to a general theory on population aging. J. Gerontol. A Biol. Sci.
Med. Sci. 2004, 59, 590–597. [CrossRef]
19. Fries, J.F. Aging, natural death, and the compression of morbidity. N. Engl. J. Med. 1980, 303, 130–135.
[CrossRef]
20. Shkrabtak, N.; Frolova, N.; Kiseleva, T.; Sergeeva, I.; Pomozova, V. Impact of environmental conditions on
the health of the Far Eastern region population. Appl. Sci. 2019, 9, 1354. [CrossRef]
46
Appl. Sci. 2019, 9, 4230
21. Hayflick, L. The future of ageing. Nature 2000, 408, 267–269. [CrossRef]
22. Kim, I.K.; Liang, J.; Rhee, K.O.; Kim, C.S. Population aging in Korea: Changes since the 1960s. J. Cross
Cult. Gerontol. 1996, 11, 369–388. [CrossRef] [PubMed]
23. Bahk, J.; Lynch, J.W.; Khang, Y.H. Forty years of economic growth and plummeting mortality: The mortality
experience of the poorly educated in South Korea. J. Epidemiol. Community Health 2017, 71, 282–288.
[CrossRef] [PubMed]
24. Khang, Y.H.; Bahk, J.; Yi, N.; Yun, S.C. Age- and cause-specific contributions to income difference in life
expectancy at birth: Findings from nationally representative data on one million South Koreans. Eur. J.
Public Health 2016, 26, 242–248. [CrossRef] [PubMed]
25. Yang, S.; Khang, Y.H.; Harper, S.; Davey Smith, G.; Leon, D.A.; Lynch, J. Understanding the rapid increase in
life expectancy in South Korea. Am. J. Public Health 2010, 100, 896–903. [CrossRef] [PubMed]
26. NCD-RisC. A century of trends in adult human height. eLife 2016, 5, e13410. [CrossRef] [PubMed]
27. NCD-RisC. Trends in adult body-mass index in 200 countries from 1975 to 2014: A pooled analysis of
1698 population-based measurement studies with 19.2 million participants. Lancet 2016, 387, 1377–1396.
[CrossRef]
28. NCD-RisC. Worldwide trends in blood pressure from 1975 to 2015: A pooled analysis of 1479
population-based measurement studies with 19.1 million participants. Lancet 2017, 389, 37–55. [CrossRef]
29. Di Cesare, M.; Khang, Y.H.; Asaria, P.; Blakely, T.; Cowan, M.J.; Farzadfar, F.; Guerrero, R.; Ikeda, N.;
Kyobutungi, C.; Msyamboza, K.P.; et al. Inequalities in non-communicable diseases and effective responses.
Lancet 2013, 381, 585–597. [CrossRef]
30. OECD. Health at a Glance 2015: OECD Indicators; OECD Publishing: Paris, France, 2015. [CrossRef]
31. Lim, D.; Ha, M.; Song, I. Trends in the leading causes of death in Korea, 1983–2012. J. Korean Med. Sci. 2014,
29, 1597–1603. [CrossRef]
32. Choe, Y.J.; Choe, S.A.; Cho, S.I. Trends in infectious disease mortality, South Korea, 1983–2015.
Emerg. Infect. Dis. 2018, 24, 320–327. [CrossRef]
33. Carnes, B.A.; Witten, T.M. How long must humans live? J. Gerontol. A Biol. Sci. Med. Sci. 2014, 69, 965–970.
[CrossRef]
34. Dong, X.; Milholland, B.; Vijg, J. Evidence for a limit to human lifespan. Nature 2016, 538, 257–259. [CrossRef]
[PubMed]
35. Newman, S.; Easteal, S. The dynamic upper limit of human lifespan [version 1; referees: 1 approved,
1 approved with reservations]. F1000Research 2017, 6, 569. [CrossRef] [PubMed]
36. Marck, A.; Antero, J.; Berthelot, G.; Saulière, G.; Jancovici, J.M.; Masson-Delmotte, V.; Boeuf, G.; Spedding, M.;
Le Bourg, E.; Toussaint, J.F. Are we reaching the limits of homo sapiens? Front. Physiol. 2017, 8, 812.
[CrossRef] [PubMed]
37. Carnes, B.A.; Olshansky, S.J. Heterogeneity and its biodemographic implications for longevity and mortality.
Exp. Gerontol. 2001, 36, 419–430. [CrossRef]
38. Vaupel, J.W. Biodemography of human ageing. Nature 2010, 464, 536–542. [CrossRef]
39. Ben-Haim, M.S.; Kanfi, Y.; Mitchel, S.J.; Maoz, N.; Vaughan, K.; Amariglio, N.; Lerrer, B.; de Cabo, R.;
Rechavi, G.; Cohen, H.Y. Breaking the ceiling of human maximal lifespan. J. Gerontol. A Biol. Sci. Med. Sci.
2018, 73, 1465–1471. [CrossRef]
40. Brown, N.J.L.; Albers, C.J.; Ritchie, S.J. Contesting the evidence for limited human lifespan. Nature 2017, 546,
E6–E7. [CrossRef]
41. De Beer, J.; Bardoutsos, A.; Janssen, F. Maximum human lifespan may increase to 125 years. Nature 2017, 546,
E16–E17. [CrossRef]
42. Hughes, B.G.; Hekimi, S. Many possible maximum lifespan trajectories. Nature 2017, 546, E8–E9. [CrossRef]
43. Lenart, A.; Vaupel, J.W. Questionable evidence for a limit to human lifespan. Nature 2017, 546, E13–E14.
[CrossRef] [PubMed]
44. Rozing, M.P.; Kirkwood, T.B.L.; Westendorp, R.G.J. Is there evidence for a limit to human lifespan? Nature
2017, 546, E11–E12. [CrossRef] [PubMed]
47
Appl. Sci. 2019, 9, 4230
45. Rootzen, H.; Zholud, D. Human life is unlimited–but short. Extremes 2017, 20, 713–728. [CrossRef]
46. De Magalhães, J.P.; Stevens, M.; Thornton, D. The business of anti-aging science. Trends Biotechnol. 2017, 35,
1062–1073. [CrossRef] [PubMed]
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
48
applied
sciences
Article
Using the Importance–Satisfaction Model and Service
Quality Performance Matrix to Improve Long-Term
Care Service Quality in Taiwan
Shun-Hsing Chen 1 , Fan-Yun Pai 2, * and Tsu-Ming Yeh 3, *
1 Department of Marketing and Distribution Management, Oriental Institute of Technology,
New Taipei City 220, Taiwan; [email protected]
2 Department of Business Administration, National Changhua University of Education, Changhua 500, Taiwan
3 Department of Industrial Engineering and Management, National Quemoy University, Kinmen 892, Taiwan
* Correspondence: [email protected] (F.-Y.P.); [email protected] (T.-M.Y.);
Tel.: +886-47-232105 (ext. 7415) (F.-Y.P.); +886-82-313585 (T.-M.Y.)
Abstract: The present study integrates the importance–satisfaction (I-S) model and service quality
performance matrix (SQPM) to examine long-term care (LTC) service demands and satisfaction
improvement. Many scholars have used a single model to explore project improvement. Each model
has advantages, but we think they are too subjective and suggest that it is best to integrate models to
determine what should be improved. We established quality attributes of service demands based
on more than two sessions of discussions and expert consultations with LTC service users (older
adults). The final questionnaire was divided into three parts: a demand survey, satisfaction survey,
and demographics survey, and 292 valid questionnaires were collected. The questionnaire items
were summarized with means and standard deviations. In this study, if only the I-S model was used
to examine LTC in Taiwan, then seven service elements of the system would need to be improved.
However, if only the SQPM method was used, then 16 service elements would need to be improved.
Only seven service elements were identified by both methods. When time and resources are limited,
it is not feasible to take comprehensiveness into account. When many projects must be improved and
it is impossible to implement them at the same time, improvement priorities need to be developed.
Taiwan lacks sufficient LTC resources, so it is impossible to provide enough resources for all those
who need care. To use resources efficiently, the I-S model and SQPM were integrated in this study to
identify areas for improvement.
Keywords: Long-term care (LTC); importance-satisfaction (I-S) model; performance evaluation matrix
(PEM); service quality performance matrix (SQPM); voice of customer (VOC)
1. Introduction
An aging population and changes in population structure directly impact social systems, such as
the labor market, retirement planning, and the medical care system, and can lead to an economic crisis,
such as a recession or government bankruptcy [1]. Therefore, population factors have become closely
watched important issues in developed countries [2]. Taiwan’s population of older adults exceeded 7%
in 1993, making it an aging society [3], exceeded 14% in 2018, categorizing it as an aged society [4], and
will exceed 20% in 2025, making Taiwan a super-aged society [5].
Long-term care (LTC) involves a variety of services designed to meet people’s health or personal
care needs during a short or long period of time. These services help people live as independently
and safely as possible when they can no longer perform everyday activities on their own [6]. LTC is a
very important issue for everyone in an aging society [5]. With the advancement of technology and
medical treatments, the global average age has been extended to 75 years [7]. Taiwan is no exception.
The continuous extension of life has resulted in changes in population structures [8]. The aging
phenomenon is a problem that developed countries must face, along with related problems regarding
the care of older adults [9]. In recent years, late marriage, bachelorism, and infertility have continued
to intensify. Coupled with extended average life expectancy, low fertility, and population aging, they
have also become concerning and unavoidable issues [10].
The implementation of LTC policies is based on the manager perspective rather than the user or
customer point of view. In other words, LTC implementation ignores the voice of the customer (VOC).
In the present study, the user point of view was used to explore service demands in LTC in Taiwan.
Those who require LTC were divided into two groups: healthy older people, who are only interested in
living happily, and older people with disabilities or dementia, who require medical interventions [11].
As the demands of these two groups are different, a service demand design would be ineffective,
and could even affect national life, if it was viewed from an official perspective. Furthermore, the
Taiwanese government has limited resources, and the needed resources would not necessarily be
utilized at the same time. To address these issues and prioritize the efficient use of LTC resources,
the importance-satisfaction (I-S) model and the revised performance evaluation matrix (PEM) can be
used. Such a practice could fulfill LTC service demands by building a complete set of service design
measurement methods.
Yang [12] developed the I-S model with reference to the importance performance analysis (IPA)
method. The I-S model is the best tool for quality improvement [12,13]. Lambert and Sharma [14]
proposed the performance evaluation matrix (PEM) to determine the importance of logistical service
quality factors and the performance of companies based on these factors. PEM-related applications
have been used in the machinery industry and in education; few applications have been used in
long-term care. The PEM can help industries find service items that need improvement. The present
study found no analyses on combining the I-S model and the PEM method for LTC service demands in
the relevant literature. Due to insufficient studies and that the fact that currently LTC service demand
policies are considered to be very important and urgent, the I-S model and PEM are introduced in the
present study. It is hoped that this will provide more accurate quality service design and improve
the essentials of LTC services to better meet the needs and expectations of the older adult population
in Taiwan.
We conducted in-depth interviews in more than two sessions of discussions with experts and
scholars through a focused discussion method to understand the LTC demands of the public and
further confirm the service demands of caregivers, and we distributed 350 questionnaires. We also
used the I-S model and the revised PEM method to find LTC users of corresponding management
strategies to serve as a transmission basis for decision makers to adjust their services. This was to
ensure that user service demands could be fully satisfied. In general, the purposes of this study can be
summarized as follows:
1. To perform in-depth interviews with experts and scholars in focused discussions to confirm user
service demands and service resources.
2. To use the I-S model and the revised PEM method to confirm the importance and priority levels
of various service demands in order to serve as a basis for decision makers to adjust LTC service
quality design and improve care for older adults.
The rest of this study is organized as follows: Section 2 reviews the literature related to long-term
care in Taiwan, long-term care service demands of older adults, the service quality evaluation model,
the I-S model, the service quality evaluation model, and related research methods. Section 3 presents
the research process, methodology, questionnaire design, data collection method, and statistical analysis
methods. The results of the analysis and discussion are given in Section 4. Finally, the conclusions of
this study and suggestions are presented in Section 5.
50
Appl. Sci. 2020, 10, 85
2. Literature Review
Medical
Long-term system Life support system
system
Dependent
status Autonomous
Sickness life status
status
Figure 1. Medical/long-term/life support care model. Source: Ministry of Health and Welfare,
Taiwan [20].
LTCA 2.0 is an extension of Taiwan’s LTC political views. To address Taiwan’s increasingly aging
society, LTCA 2.0 proposes improving the strategies of three projects [20].
51
Appl. Sci. 2020, 10, 85
Tier B
Long-term care
specialty store
Figure 2. ABC long-term community-based care model. Source: Ministry of Health and Welfare,
Taiwan [20].
52
Appl. Sci. 2020, 10, 85
older adults living in communities. Home-based services include the following 10 services: health
care, rehabilitation, physical care, housework, care visits, telephone-based care, catering, emergency
rescue, home environment improvement, and other. Community-based services include the following
15 services: health care, medical care, rehabilitation, psychological counseling, day care, catering,
family care, education, legal, transportation, retirement preparation, leisure, provision of information,
referrals, and other.
H
Quadrant B Quadrant A
Concentrate here Keep up the
Importance
good work
Quadrant C Quadrant D
Low Possible overkill
priority
L
L Performance H
53
Appl. Sci. 2020, 10, 85
Satisfaction
High
Mean
Unimportant To be improved
area
area
Low
Importance
Low Mean High
Importance
7
Definitely Maintain/
Improve Improve
Improve
3
Reduce/ Reduce/
Maintain Maintain Maintain
1 3 5 7
Performance
From this matrix analysis, the current performance and competition status of the examined
organizations can serve as directions for future development. PEM provides a reference for whether or
not business owners should adjust their performance and response strategies [32]. Due to different
perspectives explored by scholars, the viewpoints vary. Chen and Yeh [13] proposed amending the
variables by placing importance variables on the horizontal axis and satisfaction variables on the
vertical axis. The revised PEM was developed from these changes (see Figure 6).
54
Appl. Sci. 2020, 10, 85
Satisfaction(PS)
7
Prioritize
Maintain Improve
improvement
Prioritize Prioritize
Improve
improvement improvement
1 3 5 7
Importance (PI)
Figure 6. Revised performance evaluation matrix. Sources: Chen and Yeh [13].
Satisfaction(PS)
1
2/
3
1/3
Zone 8 Zone 9
Zone 7
Prioritize Prioritize
Improve
improvement improvement
0 1/3 2/3 1
Demand (PD)
55
Appl. Sci. 2020, 10, 85
Below, according to the theory of Hung et al. [33], the random variable D denotes demand, and S
denotes satisfaction. A 5-point scale was adopted to evaluate the demand and satisfaction of each item.
The indices of demand and satisfaction are defined as follows:
μD − min
PD = (1)
R
μS − min
PS = (2)
R
where PD is the index of demand, PS is index of satisfaction, min is the minimum value of scale k, and
R is the full range of scale k. Furthermore, μD and μS are the means of demand (D) and satisfaction (S),
respectively, min = 1 represents the minimum of scale k, and R = k − 1 is the full range of scale k.
Based on the literature review, we know that PEM has been used extensively in recent years and
can be applied to decision-making practices [32]. This method is mostly used in logistics and in the
semiconductor, financial, and service industries, and its application in relevant studies of LTC is still
rare. Hence, we chose PEM as one of the research methods. In this study, the resource allocation
for the LTC concept was used as the basis to explore service satisfaction levels for older people who
require care. The corresponding strategies of maintenance, improvement, and priority improvement
of the revised PEM put forward by Chen and Yeh [13] were re-proposed. This matrix is called the
service quality performance matrix (SQPM). Zones 1, 2, and 4 are areas where older adults’ cognitive
satisfaction exceeds demand. Therefore, service resources should be maintained at their current status,
and the strategy is to maintain service quality. Zones 3, 5, and 7 are areas where satisfaction is equal
to demand, indicating that the service does not exceed user expectations; therefore, organizations
must improve their existing service quality by investing in more resources, and the proposed strategy
is to improve. Zones 6, 8, and 9 are areas where demand exceeds satisfaction, indicating that older
adults are well aware of service projects, but they are not satisfied with the services provided. As a
result, governments and organizations must actively make improvements to correct these issues, and
the response strategy is to prioritize improvement. All factors that prioritize improvement are the
response strategies that this study puts forward and are regarded as factors that should be prioritized.
In the relevant literature, most PEM-related applications were in the machinery industry [34] and
education [36], and few were applied in the service industry, and PEM was not formally applied to
LTC service demand and resource allocation at all. In this study, the literature and relevant practical
experience were used to identify the elements of LTC service demands, and the SQPM method was
used to determine the priority of these demands to optimize resource allocation and implement care
for older adults’ retirement.
3. Research Methods
56
Appl. Sci. 2020, 10, 85
The samples for this research were people older than 55 years (service targets of LTC) who
lived in a long-term institution or attended a senior citizens’ learning camp. Taiwanese older people
residing in Taipei and New Taipei were the participants in this study. In addition, in the process of
questionnaire distribution, some older people were found to be much older or illiterate. In these cases,
the questionnaire was administered verbally to make the questions more comprehensive.
57
Appl. Sci. 2020, 10, 85
4. Research Results
A service needs survey was conducted to examine the demand level of older adults for essential
elements. The higher the demand level, the more important these elements are to older adults; therefore,
the government should pay attention to them in its administrative measures. The results are shown in
Table 2. The average value of the demand survey was found to be 4.16, with a standard deviation of
0.99, showing that older adults have a very high awareness of all the elements, and each element is
vital to them. The three items with the highest demand were retirement pension or living allowance
subsidies, emergency medical care and rescue services, and retirement planning.
The survey was meant to understand the service demands of older adults and confirm whether
their demands were satisfied. The lower the satisfaction level, the higher the dissatisfaction with
services provided by the government and service providers, indicating that these items should be
improved. The average value of the satisfaction survey was found to be 2.93, with a standard
deviation of 0.78, indicating that older adults are less satisfied with all elements. The three items
with the lowest satisfaction levels were retirement pension or living allowance subsidies, emergency
medical care and rescue services, and family companionship. These elements must be listed in the
improvement priorities.
58
Appl. Sci. 2020, 10, 85
Table 2. The mean and standard deviation of service demand and the satisfaction survey (N = 292).
Demand Satisfaction
Item
ME S.D. ME S.D.
1. Retirement pension or living allowance subsidies 4.89 0.95 2.19 0.49
2. Health-promoting physical fitness activities 4.67 0.81 2.67 0.65
3. Visits or telephone care services by social
welfare agencies 4.01 1.01 2.70 0.68
59
Appl. Sci. 2020, 10, 85
The service elements in the surplus area were items 15, 17, 18, 19, and 20, indicating that service
demand is below the average, and customers are not concerned with these services. Service satisfaction
is higher than the average, indicating that resources are being misplaced or wasted. Therefore, surplus
resources should be reallocated to service items with high demand and low satisfaction.
The service elements in the unimportant area were items 3, 4, 9, 10, and 12, indicating that service
demand and satisfaction are lower than the average. Customers do not value these services but accept
the performance of service providers. Therefore, as long as these elements are provided appropriately,
it is not necessary to enhance the service quality. In fact, increasing resources in this area could be
considered a waste.
Satisfaction (PS)
Unimportant
area To be improved
area
Demand
(PD)
60
Appl. Sci. 2020, 10, 85
Table 3. Relative values for I-S model and service quality performance matrix (SQPM).
Satisfaction (PS)
Prioritize
improvement
Maintain Improve
Prioritize
Improve Prioritize
improvement
improvement
Demand
(PD)
4.5. Discussion
Many scholars have used a single model or tool, such as IPA, the I-S model, or the PEM, to
explore projects that organizations need to improve. Each model has its advantages, but we think
they are too subjective. Instead, it is best to use several models to repeatedly determine the areas
that need to be improved before proposing an improvement strategy. The present study integrates
the importance-satisfaction (I-S) model and service quality performance matrix (SQPM) to examine
long-term care (LTC) service demands and satisfaction improvement.
Japan’s LTC system was established in 2000 to support older people with LTC demands and
alleviate the burden of caregiving on family members [40]. People choosing to use LTC services,
including in-home, facility, and community services, reduce their financial expenses and improve
their quality of care [41]. Taiwan established LTCA 1.0 in 1998, and amended it to develop LTCA 2.0
in 2018. The implementation schedule in Taiwan was earlier than in Japan, but its effectiveness has
61
Appl. Sci. 2020, 10, 85
not been high. Therefore, the Taiwanese government and service providers must actively emphasize
implementing the LTC system and making improvements regarding older adults’ demands.
Generally speaking, when time and resources are limited, it is not feasible to take
comprehensiveness into account. When many projects must be improved and it is impossible
to implement them at the same time, improvement priorities are developed [42]. In this study, if only
the I-S model was used to examine LTC in Taiwan, then seven elements of the system would need to
be improved: items 1, 2, 5, 6, 7, 11, and 13. However, if only SQPM were used, 16 elements would
need to be improved: items 1–16 and 20. In other words, the outcomes are different when using
different models.
As mentioned earlier, Taiwan lacks sufficient LTC resources, so it is impossible to provide enough
resources for all those who need care. To use resources efficiently, the I-S model and SQPM were
integrated in this study to identify areas for improvement, and items 1, 2, 5, 6, 7, 11, and 13 were
identified by both methods.
From the empirical analysis, it was found that the average demand score is approximately 4 and
the average satisfaction score is approximately 3, showing that older adults have a very high demand
level, but their tendency for satisfaction is low. Regarding improvements, it was found that older
adults are most dissatisfied with pensions and living allowances. Therefore, the results of this study are
consistent with the viewpoints of other scholars [22,43]. After all, adequate pensions or government
subsidies would allow older adult retirees without income and/or in poor health to feel more assured.
Taiwan’s declining birthrate and aging population have become more severe. In 2018, more than
14% of the populations were older adults, making the nation an aged society. This has forced some
people to rely on foreign workers to help provide care, companionship, and other support for older
adults. However, the improper foreign labor management system and language barriers have caused
frequent incidents of abuse of older adults by foreign workers. Therefore, four service elements (items
5, 6, 11, and 13) related to these issues were generated due to older adults’ helplessness and loneliness.
In addition to satisfaction about materials and money, spiritual sustenance and care for older adults
are equally crucial.
Items that need improvement, except for pensions and living allowances, cannot be improved
due to financial difficulties, but for emergency medical care and rescue services and transportation
services, it is recommended that government agencies and hospitals provide transportation resources
for the elderly. For travel to emergency medical care, private sector operators should be encouraged to
invest in transport services for the elderly and provide appropriate subsidies to enhance emergency
medical services.
The results also shows that most of the older people lived in residential institutions were male,
their education level were mainly university, their economic status were better, and their work were
mainly in the military, government employees, or teachers before retirement. In a Chinese society,
many older people were unwilling to live in residential institutions. In addition to considering financial
planning factors, they were also afraid of giving others a feeling that they have been abandoned by
their children. Therefore, the proportions of living with their families were also high. The interview
found that the older people living families alone have poor financial support, their need government
subsidies, and transportation services. The older people lived in residential institutions have a stable
pension or their children provide financial assistance therefore, the needs were family companionship
and health-promoting physical fitness activities.
Finally, this study attempts to analyze the differences from the sample background information,
such as gender, age, education, occupation, living status, marital status, etc., to compare the differences
in the needs of different people for LTC. After analysis of I-S model and SQPM, the research results show
that there is no significant difference between different background samples and the overall sample.
It is speculated that the reason for this might be that the samples for this research were Taiwanese older
people residing in Taipei and New Taipei who lived in a long-term institution or attended a senior
citizens’ learning camp. And Taipei and New Taipei City are the two largest Taiwanese Cities, the
62
Appl. Sci. 2020, 10, 85
population is evenly distributed, and older adults’ views are also converging, which were also likely to
be factors in the results.
5.1. Conclusions
The advancement of medical science and technology has extended the average life expectancy
and brought about changes in social structure and values and, together with the declining birthrate,
has accelerated the advent of aged societies. To cope with the current global aging population, LTC for
older adults is an essential goal of government administrations. To meet older people’s needs, it is
unwise to construct aid devices or service needs from a managerial perspective instead of listening to
them and customizing relevant services based on a user perspective. It is also important to encourage
healthy older people to continue being active, such as by performing volunteer work, participating in
community services, and acting as guides, to enrich their retirement lives and simultaneously increase
their contribution to society. Older adults who are in poor health, suffering from illnesses such as
dementia, or facing disabilities can be provided with medical care to reduce their pains, ensuring that
all older people receive proper care. The government is responsible for appropriately allocating the
necessary resources.
During the implementation process, the I-S model and SQPM were mainly used to evaluate
resource allocation, followed by a secondary consideration of service quality improvement. Seven
services were found to be in urgent need of improvement by this study after integrating the I-S model
and SQPM. This research results can serve as a basis for quality improvement for the Taiwanese
government and service providers investing in resources. If government resources are limited and
cannot be allocated simultaneously, it is suggested to prioritize the improvement projects in order.
63
Appl. Sci. 2020, 10, 85
can expand the sample size to compare the LTC needs. They may find significant differences between
different background samples.
Author Contributions: S.-H.C. and T.-M.Y. developed the idea, motivation, and question of the paper and
contributed to the discussion. F.-Y.P. outlined and revised the manuscript and made substantial contributions to
the design of this study. All authors have read and agreed to the published version of the manuscript.
Funding: The Article Processing Charge was funded by the Ministry of Science and Technology of Taiwan under
grant number MOST 108-2221-E-507-004-MY3.
Acknowledgments: The authors would like to express their sincere gratitude to the editor and the anonymous
reviewers for their insightful and constructive comments.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Lin, C.S.; Jeng, M.Y.; Yeh, T.M. The elderly perceived meanings and values of virtual reality leisure activities:
A means-end chain approach. Int. J. Environ. Res. Public Health 2018, 15, 663. [CrossRef] [PubMed]
2. Wang, Z.H.; Yang, Z.H.; Dong, T. A review of wearable technologies for elderly care that can accurately
track indoor position.recognize physical activities and monitor vital signs in real time. Sensors 2017, 17, 341.
[CrossRef] [PubMed]
3. Jeng, M.Y.; Pai, F.Y.; Yeh, T.M. The virtual reality leisure activities experience on elderly people. Appl. Res.
Qual. Life 2017, 12, 49–65. [CrossRef]
4. Lin, T.C. Discussion on the Health Promotion Management and the Demand of Life Support in Senior Citizen:
A Case study with Taiwan. CSR Summit Entrep. Cross Strait Coop. 2019, 4, 36–52.
5. Lee, I.; Chiou, C.J.; Su, H.S. Comparison of Caring Status and Needs between Elderly and Non-Elderly
Family Caregivers. J. Long-term Care 2017, 21, 149–164.
6. Lehnert, T.; Heuchert, M.; Hussain, K.; König, H. Stated preferences for long-term care: A literature review.
Ageing Soc. 2019, 39, 1873–1913. [CrossRef]
7. Huang, F.S. The Rise, Trends and Recent Developments of Elder Education in Taiwan. J. Adult Lifelong Educ.
2012, 43, 2–14.
8. Lin, Y.Y. Strategies of enhancing capacity of long-term care policy by government: A case study of taichung
city government. J. Long-Term Care 2017, 21, 19–25.
9. Kim, J.; Han, W. Improving Service Quality in Long-term Care Hospitals: National Evaluation on Long-term
Care Hospitals and Employees Perception of Quality Dimensions. Osong Public Health Res. Perspect. 2012,
3, 94–99. [CrossRef]
10. Chen, S.H. Determining the service demands of an aging population by integrating QFD and FMEA method.
Qual. Quant. 2016, 50, 283–298. [CrossRef]
11. Chen, S.F.; Teng, S.W. The Development of the Long-Term Care Service System in Taiwan. J. Nurs. 2010,
57, 5–10.
12. Yang, C.C. Improvement actions based on the customers’ satisfaction survey. Total Qual. Manag. Bus. Excell.
2003, 14, 919–930. [CrossRef]
13. Chen, S.H.; Yeh, T.M. Integration PEM and AHP methods to determine service quality improvement strategy
for the medical industry. Appl. Math. Inf. Sci. 2015, 9, 3073–3082.
14. Lambert, D.M.; Sharma, A. A customer-based competitive analysis for logistics decisions. Int. J. Phys. Distrib.
Logist. Manag. 1990, 20, 17–24. [CrossRef]
15. Shieh, M.D.; Hsiao, H.C.; Lin, Y.H.; Lin, J.Y. A study of the elderly people’s perception of wearable device
forms. J. Interdiscip. Math. 2017, 20, 789–804. [CrossRef]
16. Helbostad, J.L.; Vereijken, B.; Becker, C.; Todd, C.; Taraldsen, K.; Pijnappels, M.; Aminian, K.; Mellone, S.
Mobile health applications to promote active and healthy ageing. Sensors 2017, 17, 622. [CrossRef]
17. Huang, Y.C.; Chang, T.Y. A Study of Long-term Care Policy 2.0 Edition: An Example for Pingtung County.
J. Manag. Syst. 2019, 26, 241–269.
18. Chung, C.C. Reflect on for long-term care 2.0 from the view point of quality of medical life. Qual. Mag. 2017,
53, 5–11.
64
Appl. Sci. 2020, 10, 85
19. Lee, M.C.; Kuo, T.C.; Ueng, W.N. Perspectives on Long-Term Care in Taiwan-Integration of Health Care and
Welfare services. HOSPITAL 2019, 52, 1–5.
20. Ministry of Health and Welfare. Long Term Care Plan for Ten Years 2.0 Report; Ministry of Health and Welfare:
Taiwan, China.
21. Sun, W.J.; Chu, D.C.; Huang, S.J. Medical Home Care and National Long-Term Care Integration. J. Med.
2018, 22, 278–282.
22. Liu, Y.H. Older clinical psychology lecture (1). Consult. Couns. 2010, 289, 56–60.
23. Lin, Y.G.; Chen, P.W. The application of community consultation model in the service of the elderly. Couns. Q.
2009, 45, 49–60.
24. Shih, Y.; Lo, H.T.; Chen, C.Y. Study on Quality of Services and Satisfaction for Elderly Day Care Centers: A
Case Study of a Hospital in Southern Taiwan. J. Gerontechnol. Serv. Manag. 2018, 6, 121–136.
25. Martilla, J.A.; James, J.C. Importance-Performance Analysis. J. Mark. 1997, 41, 77–79. [CrossRef]
26. Sampson, S.E.; Showalter, M.J. The performance-importance response function: Observations and
implications. Serv. Ind. J. 1999, 19, 1–25. [CrossRef]
27. Guo, D.S.; Su, S.C. An IPA Analysis of Customer Value. Manag. Sci. Res. 2016, 10, 83–96.
28. Chen, S.H. A performance matrix for strategies to improve satisfaction among faculty members in higher
education. Qual. Quant. 2011, 45, 75–89. [CrossRef]
29. Cheng, C.C.; Tsai, M.C.; Lin, S.P. Developing strategies for improving the service quality of casual-dining
restaurants: New insights from integrating IPGA and QFD analysis. Total Qual. Manag. Bus. Excell. 2015,
26, 415–429. [CrossRef]
30. Wong, K.; Tunku, U.; Rahman, A. Constructing a survey questionnaire to collect data on service quality of
business academics. Eur. J. Soc. Sci. 2012, 29, 209–221.
31. Yeh, T.M.; Lai, H.P. Evaluating the effectiveness of implementing quality management practices in the
medical industry. J. Nutr. Health Aging 2015, 19, 102–112. [CrossRef]
32. Chen, S.H.; Liu, A.C.; Chen, F.Y. Using the PEM method to determine service quality improvement strategies
of medical industry. Int. J. Manag. Stud. Res. 2014, 2, 41–46.
33. Hung, Y.H.; Huang, M.L.; Chen, K.S. Service quality evaluation by service quality performance matrix. Total
Qual. Manag. Bus. Excell. 2003, 14, 79–89. [CrossRef]
34. Lin, W.T.; Chen, S.C.; Chen, K.S. Evaluation of performance in introducing CE marking on the European
market to the machinery industry in Taiwan. Int. J. Qual. Reliab. Manag. 2005, 22, 503–517. [CrossRef]
35. Wu, M.L. Paper Writing and Quantitative Research; Wu-Nan Publishing Company: Taipei, Taiwan, 2018.
36. YuCMChang, H.T.; Chen, K.S. Developing a performance evaluation matrix to enhance the learner satisfaction
of an e-learning system. Total Qual. Manag. Bus. Excell. 2018, 29, 727–745.
37. Kim, S.H.; Kim, D.H.; Kim, W.S. Long-Term Care Needs of the Elderly in Korea and Elderly Long-Term Care
Insurance. Soc. Work Public Health 2010, 25, 176–184. [CrossRef] [PubMed]
38. Nunnally, J.C. Psychometric Theory; McGraw-Hill: New York, NY, USA, 1987.
39. Cuieford, J.P. Fundamental Statistics in Psychology and Education, 4th ed.; McGraw Hill: New York, NY,
USA, 1965.
40. Iwagami, M.; Tamiya, N. The Long-Term Care Insurance System in Japan: Past, Present 2019, and Future.
JMA J. 1965, 2, 67–69.
41. Araia, A.; Ozakib, T.; Katsumatac, Y. Behavioral and psychological symptoms of dementia in older residents
in long term care facilities in Japan: A cross-sectional study. Aging Ment. Health 2017, 21, 1099–1105.
[CrossRef]
42. Chen, S.H.; Yang, C.C.; Shiau, J.Y.; Wang, H.H. The development of an employee satisfaction model for
higher education. TQM Mag. 2006, 18, 484–500. [CrossRef]
43. Liao, W.C.; Chiu, L.A.; Yueh, H.P. A study of rural elderly’s health information needs and seeking behavior.
J. Libr. Inf. Stud. 2012, 10, 155–204.
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
65
applied
sciences
Article
Investigation of Vocal Fatigue Using a Dose-Based
Vocal Loading Task
Zhengdong Lei 1 , Laura Fasanella 1 , Lisa Martignetti 2 and Nicole Yee-Key Li-Jessen 2
and Luc Mongeau 1, *,†
1 Department of Mechanical Engineering, McGill University, Montreal, QC H3A 0C3, Canada;
[email protected] (Z.L.); [email protected] (L.F.)
2 School of Communication Sciences and Disorders, McGill University, Montreal, QC H3A 1G1, Canada;
[email protected] (L.M.); [email protected] (N.Y.-K.L.-J.)
* Correspondence: [email protected]
† Current address: 845 Sherbrooke Street West, Montreal, QC H3A 0G4, Canada.
Abstract: Vocal loading tasks are often used to investigate the relationship between voice use and
vocal fatigue in laboratory settings. The present study investigated the concept of a novel quantitative
dose-based vocal loading task for vocal fatigue evaluation. Ten female subjects participated in the
study. Voice use was monitored and quantified using an online vocal distance dose calculator during
six consecutive 30-min long sessions. Voice quality was evaluated subjectively using the CAPE-V and
SAVRa before, between, and after each vocal loading task session. Fatigue-indicative symptoms, such
as cough, swallowing, and voice clearance, were recorded. Statistical analysis of the results showed
that the overall severity, the roughness, and the strain ratings obtained from CAPE-V obeyed similar
trends as the three ratings from the SAVRa. These metrics increased over the first two thirds of the
sessions to reach a maximum, and then decreased slightly near the session end. Quantitative metrics
obtained from surface neck accelerometer signals were found to obey similar trends. The results
consistently showed that an initial adjustment of voice quality was followed by vocal saturation,
supporting the effectiveness of the proposed loading task.
1. Introduction
Vocal fatigue may be diagnosed through a series of voice symptoms, which include, for example,
hoarse and breathy vocal qualities, pitch breaks, reduced pitch and loudness ranges, throat discomfort,
and unsteady voice [1]. Vocal fatigue may be experienced by any individuals during their life time, but
it is more frequently encountered by professional voice users in occupational settings. Vocal fatigue
increases vocal effort and decreases speaking stamina. Ultimately, vocal fatigue can lead to voice
disorders, such as vocal hyperfunction or vocal nodules. Vocal fatigue is difficult to define because
many factors, such as self-reported feelings, doctor-rated symptoms, and instrumental measures, could
be criteria for its determination. For example, a self-reported feeling of vocal fatigue might be due
to psychological stress, thereby not causing much detectable change in physiological measures [2].
Standards for assessing vocal fatigue are therefore difficult to establish. Nevertheless, most current
research adopted the definition of vocal fatigue as a sense of increased vocal effort [3].
Considerable progress has been made in the evaluation of voice quality. Perceptual, acoustic,
and aerodynamic measurements, along with self-administered tests, have been used to characterize
changes in voice quality and performance in laboratorial settings [4,5]. The most commonly used
method of voice quality evaluation in clinics is auditory perception, which relies on listeners’ personal
experience and expertise. Commonly used subjective evaluation tools included the GRBAS (Grade,
Roughness, Breathiness, Asthenia and Strain) proposed by the Japan Society of Logopedics and
Phoniatrics, the CAPE-V (Consensus Auditory-Perceptual Evaluation of Voice) proposed by the
American Speech-Language and Hearing Association, and the SAVRa (Self-Administrated Voice
Rating) proposed by the National Center for Voice and Speech in the United States. These tools
require specific vocal stimuli. For example, the CAPE-V requires the completion of three defined
phonation tasks assessed through perceptual rating. This therefore limits the applicability of these
tools in situations where the vocal stimuli are varied or unspecified. Many studies have investigated
uncertainties in subjective judgment methodologies for voice quality evaluation. Kreiman and Gerratt
investigated the source of listener disagreement in voice quality assessment using unidimensional
rating scales, and found that no single metric from natural voice recordings allowed the evaluation of
voice quality [6]. Kreiman also found that individual standards of voice quality, scale resolution, and
voice attribute magnitude also significantly influenced intra-rater agreement [7]. Objective metrics
obtained using various acoustic instruments have been investigated, and attempts have been made to
correlate these with perceptual voice quality assessments [8–12].
A plethora of temporal, spectral, and cepstral metrics have been proposed to evaluate voice
quality [13,14]. Commonly used features or vocal metrics include fundamental frequency ( f 0),
loudness, jitter, shimmer, vocal formants, harmonic-to-noise ratio (HNR), spectral tilt (H1-H2, harmonic
richness factor), maximum flow declination rate (MFDR), duty ratio, cepstral peak prominence (CPP),
Mel-frequency cepstral coefficients (MFCCs), power spectrum ratio, and others [15–19]. Self-reported
feelings of decreased vocal functionality have been used as a criterion for vocal fatigue in many
previous studies [1,4,20–22]. Standard self-administered questionnaires, such as the SAVRa and the
Vocal Fatigue Index (VFI), have been used to identify individuals with vocal fatigue, and to characterize
their symptoms [23–25]. Hunter and Titze used the SAVRa to quantify vocal fatigue recovery based
on 86 participants’ tracking reports. The results showed a self-reported 50% recovery within 4–6 h,
and 90% recovery within 12–18 h [24]. Halpern et al. used one of the three dimensions in SAVRa,
i.e., the inability to produce soft voice (IPSV), to track vocal changes in school teachers. The SAVRa
scores were then compared with two clinicians’ ratings of the participants’ f 0, and loudness [26]. The
overall correlation between self-ratings and clinician ratings was not significant. The average absolute
difference score was 1.7. This showed that the clinicians and the teachers had different rating standards.
Prolonged or inappropriate voice use is commonly regarded as one of the causes of vocal fatigue.
Vocal loading tasks (VLTs) are often used to investigate the relationship between voice use and
vocal fatigue in laboratorial settings [1]. Previous VLT studies have typically instructed participants
to complete standardized reading tasks of prescribed durations at specific loudness levels [1,4,24].
Perceptual, acoustic, and aerodynamic measurements have been used to evaluate changes in voice
quality and performance before and after VLTs [4]. Unfortunately, the findings from these studies were
often reported as inconsistent. Comparisons across different studies have been at times contradictory.
This may have been caused by multiple factors: (1) the prescribed VLT might not have induced a
detectable level of vocal fatigue across individuals, (2) the amount of vocal loading across participants
may not have been consistent due to the lack of a universal method to quantify vocal loading, and (3)
there may have been variability in experimental settings. A more robust vocal loading protocol for
VLT is therefore needed to improve consistency, and to allow comparisons across different studies.
Amongst methods of quantifying voice use for vocal fatigue assessment, the vocal distance dose,
Dd , first proposed by Titze et al., was adopted in the present study [27]. The vocal distance dose
attempts to approximately quantify the distance traveled by vocal folds during vocal oscillation [27,28].
It is usually calculated in terms of the fundamental frequency, the sound pressure level, the voicing
duration, and the vocal duty ratio. Whether Dd correspondes to the true cumulative vocal fold
displacement has not yet been verified. But, as a four-parameter estimates, vocal distance dose is more
comprehensive than other metrics that are based on one single parameter. Svec et al. described the
procedures to calculate the distance dose using synchronized microphone and EGG data [29]. The EGG
signal was used to locate the peak position for each vocal cycle in the time domain. The microphone
68
Appl. Sci. 2020, 10, 1192
signal was used to quantify loudness. Carroll et al. used cumulative vocal dose data correlations
with subjective measurements in vocal fatigue experiments. An abrupt increase in vocal loading was
closely related to a harsher subjective self-reported rating [30]. Echternach et al. found that a 10-min
intensive VLT with a >80 dB loudness level was comparable to a 45-min teaching task in terms of
vocal dose [31]. Remacle et al. showed that kindergarden teachers had significantly greater distance
doses than elementary school teachers based on an investigation of 12 kinderdarten and 20 elementary
school female teachers [5]. Bottalico and Astolfi calculated the vocal distance dose and the sound
pressure level (SPL) for school teachers during their daily teaching assignments. They found that
female teachers had on average a higher (>3.4 dB) loudness level than male teachers, but vocal distance
doses did not differ very much between female and males teachers [32]. Morrow and Connor used an
ambulatory phonation monitor to record and calculate the SPL and the distance dose for elementary
music teachers with and without voice amplification [33]. The results showed that voice amplification
significantly decreased the average SPL and the distance dose. These studies used the vocal distance
dose as a quantitative measure of voice use in the VLTs or routine phonation tasks. Despite progress,
no definitive correlations have been yet made between subjective assessments and objective measures
in vocal fatigue studies. The distance dose prospectively offers a quantitative metric for vocal loading.
Such framework is essential for cross-participants or cross-sessions comparisons to be meaningful and
reasonable.
In the present study, a uniquely designed VLT was investigated. Ten human subjects were
recruited and participated in the study. The vocal distance dose was used to quantify the participants’
vocal loading online during the experiment. Subjective and objective measures were used to assess
participants’ voice qualities. A cross-session comparison was made to investigate the relationship
between total distance dose and voice quality.
3. Participant Recruitment
The human research ethics protocol (A09-M46-11A) was approved by the Institutional Review
Board at McGill University. No occupational voice users such as singers, teachers, and voice actors
were recruited, because previous studies showed that their voice had greater endurance for vocal
loadings than normal voice users [1,34]. The purpose of the study was not communicated to the
participants before the experiment was concluded. The participants were only informed that they
had to perform reading sessions. This was to help reduce the participants’ biases towards the SAVRa
ratings. For example, if the participants were aware that the study was measuring vocal fatigue, they
would expect their voice quality to degrade throughout the sessions. This may increase the risk of
biased ratings through the introduction of a psychological variable into the cross-session analysis of
the participants’ voice quality.
Participants were recruited in Montreal, Canada. The inclusion criteria were that the participants
should be female, native English speakers with no history of voice disorders. Sex and gender differences
in voice performance were found to be significant by Hunter et al. [35,36]. The present study only used
female participants to exclude sex as a variable in the pilot study. Male participants will be used in
future experiments. The participants’ demographic information is shown in Table 1. The experiments
were conducted in a sound-proof voice recording studio located in the Centre for Interdisciplinary
Research in Music Media and Technology (CIRMMT). All experiments took place in the morning,
around 9:00 a.m. The participants were instructed not to use their voice often over a period of 8 h
before the experiments. The participants were required to withdraw from the study if they were
found to have a voice problem, such as cough and cold, in the early morning. During the experiment,
69
Appl. Sci. 2020, 10, 1192
participants who reported any severe physical discomforts, such as unceasing cough and voice loss,
were asked to withdraw from the recording session.
ID Age Occupation
1 32 SLP
2 22 psychology student
3 38 psychology student
4 25 engineering student
5 23 SLP student
6 22 nutrition student
7 24 SLP student
8 26 engineering student
9 31 arts student
10 26 psychology student
SUHSDUDWLRQ 6 6 6 6 6 6 UHVW
DP 4( 4( 4( 4( 4( 4( 4( 4( SP
Figure 1. Measurement protocol of the vocal loading task in the vocal fatigue study. QEi (i = 1, 2, .., 8):
Quality Evaluation; Si (i = 1, 2, .., 6) represents each vocal loading task session.
70
Appl. Sci. 2020, 10, 1192
the VLT sessions, so that they could adjust their vocal effort online, in real time. The virtual vocal
distance dose monitoring tool prompted participants to read intensively throughout the VLT sessions.
The 20 s distance dose calculation algorithm is illustrated in Figure 3. The total distance dose was
calculated as the sum of all previous 20 s distance doses. The recording devices were a condenser
acoustic microphone (Type 4178, Brüel & Kjær, Denmark) and the NSA. The microphone sensitivity was
verified using a calibrated precision sound pressure level meter (Type 2250-L, Brüel & Kjær, Denmark).
71
Appl. Sci. 2020, 10, 1192
McGill University. The CAPE-V recordings were blindly rated by the SLPs. There were eight audio
recordings for each participant. The order of within-participant files was randomized to reduce the
SLPs’ rating bias. The audio files were rated participant by participant, so that the SLPs could identify
the participant-specific baseline of voice quality for each rating task.
Voice features, such as f 0, SPL, duty ratio, CPP, spectral tilt, HRF, jitter, and shimmer extracted
from the recorded NSA and microphone signals, were labeled with the corresponding session numbers.
These voice features were compared across sessions to track the variations between the participants’
voice qualities. These vocal metrics were also compared with the CAPE-V and SAVRa ratings.
Table 2. List of features extracted from the microphone and NSA signals.
The procedure of multivariate analysis in the present study was that a trend was searched in
SAVRa scores at first to build a baseline of identifying vocal fatigue. A similar trend in the CAPE-V
scores was then searched to verify the consistency between self-administered and SLP-rated methods.
Finally, this trend was searched in the fifteen acoustic features to demonstrate some specific features
have the potential of indicating vocal fatigue.
72
Appl. Sci. 2020, 10, 1192
D
1RRI4XDOLW\(YDOXDWLRQ
1RUPDOL]HG',6&
E
1RRI4XDOLW\(YDOXDWLRQ
1RUPDOL]HG,369
F
1RRI4XDOLW\(YDOXDWLRQ
Figure 4. Cross-session variations mean and standard deviation values of SAVRa ratings for all
participants (n = 10). (a) speaking effort level (EFFT); (b) laryngeal discomfort level (DISC); (c) inability
to produce soft voice (IPSV).
73
Appl. Sci. 2020, 10, 1192
The Pearson correlation coefficients between the overall severity trace and other dimensions in
the CAPE-V ratings are shown in Table 4. This indicated that the overall severity, roughness, and strain
variations were well correlated. The mean values of the overall severity, the roughness, and the strain
scores followed a trend similar to that of the SAVRa results, as shown in Figure 5. They increased
rapidly after S1 (QE1–QE3), remained constant for several sessions, and decreased over the remainder
of the session. This trend was generally consistent with those of all three dimensions in SAVRa ratings,
which showed a vocal ‘transition’ or ‘adjustment’ period in the first session and a vocal ‘recovery’
period during the final session. The maximum values of these three dimensions, shown in Figure 5,
occurred at QE6. The mean and deviation values of the three dimensions of the SAVRa ratings are
shown in Table 5. Other dimensions in the CAPE-V ratings followed different trends. The standard
deviations of the CAPEV ratings across SLPs were larger than those of SAVRa ratings.
0.6 0.6
Scores
Scores
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
No. of Quality Evaluation No. of Quality Evaluation
(c)Breathiness (d)Strain
0.6 0.6
Scores
Scores
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
No. of Quality Evaluation No. of Quality Evaluation
(e)Pitch (f)Loudness
0.6 0.6
Scores
Scores
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
No. of Quality Evaluation No. of Quality Evaluation
Figure 5. Cross-session variations mean and standard deviation values of CAPE-V ratings for all
participants (n = 10).
Table 4. Pearson correlation coefficients between the overall severity and other dimensions.
Table 5. CAPEV mean and deviation values in terms of VLT session. The display format was x (y),
where x was the mean and y was the standard deviation. OS: overall severity, RG: Roughness, BT:
Breathiness, ST: Strain, PT: Pitch, LD: Loudness.
OS RG BT ST PT LD
QE1 0.22 (0.20) 0.20 (0.24) 0.16 (0.23) 0.14 (0.21) 0.04 (0.14) 0.06 (0.18)
QE2 0.39 (0.21) 0.34 (0.23) 0.10 (0.20) 0.31 (0.23) 0.12 (0.25) 0.03 (0.12)
QE3 0.48 (0.23) 0.39 (0.26) 0.15 (0.25) 0.44 (0.27) 0.09 (0.21) 0.20 (0.30)
QE4 0.42 (0.24) 0.37 (0.25) 0.14 (0.23) 0.41 (0.25) 0.11 (0.24) 0.06 (0.13)
QE5 0.46 (0.20) 0.42 (0.26) 0.16 (0.27) 0.44 (0.22) 0.10 (0.20) 0.07 (0.17)
QE6 0.53 (0.24) 0.50 (0.29) 0.14 (0.24) 0.46 (0.23) 0.19 (0.33) 0.07 (0.15)
74
Appl. Sci. 2020, 10, 1192
7LPHV
1RRIVHVVLRQ
FRXJK VZDOORZLQJ WKURDWFOHDU DFWLYDWLRQ YRLFHEUHDN
Figure 6. Mean counts of vocal fatigue symptom appearances for all participants during the VLTs in
terms of session. The vertical blue dash lines separate different VLT sessions.
75
Appl. Sci. 2020, 10, 1192
jitter_MIC increased rapidly from S1 to S4 by 10% but decreased by 5% afterwards, which was not
well correlated with the subjective rating results for the VLT sessions. The jitter_NSA showed a trend
similar to jitter_MIC, but with a larger decrease after S3. The mean values of the shimmer_MIC and
shimmer_NSA were well correlated (r = 0.91, p = 0.987). Their progressions were similar to that of the
f 0. This indicates that the loudness perturbation increased rapidly in the early sessions, and saturated
at a high level over the last sessions.
magnitude
310 80
Hz
dB
0.55
300 75
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Session Session Session
CPP_MIC jitter_MIC shimmer_MIC
20.8 0.19 0.12
magnitude
magnitude
dB
20.6 0.18
0.11
0.17
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Session Session Session
CPP_NSA jitter_NSA shimmer_NSA
25
magnitude
magnitude
0.11 0.45
0.4
dB
24 0.1
0.35
0.09 0.3
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Session Session Session
TILT_MIC HRF_MIC H1H2_MIC
0.3 15 -0.13
magnitude
dB
dB
0.29 -0.14
14
-0.15
0.28
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Session Session Session
TILT_NSA HRF_NSA H1H2_NSA
0.64 3 0.08
magnitude
2
dB
dB
0.62 0.06
1
0.6 0 0.04
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Session Session Session
Figure 7. Cross-session variations of means and confidential levels in terms of feature for all participants.
A decrease in spectral tilt slope was found to correlate with stressed phonation by Sluijter and
Heuven [46]. A decrease in spectral tilt slope indicates that higher frequencies are increased more than
lower frequencies. In Figure 7, the TILT_MIC slopes and the TILT_NSA slopes decreased from S1 to S6
in general. Thus, the high frequencies of both the microphone and NSA spectra increased throughout
VLT sessions. Childers and Lee found that vocal fry had a higher HRF value than modal and breathy
voice [47]. The HRF_MIC and the HRF_NSA obeyed similar trends than TILT_MIC and TILT_NSA.
The HRF_MIC was well correlated (r = 0.94, p = 0.98) with HRF_NSA. These two features showed
identical trends. The HRF mean values decreased from S1 and converged to S6. The H1H2_NSA
decreased by 50% from S1 to S6, and remained constant from S2 to S5. A comprehensive analysis
of the Pearson correlation between each pair of features is shown in Table 6. There are two groups
of well correlated (>90%) features. The first group includes f 0, SPL_MIC, duty ratio, shimmer_MIC,
and shimmer_NSA. The first group shows a rapid increase in early session followed by saturation.
The second group includes TILT_MIC, TILT_NSA, HRF_MIC, and HRF_NSA and shows a general
decrease from S1 to S6.
76
Table 6. Pearson correlation coefficients for each pair of features. The fi (i = 1, ..., 15) represents f 0, SPL, duty ratio, CPP_MIC, jitter_MIC, shimmer_MIC, CPP_NSA,
jitter_NSA, shimmer_NSA, TILT_MIC, HRF_MIC, H1H2_MIC, TILT_NSA, HR_NSA, and H1H2_NSA, respectively.
f3 0.97 0.86 1 * * * * * * * * * * * *
f4 −0.39 −0.20 −0.45 1 * * * * * * * * * * *
f5 0.61 0.69 0.44 −0.51 1 * * * * * * * * * *
f6 0.96 0.96 0.90 −0.40 0.74 1 * * * * * * * * *
f7 0.68 0.46 0.76 −0.91 0.46 0.63 1 * * * * * * * *
f8 −0.44 −0.24 −0.56 0.70 −0.15 −0.30 −0.78 1 * * * * * * *
f9 0.95 0.94 0.90 −0.28 0.59 0.91 0.57 −0.51 1 * * * * * *
f10 −0.96 −0.97 −0.87 0.37 −0.75 −0.96 −0.59 0.29 −0.89 1 * * * * *
f11 −0.95 −0.89 −0.89 0.62 −0.79 −0.95 −0.79 0.49 −0.87 0.95 1 * * * *
f12 −0.06 −0.20 0.08 −0.50 −0.04 −0.01 0.45 −0.54 0.03 0.22 −0.03 1 * * *
f13 −0.80 −0.63 −0.89 0.62 −0.30 −0.67 −0.86 0.87 −0.79 0.67 0.77 −0.26 1 * *
f14 −0.88 −0.72 −0.91 0.77 −0.59 −0.83 −0.94 0.69 −0.77 0.82 0.94 −0.21 0.90 1 *
f15 −0.82 −0.80 −0.83 0.03 −0.23 −0.73 −0.40 0.47 −0.92 0.68 0.63 −0.10 0.73 0.58 1
77
Appl. Sci. 2020, 10, 1192
78
Appl. Sci. 2020, 10, 1192
that normal swallowing on a daily basis was not measured, and thus the fatigue-indicative symptom
analysis lacked a baseline to refer to. This baseline could be established in future work.
Author Contributions: Conceptualization, Z.L., N.Y.-K.L.-J. and L.M. (Luc Mongeau); methodology, Z.L. and
L.M. (Luc Mongeau); software, Z.L.; validation, Z.L., N.Y.-K.L.-J. and L.M. (Luc Mongeau); formal analysis, Z.L.;
investigation, Z.L.; resources, L.M. (Lisa Martignetti), N.Y.-K.L.-J. and L.M. (Luc Mongeau); data curation, Z.L.,
L.F. and L.M. (Lisa Martignetti); writing—original draft preparation, Z.L.; writing—review and editing, Z.L., L.F.,
L.M. (Lisa Martignetti), N.Y.-K.L.-J. and L.M. (Luc Mongeau); visualization, Z.L.; supervision, N.Y.-K.L.-J. and
L.M. (Luc Mongeau); project administration, L.M. (Luc Mongeau); funding acquisition, N.Y.-K.L.-J. and L.M. (Luc
Mongeau). All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Institutes of Health (Grant R01 DC-005788) and the Canadian
Institutes of Health Research (388583).
Acknowledgments: The financial support of of the National Science and Engineering Research Council of Canada
is gratefully acknowledged.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Welham, N.V.; Maclagan, M.A. Vocal Fatigue: Current Knowledge and Future Directions. J. Voice 2003,
17, 21–30. [CrossRef]
2. Thomas, G.; de Jong, F.; Kooijman, P.G.C.; Donders, A.R.T.; Cremers, C.W. Voice complaints, risk factors
for voice problems and history of voice problems in relation to puberty in female student teachers. Folia
Phoniatr. Logop. 2006, 58, 305–322. [CrossRef]
3. Solomon, N.P. Vocal fatigue and its relation to vocal hyperfunction. Int. J. -Speech-Lang. Pathol. 2008,
10, 254–266. [CrossRef]
4. Remacle, A.; Morsomme, D.; Berrué, E.; Finck, C. Vocal Impact of a Prolonged Reading Task in Dysphonic
Versus Normophonic Female Teachers. J. Voice 2012, 26, 820.e1–820.e13. [CrossRef]
5. Remacle, A.; Morsomme, D.; Finck, C. Comparison of Vocal Loading Parameters in Kindergarten and
Elementary School Teachers. J. Speech Lang. Hear. Res. 2014, 57, 406–415. [CrossRef]
6. Kreiman, J.; Gerratt, B.R. Sources of listener disagreement in voice quality assessment. J. Acoust. Soc. Am.
2000, 108, 1867–1876. [CrossRef] [PubMed]
7. Kreiman, J.; Gerratt, B.R.; Ito, M. When and why listeners disagree in voice quality assessment tasks.
J. Acoust. Soc. Am. 2007, 122, 2354–2364. [CrossRef] [PubMed]
8. Pabon, J.P.H. Objective acoustic voice-quality parameters in the computer phonetogram. J. Voice 1991,
5, 203–216. [CrossRef]
79
Appl. Sci. 2020, 10, 1192
9. Roy, N.; Barkmeier-Kraemer, J.; Eadie, T.; Sivasankar, M.P.; Mehta, D.; Paul, D.; Hillman, R. Evidence-Based
Clinical Voice Assessment: A Systematic Review. Am. J. -Speech-Lang. Pathol. 2013, 22, 212–226. [CrossRef]
10. Awan, S.N.; Roy, N. Toward the development of an objective index of dysphonia severity: A four-factor
acoustic model. Clin. Linguist. Phon. 2006, 20, 35–49. [CrossRef] [PubMed]
11. Godino-Llorente, J.I.; Gómez-Vilda, P.; Cruz-Roldán, F.; Blanco-Velasco, M.; Fraile, R. Pathological Likelihood
Index as a Measurement of the Degree of Voice Normality and Perceived Hoarseness. J. Voice 2010,
24, 667–677. [CrossRef] [PubMed]
12. Halberstam, B. Acoustic and Perceptual Parameters Relating to Connected Speech Are More Reliable
Measures of Hoarseness than Parameters Relating to Sustained Vowels. Orl Otorhinolaryngol. Relat. Spec.
2004, 66, 70–73. [CrossRef] [PubMed]
13. Holmberg, E.B.; Hillman, R.E.; Perkell, J.S. Glottal airflow and transglottal air pressure measurements for
male and female speakers in soft, normal, and loud voice. J. Acoust. Soc. Am. 1988, 84, 511–529. [CrossRef]
[PubMed]
14. Holmberg, E.B.; Hillman, R.E.; Perkell, J.S.; Guiod, P.C.; Goldman, S.L. Comparisons Among Aerodynamic,
Electroglottographic, and Acoustic Spectral Measures of Female Voice. J. Speech Lang. Hear. Res. 1995,
38, 1212–1223. [CrossRef]
15. Anita Mcallister, Johan Sundberg, S.R.H. Acoustic measurements and perceptual evaluation of hoarseness
in children’s voices. Logop. Phoniatr. Vocology 1998, 23, 27–38. [CrossRef]
16. Heman-Ackah, Y.D.; Michael, D.D.; Baroody, M.M.; Ostrowski, R.; Hillenbrand, J.; Heuer, R.J.; Horman,
M.; Sataloff, R.T. Cepstral Peak Prominence: A More Reliable Measure of Dysphonia. Ann. Otol. Rhinol.
Laryngol. 2003, 112, 324–333. [CrossRef]
17. Godino-Llorente, J.I.; Gomez-Vilda, P.; Blanco-Velasco, M. Dimensionality Reduction of a Pathological Voice
Quality Assessment System Based on Gaussian Mixture Models and Short-Term Cepstral Parameters. IEEE
Trans. Biomed. Eng. 2006, 53, 1943–1953. [CrossRef]
18. Fang, S.H.; Tsao, Y.; Hsiao, M.J.; Chen, J.Y.; Lai, Y.H.; Lin, F.C.; Wang, C.T. Detection of Pathological Voice
Using Cepstrum Vectors: A Deep Learning Approach. J. Voice 2018. [CrossRef]
19. Klatt, D.H.; Klatt, L.C. Analysis, synthesis, and perception of voice quality variations among female and
male talkers. J. Acoust. Soc. Am. 1990, 87, 820–857. [CrossRef]
20. Kitch, J.A.; Oates, J. The perceptual features of vocal fatigue as self-reported by a group of actors and singers.
J. Voice 1994, 8, 207–214. [CrossRef]
21. Kitch, J.A.; Oates, J.; Greenwood, K. Performance effects on the voices of 10 choral tenors: Acoustic and
perceptual findings. J. Voice 1996, 10, 217–227. [CrossRef]
22. Welham, N.V.; Maclagan, M.A. Vocal fatigue in young trained singers across a solo performance: A
preliminary study. Logop. Phoniatr. Vocology 2004, 29, 3–12. [CrossRef] [PubMed]
23. Hunter, E.J. General Statistics of the NCVS Self-Administered Vocal Rating (SAVRa). Natl. Cent. Voice Speech
Online Tech. Memo 2008, 11, 1–10.
24. Hunter, E.J.; Titze, I.R. Quantifying Vocal Fatigue Recovery: Dynamic Vocal Recovery Trajectories after a
Vocal Loading Exercise. Ann. Otol. Rhinol. Laryngol. 2009, 118, 449–460. [CrossRef]
25. Nanjundeswaran, C.; Jacobson, B.H.; Gartner-Schmidt, J.; Abbott, K.V. Vocal Fatigue Index (VFI):
Development and Validation. J. Voice 2015, 29, 433–440. [CrossRef]
26. Halpern, A.E.; Spielman, J.L.; Hunter, E.J.; Titze, I.R. The inability to produce soft voice (IPSV): A tool to
detect vocal change in school-teachers. Logop. Phoniatr. Vocology 2009, 34, 117–127. [CrossRef]
27. Titze, I.R.; Svec, J.G.; Popolo, P.S. Vocal Dose Measures: Quantifying Accumulated Vibration Exposure in
Vocal Fold Tissues. J. Speech Lang. Hear. Res. 2003, 46, 919–932.[CrossRef]
28. Titze, I.R.; Hunter, E.J. Comparison of Vocal Vibration-Dose Measures for Potential-Damage Risk Criteria. J.
Speech Lang. Hear. Res. 2015, 58, 1425–1439. [CrossRef]
29. Švec, J.G.; Popolo, P.S.; Titze, I.R. Measurement of vocal doses in speech: experimental procedure and signal
processing. Logop. Phoniatr. Vocol. 2003, 28, 181–192. [CrossRef]
30. Carroll, T.; Nix, J.; Hunter, E.; Emerich, K.; Titze, I.; Abaza, M. Objective Measurement of Vocal Fatigue in
Classical Singers: A Vocal Dosimetry Pilot Study. Otolaryngol. Neck Surg. 2006, 135, 595–602. [CrossRef]
31. Echternach, M.; Nusseck, M.; Dippold, S.; Spahn, C.; Richter, B. Fundamental frequency, sound pressure
level and vocal dose of a vocal loading test in comparison to a real teaching situation. Eur. Arch.
-Oto-Rhino-Laryngol. 2014, 271, 3263–3268. [CrossRef] [PubMed]
80
Appl. Sci. 2020, 10, 1192
32. Bottalico, P.; Astolfi, A. Investigations into vocal doses and parameters pertaining to primary school teachers
in classrooms. J. Acoust. Soc. Am. 2012, 131, 2817–2827. [CrossRef] [PubMed]
33. Morrow, S.L.; Connor, N.P. Voice Amplification as a Means of Reducing Vocal Load for Elementary Music
Teachers. J. Voice 2011, 25, 441–446. [CrossRef] [PubMed]
34. Titze, I.R.; Sundberg, J. Vocal intensity in speakers and singers. J. Acoust. Soc. Am. 1992, 91, 2936–2946.
[CrossRef]
35. Hunter, E.J.; Banks, R.E. Gender Differences in the Reporting of Vocal Fatigue in Teachers as Quantified by
the Vocal Fatigue Index. Ann. Otol. Rhinol. Laryngol. 2017, 126, 813–818. [CrossRef]
36. Hunter, E.J.; Tanner, K.; Smith, M.E. Gender differences affecting vocal health of women in vocally
demanding careers. Logop. Phoniatr. Vocology 2011, 36, 128–136. [CrossRef]
37. Rowling, J.K. Harry Potter and the Sorcerer’s Stone. 2001. Available online: http://www2.sdfi.edu.cn/
netclass/jiaoan/englit/download/Harry%20Potter%20and%20the%20Sorcerer’s%20Stone.pdf (accessed
on 4 February 2020).
38. Lei, Z.; Kennedy, E.; Fasanella, L.; Li-Jessen, N.Y.K.; Mongeau, L. Discrimination between Modal, Breathy and
Pressed Voice for Single Vowels Using Neck-Surface Vibration Signals. Appl. Sci. 2019, 9, 1505. [CrossRef]
39. Mehta, D.D.; Zañartu, M.; Feng, S.W.; Cheyne, H.A.; Hillman, R.E. Mobile Voice Health Monitoring Using a
Wearable Accelerometer Sensor and a Smartphone Platform. IEEE Trans. Biomed. Eng. 2012, 59, 3090–3096.
[CrossRef]
40. Švec, J.G.; Titze, I.R.; Popolo, P.S. Estimation of sound pressure levels of voiced speech from skin vibration
of the neck. J. Acoust. Soc. Am. 2005, 117, 1386–1394. [CrossRef]
41. Zanartu, M. Acoustic Coupling in Phonation and Its Effect On Inverse Filtering of Oral Airflow and Neck
Surface Acceleration. Ph.D. Thesis, Purdue University, West Lafayette, IN, USA, 2010.
42. Zraick, R.; Kempster, G.; Connor, N.; Thibeault, S.; Klaben, B.; Bursac, Z.; Thrush, C.; Glaze, L. Establishing
validity of the consensus auditory-perceptual evaluation of voice (CAPE-V). Am. J. -Speech-Lang. Pathol.
2011, 20, 14–22. [CrossRef]
43. Rabiner, L.; Schafer, R. Digital Processing of Speech Signals; Prentice-Hall Signal Processing Series; Prentice-Hall:
Englewood, NJ, USA, 1978.
44. Yiu, E.M.L.; Chan, R.M. Effect of Hydration and Vocal Rest on the Vocal Fatigue in Amateur Karaoke Singers.
J. Voice 2003, 17, 216–227. [CrossRef]
45. Solomon, N.P.; Glaze, L.E.; Arnold, R.R.; van Mersbergen, M. Effects of a Vocally Fatiguing Task and
Systemic Hydration on Men’s Voices. J. Voice 2003, 17, 31–46. [CrossRef]
46. Sluijter, A.M.C.; van Heuven, V.J. Spectral balance as an acoustic correlate of linguistic stress. J. Acoust. Soc.
Am. 1996, 100, 2471–2485. [CrossRef] [PubMed]
47. Childers, D.G.; Lee, C.K. Vocal quality factors: Analysis, synthesis, and perception. J. Acoust. Soc. Am. 1991,
90, 2394–2410. [CrossRef]
48. Sivasankar, M.; Leydon, C. The role of hydration in vocal fold physiology. Curr. Opin. Otolaryngol. Head
Neck Surg. 2010, 18, 171–175. [CrossRef]
49. Gelfer, M.P.; Andrews, M.L.; Schmidt, C.P. Effects of prolonged loud reading on selected measures of vocal
function in trained and untrained singers. J. Voice 1991, 5, 158–167. [CrossRef]
50. Hunter, E.J.; Titze, I.R. Variations in Intensity, Fundamental Frequency, and Voicing for Teachers in
Occupational Versus Non-Occupational Settings. J. Speech Lang. Hear. Res. 2010, 53, 862–875. [CrossRef]
51. Laukkanen, A.M.; Ilomäki, I.; Leppänen, K.; Vilkman, E. Acoustic Measures and Self-reports of Vocal Fatigue
by Female Teachers. J. Voice 2008, 22, 283–289. [CrossRef]
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
81
applied
sciences
Article
Classification Maps in Studies on the
Retirement Threshold
Agnieszka Bielińska 1 , Dorota Bielińska-Wa̧ż 2, * and Piotr Wa̧ż 3
1 Department of Quality of Life Research, Medical University of Gdańsk, 80-210 Gdańsk, Poland
2 Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210 Gdańsk, Poland
3 Department of Nuclear Medicine, Medical University of Gdańsk, 80-210 Gdańsk, Poland
* Correspondence: [email protected]
Abstract: The aim of this work is to present new classification maps in health informatics and to show
that they are useful in data analysis. A statistical method, correspondence analysis, has been applied
for obtaining these maps. This approach has been applied to studies on expectations and worries
related to the retirement threshold. For this purpose two questionnaires formulated by ourselves have
been constructed. Groups of individuals and their answers to particular questions are represented by
points in the classification maps. The distribution of these points reflects psychological attitudes of
the considered population. In particular, we compared structures of the maps searching for factors
such as gender, marital status, kind of work, economic situation, and intellectual activity related
to the attendance the University of the Third Age, which are essential at the retirement threshold.
Generally, in Polish society, retirement is evaluated as a positive experience and the majority of
retirees do not want to return to their professional work. This result is independent of the kind of
work and of the gender.
1. Introduction
Classification studies are a valuable source of information in various areas of science. The problem
of classification is related to the problem of similarity of objects. Objects arranged in simple,
one-dimensional sets may be classified in a unique way according to one, properly chosen, aspect of
similarity. The problem becomes more complicated if we consider multidimensional sets, i.e., objects
characterized by several different aspects. The degree of similarity depends on the selected aspects, on
the number of aspects considered and on the mathematical measure establishing the relations between
different properties.
One of class of objects considered by us is biological sequences. Both graphical and numerical
classification of these objects is possible using methods based on Graphical Representations [1,2].
Within these methods, one can create a large number of different types of numerical characteristics
(descriptors) of the plots representing the sequences. One kind of descriptors we propose are the
distribution moments related to different statistical distributions describing the DNA sequences.
We have shown that using these descriptors a pair of the sequences that differ by only one base can
be distinguished. The coordinates of the descriptors representing these sequences are different in
the classification maps [1]. The distribution moments we have also introduced as new descriptors
of another class of objects—the molecular spectra [3,4]. The applications of the theory of molecular
similarity are broad. Except for the studies of the properties of the systems explicitly considered in
our works, the new descriptors may have broad range of interdisciplinary applications. For example,
they may be applied in computational pharmacology and toxicology [5]. Our new descriptors have
also found their application in the classification of the solutions in the chaotic systems [6], or in the
classification of the stellar spectra [7,8]. Another kind of descriptor we propose are values used in the
classical dynamics such as coordinates of center of mass or the moments of inertia. Examples of the
classification studies using these descriptors may be found in the theory of molecular similarity [9] or
in bioinformatics [10].
A class of objects considered in this work are groups of individuals. The studies are focused on
the retirement threshold. A graphical representation of the results known as the Correspondence Analysis
(CA) proves to be very useful in this kind of study [11]. Recently, CA was applied for studies on a
variety of problems, for example, on high school dropouts [12], and also in archeology [13], in food
science [14], etc. In the CA the information about the whole system is stored on maps in which objects
under consideration are represented by points located in a specific way. The classification of the objects
is here performed by studying distances between the points and, in particular, by identifying clusters
of the points. Objects corresponding to the points which form a cluster are similar in some way.
Progress of medicine and lower fertility rate caused significant changes in the structure of modern
societies. The number of seniors in developed countries is growing dynamically. In Poland in 2010 the
percentage of people aged 65 and over was 19%. According to Eurostat forecasts, in 2030 the ratio of
elderly people to the population aged 15–64 will be 36%, and in 2050—56% [15].
Due to acceleration of aging process, it seems reasonable to study the quality of life of older
people. An important role in shaping the quality of life of seniors plays the retirement threshold,
which is described in literature as symbolic moment—starting a new chapter in life. It involves many
negative changes such as loss of professional status, deterioration of the economic situation, as well as
reduction in the number of social interactions. On the other hand, pensioners have much free time
for family life and hobby. Therefore, despite losing one of the most important roles in life, one can set
new goals and develop non-professional passions. The change of social role from employee to retiree
is a natural process, but such a big change in life may lead to negative psychological effects [16–19].
Changes in different aspects of life due to the retirement threshold have been studied in many countries.
For example, changes in the sleep duration were studied in the United States [20] and Finland [21];
changes in the physical activity were studied in Canada [22], Belgium [23], and Finland [24–26]; and
changes in the body mass index were studied in the United States [27] and Finland [28]. A variety of
changes in the quality of life in different domains have been observed, for example, in the subjective
wellbeing [29], in the use of time, activity patterns, in health and wellbeing [30], in the health-related
quality of life [31], in the enjoyment of everyday activities [32], and in mobility [33]. The observed
changes at the retirement threshold are not unique. Different factors, e.g., sex, social background,
and education level, may determine whether they are positive or negative. Education is one of the
most important factors determining worry-free retirement [34]. Recently, the Universities of the
Third Age (U3A) became popular in many countries, and their positive influence has been broadly
discussed [35–39]. The International Association of Universities of the Third Age (AU3A) is a global
international organization. The AU3A network includes institutions from Asia, both Americas, Europe,
and Australia. The attendance of U3A grows exponentially in the global scale. In China alone the
number of universities for senior citizens has grown from 19,000 in 2002 to 70,000 in 2017. The
corresponding numbers of students of U3A in China is even more impressive: from 1.8 million to
8 million.
In the present work, we study the influence of factors, such as gender, kind of work, marital status,
intellectual activity related to the attendance the University of the Third Age, economic situation,
on the expectations and on the worries related to the retirement threshold from the Polish perspective.
Some pilot studies on the changes of the quality of life related to the retirement threshold using the
World Health Organization Quality of Life-BREF (WHOQOL-BREF) questionnaire, and this graphical
representation of the results we have already published [40–45]. The WHOQOL-BREF questionnaire
is a standard tool in the quality of life research and many versions of this questionnaire have been
created in different countries,for example, the Polish version [46], the Bangla version [47], the Spanish
version [48], or the Finnish version [49]. This questionnaire is composed of 26 questions. Two questions
84
Appl. Sci. 2020, 10, 1282
are related to Overall Quality of Life and General Health. The remaining 24 questions concern four
domains: Physical Health, Psychological, Social Relationships, and Environment. Using this questionnaire,
we have shown that CA classification maps are a convenient tool for the studies on the role of different
factors in changing the quality of life after the retirement threshold, such as gender [42] and marital
status [43] in four domains, job position in Physical Health and Psychological domains [44], or in Social
Relationships and Environment domains [45]. In most of cases, these factors play an important role.
The considered factors are particularly important in the Psychological and in the Social Relationships
domains. The influence of different factors, such as age, education, marital status, and job position on
the Overall Quality of Life and General Health has also been studied by us using this graphical approach
and WHOQOL-BREF questionnaire [40,41].
• employees—A4,
• retirees2—A2.
The lengths of all of the vectors CP-employees, CP-A4, CP-retirees2, and CP-A2 are large. Then, the
two associations are strong.
85
Appl. Sci. 2020, 10, 1282
0.3
0.8
A1 employees
A4
0.2
0.6
0.1
Dimension 1 (82%)
Dimension 1 (87%)
0.4
retirees1
0.0
A3
0.2
A4 A5
retirees1
−0.1
A2
0.0
employees
−0.2
A3
−0.2
A5 A1
−0.3
retirees2
retirees2
−0.4
A2
−0.4
−0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.3 −0.2 −0.1 0.0 0.1
employees A4
0.2
A2
retirees2
0.2
retirees1
0.0
Dimension 1 (83.1%)
Dimension 1 (58.5%)
A5
A3 employees
A3
0.0
−0.2
A5
A4
retirees1
−0.2
−0.4
retirees2
A2
A1
A1
−0.4
−0.6
−0.4 −0.2 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
For males (top left panel), the angles are nearly 180 degrees between vectors CP-retirees2 and
CP-A4, between CP-retirees1 and CP-A5. Therefore, in this case, we have negative associations:
• retirees2—A4,
• retirees1—A5.
If the spread between the number of answers to different questions is large then the least common
answers usually correspond to the negative associations. The least common answer about emptiness
after retirement given by retirees2 is A4 “I was not afraid”. The least common answer about emptiness
after retirement given by retirees1 is A5 “I was not afraid at all”. Depending on the lengths of the
corresponding vectors, the strengths of the association may vary. In this case, the negative association
for males is weak since the lengths of these vectors are smaller than for females. The point representing
employees (males) is located close to the central point, so there are no strong associations of this group
with any of the answers, while the most frequent answer for the employees (females) is A4 “I am not
afraid” (cluster employees—A4).
86
Appl. Sci. 2020, 10, 1282
0.2
A3 A5 A4
0.2
A5
Dimension 1 (84.9%)
Dimension 1 (76.9%)
0.0
A1
employees
0.0
retirees2
A3 retirees2
−0.2
−0.2
A4
retirees1
A2
−0.4
−0.4
A2
A1
−0.4 −0.2 0.0 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
retirees1
A4 A3
0.2
0.4
employees
Dimension 1 (67.9%)
Dimension 1 (78.1%)
0.2
0.0
A1
retirees2
retirees2
0.0
A3
A2
−0.2
employees
A4 A2
A5
−0.2
retirees1
A5
A1
−0.4
−0.4
−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6
Transition to retirement is a key moment when an individual must redefine his or her social
roles, which is not always successful. Women more often define their social role as a family member:
housewife and mother. After the end of their professional activity they find themselves in the role of
grandmothers, participating in the family life of their children [51]. According to opinion polls in Polish
society men are less involved in family life and performing household duties so after they terminate
professional life they would stay without any activities and many of them may feel emptiness [52].
87
Appl. Sci. 2020, 10, 1282
retirees2 A5
0.6
0.4
retirees2
0.4
A4
A4
0.2
Dimension 1 (74.2%)
Dimension 1 (93.2%)
A3
0.2
retirees1
0.0
0.0
A5 A3
−0.2
−0.2
employees employees
retirees1
A2
A2
−0.4
−0.4
A1
A1
−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4
A2
0.3
0.6
retirees2
0.2
A1
0.4
A5 retirees1 employees
0.1
A4 A3
Dimension 1 (95.7%)
0.2
Dimension 1 (67%)
A3
0.0
0.0
retirees1
−0.1
−0.2
employees
−0.2
−0.4
A2
−0.3
A4
A5
A1
−0.4
−0.6
retirees2
−0.4 −0.2 0.0 0.2 0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
If we consider the marital status, we observe negative associations for married (Figure 1, bottom
left panel):
• employees—A1,
• employees—A2,
• retirees2—A4,
• retirees1—A5.
The angles between the vectors CP-employees and CP-A1, between CP-employees and CP-A2,
between CP-retirees2 and CP-A4, and between CP-retirees1 and CP-A5 are close to 180 degrees. The least
common answers for employees (married) about emptiness after the retirement are A1 “Yes, I am very
much afraid” and A2 “Yes, I am slightly afraid”. The least common answer to this question of retirees2
(married) is A4 “I was not afraid”. The least common answer to this question of retirees1 (married) is
A5 “I was not afraid at all”. As the lengths of CP-A5 and CP-retirees1 are smaller comparing to other
vectors, the association for retirees1 is the weakest.
88
Appl. Sci. 2020, 10, 1282
Return to work
0.6
Manual
(males)
0.4
Manual
(females) A4
0.2
Dimension 1 (76.6%)
0.0
A5
Intellectual
(females)
−0.2
Intellectual A3
(males)
A2
−0.4
−0.6
−0.8
A1
Dimension 2 (20.8%)
For others (Figure 1, bottom right panel), the structure of the map is different than for married.
We observe different negative associations:
• employees—A5,
• retirees2—A4,
• retirees1—A2.
Analogously, as for married, the least common answer to this question of retirees2 (others) is A4
“I was not afraid”. The marital status is not an important factor determining the kind of answer to
this question for retirees2. For retirees1 (others) the least common answer is A2, whereas for married
is A5. For employees (others) the least common answer is A5, whereas for employees (married) the least
common answers are A1 and A2. For employees and retirees1, the marital status changes the results of
the classification.
Similar studies can be performed for other aspects, for example, satisfaction with retirement
(Appendix A, questions No. 15E and 15R). The results are shown in Figure 2: males (Figure 2, top left
panel), females (Figure 2, top right panel), married (Figure 2, bottom left panel), and others (Figure 2,
bottom right panel). The answer A1 corresponds to “Very unhappy”, whereas A5 corresponds to “Very
happy”. We observe different clusters for males and females, so gender is an important factor influencing
the kind of answer to this question. In particular, for males we observe the following clusters:
• retirees1—A2,
• employees—A5.
For females the clusters are:
• retirees1—A4,
• retirees2—A2.
The most frequent answer of females attending the U3A (retirees2) is the same as retirees1 (males),
i.e., A2. There are also several negative associations for males:
89
Appl. Sci. 2020, 10, 1282
• employees—A4,
• retirees1—A3.
The least common answer for retirees1 (males) is A3. For females the negative associations are:
• retirees2—A5,
• retirees1—A1.
The least common answer for retirees1 (females) is A1. A5 is located close to the central point so
the association for retirees2 (females) is weak.
If we consider the marital status, for this question, we also observe different clusters for married
and for others. The marital status is an important factor influencing the results of the classification in
this case. In particular, for married we observe the clusters (Figure 2, bottom left panel):
• employees—A5,
• retirees1—A4,
• employees—A3,
• retirees1—A5,
• retirees2—A1.
The associations retirees1—A5 and retirees2—A1 are weak for others. The most frequent answer
for employees (married) is A5 “Very happy”, while for employees (others) the least common answer is A5
(negative association).
Figure 3 is concerned about the question on the adequate amount of money after retirement
(Appendix A, questions No. 19E and 20R). Analogously to Figures 1 and 2, the top panels refer to
males and to females and the bottom ones to married and to others. A1 corresponds to the answer “Not
at all”, and A5 to “Quite enough”. For males (top left panel) the clusters are the following:
• employees—A5,
• retirees2—A3,
• retirees1—A1,
• retirees1—A2,
• retirees2—A5.
• employees—A4,
• retirees2—A2,
• employees—A4,
• retirees2—A1,
• retirees1—A5.
Employees and retirees estimate in a different way the economic situation during the retirement.
Some differences are also between males and females. Considering the marital status, the clusters are
also different. In particular, the group employees (married) clusters with A1, whereas employees (others)
clusters with A3.
The summary is contained in the question about the return to the professional work (Appendix A,
question No. 25R). The results are shown in Figure 4. The groups are split according to the kind
90
Appl. Sci. 2020, 10, 1282
of work—Manual labor and Intellectual labor (Appendix A, question No. 4ER)—and according to the
gender. Finally, four subgroups are considered and denoted in the figure as Manual (females), Manual
(males), Intellectual (females), and Intellectual (males). The answer A1 is “Yes, full-time”, and the answer
A5 is “Absolutely not”. We observe the cluster
• Manual (males)—A4.
• Manual (females)—A2,
• Manual (males)—A1,
• Intellectual (males)—A4.
The most frequent answer of Manual (males) is A4 “No”. The associations of Intellectual (females)
with A3 and A2 are weak. None of the groups clusters with A1 “Yes, full-time”. Our analysis shows
that in Poland in 2017, reaching the retirement threshold is rather a positive experience.
In line with social expectations, the retirement age was reduced to 60 for females and 65 for males
in October 2017.
4. Conclusions
Summarizing, we described a non-standard approach to deriving information about objects met
in the medical sciences based on an analysis of classification maps. We demonstrate that the graphical
representation of the considered data (of the answers to some questions in the case of groups of
individuals) is useful in health informatics, i.e., a lot of information is stored on one map. Searching
for the so-called clusters of points, we can classify the objects. In this way, one can discover new
properties of the considered objects. In the case of groups of individuals, we search for factors such
as gender, marital status, kind of work, intellectual activity related to the attendance the University
of the Third Age, economic situation, and determining the psychological attitudes of the considered
population. For the creation of the classification maps, a statistical method, Correspondence Analysis, is
used. New applications of the method are proposed: studies on expectations and worries related to
the retirement threshold. Using standard methods, some considerable part of information may be lost.
In particular, the commonly used Pearson’s coefficients measure the strength of the linear correlation
between the variables. If the correlation is strong but nonlinear, for example, quadratic or exponential,
then the standard methods, contrary to the CA, may show that there is no correlation or that the
correlation is weak. Similar classification maps we are also going to apply in other medical informatics
areas in forthcoming papers related to the studies on the quality of life and to bioinformatics.
Author Contributions: Conceptualization, A.B., D.B.-W., and P.W.; methodology, A.B., D.B.-W., and P.W.;
software, P.W., formal analysis, D.B.-W.; data curation, A.B.; writing—original draft preparation, D.B.-W. and A.B.;
visualization, P.W. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A
91
Appl. Sci. 2020, 10, 1282
3ER. Education:
A1. Elementary A2. Vocational A3. High A4. University A5. Doctor’s
school education school education degree
13E. Do you feel that you have enough energy to perform your work?
A2. Slightly A3. Moderately
A1. Not at all A4. Nearly enough A5. Quite enough
enough enough
16E. Do you think that after the retirement you will have enough energy to implement your aims?
92
Appl. Sci. 2020, 10, 1282
A3. Moderately
A1. Not at all A2. Slightly enough A4. Nearly enough A5. Quite enough
enough
17E. Are you afraid that after the retirement you may not be self-sufficient?
A1. Yes, I am very A2. Yes, I am A3. I do not think A4. I am A5. I am not afraid
much afraid slightly afraid of it not afraid at all
18E. Are you afraid that after the retirement you will feel lonely?
A1. Yes, I am very A2. Yes, I am A3. I do not A4. I am not A5. I am not afraid
much afraid slightly afraid know yet afraid at all
20E. Do you think that you will be satisfied having a lot of free time during retirement?
A1. Very A3. Neither satisfied
A2. Dissatisfied A4. Satisfied A5. Very satisfied
dissatisfied nor dissatisfied
21E. Are you afraid of emptiness after the retirement, because you will not be so active as before?
A1. Yes, I am very A2. Yes, I am A3. I do not think A4. I am A5. I am not afraid
much afraid slightly afraid of it not afraid at all
22E. Are you afraid that during the next several years your health is going to deteriorate?
A1. Yes, I am very A2. Yes, I am A3. I do not think A4. I am not A5. I am not afraid
much afraid slightly afraid of it afraid at all
12R. How did you rate the interpersonal relations in your work?
A1. Very bad A2. Bad A3. Neither bad nor good A4. Good A5. Very good
93
Appl. Sci. 2020, 10, 1282
15R. Were you happy that in several years/months you would retire?
A3. Neither happy
A1. Very unhappy A2. Unhappy A4. Happy A5. Very happy
nor unhappy
16R. Were you afraid that you would have not enough retirement income to support yourself?
A1. Yes, I was very A2. Yes, I was A3. I did not A4. I was not A5. I was not
much afraid slightly afraid think of it afraid afraid at all
17R. Were you satisfied that after the retirement you would have a lot of free time?
A1. Very A3. Neither satisfied
A2. Dissatisfied A4. Satisfied A5. Very satisfied
dissatisfied nor dissatisfied
18R. Were you afraid of emptiness after the retirement, because you would not be so active as before?
A1. Yes, I was very A2. Yes, I was A3. I did not A4. I was not A5. I was not afraid
much afraid slightly afraid think of it afraid at all
21R. Are you afraid that due to your bad health you will not manage with your housework?
A1. Yes, I am very A2. Yes, I am A3. I do not think A4. I am not A5. I am not
much afraid slightly afraid of it afraid afraid at all
24R. Are you satisfied with relations with your children and grandchildren (emotional relations,
frequency of visits)?
A3. Neither satisfied
A1. Very dissatisfied A2. Dissatisfied A4. Satisfied A5. Very satisfied
nor dissatisfied
94
Appl. Sci. 2020, 10, 1282
References
1. Bielińska-Wa̧ż, D. Graphical and numerical representations of DNA sequences: Statistical aspects of
similarity. J. Math. Chem. 2011, 49, 2345–2407. [CrossRef]
2. Randić, M.; Novič, M.; Plavšić, D. Milestones in graphical bioinformatics. Int. J. Quant. Chem. 2013, 113,
2413–2446. [CrossRef]
3. Bielińska-Wa̧ż, D.; Wa̧ż, P.; Basak, S.C. Similarity studies using statistical and genetical methods. J. Math.
Chem. 2007, 42, 1003–1013. [CrossRef]
4. Bielińska-Wa̧ż, D.; Wa̧ż, P. Correlations in Spectral Statistics. J. Math. Chem. 2008, 43, 1287–1300. [CrossRef]
5. Bielińska-Wa̧ż, D.; Wa̧ż, P.; Jagiełło, K.; Puzyn, T. Spectral Density Distribution Moments as Novel
Descriptors for QSAR/QSPR. Struct. Chem. 2014, 25, 29–35. [CrossRef]
6. Wa̧ż, P.; Bielińska-Wa̧ż, D. Asymmetry Coefficients as Indicators of Chaos. Acta Phys. Pol. A 2009, 116,
987–991. [CrossRef]
7. Wa̧ż, P.; Bielińska-Wa̧ż, D.; Pleskacz, A.; Strobel, A. Identification of Stellar Spectra Using Methods of
Statistical Spectroscopy. Acta Phys. Polon. B 2008, 39, 1993–2001.
8. Wa̧ż, P.; Bielińska-Wa̧ż, D.; Strobel, A.; Pleskacz, A. Statistical indicators of astrophysical parameters.
Acta Astronom. 2010, 60, 283–293.
9. Wa̧ż, P.; Bielińska-Wa̧ż, D. Moments of Inertia of Spectra and Distribution Moments as Molecular Descriptors.
MATCH Commun. Math. Comput. Chem. 2013, 70, 851–865.
10. Wa̧ż, P.; Bielińska-Wa̧ż, D.; Nandy, A. Descriptors of 2D-dynamic graphs as a classification tool of DNA
sequences. J. Math. Chem. 2014, 52, 132–140. [CrossRef]
11. Beh, E.J.; Lombardo, R. Correspondence Analysis: Theory, Practice and New Strategies; John Wiley & Sons:
Oxford, UK, 2014.
12. Ziemer, K.S.; Pires, B.; Lancaster, V.; Keller, S.; Orr, M.; Shipp, S. A New Lens on High School Dropout:
Use of Correspondence Analysis and the Statewide Longitudinal Data System. Am. Stat. 2018, 72, 191–198.
[CrossRef]
13. Beh, E.J.; Lombardo, R.; Alberti, G. Correspondence analysis and the Freeman-Tukey statistic: A study of
archaeological data. Comput. Stat. Data Anal. 2018, 128, 73–86. [CrossRef]
14. Frost, S.C.; Blackman, J.W.; Ebeler, S.E.; Heymann, H. Analysis of temporal dominance of sensation data
using correspondence analysis on Merlot wine with differing maceration and cap management regimes.
Food Qual. Prefer. 2018, 64, 245–252. [CrossRef]
15. Parnowski, T. Medyczne i psychologiczne problemy wieku podeszłego. Probl. Rodz. 1996, 5, 41–50. (In Polish)
16. Skarborn, M.; Nicki, R. Worry in pre- and post-retirement persons. Int. J. Aging Hum. Dev. 2000, 50 61–71.
[CrossRef]
17. Hershey, D.A.; Henkens, K.; van Dalen, H.P. What drives retirement income worries in Europe? A multilevel
analysis. Eur. J. Ageing 2010, 7, 301–311. [CrossRef]
18. Gutierrez, H.C.; Hershey, D.A. Impact of Retirement Worry on Information Processing. J. Neurosci.
Psychol. Econ. 2013, 6, 264–277. [CrossRef]
19. Kail, B. The Mental and Physical Health Consequences of Changes in Private Insurance Before and After
Early Retirement. J. Gerontol. B Psychol. Sci. Soc. Sci. 2016, 71, 358–368. [CrossRef]
20. Hagen, E.W.; Barnet, J.H.; Hale, L.; Peppard, P.E. Changes in Sleep Duration and Sleep Timing Associated
with Retirement Transitions. Sleep 2016, 39, 665–673. [CrossRef]
21. Myllyntausta, S.; Salo, P.; Kronholm, E.; Aalto, V.; Kivimaki, M.; Vahtera, J.; Stenholm, S. Changes in Sleep
Duration During Transition to Statutory Retirement: A Longitudinal Cohort Study. Sleep 2017, 40. [CrossRef]
22. Regan, K.; Intzandt, B.; Swatridge, K.; Myers, A.; Roy, E.; Middleton, L.E. Changes in Physical Activity and
Function with Transition to Retirement Living: A Pilot Study. Can. J. Aging 2016, 35, 526–532. [CrossRef]
[PubMed]
23. Van Dyck, D.; Cardon, G.; De Bourdeaudhuij, I. Longitudinal changes in physical activity and sedentary time
in adults around retirement age: What is the moderating role of retirement status, gender and educational
level? BMC Public Health 2016, 16, 1125. [CrossRef] [PubMed]
24. Manty, M.; Kouvonen, A.; Lallukka, T.; Lahti, J.; Lahelma, E.; Rahkonen, O. Pre-retirement physical working
conditions and changes in physical health functioning during retirement transition process. Scand. J. Work
Environ. Health 2016, 42 405–412. [CrossRef] [PubMed]
95
Appl. Sci. 2020, 10, 1282
25. Stenholm, S.; Pulakka, A.; Kawachi, I.; Oksanen, T.; Halonen, J.I.; Aalto, V.; Kivimaki, M.; Vahtera, J. Changes
in physical activity during transition to retirement: A cohort study. Int. J. Behav. Nutr. Phys. Act. 2016, 13, 51.
[CrossRef]
26. Holstila, A.; Manty, M.; Rahkonen, O.; Lahelma, E.; Lahti, J. Statutory retirement and changes in self-reported
leisure-time physical activity: A follow-up study with three time-points. BMC Public Health 2017, 17, 528.
[CrossRef]
27. Gueorguieva, R.; Sindelar, J.L.; Wu, R.; Gallo, W.T. Differential changes in body mass index after retirement
by occupation: Hierarchical models. Int. J. Public Health 2011, 56 111–116. [CrossRef]
28. Stenholm, S.; Solovieva, S.; Viikari-Juntura, E.; Aalto, V.; Kivimaki, M.; Vahtera, J. Change in body mass
index during transition to statutory retirement: An occupational cohort study. Int. J. Behav. Nutr. Phys. Act.
2007, 14, 85. [CrossRef]
29. Barrett, G.F.; Kecmanovic, M. Changes in subjective well-being with retirement: Assessing savings adequacy.
Appl. Econ. 2013, 45, 4883–4893. [CrossRef]
30. Maher, C.A.; Burton, N.W.; van Uffelen, J.G.Z.; Brown, W.J.; Sprod, J.A.; Olds, T.S. Changes in use of time,
activity patterns, and health and wellbeing across retirement: Design and methods of the life after work
study. BMC Public Health 2013, 13 952. [CrossRef]
31. Vercambre, M.N.; Okereke, O.I.; Kawachi, I.; Grodstein, F.; Kang, J.H. Self-Reported Change in Quality
of Life with Retirement and Later Cognitive Decline: Prospective Data from the Nurses’ Health Study.
J. Alzheimers Dis. 2016, 52, 887–897. [CrossRef]
32. Olds, T.S.; Sprod, J.; Ferrar, K.; Burton, N.; Brown, W.; van Uffelen, J.; Maher, C. Everybody’s working for the
weekend: Changes in enjoyment of everyday activities across the retirement threshold. Age Ageing 2016, 45,
850–855. [CrossRef]
33. Berg, J. Mobility changes during the first years of retirement. Qual. Ageing Older Adults 2016, 17 131–140.
[CrossRef]
34. Cho, H.; Suh, W.; Lee, J.; Jang, Y.; Kim, M. A worry-free retirement in Korea: Effectiveness of retirement
coaching education. Educ. Gerontol. 2016, 42, 785–794. [CrossRef]
35. Jun, S.K.; Evans, K. The learning cultures of Third Age participants: Institutional management and
participants’ experience in U3A in the UK and SU in Korea. KEDI J. Educ. Policy 2007, 4, 53–72.
36. Hebestreit, L. The role of the University of the Third Age in meeting needs of adult learners in Victoria,
Australia. Aust. J. Adult Learn. 2008, 48, 547–565.
37. Sonati, J.G.; Modeneze, D.M.; Vilarta, R.; Maciel É.S.; Boccaletto, E.M.A.; da Silva, C.C. Body composition
and quality of life (QoL) of the elderly offered by the “University Third Age” (UTA) in Brazil. Arch. Gerontol.
Geriatr. 2011, 52, e31–e35. [CrossRef]
38. Mackowicz, J.; Wnek-Gozdek, J. “It’s never too late to learn”—How does the Polish U3A change the quality
of life for seniors? Educ. Gerontol. 2016, 42, 186–197. [CrossRef]
39. Nascimento, M.D.; Giannouli E. Active aging through the University of the Third Age: The Brazilian model.
Educ. Gerontol. 2019, 45, 11–21. [CrossRef]
40. Bielińska, A.; Majkowicz, M.; Wa̧ż, P.; Bielińska-Wa̧ż, D. Overall Quality of Life and General
Health—Changes Related to the Retirement Threshold. In Proceedings of the eTELEMED 2018, The Tenth
International Conference on eHealth, Telemedicine, and Social Medicine, Rome, Italy, 25–29 March 2018;
XPS IARIA Press: Copenhagen, Denmark, 2018; pp. 1–5.
41. Bielińska, A.; Majkowicz, M.; Bielińska-Wa̧ż, D.; Wa̧ż, P. Influence of the Education Level on Health of
Elderly People. In Proceedings of the eTELEMED 2018, The Tenth International Conference on eHealth,
Telemedicine, and Social Medicine, Rome, Italy, 25–29 March 2018; XPS IARIA Press: Copenhagen, Denmark,
2018; pp. 6–11.
42. Bielińska, A.; Majkowicz, M.; Bielińska-Wa̧ż, D.; Wa̧ż, P. Classification Studies in Various Areas of Science. In
Numerical Methods and Applications; NMA 2018, Lecture Notes in Computer Science; Nikolov, G., Kolkovska,
N., Georgiev, K.; Eds.; Springer: Cham, Switzerland, 2019; Volume 11189, pp. 326–333.
43. Bielińska, A.; Majkowicz, M.; Wa̧ż, P.; Bielińska-Wa̧ż D. Mathematical Modeling: Interdisciplinary Similarity
Studies. In Numerical Methods and Applications; NMA 2018, Lecture Notes in Computer Science; Nikolov, G.,
Kolkovska, N., Georgiev, K.; Eds.; Springer: Cham, Switzerland, 2019; Volume 11189, pp. 334–341.
44. Bielińska, A.; Majkowicz, M.; Bielińska-Wa̧ż, D.; Wa̧ż, P. A New Method in Bioinformatics—Interdisciplinary
Similarity Studies. AIP Conf. Proc. 2019, 2116, 450013.
96
Appl. Sci. 2020, 10, 1282
45. Bielińska, A.; Majkowicz, M.; Wa̧ż, P.; Bielińska-Wa̧ż, D. A New Computational Method: Interdisciplinary
Classification Analysis. AIP Conf. Proc. 2019, 2116, 450014.
46. Jaracz, K.; Kalfoss, M.; Ba̧czyk, G. Quality of life in Polish respondents: Psychometric properties of the Polish
WHOQOL—Bref. Scand. J. Caring Sci. 2006, 20, 251–260. [CrossRef] [PubMed]
47. Tsutsumi, A.; Izutsu, T.; Kato, S.; Islam, A.; Yamada, H.S.; Kato, H.; Wakai, S. Reliability and validity of the
Bangla version of WHOQOL-BREF in an adult population in Dhaka, Bangladesh. Psychiatry Clin. Neurosci.
2006, 60, 493–498. [CrossRef] [PubMed]
48. Benitez-Borrego, S.; Guardia-Olmos, J.; Urzua-Morales, A. Factorial structural analysis of the Spanish version
of WHOQOL-BREF: An exploratory structural equation model study. Qual. Life Res. 2014, 23, 2205–2212.
[CrossRef] [PubMed]
49. Siljander, E.; Luoma, M.L.; Merilainen-Porras, S. Validity and reliability of Finnish version of WHOQOL-Bref
on adult population in Finland. IJHD 2015, 2, 52–68. [CrossRef]
50. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
Vienna, Austria, 2018.
51. Kim, J.E.; Moen, P. Is Retirement Good or Bad for Subjective Well-Being? Curr. Dir. Psychol. Sci. 2001, 10,
83–86. [CrossRef]
52. Bożewicz, M. Kobiety i Mȩżczyźni w Domu. CBOS, Komunikat z Badań 2018, 127. Available online: https:
//www.cbos.pl/SPISKOM.POL/2018/K_127_18.PDF (accessed on 1 December 2019). (In Polish)
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
97
applied
sciences
Article
Castration-Resistant Prostate Cancer Outcome
Prediction Using Phased Long Short-Term
Memory with Irregularly Sampled Serial Data
Jihwan Park 1,2 , Mi Jung Rho 2 , Hyong Woo Moon 3 and Ji Youl Lee 3, *
1 Department of Biomedicine & Health Sciences, College of Medicine, The Catholic University of Korea,
Seoul 06591, Korea; [email protected]
2 Catholic Cancer Research Institute, College of Medicine, The Catholic University of Korea,
Seoul 06591, Korea; [email protected]
3 Department of Urology, Seoul St. Mary’s Hospital, College of Medicine, the Catholic University of Korea,
Seoul 06591, Korea; [email protected]
* Correspondence: [email protected]
Keywords: prostate cancer; castration-resistant prostate cancer; deep learning; phased long
short-term memory
1. Introduction
Prostate cancer (PCa) is one of the most common male cancers. There are more than 3.3 million
men suffering from PCa in the USA [1], and its incidence is increasing in Korea [2]. Recently, a national
project in Korea has been developing artificial intelligence medical software for several diseases
including PCa [3].
Androgen deprivation therapy (ADT) is commonly used to treat PCa patients. However, this fails
patients that are classified as having castration-resistant prostate cancer (CRPC). According to the
National Cancer Institute (NCI) of the National Institute of Health (NIH), CRPC is a “prostate cancer
that keeps growing even when the amount of testosterone in the body is reduced to very low levels.
Many early-stage prostate cancers need normal levels of testosterone to grow, but castration-resistant
prostate cancers do not.” It is important to administer appropriate treatments at the right time. When
a patient has appropriate treatment, it will reduce the time and pain from the treatment such as
chemotherapy. However, it is a complicated problem to make a treatment decision from cumbersome
medical data, which has many irregularly sampled features. To overcome this situation, a CRPC
prediction model can support physicians’ decision-making and will help to identify appropriate
treatments for PCa patients.
At present, PCa outcomes are usually predicted using nomograms [4–7]. Recently, researchers have
begun to explore the use of deep learning methods, which show some promise [8–11], but the majority
of these techniques rely on image segmentation and classification [12,13]. Data from longitudinal
electronic health records (EHRs) that are irregularly sampled, such as PCa data, does not fit well
with deep learning prediction models. Therefore, in this study, we adopt a phased long short-term
memory (phased-LTSM) method with the aim of developing a model that can predict the outcome
of PCa patients from asynchronously sampled data for various variables using deep learning [14].
In particular, we will develop a model that can identify CPRC patients.
2.1. Materials
Data from 1592 patients who had undergone prostatectomy were used to predict the use of CRPC
medication. The data consisted of clinical stage information, laboratory data, and treatment events.
In total, 8 patients received CRPC medication and 1584 patients did not. We used the receipt of CPRC
medication as the binary classification condition. These medications included docetaxel, cabazitaxel,
abiraterone, enzalutamide, and radium-223 (see Table 1).
Non-Castration-Resistant
Category Prostate Cancer (CRPC) CRPC Medication p-Value (chi-Squared Test)
Medication
Number of cases (n = 1592 cases) 1584 8
Age at diagnosis (%) 0.253
<40 1 (0.1) 0 (0.0)
40–44 3 (0.2) 0 (0.0)
45–49 16 (1.0) 0 (0.0)
50–54 54 (3.4) 0 (0.0)
55–59 165 (10.4) 2 (25.0)
60–64 303 (19.1) 1 (12.5)
65–69 461 (29.1) 0 (0.0)
70–74 429 (27.1) 2 (25.0)
75–80 146 (9.2) 3 (37.5)
80–84 6 (0.4) 0 (0.0)
Mean 63.64 66.5
T-stage (%) 0.008
T1 1 (0.1) 0 (0.0)
T1a 7 (0.4) 0 (0.0)
T1c 3 (0.2) 0 (0.0)
T2 100 (6.3) 0 (0.0)
T2a 141 (8.9) 0 (0.0)
T2b 56 (3.5) 0 (0.0)
T2c 746 (47.1) 0 (0.0)
T3 4 (0.3) 0 (0.0)
T3a 312 (19.7) 3 (37.5)
T3b 195 (12.3) 4 (50.0)
T3c 1 (0.1) 0 (0.0)
T4 14 (0.9) 1 (12.5)
Tx 4 (0.3) 0 (0.0)
N-stage (%) 0.305
N0 379 (23.9) 3 (37.5)
N1 68 (4.3) 1 (12.5)
Nx 1137 (71.8) 4 (50.0)
M-stage (%) 0.422
M0 304 (19.2) 3 (37.5)
M1 3 (0.2) 0 (0.0)
Mx 1277 (80.6) 5 (62.5)
CRPC patients are treated with specific treatments including chemotherapy, such as docetaxel [15],
cabazitaxel [16], abiraterone [17], enzalutamide [18], and radium-223 [19]. These CRPC treatments are
delivered in many different combinations alongside ADT medications such as leuprorelin, bicalutamide,
goserelin, and degarelix.
100
Appl. Sci. 2020, 10, 2000
We preprocessed the serial data into three-dimensional (3D) vectors to use as the input for the deep
learning model. Seven features were used: two laboratory data, prostate specific antigen (PSA) and
PSA doubling time (DT), the use of radiation therapy (RT), and four ADT medications—leuprorelin,
bicalutamide, goserelin, and degarelix. The timesteps were the days when the features were observed.
However, not all of the features are observed at every interval and each observation has irregularly
sampled follow-up data (see Figure 1). Because PSA is commonly examined lab data through all of
the patient’s follow up periods, PSA is the most important factor. PSADT shows the characteristic of
PSA kinetics.
Figure 1. Irregularly collected events observations (top) 3D format input for serial data model (bottom).
2.2. Ethics
This study procedures were carried out in accordance with the Declaration of Helsinki and were
approved by the Institutional Review Board of the Catholic University (IRB number: KC18SNDI0512).
Identifying information was removed from the electronic health record (EHR) data, which was then
formatted. Unless locally managed mapping information was exposed, specific patients could not
be identified.
2.3. Methods
We used a 3D fixed time format to input serial data into the model. For the serial data modeling,
a recurrent neural network (RNN) managed the data using a circular network model. Simple RNNs
have a vanishing gradient problem, which cannot generate proper gradient values from repeated
activation functions. The gradient value is an operator that directs the RNN’s learning. The activation
function calculates the gradient value for each iteration and will produce a value close to 0 after circular
network executions. In contrast, long short-term memory (LSTM) uses a gating method to adjust the
signal, which eliminates the vanishing gradient problem (see Figure 2). The input vector adjusts the
value based on the tanh activation function. The forget vector adjust part of the previous input value
to the current input. The output vector adjusts the output values using the previous input and current
adjusted input. After all of these gating methods, the final output is used as the next input for the tanh
and sigmoid layers.
Typically, the clinical event observations were not collected regularly. This irregular sampling
meant that the intervals between consecutive events were not evenly spaced. LSTM cannot learn
effectively from such data.
Phased-LSTM selectively updates the status to account for irregular sampling times through an
additional time gate [14]. At each time step, the time gate switches on and off to update the state and
101
Appl. Sci. 2020, 10, 2000
the hidden output state. These updates only happen for the open phase. Otherwise, the previous
values will be used. The time gate addresses the LSTM problems caused by irregularly sampled data
(see Figure 3).
Figure 2. Standard long short-term memory (LSTM) using three gates (forget, input, and output) to
eliminate the vanishing gradient problem.
Figure 3. Phased LSTM using rectified linear units as the time-gate kt that propagates the gradient well.
There are two patterns of serial medical data: in the first, more than two features are sampled
simultaneously and, in the second, some parts of the features are missing and others are not. To manage
both patterns, missing imputation must be applied. The phased-LSTM’s time gate addresses these
patterns based on three phases (see Figure 4).
The time gate of phased-LSTM kt has three phases: phase 1, “openness-rise” 0 to 1, phase 2,
“openness-drop” 1 to 0, and phase 3, “closed” a leak at a rate α [14].
102
Appl. Sci. 2020, 10, 2000
3. Results
The majority of age distribution resided between 55 and 80 years. The mean of age is 63.64 and
66.5 years for each Non-CRPC medication and CRPC medication group. The majority of T-stage data
distribution resided between T2 and T3. Most of N-stage and M-stage data resided at a 0 or X stage,
which means the absence of metastasis. Though it is highly imbalanced, the data distribution showed
a similar shape from two groups.
103
Appl. Sci. 2020, 10, 2000
Figure 5 shows the prior data shape, oversampling result, and scaling result for 120 and 360 days
of serial data. We used SMOTE to generate CRPC-treated data similar to the prior CRPC cases. Lastly,
we obtained balanced data (1575:1576, CRPC: non-CRPC treated) compared to the prior data that was
highly imbalanced (8:1584, CRPC: non-CRPC treated).
Figure 5. Shape of the input data for prostate specific antigens: (A) prior data, (B) oversampling result
from prior data, and (C) scaling result from oversampling data.
We used two sets of serial data: 120 days and 360 days. Each set had a plus-one timestep.
For instance, in the 120-day case, there were 121 (120 + 1) time steps including the start time point.
The 120-day set contained 3152 cases, 121 timesteps, and seven features. The 360-day set contained
3152 cases, 361 timesteps, and seven features. The model parameter number for the model was 70,016,
which was trained by the deep learning model.
Both sets used four epochs (blue lines), which is the number of training iterations. To prevent
building the overfitted model, we manually determined four epochs by testing 10 epochs (see Table 2).
Although the training time was short, the 120-day set has a recall precision of 91.6% and the 360-day
set had a recall precision of 96.9%. The validation loss (red lines) of the 120-day set converged slowly,
while that of the 360-day set converged quickly (see Figure 6). Ultimately, the Receiver Operating
Characteristic/Area Under the Curve (ROCAUC) accuracy result of the 120-day model is 88.6% and the
360-day model is 94.8% (see Figure 7).
Because the loss of the model continued decreasing and the accuracy of the model continued
increasing after four epochs, we did 10-fold cross-validation to verify 10 epochs’ training results did
not overfit (see Table 3). We have 97.81% and 97.68% average accuracy of the model’s evaluations
every 120 days and 360 days.
104
Appl. Sci. 2020, 10, 2000
Figure 7. Accuracy Result (ROCAUC): (A) 120-day model’s accuracy is 88.6%, (B) 360-day model’s
accuracy is 94.8%.
105
Appl. Sci. 2020, 10, 2000
4. Discussion
In this study, we built a model to predict CPRC-treated patients. From this, we were able to draw
the following conclusions.
Serial medical data is an important factor in predicting the likely outcome of patients with PCa.
For example, predicting the likelihood of CRPC helps medical professionals to make appropriate
decisions at the right time. Some previous studies have predicted time or variables related to CRPC.
For example, Elishmereni et al. used a mathematical algorithm to predict the time until biochemical
failure of ADT using data from 83 hormone sensitive prostate cancer patients [30]. Although this study
only used data from 83 patients, it had a prediction accuracy of 90%. In this study, we used data from
1592 patients to predict CRPC, which is a large number of patients for building a prediction model.
Another study by Humphreys et al. tried to find a relationship between age at diagnosis and CRPC.
They found that those aged between 55 and 75 years have the shortest survival time [31]. We did not
use age as an input variable since there was no significant difference (chi-squared test p-value of 0.253)
between the groups with and without CRPC treatment (see Table 1). In another study, Angulo et al.
used DNA hypermethylation profiles to predict CRPC. They found 61 genes related to CRPC [32].
However, DNA sequencing is not commonly used for PCa patients and it incurs extra expenses for the
patient. We used typical follow-up data from PCa patients. A model based on commonly collected
data make it more applicable in real out-patient scenarios.
We built a CRPC prediction model, which can support physicians’ decision making and allows
them to review more patients in a short period of time. When medical staff do not have enough time to
track the status of each patient in need of significant interventions, human errors are more likely to
occur and opportunities for treatment may be missed until the condition becomes serious. If prediction
support is available, such as the likelihood of a PCa patient developing CPRC, patients in need of
treatment can be identified quickly. Therefore, patients will receive the right treatment at the right time.
We used the state-of-the-art deep learning method to handle serial medical data. Because it
is difficult to predict a patient’s outcome with serial data that is collected irregularly, including
medications, treatments, and laboratory test results such as blood tests. Typical deep learning methods
such as RNN and LSTM can be used to analyze serial data. However, they must be improved in order
to handle irregularly-sampled serial datasets. To overcome this, we adopted recently developed deep
learning technology. The phased-LSTM [14] deep learning method builds a fast and accurate prediction
model using serial medical data. The predictions show high precision recall scores, which means that
they are highly accurate.
The results of this study can be extended to other areas such as emergency room or intensive care
units, which receive a significant amount of serial data from patients. In addition, follow-up data
from many patient care organizations can be used with this method to make accurate predictions that
support clinicians’ decision-making.
106
Appl. Sci. 2020, 10, 2000
Although we produced meaningful results, this study still has some limitations. We used data
from 1592 patients taken from one hospital. Because we were not permitted to use data from other
hospitals for this study, we only used one hospital’s data for this study. Though the data was from only
one hospital, 10 years of follow-up data of 1592 patients is a big enough number to build a prediction
model. Because there were many missing values for other features such as family history, genotype,
and lifestyle, we built a model without those features.
In addition, we used imbalanced PCa treatment data with only a small number of CRPC-treated
patients. Extremely imbalanced data is not enough to verify the model. For instance, when validation
data have only a few CRPC patients out of hundreds of PCa patients, it does not make sense to verify
the model’s accuracy. Chawla et al. proposed SMOTE and presented results with imbalanced data sets,
which has 10923 examples in the majority class and 260 examples in the minority class originally [21].
We used the SMOTE method to build oversampling data. Lastly, we got well balanced data, which has
a similar shape to the prior data based on SMOTE.
In future studies, we will obtain more CRPC-treated cases in order to improve the model. However,
our study is an example of how irregularly sampled serial medical data can be used to build a decision
support prediction model.
Ultimately, this study shows how the state-of-the-art phased-LSTM deep learning method can be
used to make meaningful prediction models based on serial medical data. This can help clinicians who
want to know their patients’ likely outcomes from irregularly sampled follow up data.
Author Contributions: J.P. wrote the article and helped to develop the CRPC prediction model. M.J.R. supported
in writing the article and in developing the CRPC prediction model. H.W.M. gave medical advice for the
development of the CRPC prediction model. J.Y.L. supervised the research and tested the CRPC prediction model.
All four authors substantially contributed their expertise. All authors have read and agreed to the published
version of the manuscript.
Funding: The Institute for Information & Communications Technology Promotion (IITP) and grant funded
by the Korea government (MSIT) (2018-2-00861, Intelligent SW Technology Development for Medical Data
Analysis) funded this research. The Smart Healthcare Research Grant through the Daewoong Foundation
(DF-201907-0000001) also supported this research.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Siegel, R.; Miller, K.; Jemal, A. Cancer statistics, 2018. CA Cancer J. Clin. 2018, 68, 7–30. [CrossRef]
2. Jung, K.-W.; Won, Y.-J.; Kong, H.-J.; Lee, E.S. Cancer statistics in korea: Incidence, mortality, survival, and
prevalence in 2016. Cancer Res. Treat. 2019, 51, 417–430. [CrossRef] [PubMed]
3. Park, J.; Rho, M.J.; Park, Y.H.; Jung, C.K.; Chong, Y.; Kim, C.-S.; Go, H.; Jeon, S.S.; Kang, M.; Lee, H.J. Promise
clip project: A retrospective, multicenter study for prostate cancer that integrates clinical, imaging and
pathology data. Appl. Sci. 2019, 9, 2982. [CrossRef]
4. Koh, H.; Kattan, M.W.; Scardino, P.T.; Suyama, K.; Maru, N.; Slawin, K.; Wheeler, T.M.; Ohori, M. A nomogram
to predict seminal vesicle invasion by the extent and location of cancer in systematic biopsy results. J. Urol.
2003, 170, 1203–1208. [CrossRef] [PubMed]
5. Omlin, A.; Pezaro, C.; Mukherji, D.; Cassidy, A.M.; Sandhu, S.; Bianchini, D.; Olmos, D.; Ferraldeschi, R.;
Maier, G.; Thompson, E. Improved survival in a cohort of trial participants with metastatic castration-resistant
prostate cancer demonstrates the need for updated prognostic nomograms. Eur. Urol. 2013, 64, 300–306.
[CrossRef] [PubMed]
6. Hirasawa, Y.; Nakashima, J.; Sugihara, T.; Takizawa, I.; Gondo, T.; Nakagami, Y.; Horiguchi, Y.; Ohno, Y.;
Namiki, K.; Ohori, M. Development of a nomogram for predicting severe neutropenia associated with
docetaxel-based chemotherapy in patients with castration-resistant prostate cancer. Clin. Genitourin. Cancer
2017, 15, 176–181. [CrossRef]
7. Mun, S.; Park, J.; Dritschilo, A.; Collins, S.; Suy, S.; Choi, I.; Rho, M. The prostate clinical outlook (pco)
classifier application for predicting biochemical recurrences in patients treated by stereotactic body radiation
therapy (sbrt). Appl. Sci. 2018, 8, 1620. [CrossRef]
107
Appl. Sci. 2020, 10, 2000
8. Fakoor, R.; Ladhak, F.; Nazi, A.; Huber, M. Using deep learning to enhance cancer diagnosis and classification.
In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013.
9. Liu, S.; Zheng, H.; Feng, Y.; Li, W. Prostate cancer diagnosis using deep learning with 3d multiparametric
MRI. arXiv 2017, arXiv:1703.04078.
10. Litjens, G.; Sánchez, C.I.; Timofeeva, N.; Hermsen, M.; Nagtegaal, I.; Kovacs, I.; Hulsbergen-Van De Kaa, C.;
Bult, P.; Van Ginneken, B.; Van Der Laak, J. Deep learning as a tool for increased accuracy and efficiency of
histopathological diagnosis. Sci. Rep. 2016, 6, 26286. [CrossRef]
11. Wang, X.; Yang, W.; Weinreb, J.; Han, J.; Li, Q.; Kong, X.; Yan, Y.; Ke, Z.; Luo, B.; Liu, T. Searching for prostate
cancer by fully automated magnetic resonance imaging classification: Deep learning versus non-deep
learning. Sci. Rep. 2017, 7, 15415. [CrossRef]
12. He, B.; Xiao, D.; Hu, Q.; Jia, F. Automatic magnetic resonance image prostate segmentation based on
adaptive feature learning probability boosting tree initialization and cnn-asm refinement. IEEE Access 2017,
6, 2005–2015. [CrossRef]
13. Bhattacharjee, S.; Park, H.-G.; Kim, C.-H.; Prakash, D.; Madusanka, N.; So, J.-H.; Cho, N.-H.; Choi, H.-K.
Quantitative analysis of benign and malignant tumors in histopathology: Predicting prostate cancer grading
using svm. Appl. Sci. 2019, 9, 2969. [CrossRef]
14. Neil, D.; Pfeiffer, M.; Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based
sequences Advances in neural information processing systems. arXiv 2016, arXiv:1610.09513.
15. Bianchini, D.; Lorente, D.; Rodriguez-Vida, A.; Omlin, A.; Pezaro, C.; Ferraldeschi, R.; Zivi, A.; Attard, G.;
Chowdhury, S.; De Bono, J.S. Antitumour activity of enzalutamide (mdv3100) in patients with metastatic
castration-resistant prostate cancer (crpc) pre-treated with docetaxel and abiraterone. Eur. J. Cancer 2014, 50,
78–84. [CrossRef]
16. Miyoshi, Y.; Sakamoto, S.; Kawahara, T.; Uemura, K.; Yokomizo, Y.; Uemura, H. Correlation between
automated bone scan index change after cabazitaxel and survival among men with castration-resistant
prostate cancer. Urol. Int. 2019, 103, 279–284. [CrossRef]
17. Attard, G.; Reid, A.H.; A’Hern, R.; Parker, C.; Oommen, N.B.; Folkerd, E.; Messiou, C.; Molife, L.R.; Maier, G.;
Thompson, E. Selective inhibition of cyp17 with abiraterone acetate is highly active in the treatment of
castration-resistant prostate cancer. J. Clin. Oncol. 2009, 27, 3742. [CrossRef]
18. Penson, D.F.; Armstrong, A.J.; Concepcion, R.; Agarwal, N.; Olsson, C.; Karsh, L.; Dunshee, C.; Wang, F.;
Wu, K.; Krivoshik, A. Enzalutamide versus bicalutamide in castration-resistant prostate cancer: The strive
trial. J. Clin. Oncol. 2016, 34, 2098–2106. [CrossRef]
19. Parker, C.; Heinrich, D.; OSullivan, J.; Fossa, S.; Chodacki, A.; Demkow, T.; Cross, A.; Bolstad, B.;
Garcia-Vargas, J.; Sartor, O. Overall survival benefit of radium-223 chloride (alpharadin) in the treatment
of patients with symptomatic bone metastases in castration-resistant prostate cancer (crpc): A phase iii
randomized trial (alsympca). Eur. J. Cancer 2011, 47, 3. [CrossRef]
20. McHugh, M.L. The chi-square test of independence. Biochem. Med. Biochem. Med. 2013, 23, 143–149.
[CrossRef]
21. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. Smote: Synthetic minority over-sampling technique.
J. Artif. Intell. Res. 2002, 16, 321–357. [CrossRef]
22. Wang, J.; Xu, M.; Wang, H.; Zhang, J. Classification of imbalanced data by using the smote algorithm and
locally linear embedding. In Proceedings of the 2006 8th international Conference on Signal Processing,
Guilin, China, 16–20 November 2006.
23. Jeatrakul, P.; Wong, K.W.; Fung, C.C. Classification of imbalanced data by combining the complementary
neural network and smote algorithm. In Proceedings of the International Conference on Neural Information
Processing, Sydney, Australia, 22–25 November 2010; pp. 152–159.
24. Ditzler, G.; Polikar, R.; Chawla, N. An incremental learning algorithm for non-stationary environments
and class imbalance. In Proceedings of the 2010 20th International Conference on Pattern Recognition,
Istanbul, Turkey, 23–26 August 2010; pp. 2997–3000.
25. Torgo, L.; Ribeiro, R.P.; Pfahringer, B.; Branco, P. Smote for regression. In Proceedings of the Portuguese
Conference on Artificial Intelligence, Azores, Portugal, 9–12 September 2013; pp. 378–389.
26. Boyce, M.S.; Vernier, P.R.; Nielsen, S.E.; Schmiegelow, F.K. Evaluating resource selection functions. Ecol. Model.
2002, 157, 281–300. [CrossRef]
108
Appl. Sci. 2020, 10, 2000
27. Jia, Y.; Martín, A.; Ashish, A.; Paul, B.; Eugene, B.; Zhifeng, C.; Craig, C.; Greg, S.C.; Andy, D.; Jeffrey, D.;
et al. Tensorflow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online:
http://tensorflow.org/ (accessed on 1 October 2019).
28. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
29. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
30. Elishmereni, M.; Kheifetz, Y.; Shukrun, I.; Bevan, G.H.; Nandy, D.; McKenzie, K.M.; Kohli, M.; Agur, Z.
Predicting time to castration resistance in hormone sensitive prostate cancer by a personalization algorithm
based on a mechanistic model integrating patient data. Prostate 2016, 76, 48–57. [CrossRef] [PubMed]
31. Humphreys, M.R.; Fernandes, K.A.; Sridhar, S.S. Impact of age at diagnosis on outcomes in men with
castrate-resistant prostate cancer (crpc). J. Cancer 2013, 4, 304. [CrossRef]
32. Angulo, J.C.; Andrés, G.; Ashour, N.; Sánchez-Chapado, M.; López, J.I.; Ropero, S. Development of castration
resistant prostate cancer can be predicted by a DNA hypermethylation profile. J. Urol. 2016, 195, 619–626.
[CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
109
applied
sciences
Article
A Comparison of Deep Learning Methods for ICD
Coding of Clinical Records
Elias Moons 1, *, Aditya Khanna 1,2 , Abbas Akkasi 1 and Marie-Francine Moens 1
1 Department of Computer Science, KU Leuven, 3000 Leuven, Belgium; [email protected] (A.K.);
[email protected] (A.A.); [email protected] (M.-F.M.)
2 Department of Electrical Engineering, IIT Bombay, Maharashtra 400076, India
* Correspondence: [email protected]
Abstract: In this survey, we discuss the task of automatically classifying medical documents into the
taxonomy of the International Classification of Diseases (ICD), by the use of deep neural networks.
The literature in this domain covers different techniques. We will assess and compare the performance
of those techniques in various settings and investigate which combination leverages the best results.
Furthermore, we introduce an hierarchical component that exploits the knowledge of the ICD
taxonomy. All methods and their combinations are evaluated on two publicly available datasets
that represent ICD-9 and ICD-10 coding, respectively. The evaluation leads to a discussion of the
advantages and disadvantages of the models.
1. Introduction
The International Classification of Diseases (ICD), which is endorsed by the World Health
Organization, is the diagnostic classification standard for clinical and research purposes in the medical
field. ICD defines the universe of diseases, disorders, injuries, and other related health conditions,
listed in a comprehensive, hierarchical fashion. ICD coding allows for easy storage, retrieval, and
analysis of health information for evidenced-based decision-making; sharing and comparing health
information between hospitals, regions, settings, and countries; and data comparisons in the same
location across different time periods (https://www.who.int/classifications/icd/en/) . ICD has
been revised periodically to incorporate changes in the medical field. Today, there have been 11
revisions of the ICD taxonomy, where ICD-9 and ICD-10 are the most studied when it comes to
their automated assignment to medical documents. In this paper, we compare state-of-the-art neural
network approaches to classification of medical reports written in natural language (in this case
English) according to ICD categories.
ICD coding of medical reports has been a research topic for many years [1]. Hospitals need to
label their patient visits with ICD codes to be in accordance with the law and to gain subsidies from
the government or refunds from insurance companies. When the documents are in free text format,
this process is still done manually. Automating (a part of) this process would greatly reduce the
administrative work.
In this paper, we compare the performance of several deep learning based approaches for ICD-9
and ICD-10 coding. The codes of ICD-9 consist of, at most, five numbers. The first three numbers
represent a high level disease category, a fourth number narrows this down to specific diseases, and a
fifth number differentiates between specific disease variants. This leads to a hierarchical taxonomy with
four layers underneath a root node. The first layer (L1) consists of groups of 3-numbered categories,
the next three layers (L2 through L4) correspond to the first 3, 4, or 5 numbers of the ICD code as is
displayed in the upper part of Figure 1. In the lower part of this figure, a concrete example of the
coding is shown.
Figure 1. ICD-9 code structure with second through fourth layer representations and diabetes as
an example.
In this paper, we survey state-of-the-art deep learning approaches for ICD-9 coding. We especially
focus on the representation learning that the methods accomplish.
Experiments with ICD-9 are carried out on the MIMIC-III dataset [2]. This dataset consists of over
50,000 discharge summaries of patient visits in US hospitals. These summaries are in free textual format
and labeled with corresponding ICD-9 codes, an example snippet is visible in Figure 2. Most discharge
summaries are labeled with multiple categories, leading to a multiclass and multilabel setting for
category prediction.
Figure 2. Example snippet of discharge summary from the MIMIC-III dataset with corresponding
target International Classification of Diseases (ICD) codes.
112
Appl. Sci. 2020, 10, 5262
Codes from the ICD-10 version are very similar to those of ICD-9. The main difference is that they
consist of up to seven characters of which at least the first three are always present, the latter four are
optional. The first character is an uppercase alphabetic letter, all other characters are numeric. The first
three characters indicate the category of the diagnoses, and the following three characters indicate the
etiology, anatomic site, severity, or other clinical details. A seventh character indicates an extension.
An example of the ICD-10 structure is visible in Figure 3, it visualizes the same diagnosis as in Figure 1
but for ICD-10 instead of ICD-9.
Experiments with ICD-10 are conducted on the CodiEsp dataset, which is publicly available.
This dataset consists of 1000 discharge summaries of patient visits in Spain. The documents are in free
text format, which is automatically translated to English from Spanish, and they are manually labeled
with ICD-10 codes by healthcare professionals.
Figure 3. ICD-10 code structure with second through fourth layer representations and diabetes as
an example.
The deep learning methods that we discuss in this paper encompass different neural network
architectures including convolutional and recurrent neural networks. It is studied how they can be
extended with suitable attention mechanisms and loss functions and how the hierarchical structure
of the ICD taxonomy can be exploited. ICD-10 coding is especially challenging, as in the benchmark
dataset that we use for our experiments the ICD coding model has to deal with very few manually
labeled training data.
In our work we want to answer the following research questions. What are the current
state-of-the-art neural network approaches for classifying discharge summaries? How do they compare
to each other in terms of performance? What combination of techniques gives the best results on
a public dataset? We hypothesize the following claims. (1) A combination of self-attention and
convolutional layers yields the best classification results. (2) In a setting with less training samples per
category, attention on description vectors of the target categories improves the results. (3) Using the
hierarchical taxonomy explicitly in the model improves classification on a small dataset. The most
important contribution of our work is an extensive evaluation and comparison of state-of-the-art deep
learning models for ICD-9 and ICD-10 coding which currently does not exist in the literature.
The remainder of this paper is organized as follows. In Section 2, related work relevant for the
conducted research will be discussed. Section 3 will elaborate on the datasets used in the experiments
and how this data is preprocessed. The compared deep learning methods are described in Section 4.
These methods are evaluated on the datasets in different settings and all findings are reported in
Section 5. The most important findings will be discussed in Section 6. Finally, we conclude with some
recommendations for future research.
2. Related Work
The most prominent and more recent advancements in categorizing medical reports with standard
codes will be described in this section.
113
Appl. Sci. 2020, 10, 5262
114
Appl. Sci. 2020, 10, 5262
evaluated on both the MIMIC-II and MIMIC-III datasets but achieves inferior results to the neural
models presented further in this paper. Zeng et al. [21] transfer MeSH domain knowledge to improve
automatic ICD-9 coding but improvements compared to baselines are limited. Baumel et al. [22]
have introduced the Hierarchical Attention bidirectional Gated Recurrent Unit model (HA-GRU).
By identifying relevant sentences for each label, documents are tagged with corresponding ICD codes.
Results are reported both on the MIMIC-II and MIMIC-III datasets. Mullenbach et al. [23] present the
Convolutional Attention for Multilabel classification (CAML) model that combines the strengths of
convolutional networks and attention mechanisms. They propose adding regularization on the long
descriptions of the target ICD codes, especially to improve classification results on less represented
categories in the dataset. This approach is further extended with the idea of multiple convolutional
channels in [24] with max pooling across all channels. The authors also shift the attention from the
last prediction layer, as in [23], to the attention layer. Mullenbach et al. [23,24] achieve state-of-the art
results for ICD-9 coding on the MIMIC-III dataset. As an addition to these models, in this paper a
hierarchical variant of each of them is constructed and evaluated.
Recently, language models have become popular in natural language processing. The use of
Bidirectional Encoder Representations from Transformers (BERT) models, which uses a transformer
architecture with multi-head attention, and especially BioBERT has improved the overall recall values at
the expense of precision compared to CNN and LSTM models when applied in the ICD-10 coding task
at CLEF eHealth in 2019 [25], a finding which we have confirmed in our experiments. Therefore, we do
not report on experiments with this architecture in this survey.
Finally, Campbell et al. [26] survey the literature on the benefits, limitations, implementation,
and impact of computer-assisted clinical coding on clinical coding professionals. They conclude that
human coders could be greatly helped by current technologies and are likely to become clinical coding
editors in an effort to raise the quality of the overall clinical coding process. Shickel et al. [11] review
deep learning models for EHR systems by examining architectures, technical aspects, and clinical
applications. Their paper discusses shortcomings of the current techniques and future research
directions among which the authors cite ICD coding of free clinical text as one of the future challenges.
115
Appl. Sci. 2020, 10, 5262
of confusing classes that are “nearby” in the taxonomy is less costly or severe than predicting a
class that is “far away” from the correct class. This is realized by scaling the penalties for margin
violation. A similar idea is modeled in a deep learning model for audio event detection [31]. These
authors propose the hierarchy-aware loss function modeled as a triplet or quadruplet loss function that
favors confusing classes that are close in the taxonomy, over ones that are far away from the correct
class. In [32], an hierarchical SVM is shown to outperform that of a flat SVM. Results are reported
on the MIMIC-II dataset. In a deep neural network setting, recent publications on hierarchical text
classification outside the medical field make use of label distribution learning [18], an hierarchical
softmax activation function [33], and hierarchical multilabel classification networks [34].
Recent research shows the value of hierarchical dependencies using hierarchical attention
mechanisms [22] and hierarchical penalties [34], which are also integrated in the training of the
models surveyed in this paper.
If the target output space of categories follows a hierarchy of labels—as is also the case
in ICD coding—the trained models efficiently use this hierarchy for category assignment or
prediction [32,35,36]. During categorization the models apply a top-down or a bottom-up approach at
the classification stage. In a top-down approach parent, categories are assigned first and only children
of assigned parents are considered as category candidates. In a bottom-up approach, only leaf nodes
in the hierarchy are assigned which entail that parent nodes are assigned.
In the context of category occurrences in hierarchical target spaces, a power-law distribution is
described in [37]. Later, the authors of [38] have addressed this phenomenon quantitatively deriving
a relationship in terms of space complexity for those kind of distributions. They have proved that
hierarchical classifiers have lower space complexity than their flat variants if the hierarchical target
space satisfies certain conditions based on, e.g., maximum branching factor and the depth of the
hierarchy. The hierarchical variants discussed in this survey are of different shape than those discussed
in these works, layer-based instead of node-based, and do not suffice the necessary conditions for
these relationships to apply.
3. Materials
116
Appl. Sci. 2020, 10, 5262
• Dis-50 consists of a selection of the discharge summaries from the MIMIC-III dataset (11,369 out
of 52,726) for the classification of Top-50 ICD-9 codes. We use the publicly available split [23] for
training (8066), testing (1729), and development (1574) of the models.
• Dis describes the full label setting where all Diagnostic (6918) and Procedural (2011) ICD-9 codes
are used. This leads to a total of 8929 Unique codes on the 52,726 discharge summaries. We again
use the publicly available split for training (47,723), testing (3372), and development (1631) of
the models.
• Full extends the Dis dataset with other notes regarding the patient (radiology notes, nursing notes,
etc.) in addition to the discharge summaries. This dataset, contains almost thrice the number of
tokens for training. We use the same test, train, development split as used in the Dis dataset.
100
Frequency in training set
80
60
40
20
There are in total 1767 different categories spread out over only 500 training documents.
Every document is labeled with on average 11.3 different categories and each category is on average
represented by 3.2 training examples. Even the top 50 most frequently occurring categories have only
between 15 and 112 corresponding positive training documents. Therefore, tests for this dataset are
conducted on these 50 categories. Table 1 gives an overview of statistics for all discussed training
datasets. The specifics for the corresponding development and test sets are similar. Displayed statistics
for the Dis and the Full dataset are the same since the only difference lies in larger text fragments,
resulting in 72,891 unique tokens for the Full dataset compared to 51,917 for Dis. There are no
differences concerning the labels.
117
Appl. Sci. 2020, 10, 5262
3.3. Preprocessing
The preprocessing follows the standard procedure described in [23], i.e., tokens that contain no
alphabetic characters are removed and all tokens are put to lowercase. Furthermore tokens that appear
in fewer than three training documents are replaced with the “UNK” token. All documents are then
truncated to a maximum length of 2500 tokens.
4. Methods
In this section, all tested models will be discussed in detail. First, a simple convolutional and a
recurrent baseline commonly used in text classification are described. Then, two recent state-of-the-art
models in the field of ICD coding are explained in detail. These models are implemented by the authors
following the original papers and are called DR-CAML [23] and MVC-(R)LDA [24], respectively.
We discuss in detail the attention mechanisms and loss functions of these models. Afterwards, as a
way of handling the hierarchical dependencies of the ICD-codes, we propose various ways of their
integration in all models. This is based on advancements in hierarchical classification as inspired
by [34].
All discussed models have for each document i as input a sequence of word vectors xi as their
representation and as output a set of ICD-codes yi .
4.1. Baselines
The performance of all models will be evaluated and compared against two simple common
baselines used for handling sequential input data (text). These models are, respectively, based on
convolutional and recurrent neural principles.
4.1.1. Convolutional
The baseline convolutional neural network model, or CNN, consists of a 1D temporal
convolutional neural layer. This convolutional layer consists of different kernels, which are filters with
a specific pattern that are tested against all sequences of the input data with the same length. This is
followed by a (max-)pooling layer, to reduce the data size by only remembering the maximum value
over a certain range. More formally, for an input x and a given 1D kernel f on element s of the input
sequence, the convolutional and pooling operation can be defined as follows.
k −1
F1 (s) = ( x f )(s) = ∑ f (i ) · x s −i (1)
i =0
−1
F2 (s) = maxil= 0 F1 ( s − 1) (2)
The amount of filters k and l in the convolutional and pooling layer, respectively, as well as their
sizes are optimizable parameters of this model. For both layers a stride length, i.e., the amount by
which the filter shifts in the sequence, can be defined leading to a trade-off between output size and
observability of detailed features.
4.1.2. Recurrent
As the recurrent neural network baseline, two common approaches are considered.
118
Appl. Sci. 2020, 10, 5262
BiGRU
The GRU, or Gated Recurrent Unit, is a gating mechanism in recurrent neural networks. It is
the mechanism of recurrent neural networks allowing the model to “learn to forget” less important
fragments of the data and “learn to remember” the more important fragments with respect to the
learning task. More formally, consider an input vector xt , update gate vector zt , reset gate vector rt ,
and output vector ht at time t. The respective values can be calculated as follows.
This leads to weight matrices Wz , Wr , and Wh to train as well as biases bz , br , and bh , σ stands for
the sigmoid activation function. BiGRU is the bidirectional variant of such a model that processes the
input data front-to-back and back-to-front in parallel.
BiLSTM
An LSTM, or Long Short-Term Memory neural network model, is very similar to a GRU but
replaces the update gate with a forget gate and an additional output gate. This way it usually has more
computational power than a regular GRU, but at the expense of more trainable parameters and more
chance of overfitting when the amount of training data is limited [40–42]. Formally, consider again an
input vector xt and a hidden state vector ht at time t. Activation vectors for the update gate, forget gate,
and output gate are, respectively, represented by zt , f t , and ot . These states relate to each other like
follows.
This again leads to weight matrices Wz , W f , Wo , and Wc to train as well as biases bz , b f , bo , and bc
with σ being the sigmoid activation function. BiLSTM is the bidirectional variant of a regular LSTM,
analogous to BiGRU, which is the bidirectional variant of GRU.
4.2.1. DR-CAML
DR-CAML is a CNN-based model adopted for ICD coding [23]. When an ICD code is defined by
the WHO, it is accompanied by a label definition expressed in natural language to guide the model
towards learning the appropriate parameter values of the model. For this purpose, the model employs
a per-label attention mechanism enabling it to learn distinct document representations for each label.
It has been shown that for labels for which there are very few training instances available, this approach
is advantageous. The idea is that the description of a target code is itself a very good training example
119
Appl. Sci. 2020, 10, 5262
for the corresponding code. Similarity between the representation of a given test sample and the
representation of the description of a target code gives extra confidence in assigning this label.
In general, after the convolutional layer, DR-CAML employs a per-label attention mechanism
to attend to the relevant parts of text for each predicted label. An additional advantage is that the
per-label attention mechanism provides the model with the ability of explaining why it decided to
assign each code by showing the spans of text relevant for the ICD code.
DR-CAML consists of two modules: one for the representation of the input text, and the other
for the embedding of the label’s description as is visualized in Figure 5. The CAML module has a
CNN at the base layer which takes a sequence of the embeddings of the text tokens as input and
consequently represents the document as the matrix H. Then, the per-label attention mechanism
applies. Attention in this context means learning which parts of some context (the label description
vectors) are relevant for a given input vector.
After calculating the attention vector α using a softmax activation function, it is applied as a
product with H. With hl , the vector parameter for label l, the vector representation for each label is
computed as
N
vl = ∑ αl,n hn (13)
n =1
Given the vector representation of a document and the probability for label l, ŷl can be obtained
as shown in Figure 5.
The CNN modules on the left hand side try to minimize the binary cross entropy loss. The second
module is a max-pooling CNN model which produces a max-pooled vector, zl , by getting the
description of code l. Assuming ny is the number of true labels in train data, the final loss is computed
by adding a regularization term to the base loss function. The loss function is explained in more detail
in Section 4.2.3.
4.2.2. MVC-(R)LDA
MVC-LDA and MVC-RLDA can be seen as extensions of DR-CAML. Similar to that model,
they are based on a CNN architecture with a label attention mechanism that considers ICD coding as a
multi-task binary classification problem. The added functionality lies in the use of parallel CNNs with
different kernel sizes to capture information of different granularity. MVC-LDA, the top module in
Figure 6, is a multi-view CNN model stacked on an embedding layer. MVC-RLDA reintroduces the
per-label attention mechanism introduced in the previous subsection.
In general, the multi-view CNNs are constructed with four CNNs that have the same number of
filters but with different kernel sizes. This convolutional layer is followed by a max-pooling function
120
Appl. Sci. 2020, 10, 5262
across all channels to select the most relevant span of text for each filter. A separate attention layer
for each label comes next, helping the model to attend to relevant parts of a document for each label.
A linear layer with weight vector Vj is implemented for the jth label and CVj is the attention for the
input C and label j. This attention vector CVj is the output of a dense layer with softmax activation
function leading to a relative weighting of the input elements C.
Pj = C T (CVj ). (14)
At the end, a dense layer is used for each label. The length of an input document is also encoded
into the output layers with embedding function
Tj (l ) = Sigmoid(K j l + d j ), (15)
to decrease the problem of under-coding to some extent. This is done as in [24], which showed a
statistically significant Pearson’s correlation between the input length and the number of ground truth
codes. Therefore, the model can derive an underlying bias that, on average, shorter input documents
represent a lower amount of categories.
Parameters a, K j , and d j in the length embedding function respectively represent the input length,
and the layer’s weight and bias, respectively, for a given label j. The prediction y j for class j is then
computed as
y j = Sigmoid(UjT Pj + b j + Tj ( a)). (16)
Similar to DR-CAML, this model tries to minimize the binary loss function. Adding the label
description embedding to MVC-LDA, the lower part of Figure 4 leads to MVC-RLDA whose loss
function includes an extra weighted term as a regularizer. It guides the attention weights to avoid
121
Appl. Sci. 2020, 10, 5262
overfitting. In addition, this regularization forces the attention for classes with similar descriptions to
be closer to each other. The loss function is again explained in more detail in Section 4.2.3.
122
Appl. Sci. 2020, 10, 5262
which leaves a parameter λ to optimize the loss function (parameter λ is optimized over the
training set).
5. Results
5.1. MIMIC-III
Results are displayed for five different models. First, results for the two baseline models, CNN
and BiGRU, are shown. Because in most of the experiments, the BiGRU models performed at least on
par with their BiLSTM variants, we only report the results of BiGRU as a recurrent neural network
baseline. The reason for this good performance of GRU models compared to LSTM models most
likely resides in the amount of available training data for various target categories. Then, we report
on three more advanced models as discussed in the Method section: DR-CAML, MVC-LDA, and
MVD-RLDA. Different hyperparameter values are considered and tested on the development set of
MIMIC-III the setting giving the highest average performance on the development set is reported in
Table 2.
123
Appl. Sci. 2020, 10, 5262
Table 2. Optimal hyperparameter values obtained on the development set of the MIMIC-III dataset.
For all these models using their optimal hyperparameter settings, the average performance is
reported in terms of Micro F1, Macro F1, Micro AUC (ROC), and Precision@X. For models that are only
evaluated on the top 50 most frequent categories in the training data, results are displayed in Table 3.
This experiment is then repeated over all categories, which leads to the results in Table 4. Last, Table 5
gives the results of training the models on all labels for the Full dataset.
Table 3. Results of flat models on top 50 most frequent categories of the MIMIC-III Dis-50 dataset.
Table 4. Results of flat models on all labels of the MIMIC-III Dis dataset.
Table 5. Results of flat models on all labels of the MIMIC-III Full dataset.
This experiment is repeated for the hierarchical variants of all described models. This time,
only results on the top 50 most frequent target categories are reported in Table 6. As hierarchical
models introduce a large number of additional intermediate categories, the target space is too large to
train these hierarchical variants in a full category setting.
124
Appl. Sci. 2020, 10, 5262
To assess the importance of the different components of the highest performing model on
MIMIC-III Dis, an ablation study is conducted. The multi-view and the hierarchical component
are added and the regularization on long descriptions of the target ICD-codes is removed while all
other components stay the same. The difference in performance is measured and visualized in Figure 8.
1.96
Performance loss (% Micro F1)
1.5
0.5 0.46
0.11
0
REMOVE ADD ADD
regularization on Multi-view Hierarchy
long descriptions
Figure 8. Ablation study on highest performing model (hierarchical MVC-RLDA) for MIMIC-III dataset.
5.2. CodiEsp
Similar experiments are carried out on the CodiEsp dataset, while only using the top 50 most
frequent codes. The same hyperparameter settings are used as in Table 2. Results are visualized in
Tables 7 and 8. Results for the full target space are not reported, as 90% of the target categories would
only have 5 or less positive training examples. Furthermore, this would lead to a target space of
1767 different categories with only 500 training examples in total, which makes training of a decent
model unfeasible.
Table 7. Results of flat models on top 50 codes obtained on the CodiEsp dataset.
125
Appl. Sci. 2020, 10, 5262
Table 8. Results of hierarchical models for top 50 codes obtained on the CodiEsp dataset.
To assess the importance of the different components of the highest performing model on CodiEsp,
an ablation study is conducted. The multi-view, the hierarchical component, and the regularization on
long descriptions of the target ICD-codes are each removed while all other components stay the same.
The difference in performance is measured and visualized in Figure 9.
4
3.57
Performance loss (% Micro F1)
3
2.4
1 0.92
0
REMOVE REMOVE REMOVE
regularization on Multi-view Hierarchy
long descriptions
Figure 9. Ablation study on highest performing model (hierarchical MVC-RLDA) for CodiEsp dataset.
6. Discussion
A comparison between the results displayed in Tables 3 and 4 shines a light on the value of the
multiview component. The micro F1 scores of the five models are in similar relationship to each other
in both tables, except for the two multiview models. They outperform CAML in the full label setting
of the MIMIC-III Dis dataset, where they show very similar behavior to CAML in a top-50 category
setting, where for each of the categories a decent amount of training samples is available. When the
target space increases and more categories have fewer training examples, the added granularity of
having multiple kernel sizes in the MVC-(R)LDA model pays off. Table 5 shows results for models
trained on the Full dataset. The best performing model (MVC-LDA, 58.12%) gets outperformed by the
best performing model for all labels on Dis (MVD-LDA, 59, 75%). The addition of the information in
other medical documents than just discharge summaries thus seems to complicate instead of facilitate
the classification process.
Furthermore, comparing Tables 3 and 6, where the influence of the hierarchical parameter can
be assessed in a top-50 category setting, reveals a shift in the opposite direction. While in general,
the modeling of the hierarchical relationships hurts the classification process for all categories, it hinders
the multiview models the most. This time, DR-CAML is clearly the best performing model. Adding
126
Appl. Sci. 2020, 10, 5262
multiview and simultaneously modeling the hierarchical relationships between the target categories
tend to make the model overfit on the training data.
Looking at Tables 7 and 8, it is clear that the lack of a sufficient amount of training data in CodiEsp
(about 100 times less than for the Dis dataset) for most categories led to lower performance of all
models on this dataset. For the flat variants of the models, a regular CNN even outperforms the more
complex models. As the amount of training data is low, the added complexity of the latter models
hinders them generalizing well for unseen data. Comparing the results in both tables also leads to the
conclusion that in contrast to the results on MIMIC-III, on average the hierarchical component increases
the classification performance based on Micro F1 on CodiEsp. Where the information embedded in
the ICD taxonomy is redundant and even counteracting the performance for the larger MIMIC-III
dataset, it is leveraged when there is a lack of information in the training data itself, which is the case
for CodiEsp.
Last, Figures 8 and 9 display the relative importance of the long description regularization,
the multi-view. and the hierarchy for the top performing model on both the Dis and CodiEsp
datasets. For the Dis dataset, not using the hierarchy is by far the most important component.
The regularization on long descriptions still adds 0.46% and the multi-view almost does not influence
the results. For CodiEsp, it shows that the multi-view component has the biggest influence, followed
by the hierarchy, whose importance on this smaller dataset is already shown previously.
7. Conclusions
In this paper, we have surveyed the current methods used for classification of clinical reports
based on ICD codes using neural networks. We have combined the techniques already present in the
literature and assessed the relative importance of all present components. Combining a convolutional
framework with self-attention as well as regularizing, the loss function with attention on the long
descriptions of target ICD codes proved to be valuable. Furthermore, a hierarchical objective was
integrated in all presented models. Its added value lies especially in a setting with low amounts of
available training data. Last, extending the dataset with the information present in other medical
documents introduced too much noise into the data, hindering the performance of the tested models.
Concerning future research directions, it would be valuable to test the techniques on a ICD-10 or
ICD-11 dataset of larger size. This would give better insights into which performance these models
could achieve in current hospital settings. On a similar note, tackling the problem of lack of data
by finding a way to combine the available training data from different datasets (e.g., MIMIC-III
and CodiEsp) and different ontologies (e.g., ICD-9, ICD-10, and MeSH) could further improve
the classification performance of all models. Last, it would be interesting to investigate the use
of hierarchical descriptions as an addition to the loss function, giving another use for the information
inherently present in the ICD taxonomy.
Author Contributions: Conceptualization, E.M. and M.-F.M.; Methodology, E.M. and M.-F.M.; Software, E.M.,
A.A. and A.K.; Validation, E.M., A.A. and A.K.; Formal Analysis, E.M. and M.-F.M.; Investigation, E.M.; Resources,
A.K.; Data Curation, A.K.; Writing-Original Draft Preparation, E.M. and M.-F.M.;Writing-Review Editing, E.M.,
A.A., A.K. and M.-F.M..; Visualization, E.M.; Supervision, M.-F.M.; Project Administration, A.A. and M.-F.M.;
Funding Acquisition, M.-F.M. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the VLAIO SBO project, grant number 150056. M.-F.M. is co-funded by the
ERC Advanced Grant CALCULUS H2020-ERC-2017-ADG 788506.
Conflicts of Interest: Authors declare no conflict of interest.
127
Appl. Sci. 2020, 10, 5262
References
1. Larkey, L.; Croft, W.B. Automatic Assignment of ICD9 Codes To Discharge Summaries; Technical Report;
University of Massachusetts: Amherst, MA, USA, 1995.
2. Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.;
Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9.
[CrossRef] [PubMed]
3. Larkey, L.S.; Croft, W.B. Combining classifiers in text categorization. In Proceedings of the 19th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich,
Switzerland, 18–22 August 1996; (Special Issue of the SIGIRForum); pp. 289–297.
4. Goldstein, I.; Arzumtsyan, A.; Özlem, U. Three approaches to automatic assignment of ICD-9-CM codes to
radiology reports. In Proceedings of the AMIA Annual Symposium; American Medical Informatics Association:
Washington, DC, USA, 2007; pp. 279–283.
5. Farkas, R.; Szarvas, G. Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinform.
2008, 9 (Suppl. 3), S10. [CrossRef] [PubMed]
6. Marafino, B.J.; Davies, J.M.; Bardach, N.S.; Dean, M.L.; Dudley, R.A. N-gram support vector machines for
scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive
care unit. J. Am. Med. Inform. Assoc. 2014, 21, 871–875. [CrossRef] [PubMed]
7. Stanfill, M.; Williams, M.; Fenton, S.; Jenders, R.; Hersh, W. A systematic literature review of automated
clinical coding and classification systems. J. Am. Med. Inform. Assoc. JAMIA 2010, 17, 646–651. [CrossRef]
[PubMed]
8. Kavuluru, R.; Rios, A.; Lu, Y. An empirical evaluation of supervised learning approaches in assigning
diagnosis codes to electronic medical records. Artif. Intell. Med. 2015, 65, 155–166. [CrossRef]
9. Scheurwegs, E.; Cule, B.; Luyckx, K.; Luyten, L.; Daelemans, W. Selecting relevant features from the
electronic health record for clinical code prediction. J. Biomed. Inform. 2017, 74, 92–103. [CrossRef]
10. Leo, M.; Furnari, A.; Medioni, G.G.; Trivedi, M.M.; Farinella, G.M. Deep Learning for Assistive Computer
Vision. In Proceedings of the Computer Vision—ECCV 2018 Workshops—Part VI, Munich, Germany, 8–14
September 2018; pp. 3–14.
11. Shickel, B.; Tighe, P.; Bihorac, A.; Rashidi, P. Deep EHR: A survey of recent advances in deep learning
techniques for electronic health record (EHR) analysis. IEEE J. Biomed. Health Inform. 2018, 22, 1589–1604.
[CrossRef]
12. Kelly, L.; Suominen, H.; Goeuriot, L.; Neves, M.; Kanoulas, E.; Li, D.; Azzopardi, L.; Spijker, R.; Zuccon, G.;
Scells, H.; et al. Overview of the CLEF eHealth Evaluation Lab 2019. In International Conference of the
Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2019; pp. 322–339.
13. Shi, H.; Xie, P.; Hu, Z.; Zhang, M.; Xing, E.P. Towards automated ICD coding using deep learning. arXiv
2017, arXiv:1711.04075.
14. Duarte, F.; Martins, B.; Pinto, C.S.; Silva, M.J. Deep neural models for ICD-10 coding of death certificates and
autopsy reports in free-text. J. Biomed. Inform. 2018, 80, 64–77. [CrossRef]
15. Xie, P.; Xing, E. A neural architecture for automated ICD coding. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers); ACL: Melbourne, Australia, 2018.
16. Huang, J.; Osorio, C.; Sy, L.W. An empirical evaluation of deep learning for ICD-9 code assignment using
MIMIC-III clinical notes. Comput. Methods Prog. Biomed. 2019, 177, 141–153. [CrossRef]
17. Li, M.; Fei, Z.; Zeng, M.; Wu, F.; Li, Y.; Pan, Y.; Wang, J. Automated ICD-9 coding via a deep learning
approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 16, 1193–1202. [CrossRef] [PubMed]
18. Xu, K.; Lam, M.; Pang, J.; Gao, X.; Band, C.; Mathur, P.; Papay, F.; Khanna, A.K.; Cywinski, J.B.;
Maheshwari, K.; et al. Multimodal machine learning for automated ICD coding. In Proceedings of the
4th Machine Learning for Healthcare Conference; Doshi-Velez, F., Fackler, J., Jung, K., Kale, D., Ranganath, R.,
Wallace, B., Wiens, J., Eds.; PMLR: Ann Arbor, MI, USA, 2019; Volume 106, pp. 197–215.
19. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [CrossRef]
20. Wang, G.; Li, C.; Wang, W.; Zhang, Y.; Shen, D.; Zhang, X.; Henao, R.; Carin, L. Joint embedding of words
and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers); ACL: Melbourne, Australia, 2018.
128
Appl. Sci. 2020, 10, 5262
21. Zeng, M.; Li, M.; Fei, Z.; Yu, Y.; Pan, Y.; Wang, J. Automatic ICD-9 coding via deep transfer learning.
Neurocomputing 2018, 324, 43–50. [CrossRef]
22. Baumel, T.; Nassour-Kassis, J.; Elhadad, M.; Elhadad, N. Multi-Label Classification of Patient Notes: A Case
Study on ICD Code Assignment. In Proceedings of the Workshops at the Thirty-Second AAAI Conference
on Artificial Intelligence, Hilton, NO, USA, 2–7 February 2018.
23. Mullenbach, J.; Wiegreffe, S.; Duke, J.; Sun, J.; Eisenstein, J. Explainable prediction of medical codes
from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); ACL: New Orleans, LA,
USA, 2018.
24. Sadoughi, N.; Finley, G.P.; Fone, J.; Murali, V.; Korenevski, M.; Baryshnikov, S.; Axtmann, N.; Miller, M.;
Suendermann-Oeft, D. Medical code prediction with multi-view convolution and description-regularized
label-dependent attention. arXiv 2018, arXiv:1811.01468.
25. Amin, S.; Neumann, G.; Dunfield, K.; Vechkaeva, A.; Chapman, K.A.; Wixted, M.K. MLT-DFKI at CLEF
eHealth 2019: Multi-label classification of ICD-10 codes with BERT. In Proceedings of the Working Notes of
CLEF 2019—Conference and Labs of the Evaluation, Forum, Lugano, Switzerland, 9–12 September 2019.
26. Campbell, S.; Giadresco, K. Computer-assisted clinical coding: A narrative review of the literature on its
benefits, limitations, implementation and impact on clinical coding professionals. Health Inf. Manag. J. 2019,
49, 183335831985130. [CrossRef] [PubMed]
27. Deschacht, K.; Moens, M. Efficient hierarchical entity classifier using conditional random fields.
In Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text
and Knowledge@COLING/ACL 2006, Sydney, Australia, 22 July 2006; pp. 33–40.
28. Babbar, R.; Partalas, I.; Gaussier, É.; Amini, M. On flat versus hierarchical classification in large-scale
taxonomies. In Proceedings of the Advances in Neural Information Processing Systems 26: Proccedings of
the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8
December 2013; pp. 1824–1832.
29. Gopal, S.; Yang, Y. Recursive Regularization for Large-Scale Classification with Hierarchical and Graphical
Dependencies; Association for Computing Machinery: New York, NY, USA, 2013.
30. Cai, L.; Hofmann, T. Hierarchical document categorization with support vector machines. In Proceedings of
the Thirteenth ACM International Conference on Information and Knowledge Management, Washington,
DC, USA, 8–13 November 2004.
31. Jati, A.; Kumar, N.; Chen, R.; Georgiou, P. Hierarchy-aware loss function on a tree structured label space for
audio event detection. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6–10.
32. Perotte, A.; Pivovarov, R.; Natarajan, K.; Weiskopf, N.; Wood, F.; Elhadad, N. Diagnosis code assignment:
Models and evaluation metrics. J. Am. Med. Inform. Assoc. 2014, 21, 231–237. [CrossRef]
33. Mohammed, A.A.; Umaashankar, V. Effectiveness of hierarchical softmax in large scale classification tasks.
In Proceedings of the 2018 International Conference on Advances in Computing, Communications and
Informatics (ICACCI), Bangalore, India, 19–22 September 2018; pp. 1090–1094.
34. Wehrmann, J.; Cerri, R.; Barros, R. Hierarchical multi-label classification networks. In Proceedings
of the Thirty-Fifth International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018;
pp. 5075–5084.
35. Silla, C.N.; Freitas, A.A. A survey of hierarchical classification across different application domains. Data Min.
Knowl. Discov. 2011, 22, 31–72. [CrossRef]
36. Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. HDLTex: Hierarchical
Deep Learning for Text Classification. In Proceedings of the 2017 16th IEEE International Conference on
Machine Learning and Applications, Cancun, Mexico, 18–21 December 2017.
37. Yang, Y.; Zhang, J.; Kisiel, B. A scalability analysis of classifiers in text categorization. In Proceedings of the
26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval,
Toronto, ON, Canada, 28 July–1 August 2003; pp. 96–103.
38. Babbar, R.; Metzig, C.; Partalas, I.; Gaussier, E.; Amini, M.R. On power law distributions in large-scale
taxonomies. ACM Sigkdd Explor. Newsl. 2014, 16, 47–56. [CrossRef]
129
Appl. Sci. 2020, 10, 5262
39. Miranda-Escalada, A.; Gonzalez-Agirre, A.; Krallinger, M. CodiEsp Corpus: Spanish Clinical Cases Coded in
ICD10 (CIE10)—eHealth CLEF2020. Available online: https://zenodo.org/record/3758054#.XxXGgy17E6h
(accessed on 29 July 2020).
40. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv 2014, arXiv:1412.3555.
41. Yin, W.; Kann, K.; Yu, M.; Schütze, H. Comparative study of CNN and RNN for natural language processing.
arXiv 2017, arXiv:1702.01923.
42. Kaiser, Ł.; Sutskever, I. Neural GPUs learn algorithms. arXiv 2015, arXiv:1511.08228.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
130
applied
sciences
Article
Applying Machine Learning for Healthcare: A Case
Study on Cervical Pain Assessment with
Motion Capture
Juan de la Torre 1, *, Javier Marin 1 , Sergio Ilarri 2,3 and Jose J. Marin 1,4
1 IDERGO-Research and Development in Ergonomics, Biomechanical Laboratory, I3A-University Institute of
Research of Engineering of Aragon, University of Zaragoza, 50018 Zaragoza, Spain; [email protected] (J.M.);
[email protected] (J.J.M.)
2 Computer Science for Complex System Modelling (COSMOS), I3A-University Institute of Research of
Engineering of Aragon, University of Zaragoza, 50018 Zaragoza, Spain; [email protected]
3 Department of Computer Science and Systems Engineering, University of Zaragoza, 50018 Zaragoza, Spain
4 Department of Design and Manufacturing Engineering, University of Zaragoza, 50018 Zaragoza, Spain
* Correspondence: [email protected]
Abstract: Given the exponential availability of data in health centers and the massive sensorization that
is expected, there is an increasing need to manage and analyze these data in an effective way. For this
purpose, data mining (DM) and machine learning (ML) techniques would be helpful. However, due to
the specific characteristics of the field of healthcare, a suitable DM and ML methodology adapted
to these particularities is required. The applied methodology must structure the different stages
needed for data-driven healthcare, from the acquisition of raw data to decision-making by clinicians,
considering the specific requirements of this field. In this paper, we focus on a case study of cervical
assessment, where the goal is to predict the potential presence of cervical pain in patients affected with
whiplash diseases, which is important for example in insurance-related investigations. By analyzing
in detail this case study in a real scenario, we show how taking care of those particularities enables the
generation of reliable predictive models in the field of healthcare. Using a database of 302 samples,
we have generated several predictive models, including logistic regression, support vector machines,
k-nearest neighbors, gradient boosting, decision trees, random forest, and neural network algorithms.
The results show that it is possible to reliably predict the presence of cervical pain (accuracy, precision,
and recall above 90%). We expect that the procedure proposed to apply ML techniques in the field of
healthcare will help technologists, researchers, and clinicians to create more objective systems that
provide support to objectify the diagnosis, improve test treatment efficacy, and save resources.
Keywords: data mining; data anonymization; health; cervical injury; neck pain; inertial sensors
1. Introduction
In the field of healthcare, the exponential increase in the data that health centers must produce and
manage is significant. The need has arisen to develop procedures that make this process easier and that
take advantage of all the data generated [1], detecting unknown and valuable information in health
data [2]. Thus, the volume of data generated is such that its processing and analysis by traditional
methods is too complex and overwhelming [3]. To tackle this challenge, data mining (DM) can play a
key role, as it allows the discovery of patterns and trends in large amounts of complex data and the
extraction of hidden information to help in making decisions that can improve the quality of the care
processes [4–7]. It is closely linked with the scientific discipline in the field of artificial intelligence
called machine learning (ML), which “employs a variety of statistical, probabilistic and optimization
techniques that allow computers to learn from past examples and to detect hard-to-discern patterns
from large, noisy or complex data sets” [8].
Consequently, ML is generating growing interest in the field of healthcare (e.g., see [9,10])
for relevant special issues related to this topic), mainly derived from its possible applications, such as
assessing the effectiveness of treatments, detecting fraud and abuse in health insurance, managing
healthcare, making lower-cost medical solutions available to the patients, detecting symptoms and
diseases [11], discovering treatment patterns from electronic medical records [12], detecting groups
of incidents [13], and identifying medical treatment methods [2,3]. Likewise, ML also presents
health benefits: (1) a potential reduction in the time and effort required for diagnosis and treatment,
(2) the ability to examine multiple areas simultaneously, (3) a decreased potential for human error,
and (4) data that are accessible anytime and anywhere [14]. Besides, DM and ML are key in the path
towards personalized medicine, where the goal is to customize treatments to the specifics of each
individual [15–18].
However, to take full advantage of the benefits offered by ML in the field of healthcare,
several considerations are necessary. Among others, the following aspects can be highlighted:
• The data have to be structured and organized in order to properly process and transform them
into suitable variables, which is essential in the development of any pattern recognition software
and a highly problem-dependent task [19,20].
• Moreover, the secure treatment and management of data acquires special relevance in the field
of healthcare, where the privacy of the patient must be ensured. The management of sensitive
data contrasts with other fields of application of ML (fraud detection, stock prediction, etc.),
where anonymization treatments may be sometimes not necessary or critical. Therefore, a specific
treatment involving the anonymization and categorization of the data must be performed in order
to ensure the privacy of the patients [21]. Due to privacy policies, on certain occasions if a suitable
anonymization of the data is not performed and/or the required authorizations to access some
data are not obtained, the needed health studies cannot be carried out. Therefore, data availability
should also be considered as a key factor [21].
• Another difference with other ML applications is the existence of different costs of failures; in the
health area, the cost of a false negative (e.g., failing to detect that a patient has a specific disease) is
usually much higher than the cost of a false positive (e.g., if a person is initially considered to
have a disease that he/she does not really have, additional tests will be performed to rule this out,
which may be costly but usually less harmful than failing to diagnose an existing disease).
• It should also be considered that ML hardly ever follows a linear sequence ending at the first
attempt; instead, it is rather an iterative feedback process where the stages interact with each
other. Furthermore, in the field of healthcare, where the flow of data is continuous and constant,
it is reasonable to assume that the model can be designed to be a “learning” model that must be
continuously updated to improve its predictions over time. Therefore, the different stages needed
to generate a reliable predictive model should be properly structured, from the acquisition of raw
data to decision-making, which is essential to achieve the effectiveness of the model [22].
All this motivates the need to define the particularities of the application of ML techniques in the
field of healthcare, where different stages in the ML workflow must be correctly defined and structured.
The proper application of ML techniques would be beneficial for the clinicians, researchers, developers,
and designers involved in the field of health, where the management of information acquires a
transcendental role. It would favor the design of new products and services for improving healthcare
access [23], creating truly accessible technological solutions [24], and enhance the relationship between
health systems and people by providing adequate services at the right time [23].
Based on the above, the aims of this study are the following: (1) to show and develop the
particularities of applying ML techniques in the field of healthcare, detailing all the stages that comprise
this process, from the acquisition of raw data to the decision-making derived from the predictive model
132
Appl. Sci. 2020, 10, 5942
generated; and (2) to demonstrate and show its practical application in a real use case. Specifically,
the ML process is applied in a cervical pain assessment study with patients affected by whiplash
pathologies derived from traffic accidents or other causes. This case study shows the proposed
methodology in action to solve a specific relevant problem. Moreover, the applied procedure can be
used as an ML application guide for other similar studies in the field of healthcare. We believe that the
combination of the use case study and the machine learning methodology is a relevant contribution of
this paper. We do not remain in the theoretical/methodological part only or limit our work to apply
different machine learning algorithms and compare the results, as many other works do; instead,
we describe the whole machine learning process, highlighting the aspects that are more relevant for our
use case but at the same time providing a general framework that could be used in other health-related
projects. In this way, we think that the paper could be relevant both as a specific case study and also as
a reference and guideline for other similar projects.
The structure of the rest of this paper is as follows. In Section 2, we present the use case scenario
studied and the clinical methods used for data collection. In Section 3, we describe the proposed
procedure to develop predictive models for healthcare, illustrating each step with our work on the
case study. In Section 4, we present an overall discussion of the proposal and the lessons learnt.
Finally, in Section 5, we summarize our conclusions, the limitations of the study, and some ideas for
future work.
The medical inspection, assessment by scales/clinical tests, and development of the clinical profile
of the patients were conducted by clinicians. All the participants received information about the
experiment and signed a consent agreement prior to the testing. The study received a favorable verdict
from the Bioethics Committee of Aragón in Spain (CEICA) on 25 July 2017.
133
Appl. Sci. 2020, 10, 5942
(a) (b)
Figure 1. Move Human (MH)-sensor motion capture system, cervical assessment. (a) Back view. (b)
Lateral view.
134
Appl. Sci. 2020, 10, 5942
Figure 3. Application of predictive models for clinical decision-making. Icons made by monkik,
smashicons and mynamepong.
The complete ML process has been considered, with the particularities of its application in the
healthcare area. It is based on seven stages that range from the definition of the target to the clinical
use of the system, as shown in Figure 4. Each stage is explained in the following subsections. Besides,
this paper is accompanied by electronic Supplementary Material to facilitate the understanding of the
different stages of application in the case study considered; specifically, we provide sample data files
obtained at the end of different stages of the process (anonymized datasets) and an example report that
a health professional could obtain as the output of the process.
Figure 4. Project management procedure proposed for the application of machine learning in healthcare.
In order to adapt and particularize the usual process of applying ML techniques in the healthcare
field and develop the project management procedure, some outstanding studies such as [20,30], as well
135
Appl. Sci. 2020, 10, 5942
as the most widespread DM processes, such as knowledge discovery in databases (KDD) [31,32];
sample, explore, modify, model, and assess (SEMMA) [14,32]; and the cross-industry standard process
for DM (CRISP-DM) [32–35], have been considered as a reference. Likewise, the project management
procedure scheme proposed (shown in Figure 4) has been inspired by different outlines of clinical
applications proposed by different authors [4,19,36–38] and adapted and extended according to our own
experience and the work performed with clinicians in our case study and other related collaborations,
such as the project “Mobile units for functional assessment of the musculoskeletal system” (CEICA
reference of the project: OTRI-2019/0108) in collaboration with the hospital MAZ (Mutua de Accidentes
de Zaragoza, Zaragoza, Spain), whose goal was to predict the degree of collaboration of patients in
insurance litigation.
From the related proposals mentioned above, the CRISP-DM process has been our main inspiration
to develop the project management procedure proposed in this paper. This is to be expected because
CRISP-DM sets a general common framework that can be adapted to different scenarios. Thus,
there are similarities between the six stages in CRISP-DM and our seven-stage proposal. For example,
the CRISP-DM stage 6 “deployment” is closely related to our last stage, which is “operating profit
and clinical use”. As another example, stage 6 of CRISP-DM establishes that the creation of the model
is not the end of the project and, similarly, in healthcare the knowledge provided to the clinician
through the application of the predictive model is not the end of the process, since the system is
continuously acquiring new data to improve the clinical performance. As the main difference between
both procedures, we put more emphasis on data management aspects, since this is a key point in
healthcare, and consider the whole process from the perspective of its application in a healthcare
scenario. While only one stage for data management is considered in the CRISP-DM process (stage
3, “data preparation”), data management is the focus of three stages in our proposal (stage 2 “data
acquisition and understanding”, stage 3 “data transformation”, and stage 4 “data selection”).
The works mentioned in this section have inspired our proposal, which extends existing models
by including a thorough analysis of all the data management challenges, as well as an illustration of
each step through a real practical case study. Although the particularities of applying ML techniques
in healthcare are exemplified in a specific case study, the procedure presented is flexible enough to
adapt to any healthcare case.
136
Appl. Sci. 2020, 10, 5942
has been reported by the collaborating subjects, who were real patients undergoing an assessment
process in a hospital.
Predicting the presence of cervical pain is of interest especially in the forensic field, as the incidence
and prognosis of whiplash injury from motor vehicles is relevant to insurance litigations for pain and
suffering [29]. The aim is to determine the presence or absence of pain with enough confidence to be
able to aid clinicians to detect possible magnifications of the injury by the affected individuals and thus
establish an unbiased compensation for the cervical pain [40]. It can help to identify and objectify pain
in patients with a high degree of anxiety and in hypochondriac patients.
Without a loss of generality, the following target metrics have been determined for the purpose of
this study, whose required threshold values have been stablished according to the criteria of clinical
experts (for this particular case study, they considered that achieving this quality criteria would be
enough for the system to be used as a decision-support system in production):
The sample size is 302; although this is not a very large dataset, it contains a lot of variables
with relevant information to characterize the presence of cervical pain during insurance litigation,
which allows predicting the target variable, thus considering the project viable.
• Patient characterization data. They are generally static data (although there may be small
fluctuations in certain data values over large time intervals, such as in the case of the weight of
the patient). They can be grouped in the following categories:
137
Appl. Sci. 2020, 10, 5942
• Measures: objective data that do not require assessment by a clinician. These data are not affected
by the reproducibility factor. Examples are test or clinical scales, test results, or data collected
by medical instrumentation. Data recorded by sensor devices provide objective data on some
measurable dimensions of the patient and can be of different types: motion capture (MoCap)
sensors, surface electromyography (EMG), stabilometric platforms, dynamometers, etc.
• Assessed data: information that depends on the assessment of the clinician, such as diagnoses
and treatments.
• Reported data: subjective information provided by the patient regarding his/her condition
(perceived symptoms).
• Clinical profile: data about symptoms and clinical signs of the patient that can lead to a diagnosis
by the clinician. We refer to symptoms when they are of subjective nature, reported by the patient,
and to signs if they are objective and obtained by the clinician about the pathology. In addition,
the signs can be qualitative (binary) or (discrete or continuous) quantitative (e.g., the temperature
of a thermometer, image tests, other measurements, etc.).
• Treatment data: data about the treatment that the clinician has prescribed, such as the type of
treatment, number of rehabilitation sessions, drugs received, surgery, etc.
• Clinical scales, tests, or surveys: data resulting from scales or validated and protocolized tests
whose objective is to obtain objective information about the patient (e.g., the timed up and go test,
the Unterberger test, psychological tests, etc.).
• Medical history: data concerning the patient’s clinical history, ordered chronologically
(e.g., first hospital visit, imaging test for diagnosis, treatment administration after diagnosis, etc.).
• Qualitative variables: also called categorical variables, they are variables that are not numerical.
They describe data that fit into categories (e.g., educational level, the level of development of a
disease, the level of invasiveness of a treatment, etc.).
• Quantitative variables: also called measurement or numerical variables, they represent quantities
of different nature. They can be divided into discrete variables that can only take a finite
number of values (e.g., the number of rehabilitation sessions, score on a clinical scale, age, etc.),
138
Appl. Sci. 2020, 10, 5942
and continuous variables, which can take values in an infinite/continuous range of possible values
(e.g., the temperature of a thermometer, weight, Body Mass Index (BMI), etc.).
• Textual data: data that are directly collected in text format, such as handwritten annotations in
medical histories.
In this stage, data profiling [41] and cleaning [42] must be applied to detect potential problems
and, if possible, fix them. By categorizing the data, using one of the proposed classifications (or another
one that could be useful for a specific use case) independently, or several of them at the same time,
the process of the understanding and assimilation of the data available and their scope is facilitated.
This step must be carried out to obtain a preliminary database, containing the data collected, that will be
called the initial database. For this task, having the support of both a clinical expert and a technologist
is recommended. In some cases, when the amount of data to handle is large or coming from different
data sources, a data warehouse can be created to integrate all the information and allow the easy and
efficient analysis of the data stored [43,44].
During data collection for classification tasks, it is also important to collect enough
instances/samples to represent in an appropriate way the different classes that must be predicted.
In case there is imbalance regarding the number of samples in the different target classes, this should
be identified as part of the data profiling, and some strategies could be applied to deal with this
issue [45,46] (e.g., to try to prevent the majority class from dominating the predictions in a harmful way).
Application to Our Case Study
In this case study, the initial database was prepared according to the final goal, which was
predicting the presence/absence of cervical pain. The categorization of the data was jointly agreed
by physicians and technologists, considering the application of this research in the legal field and the
degree of objectivity of the data. According to the four classifications presented, the data from our case
study could be classified as shown in Table 1, following the exposed criteria:
1. Clinical Data and Patient 2. Data According to the 3. Data According to the 4. Data According to
Data Degree of Objectivity Clinical Considerations Their Data Types
Patient characterization data:
- Identification: name, age,
gender, educational level, etc. Clinical profile:
- Temporary: accident date, - Symptoms: pain
visits to the hospital, date of periodicity, feeling of
Measures: all the data Qualitative variables:
the range of motion (ROM) instability, limitation of
from the MoCap sensors educational level,
test, etc. mobility, etc.
or Whiplash scale limitation of mobility,
- Anthropometric: weight, - Signs: contracture,
(WDQ). contracture, etc.
height, body mas index, foot limitation of mobility,
length, etc. spinal column
- Daily life activities: alterations, etc.
physical activity intensity,
workplace, etc.
Quantitative variables:
Clinical data: feeling of Assessed data:
- Discrete: WDQ, age, etc.
instability, surgery, all the contracture, limitation of
Treatment: n/a. - Continuous: all the data
data from the MoCap mobility, Jackson
from the MoCap sensors,
sensors, etc. contraction, etc.
weight, height, etc.
Reported data: pain
Clinical scales or tests:
periodicity, feeling of Textual data: n/a.
WDQ.
instability, etc.
Medical history: accident
date, visits to the
hospital, etc.
139
Appl. Sci. 2020, 10, 5942
The classifications shown in Table 1 illustrate different useful perspectives of the data collected in
the case study. Specifically, and considering the final goal of objectively predicting the presence of
cervical pain, the classification that best suited the point of view of the health professionals participating
in our case study was the following (a combination of the first two classification approaches described):
• Patient characterization data: the data measured and reported, such as the identification
information (e.g., ID, age, gender, etc.), anthropometrics (e.g., height, weight, BMI, etc.), and data
relative to the activities of daily life (e.g., weekly physical activity and its intensity, workplace, etc.).
• Assessed and reported data: such as the characterization of a cervical accident (e.g., the time since the
accident, type of impact of the traffic accident, position of the head, type of effect, etc.), qualitative
signs (e.g., column alterations and contractures), clinical scales (e.g., whiplash scale—WDQ),
and symptoms (e.g., instability, limitation of mobility, etc.).
• Measured data: such as data from a cervical assessment test with MoCap or from each movement
studied, for example the angles reached and the speeds of the movements (e.g., maximum values,
minimum values, average values, etc.).
For the purposes of medical legal/forensic assessments, a classification according to the degree of
objectivity of the clinical data (assessed and reported) is of interest. Thanks to the understanding of an
expert technologist in insurance litigation, it has been possible to identify objective information in the
case study, such as the presence of a contracture, column alterations, the WDQ, the characterization of
the traffic accident (in case such an event had occurred), etc.
The initial database is presented as Supplementary Material File S1, which collects the dataset
that has been considered for the development of the model.
• Texts that become “positive or negative opinions”, concepts, dates, values, etc., through the
application of text mining techniques [53,54].
140
Appl. Sci. 2020, 10, 5942
• Images of medical tests that are converted into binary variables related to a pathology or other
binary representations.
• Curves or other graphical representations from which information can be extracted as statistical
variables, such as the mean, standard deviation, skewness, quartiles, etc.
• Different imputation techniques [18,55–57] that can be applied in order to fill empty values,
either by means of interpolation or by using other procedures. Alternatively, some records may
need to be discarded (e.g., if several key data values are missing).
• The potential addition or creation of variables based on the knowledge of clinical experts and
technologists, either as a combination of existing variables or based on experience acquired in
similar studies [30].
• Data anonymization, applied in order to preserve the privacy of patients [58]. The trade-off
between privacy and information loss should be considered. A detailed analysis of data
anonymization techniques for health data is out of the scope of this work but, for illustration
purposes, some examples of transformations that can be applied to guarantee the privacy of
sensitive information are:
• Transformation of continuous variables (e.g., age, size, weight, income, etc.) into ordinal
variables by defining suitable ranges. Normalization (e.g., min-max normalization) or
standardization (z-score scaling) techniques could also be applied to transform the quantitative
variables into variables in the range from 0 to 1.
• Transformation of qualitative variables (e.g., diagnosis, treatment, education level, etc.),
that could be classified, according to a scale, into ordinal variables (e.g., severity of
diagnosis, treatment risks, range of studies (“high school”, ”BS”, ”MS”, “PhD”), etc.).
In this way, these variables cannot be associated with a specific patient, thus preserving the
patient’s privacy.
• Transformation of qualitative variables (e.g., nationality, physical description, address, etc.),
that could not be classified according to a scale into groups (e.g., by continents or country
grouping, by groups according to a general description, by postal code, etc.) so that these
variables cannot be associated with a specific patient.
• The previous anonymization techniques are examples of the generalization of attribute values.
Other possible techniques and privacy-preservation methodologies include adding noise [59],
k-anonymity [60], differential privacy [61], etc.
Another important aspect to mention here is the need for the normalization of quantitative
attributes; if we have quantitative variables with very different scales, the variables that can take larger
values could end up dominating others when learning a predictive model. This is unsuitable because
it mistakenly attributes more importance to those variables just because their usual values are higher
than the values of other variables.
After applying the specified transformations, the initial database is transformed to an encoded
(modified) database, which includes all the available information in the format of multiple variables.
Application to Our Case Study
In the case of the cervical pain assessment study, this process consisted of transforming the data
from the MoCap sensors into variables (i.e., mean, deviation, maximum, minimum, etc., of the range
of the movement) and transforming the clinical and characterization data of the patients into variables
considering sensitive variables and their anonymization. Some ranges of variables were associated
with a numeric coding to improve the anonymization process and the generation of the predictive
model. The process was carried out jointly by a clinical expert and a technologist. As in our case study
we have transformed all the quantitative variables into discrete variables, no additional normalization
141
Appl. Sci. 2020, 10, 5942
was needed. The following are examples of the transformations performed (the numeric coding used
is indicated in brackets):
• Age transformation in a three-tier classification: under 30 (0), between 30 and 55 (1), and over 55 (2).
• Transformation of the level of weekly exercise into a classification of three levels as a function of
the time required and the intensity of the exercise: slight (0), moderate (1), and intense (2).
• Transformation of the level of studies in a classification of three levels with their assigned numerical
coding: basic/high school (0), medium/bachelor (1), and superior/university studies (2).
• Weight transformation in a three-tier classification: less than 60 Kg (0), between 60 and 85 Kg (1),
and greater than 85 Kg (2).
• Height transformation in a three-level classification: less than 158 cm (0), between 158 and 185 cm
(1), and greater than 185 cm (2).
• Transformation of the body mass index (BMI) into a three-tier classification: under 24 (0), between 24
and 30 (1), and over 30 (2).
• Grouping of the cervical pain periodicity to create a variable of three levels: sporadic (0),
discontinuous (1), and frequent (2).
The completeness and quality of the data recorded is also a key aspect in any ML pipeline,
and particularly in the health area. In our case study, some variables were initially incomplete, due to
the lack of collected data from certain patients (five patients). These incomplete data were related to
specific features (e.g., the workplace of the patient, his/her age, the dominant side of his/her body,
the date of the traffic accident, etc.), and were later collected by a telephone call.
The encoded database is presented as Supplementary Material File S2, where the variables are
obtained after the transformations that are carried out from the initial database are collected and the
anonymized variables are highlighted (see Variable_View).
142
Appl. Sci. 2020, 10, 5942
1. The removal of variables related to personal and ethical data not anonymized previously and
with no predictive power. The variables name, surname, telephone number, address, and email
were removed in this filtering.
2. Manual filtering performed by physicians and technologists, corresponding to non-objective
or inappropriate variables in the medical legal/forensic field (e.g., the sensation of instability,
mobility limitation, pain periodicity, etc.), as well as variables with missing data that could not be
completed (e.g., the position of the head and type of effect in case of a traffic accident, intensity of
work, etc.). In this filtering, 74 variables were removed according to the criteria indicated.
3. Finally, a filtering was applied based on the gain ratio of the variables. We used the IBM SPSS
modeler software [62] (v. 18), discarding 123 variables with low predictive power (we selected
those with a gain ratio higher than 0.95 out of 1). Variables such as the average angle, standard
deviation, complementary angle, weight, height, etc., were selected. The selection of these
28 variables is consistent with the target variable (the presence or absence of pain) and its
associated requirements, since it is a desirable situation for clinicians that these variables are
objective, represent the main cervical movements, and correspond to clinical data objectively
acquired by the clinicians.
Table 2. Final variables considered in the case study after feature selection.
Patient Educational
Gender Age
Characterization Data Level
Clinical data: assessed Spinal column
Contracture Traffic accident WDQ
and reported data alterations
Mean Speed [◦ /s] in: Flex.-Ext. Rotation Lateralization
Flexion Right Rotation Right Lateral
Max Speed [◦ /s] in:
Clinical data: data Extension Left Rotation Left Lateral
measured with sensors
Flexion Right Rotation Right Lateral
Max Angle [◦ ] in:
Extension Left Rotation Left Lateral
Total Range [◦ ] in: Flex.-Ext. Rotation Lateralization
Total Length [◦ ] in: Flex.-Ext. Rotation Lateralization
Table 2 shows the 28 final variables of the filtered database; the detailed information of each variable,
with the different associated values for the different data instances, is included as Supplementary
Material File S3.
143
Appl. Sci. 2020, 10, 5942
validation). Furthermore, a validation set could be used (besides the “test set” and the “training set”),
which is a set of examples used to tune the parameters of a classifier [67]. This is useful because the
performance of the selected algorithms can be improved through parameter tuning, which consists of
varying values of parameters of the algorithms in order to find the most suitable configuration for each
of them. Once the suitable parameter configuration for each algorithm is selected, the performance of
the different algorithms can be compared.
It must be stressed that the most appropriate prediction method depends on the data. Besides,
overfitting (a phenomenon that occurs when the adjustment of the model to the training data is too
strong and, as a consequence, finds difficulties in obtaining suitable conclusions about unobserved
data) should be avoided, as this would lead to a model that will only be able to make predictions
for the data with which it has been trained [52]. There are several techniques to deal with this,
such as regularization (smoothing the models), data augmentation (increasing the training data),
early stopping, etc. As an example, early stopping implies that the training process of the model must
be stopped before overfitting [30,68,69], that is, before the model adjusts too much to the training data
(i.e., before the performance of model gets worse on the validation data).
Over-parametrization is a recurrent problem in the application of ML techniques in healthcare,
where there are cases where the ratio between variables and data is very high and overfitting effects
can occur. The opposite can also happen when the volume of data is very high, but the number of
variables is excessively reduced and/or has little predictive power. Therefore, if needed, depending on
the results obtained we could come back to Stage 4 “Data Selection” to reduce the number of variables.
As a guideline, to avoid over-parametrization the 1 to 10 ratio between variables (attributes or features)
and data (number of instances) should not be exceeded [20,52]. If there was a smaller ratio between
the variables and data, a more exhaustive screening of variables would have to be carried or a larger
sample would have to be obtained. In terms of classification, having an excessive number of variables
will lead to not solving the problem or not achieving the proposed objective because the model will
only be able to classify the training data correctly [70].
To select the most suitable algorithm for the project, it is necessary to follow an organization
strategy, storing each previous version and modification of the project [52]. In this way, the results are
contrasted (training data, validation, errors, etc.). After the selection of the most effective algorithm,
we will be able to generate the definitive predictive model that incorporates the already-existing filtered
database. If the predictive model were evaluated positively, a continuous learning process could begin
by receiving periodic information from the assessment tests performed on future patients, as shown
in Figure 3.
Application to Our Case Study
In view of the target of our clinical study to classify the presence of cervical pain, only supervised
learning algorithms must be selected, dismissing unsupervised learning algorithms. The algorithms
selected in our study were logistic regression [71], decision trees [72], random forests [73], SVM [73],
neural networks (MLP neural networks) [74], k-Nearest Neighbors (KNN) [73], and Gradient Boosting
Algorithm (GBA) [75]. All these are popular supervised machine learning approaches. The main
parameters selected to perform the parameter tuning of those algorithms are shown in Table 3. In our
experimental evaluation, we combined the parameters of each algorithm presented in Table 3 using the
software tool Weka [76] (v. 3.8); specifically, for the parameter tuning we used the Weka Experimenter
user interface.
144
Appl. Sci. 2020, 10, 5942
In our case study, the predictive models generated by applying the previously selected algorithms
are shown in Table 3. As mentioned previously, the DM software used in this study was Weka (v. 3.8).
Considering the current availability of data for our case study, the performance of each parameter
configuration of each algorithm was tested using a validation set which is the same as the test set;
using the same set for validation and testing is not the ideal situation, but we decided not to reserve a
part of the available dataset for validation because of the moderate size of our sample. We considered
the accuracy metric (i.e., the percentage of instances correctly classified) for determining the most
suitable algorithm configuration. Figure 5 shows the performance of the algorithms obtained during
parameter tuning using a k-fold cross validation (k = 10).
Consequently, the parameter configuration selected for each algorithm is shown in the first row of
Table 4. The effectiveness of the models generated by k-fold cross validation (k = 10) was evaluated.
The following metrics were computed in order to determine the most suitable algorithms (see Table 4):
the accuracy (the percentage of instances correctly classified), the precision (the percentage of patients
with pain correctly classified over all the patients labelled by the algorithms as patients with pain),
the recall/sensitivity (the percentage of patients with pain correctly classified over all the patients with
real pain), the specificity (the percentage of healthy patients correctly classified over all the patients
who are really healthy), and the F1-score (the harmonic average of the precision and recall).
We have noticed using the software Weka, which provides the attribute weights of each model in
the results display section, that the variables with greater predictive power in all the predictive models
evaluated are the following: maximum speed in all the movements, the existence of a traffic accident,
and the presence of a contracture. In our case study, there is no indication of over-parametrization;
as described in Section 3.4, we have a sample of 302 instances and 28 selected variables, which complies
with the 1 to 10 ratio between the variables and data.
145
Appl. Sci. 2020, 10, 5942
146
Appl. Sci. 2020, 10, 5942
Table 4. Metrics considered for algorithm selection and the results obtained for each algorithm.
For effective deployment, not only the accuracy of the models but also the resources must be
considered. For example, the scalability requirements are key when you have high volumes of data
to avoid very long training times, lack of memory, etc. This is important in terms of productivity,
portability, cost reduction, the minimization of staff involvement, etc. In cases where large volumes of
data have been collected for many years (cancer studies, studies of discharge from the hospital and
sick leaves, etc.), scalability acquires a transcendental role [30].
Application to Our Case Study
Based on the results collected in Table 4, a quite good performance of all the algorithms can be
observed for most metrics. Considering the large number of variables available initially, the rigor with
which all the data were obtained by the multidisciplinary team, and the adequate selection of variables
carried out in previous phases (choosing the objective variables and with a greater predictive power),
a suitable prediction performance was initially expected.
Considering the minimal target values of the statistical measures established in Stage 1 “Clinical
target definition”, several algorithms fulfil the requirements in terms of the target metrics:
• Accuracy (>85%): logistic regression, SVM, random forest, MLP neural network, and GBA.
• Precision (>85%): logistic regression, SVM, decision tree random forest, MLP neural network,
and GBA.
147
Appl. Sci. 2020, 10, 5942
• Recall (>90%): SVM, random forest, MLP neural network, kNN, and GBA.
According to the results, the four algorithms that fulfil the requirements are SVM, random forest,
the MLP neural network, and GBA. From all these algorithms, SVM achieves the best results in all the
metrics considered, and therefore it could be the model selected for production.
Sample
Category Key Aspect Description Case Study
References
Clinical target definition
Presence of cervical pain.
according to the aims and
Only collaborating subjects
Proper selection clinical needs. [34,40]
and objective variables were
This facilitates the subsequent
selected.
selection of data.
Clinical target
Minimum metrics to be fulfilled
by the model according to the Performance required:
Definition of clinical target. Metrics are accuracy: greater than 85%;
[39]
statistical measures checked in stage 6 precision: greater than 85%;
(interpretation of results) after recall: greater than 90%.
the predictive model generation.
148
Appl. Sci. 2020, 10, 5942
Table 5. Cont.
Sample
Category Key Aspect Description Case Study
References
Diversity of the origins of data
in healthcare (regarding
Prior to the field work,
typology, consistency, and
Identification and relevant clinical information
veracity) and need to correctly [34]
understanding was identified and the tests
understand the data.
to perform were determined.
A health expert is required in
this stage.
Categorization of data using Data classification motivated
appropriate variables/features, by clinical staff and forensic
applying classifications experts: patient
Clear and concise
motivated by medical needs. characterization data, [30]
structure
This is essential to carry out an assessed and reported
adequate analysis of the clinical data, measured
information. clinical data.
Variables such as the age,
Feature engineering. level of studies, weight,
Data Reducing the raw data to be height, etc., were
transformations in handled and adapting them to transformed into discrete [5,19,47,52]
healthcare the required format in order to variables. Ranges of
increase the predictive power. variables were associated
with a numeric code.
No quantitative variables
Preservation of sensitive patient
remained after the
data by transforming values of
anonymization process
data variables into scales or
Data (through transformation into
Anonymization groups, thus avoiding patient [58]
discrete variables and the
identification.
removal of identifying
This is a key aspect in healthcare
attributes) that could be
data management.
associated with patients.
After data transformation, the The volume of data in the
selection of variables applying a case study was reduced
filter according to the from 230 variables to 28 after
Selection [19,21]
target:ethical and legal issues, applying the three
manual selection, automated successive aforementioned
attribute selection. filters.
Normalization of quantitative
attributes avoiding situations No quantitative variables
Normalization where variables that can take remained after [55–57]
larger values could end up anonymization.
dominating others.
Incomplete data from 5
Completeness and quality of the
patients related to specific
Completeness data recorded as a key aspect in [81]
features were collected
the health area.
through a telephone call.
This was a real issue in our
Need not to exceed the 1 to 10
case study because of the
ratio between variables and data
volume of data provided by
to avoid overfitting. The
sensors.
Over-parametrisation dimensionality is an issue in [20]
A sample of 302 and 28
studies with a high number of
variables was finally
variables compared to the total
selected, which complies
number of samples.
with the 1 to 10 ratio.
149
Appl. Sci. 2020, 10, 5942
Table 5. Cont.
Sample
Category Key Aspect Description Case Study
References
Support for handling large
amounts of data (efficient and
effective collection, storage,
management, and exploitation). The current project is still in
Depending on the project an initial stage, with no
Scalability duration, and especially if it is large-scale deployment. No [30]
intended to have an adaptive scalability problems have
character (projects with data been detected.
collected for many years),
scalability is a key issue to
consider.
Minimization of the number of
false negatives (increasing
recall). This is a key goal in
healthcare, since false negatives
and false positives have no
The selected algorithms
High recall and similar costs in this area. The
Predictive (SVM, random forest, MLP
relatively high precision should also be suitable, [39]
model neural network, and GBA)
precision as a high number of false
have recall >90%.
positives would lead to false
alarms, the performance of
needless procedures, and
increasing costs and discomfort
for the patients.
Composing teams involving
technical people and diverse There was interaction
health professionals. This is between professionals in
required, but not always almost all the stages.
Multidisciplinary possible. Insufficient Nevertheless, more
[34,52]
work as a key point collaboration could be interventions could be
Project work
diminished by applying the encouraged because clinical
procedure
stages assigned to each of the experts were not present in
professionals in a concise and stage 5.
structured way.
There is an intention to
Improvement of the model
Continuous data improve the current system
performance thanks to a [30]
collection by incorporating data of new
continuous learning process.
collaborating patients.
We expect the applicability
of the proposal in the
Sensors/devices in Complementary objective tests
MoCap forensic field as an objective [19,82]
healthcare to help physicians.
system of application to aid
in judicial processes.
Data structuring. A clear and concise structuring of the data is essential to carry out an adequate
analysis of the information as well as to make this information really useful for the purposes of the
predictive model. In our case study, and prior to the field work, the relevant clinical information
was identified and the tests to perform were determined (cervical ROM in three different planes with
a MoCap system of inertial sensors), so that its processing and subsequent structuring were easier.
It is essential to accurately structure the available information in a suitable way (using appropriate
variables/features) when working with large volumes of data, and to classify the different variables
in different groups (using appropriate categories) to facilitate access (for all the parties involved) in
a more effective and useful way. Data management is so important that the preparation of the data
covers three of the seven stages of the methodology (stages 2, 3, and 4).
Selection of variables. Likewise, the selection of the most adequate information to predict a
certain characteristic, as well as its transformation in terms of variables, is essential. In relation to this
adaptation, sensitive patient data must be anonymized for their use in the generation of a predictive
150
Appl. Sci. 2020, 10, 5942
model [19,21]. Although converting continuous predictors to discrete variables (specifically binary
variables) is not always recommended [20], the necessary transformation of data for privacy reasons in
the field of health conditions the stage of data transformation. The reduction in the volume of data
in the case study was from 230 variables to 28 due to the large number of variables provided by the
inertial sensors and the clinical data. The variable reduction applied (based on the gain ratio of the
variables) after the corresponding selection of the data for ethical and legal issues and the screening
made by an expert was necessary to fulfil the 1 to 10 ratio between the variables and data. This key
point allowed us to ensure that the predictive models work properly, maximizing the predictive power
and avoiding overfitting.
Selection of the predictive model. Regarding the selection of the predictive model, after performing
a parameter tuning of the seven selected algorithms and comparing the most suitable configuration of
each of them (see Figure 5), it was concluded that the models that meet the established requirements
regarding accuracy, precision, and recall for the case study were SVM, random forest, MLP neural
networks, and GBA. Our results highlight the low number of false negatives achieved (high recall),
a fundamental aspect in healthcare studies [39]. These results are in agreement with other investigations
of a similar nature, using the same software (Weka), in terms of the accuracy, precision, recall, and F1
score with the SVM and random forest algorithms [36].
Variability of the measures acquired. It has been detected that it is possible to assess the measurement
capacity of our medical equipment in terms of the variability of the measures that it obtains. A series
variable identifies whether the data are relative to the first measure of each subject or to the second
(which was performed consecutively). The results obtained by the predictive models showed that there
were no differences between the two series of cervical ROM (this variable had the lowest predictive
power among all the variables introduced in the model). This result indicates that the measure has
behaved stably in collaborating subjects. This result is interesting in the forensic field due to the
following reason. If repeating the cervical ROM test in a patient results in significant differences,
they would not be derived from the variability of the test, but by the type of pathology that prevents
the patient from repeating the test normally. Alternatively, the patient may try to simulate or magnify
the lesion by not showing consistent results between the first and second series. This aspect would be
of relevance to judicial experts [29,83].
Data collection in production. Once the system is applied in its context and has been developed,
continuous data collection must be planned in production in order to improve the prediction accuracy,
resulting in a continuous learning system (Figure 3). In the case study, to increase the sample in the
exploitation stage of the model it is possible to include those patients who perform cervical assessment
tests in a care or rehabilitation setting; thus, their sincerity can be assumed regarding the degree of
cervical pain as well as full collaboration in the performance of the tests. However, for the collection
of training data it may be necessary to exclude those patients who are immersed in a judicial and
indemnifying process and who report cervical pain because their degree of collaboration or sincerity
is unknown. In the future, the system can be used to predict the cervical pain of non-collaborating
patients (e.g., patients in a judicial process or patients with a high degree of anxiety or hypochondriacs)
from the predictive model previously generated with collaborating patients, serving as objective
evidence in judicial proceedings with insurance companies [40].
Multidisciplinarity. For a correct interpretation of the results, multidisciplinary work is a key
point, since the contribution of each of the branches of knowledge is necessary in this type of project
to optimize the possibilities offered by the model. In this way, it will be possible to assess the
statistical quality of the results and their medical utility. For example, in our case study questions were
raised regarding the way the data and the results should be represented (solved with the databases
defined and the design of a report for physicians, presented as Supplementary Material), the possible
interpretation and use of the system by the clinician (problems could have arisen if we had not been
worked in close collaboration with the clinicians; however, the target variable and how to present
the results were clarified from the beginning of the project), and the overlap with other possible
151
Appl. Sci. 2020, 10, 5942
decision-support systems (if other systems could also provide in the future a prediction indicating that
a patient suffers cervical pain, both results would be presented to the clinician and he/she would take
the final decision). Through collaboration with the medical experts, these issues have been solved.
However, multidisciplinary work is not always possible, since professionals participate in different
stages of the entire process according to their degree of knowledge, experience, and training, so in some
cases there may be no direct or sufficient interaction between them [34]. This lack of interaction among
professionals could be diminished by following the stages assigned to each of the professionals in a
concise and structured way, thus avoiding problems that may lead to project failure [52]. In the case
study, the rigor followed by the different professionals involved in the different stages have resulted
in adequate results. Nevertheless, although there has been interaction between them, it could have
been done in a more collaborative way, since clinical experts were not present in the generation of the
predictive model and the interpretation of the results, which could have improved the quality of the
study thanks to its specific medical knowledge.
Exploitation of sensor data. The use of sensors and devices with the use of ML could be implemented
as a complementary objective test to help physicians. This type of test could constitute an aid to the
decision-making in the diagnosis or treatment, or if there is doubt about the veracity of the information
reported by the patient [19,82]. Although we wanted to show the clinical utility of this type of
technology, the lack of studies on the application of ML techniques with motion capture sensors in
healthcare, and specifically their applicability in the forensic field as an objective system of application
in judicial processes, have further motivated the choice of the case study. Therefore, while our case
study focuses on the medical legal/forensic field, the procedure proposed to use ML techniques could
be applied in any study of the health field (cancer detection, studies of discharge from the hospital and
sick leave, etc.).
Use of resources. Regarding the frequency and regularity of data acquisition, it is necessary to
previously estimate it to limit the duration of the project [34], as well as to quantify the necessary
storage size, which is a factor with high variability between projects. If the project is intended to have
an adaptive character that can be applied or expanded for subsequent research, the scalability and
magnitude should be considered. If the scalability of the project is not considered, and the intention is
to continue acquiring data and adapting the model to the continuous growth of information, there may
come a time when the project is no longer viable because it is unable to assimilate the corresponding
increase in data size. This situation is common in epidemiological projects of data collection for large
periods of time, where the scalability is transcendental for the future of the project [30]. On the other
hand, it is important to consider that, in certain projects as in the case study presented in this paper,
the ratio between available variables and data can be high. So, not exceeding the 1 to 10 ratio between
variables and data is transcendental to avoid overfitting effects [20].
152
Appl. Sci. 2020, 10, 5942
Despite the great usefulness of ML in the field of healthcare, some limitations have been detected
in this field. First, a major limitation is how to achieve a suitable flow of data from health centers and
hospitals, as well as the accessibility (in relation to privacy policies and authorizations) [21], gathering,
and integration of the data [3]. If a global information collaboration policy were established between
hospitals [6,69,88], the problem of access to information could be solved, and it would be possible
to share more data and feed the predictive models applied to the field of healthcare more efficiently.
The explainability of predictions [89,90] is an important issue in a health care domain, as it could
increase the trust in the ML systems (for both clinicians and patients) and even lead to the acquisition
of new knowledge; however, more research on how to achieve explainability while considering the
potential trade-off with accuracy must be performed, especially in the health domain.
Regarding the limitations of the conclusions obtained with this case study, once the model is in
the exploitation stage it would be advisable to carry out an external validation to verify the viability of
the model in terms of geography, temporality, etc. [30,77]. Regarding the data sample used in our case
study, its size has been large enough to obtain good results, but it would be relevant to see the impact
of increasing it, as new data about patients becomes available, to enable a continuous learning process
that could lead to better results over time. Besides, a study is currently being conducted to check the
accuracy of the proposed models with a sample of non-collaborating patients.
Concerning the target variable (presence of cervical pain), which is a binary variable, it could
be defined in a more granular way considering not only the presence of pain but also its intensity
(as a continuous variable or as a discrete variable with several pain degrees). The problem with pain
intensity is that pain scales are highly dependent on the subjectivity of the patient, and this issue could
be further exacerbated with non-collaborating subjects. However, as a future goal of our research,
it could be useful to tackle this issue and introduce some statistical techniques, such as the numerical
measurement z-score, to normalize the subjective values from pain intensity scales (e.g., the Visual
Analogue Scale) provided by the patients. The z-score could help to reduce the bias of patients,
allowing us to include pain intensity as a target variable in our proposal.
Finally, as future work, it could also be interesting to extend the range of experiments performed
and analyze the potential interest of other ML methods; for example, we could consider applying
different classifiers applied over different categories of data proposed in the paper and combine them
into an ensemble.
Abbreviations
The following abbreviations are used in this manuscript:
153
Appl. Sci. 2020, 10, 5942
References
1. Kayyali, B.; Knott, D.; Van Kuiken, S. The Big-Data Revolution in US Health Care: Accelerating Value and
Innovation; Mc Kinsey Co.: New York, NY, USA, 2013; Volume 2, pp. 1–13.
2. Tomar, D.; Agarwal, S. A survey on Data Mining approaches for Healthcare. Int. J. Bio-Sci. Bio-Technol. 2013,
5, 241–266. [CrossRef]
3. Koh, H.C.; Tan, G. Data mining applications in healthcare. J. Healthc. Inf. Manag. 2011, 19, 65.
4. Maity, N.G.; Das, S. Machine learning for improved diagnosis and prognosis in healthcare. In Proceedings of
the 2017 IEEE Aerospace Conference, Big Sky, MT, USA, 4–11 March 2017; pp. 1–9.
5. Yoo, I.; Alafaireet, P.; Marinov, M.; Pena-Hernandez, K.; Gopidi, R.; Chang, J.-F.; Hua, L. Data Mining in
Healthcare and Biomedicine: A Survey of the Literature. J. Med. Syst. 2012, 36, 2431–2448. [CrossRef]
6. Sen, I.; Khandelwal, K. Data Mining in Healthcare. 2018. Available online: https://www.researchgate.net/
publication/322754945_DATA_MINING_IN_HEALTHCARE (accessed on 26 August 2020).
7. Clavel, D.; Mahulea, C.; Albareda, J.; Silva, M. A Decision Support System for Elective Surgery Scheduling
under Uncertain Durations. Appl. Sci. 2020, 10, 1937. [CrossRef]
8. Cruz, J.A.; Wishart, D.S. Applications of Machine Learning in Cancer Prediction and Prognosis. Cancer Inform.
2006, 2, 59–77. [CrossRef]
9. Wang, F.; Stiglic, G.; Obradovic, Z.; Davidson, I. Guest editorial: Special issue on data mining for medicine
and healthcare. Data Min. Knowl. Discov. 2015, 29, 867–870. [CrossRef]
10. Rosales, R.E.; Rao, R.B. Guest Editorial: Special Issue on impacting patient care by mining medical data.
Data Min. Knowl. Discov. 2010, 20, 325–327. [CrossRef]
11. Alotaibi, S.; Mehmood, R.; Katib, I.; Rana, O.; Albeshri, A. Sehaa: A Big Data Analytics Tool for Healthcare
Symptoms and Diseases Detection Using Twitter, Apache Spark, and Machine Learning. Appl. Sci. 2020,
10, 1398. [CrossRef]
12. Huang, Z.; Dong, W.; Bath, P.; Ji, L.; Duan, H. On mining latent treatment patterns from electronic medical
records. Data Min. Knowl. Discov. 2015, 29, 914–949. [CrossRef]
13. Bentham, J.; Hand, D.J. Data mining from a patient safety database: The lessons learned. Data Min. Knowl.
Discov. 2012, 24, 195–217. [CrossRef]
14. Obenshain, M.K. Application of Data Mining Techniques to Healthcare Data. Infect. Control. Hosp. Epidemiol.
2004, 25, 690–695. [CrossRef] [PubMed]
15. Zhang, P.; Wang, F.; Hu, J.; Sorrentino, R. Towards Personalized Medicine: Leveraging Patient Similarity and
Drug Similarity Analytics. AMIA Summits Transl. Sci. Proc. 2014, 2014, 132. [PubMed]
16. Hamet, P.; Tremblay, J. Artificial intelligence in medicine. Metab. Clin. Exp. 2017, 69, S36–S40. [CrossRef]
[PubMed]
154
Appl. Sci. 2020, 10, 5942
17. Joyner, M.J.; Paneth, N. Seven Questions for Personalized Medicine. JAMA 2015, 314, 999–1000. [CrossRef]
18. Weiss, J.C.; Natarajan, S.; Peissig, P.L.; McCarty, C.A.; Page, D. Machine Learning for Personalized Medicine:
Predicting Primary Myocardial Infarction from Electronic Health Records. AI Mag. 2012, 33, 33–45. [CrossRef]
19. Mannini, A.; Sabatini, A.M. Machine Learning Methods for Classifying Human Physical Activity from
On-Body Accelerometers. Sensors 2010, 10, 1154–1175. [CrossRef]
20. Moons, K.; Kengne, A.P.; Woodward, M.; Royston, P.; Vergouwe, Y.; Altman, U.G.; Grobbee, D.E. Risk
prediction models: I. Development, internal validation, and assessing the incremental value of a new
(bio)marker. Heart 2012, 98, 683–690. [CrossRef]
21. Wilkowska, W.; Ziefle, M. Privacy and data security in E-health: Requirements from the user’s perspective.
Heal. Inform. J. 2012, 18, 191–201. [CrossRef]
22. Dolley, S. Big Data Solution to Harnessing Unstructured Data in Healthcare. IBM Report. 2015.
Available online: https://assets.sourcemedia.com/31/a6/cb1b019c4d6cb338fab539eea360/ims14428usen.pdf.
(accessed on 26 August 2020).
23. Andersen, R.M.; Davidson, P.L.; Baumeister, S.E. Improving access to care. In Changing the US Health Care
System: Key Issues in Health Services Policy and Management; John Wiley & Sons: Hoboken, NJ, USA, 2013;
pp. 33–70.
24. Marin, J.; Blanco, T.; Marin, J.J. Research Lines to Improve Access to Health Instrumentation Design.
Procedia Comput. Sci. 2017, 113, 641–646. [CrossRef]
25. Cassidy, J.D.; Carroll, L.; Côté, P.; Lemstra, M.; Berglund, A.; Nygren, Å. Effect of Eliminating Compensation
for Pain and Suffering on the Outcome of Insurance Claims for Whiplash Injury. N. Engl. J. Med. 2000, 342,
1179–1186. [CrossRef]
26. Moreno, A.J.; Utrilla, G.; Marin, J.; Marin, J.J.; Sanchez-Valverde, M.B.; Royo, A.C. Cervical Spine Assessment
with Motion Capture and Passive Mobilization. J. Chiropr. Med. 2018, 17, 167–181. [CrossRef] [PubMed]
27. Utrilla, G.; Marín, J.J.; Sanchez-Valverde, B.; Gomez, V.; JAuria, J.M.; Marin, J.; Royo, C. Cervical Mobility
Testing in Flexion-Extension and Protraction-Retraction to Evaluate Whiplash Syndrome Through Motion Capture;
Universidad de Zaragoza: Zaragoza, Spain, 2017.
28. Marín, J.J.; Pina, M.J.B.; Gil, C.B. Evaluación de Riesgos de Manipulación Repetitiva a Alta Frecuencia Basada
en Análisis de Esfuerzos Dinámicos en las Articulaciones sobre Modelos Humanos Digitales. Cienc. Trab.
2013, 15, 86–93. [CrossRef]
29. Marín, J.; Boné, M.; Ros, R.; Martínez, M. Move-Human Sensors: Sistema portátil de captura de movimiento
humano basado en sensores inerciales, para el análisis de lesiones musculoesqueléticas y utilizable en
entornos reales. In Proceedings of the Sixth International Conference on Occupational Risk Prevention,
Galicia, Spain, 14–16 May 2008.
30. Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating;
Springer: Berlin, Germany, 2009.
31. Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Knowledge Discovery and Data
Mining; AAAI Press: Menlo Park, CA, USA, 1996.
32. Azevedo, A.I.R.L.; Santos, M.F. KDD, SEMMA and CRISP-DM: A Parallel Overview.
ISCAP—Informática—Comunicações em Eventos Científicos. 2008. Available online:
https://recipp.ipp.pt/handle/10400.22/136 (accessed on 26 August 2020).
33. Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. CRISP-DM 1.0 Step-by-Step
Data Mining Guide; SPSS Inc.: Chicago, IL, USA, 2000.
34. McGregor, C.; Catley, C.; James, A. A process mining driven framework for clinical guideline improvement
in critical care. In Proceedings of the Learning from Medical Data Streams Workshop, Bled, Slovenia,
6 July 2011.
35. Catley, C.; Smith, K.; McGregor, C.; Tracy, M. Extending CRISP-DM to incorporate temporal data mining of
multidimensional medical data streams: A neonatal intensive care unit case study. In Proceedings of the
2009 22nd IEEE International Symposium on Computer-Based Medical Systems, Albuquerque, NM, USA,
3–4 August 2009; pp. 1–5.
36. Araujo, F.H.; Santana, A.M.; Neto, P.S. Using machine learning to support healthcare professionals in making
preauthorisation decisions. Int. J. Med. Inform. 2016, 94, 1–7. [CrossRef]
37. Bose, I.; Mahapatra, R.K. Business data mining—A machine learning perspective. Inf. Manag. 2001, 39,
211–225. [CrossRef]
155
Appl. Sci. 2020, 10, 5942
38. Bhatla, N.; Jyoti, K. An analysis of heart disease prediction using different data mining techniques. Int. J. Eng.
2012, 1, 1–4.
39. Raschka, S.; Mirjalili, V. Python Machine Learning; Packt Publishing Ltd.: Birmingham, UK, 2017.
40. Schuller, E.; Eisenmenger, W.; Beier, G. Whiplash Injury in Low Speed Car Accidents: Assessment of
Biomechanical Cervical Spine Loading and Injury Prevention in a Forensic Sample. J. Musculoskelet. Pain
2000, 8, 55–67. [CrossRef]
41. Naumann, F. Data profiling revisited. ACM SIGMOD Rec. 2014, 42, 40–49. [CrossRef]
42. Rahm, E.D.H. Data Cleaning: Problems and Current Approaches. Bull. Tech. Comm. Data Eng. 2000, 23, 3–13.
43. Jannot, A.-S.; Zapletal, E.; Avillach, P.; Mamzer, M.-F.; Burgun, A.; Degoulet, P. The Georges Pompidou
University Hospital Clinical Data Warehouse: A 8-years follow-up experience. Int. J. Med. Inform. 2017, 102,
21–28. [CrossRef]
44. Evans, R.S.; Lloyd, J.F.; Pierce, L.A. Clinical Use of an Enterprise Data Warehouse. AMIA Annu. Symp. Proc.
2012, 2012, 189–198.
45. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing
machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [CrossRef]
46. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. Balancing Strategies and Class Overlapping. Lect. Notes
Comput. Sci. 2005, 3646, 24–35. [CrossRef]
47. Bhardwaj, R.; Nambiar, A.R.; Dutta, D. A Study of Machine Learning in Healthcare. 2017 IEEE 41st Annu.
Comput. Softw. Appl. Conf. 2017, 2, 236–241. [CrossRef]
48. Dörre, J.; Gerstl, P.; Seiffert, R. Text mining: Finding nuggets in mountains of textual data. In Proceedings
of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD),
San Diego, CA, USA, 23–27 August 1999; pp. 398–401.
49. Aggarwal, C.C.; Zhai, C. Mining Text Data; Springer Science & Business Media: Berlin, Germany, 2012.
50. Janasik, N.; Honkela, T.; Bruun, H. Text mining in qualitative research: Application of an unsupervised
learning method. Organ. Res. Methods 2009, 12, 436–460. [CrossRef]
51. Blei, D.M. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [CrossRef]
52. Henri, L. Data Scientist y Lenguaje R Guía de Autoformación Para el uso de Big Data; Francisco, J., Piqueres, J.,
Eds.; Colecciones Epsilon: Cornellá de Llobregat, Spain, 2017.
53. Stavrianou, A.; Andritsos, P.; Nicoloyannis, N. Overview and semantic issues of text mining. ACM SIGMOD
Rec. 2007, 36, 23–34. [CrossRef]
54. Carlos, T.; Sergio, I.; Carlos, S. Text Mining of Medical Documents in Spanish: Semantic Annotation and
Detection of Recommendations. In Proceedings of the 16th International Conference on Web Information
Systems and Technologies (WEBIST 2020), Budapest, Hungary, 3–5 November 2020.
55. Donders, A.R.T.; Van Der Heijden, G.J.; Stijnen, T.; Moons, K.G. Review: A gentle introduction to imputation
of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [CrossRef]
56. Farhangfar, A.; Kurgan, L.; Dy, J. Impact of imputation of missing values on classification error for discrete
data. Pattern Recognit. 2008, 41, 3692–3705. [CrossRef]
57. Bennett, D.A. How can I deal with missing data in my study? Aust. N. Z. J. Public Health 2001, 25, 464–469.
[CrossRef]
58. Arbuckle, L.; El Emam, K. Anonymizing Health Data; O’Reilly Media, Inc.: Newton, MA, USA, 2013.
59. Kargupta, H.; Datta, S.; Wang, Q.; Sivakumar, K. On the privacy preserving properties of random data
perturbation techniques. In Proceedings of the Third IEEE International Conference on Data Mining,
Melbourne, FL, USA, 19–22 November 2003; pp. 99–106.
60. El Emam, K.; Dankar, F.; Issa, R.; Jonker, E.; Amyot, D.; Cogo, E.; Corriveau, J.-P.; Walker, M.; Chowdhury, S.;
Vaillancourt, R.; et al. A globally optimal k-anonymity method for the de-identification of health data. J. Am.
Med. Inform. Assoc. 2009, 16, 670–682. [CrossRef]
61. Dankar, F.K.; El Emam, K. The application of differential privacy to health data. Jt. EDBT/ICDT Workshops
EDBT-ICDT 2012, 2012, 158–166. [CrossRef]
62. IBM. SPSS Software. Available online: https://www.routledge.com/IBM-SPSS-Statistics-26-Step-by-Step-A-
Simple-Guide-and-Reference/George-Mallery/p/book/9780367174354 (accessed on 16 February 2020).
63. Wu, X.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.; et al.
Top 10 algorithms in data mining. Knowl. Inf. Syst. 2007, 14, 1–37. [CrossRef]
156
Appl. Sci. 2020, 10, 5942
64. Gupta, P. Cross Validation in Machine Learning. 2020. Available online: https://towardsdatascience.com/
cross-validation-in-machine-learning-72924a69872f (accessed on 26 August 2020).
65. Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79.
[CrossRef]
66. Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation. Encycl. Database Syst. 2009, 5, 532–538.
67. Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation,
Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning.
J. Anal. Test. 2018, 2, 249–262. [CrossRef] [PubMed]
68. McCaffrey, J. Neural Network Train-Validate-Test Stopping. 2020. Available online: https://
visualstudiomagazine.com/articles/2015/05/01/train-validate-test-stopping.aspx (accessed on 26 August 2020).
69. Ferber, R.; Osis, S.T.; Hicks, J.L.; Delp, S.L. Gait biomechanics in the era of data science. J. Biomech. 2016, 49,
3759–3761. [CrossRef] [PubMed]
70. Reed, R.; MarksII, R.J. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks; Mit Press:
Cambridge, MA, USA, 1999.
71. Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B.; Evangelia, C.; Jie, M.
A systematic review shows no performance benefit of machine learning over logistic regression for clinical
prediction models. J. Clin. Epidemiol. 2019, 110, 12–22. [CrossRef]
72. Bae, J.M. The clinical decision analysis using decision tree. Epidemiol. Health 2014, 36. [CrossRef] [PubMed]
73. Noi, P.T.; Kappas, M. Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine
Classifiers for Land Cover Classification Using Sentinel-2 Imagery. Sensors 2018, 18, 18. [CrossRef]
74. Penny, W.; Frost, D.; Penny, W. Neural Networks in Clinical Medicine. Med. Decis. Mak. 1996, 16, 386–398.
[CrossRef]
75. Zhang, Z.; Zhao, Y.; Canes, A.; Steinberg, D.; Lyashevska, O.; AME Big-Data Clinical Trial Collaborative
Group. Predictive analytics with gradient boosting in clinical medicine. Ann. Transl. Med. 2019, 7. [CrossRef]
76. Witten, I.; Frank, E.; Hall, M.; Pal, C. Appendix B: The WEKA workbench. In Data Mining: Practical Machine
Learning Tools and Techniques, 4th ed.; Morgan Kaufmann: Burlington, MA, USA, 2016.
77. Moons, K.; Kengne, A.P.; Grobbee, D.E.; Royston, P.; Vergouwe, Y.; Altman, U.G.; Woodward, M. Risk
prediction models: II. External validation, model updating, and impact assessment. Heart 2012, 98, 691–698.
[CrossRef] [PubMed]
78. Murphy, C.K. Identifying Diagnostic Errors with Induced Decision Trees. Med. Decis. Mak. 2001, 21, 368–375.
[CrossRef] [PubMed]
79. Zhao, L. The gut microbiota and obesity: From correlation to causality. Nat. Rev. Genet. 2013, 11, 639–647.
[CrossRef] [PubMed]
80. Dab, W.; Ségala, C.; Dor, F.; Festy, B.; Lameloise, P.; Le Moullec, Y.; Le Tertre, A.; Médina, S.; Quenel, P.;
Wallaert, B.; et al. Air pollution and health: Correlation or causality? The case of the relationship between
exposure to particles and cardiopulmonary mortality. J. Air Waste Manag. Assoc. 2001, 51, 220–235. [CrossRef]
[PubMed]
81. Liu, C.; Talaei-Khoei, A.; Zowghi, D.; Daniel, J. Data completeness in healthcare: A literature survey. Pac. Asia
J. Assoc. Inf. Syst. 2017, 9, 5. [CrossRef]
82. Mannini, A.; Trojaniello, D.; Cereatti, A.; Sabatini, A.M. A Machine Learning Framework for Gait Classification
Using Inertial Sensors: Application to Elderly, Post-Stroke and Huntington’s Disease Patients. Sensors 2016,
16, 134. [CrossRef]
83. Ramírez, P.C.; Ordi, H.G.; Fernández, P.S.; Morales, M.I.C. Detección de exageración de síntomas en esguince
cervical: Pacientes clínicos versus sujetos análogos. Trauma 2014, 25, 4–10.
84. De La Torre, J.; Marin, J.; Marin, J.J.; Auria, J.M.; Sanchez-Valverde, M.B. Balance study in asymptomatic
subjects: Determination of significant variables and reference patterns to improve clinical application.
J. Biomech. 2017, 65, 161–168. [CrossRef]
85. Austin, P.C.; Tu, J.V.; Ho, J.E.; Levy, D.; Lee, D.S. Using methods from the data-mining and machine-learning
literature for disease classification and prediction: A case study examining classification of heart failure
subtypes. J. Clin. Epidemiol. 2013, 66, 398–407. [CrossRef]
86. Choi, E.; Bahadori, M.T.; Schuetz, A.; Stewart, W.F.; Sun, J. Doctor AI: Predicting Clinical Events via Recurrent
Neural Networks. JMLR Work. Conf. Proc. 2016, 56, 301–318.
157
Appl. Sci. 2020, 10, 5942
87. Uyar, A.; Bener, A.; Ciray, H.N. Predictive modeling of implantation outcome in an in vitro fertilization
setting: An application of machine learning methods. Med. Decis. Mak. 2015, 35, 714–725. [CrossRef]
[PubMed]
88. Abdelaziz, A.; Elhoseny, M.; Salama, A.S.; Riad, A. A machine learning model for improving healthcare
services on cloud computing environment. Measurement 2018, 119, 117–128. [CrossRef]
89. Hall, P.; Gill, N. An Introduction to Machine Learning Interpretability; O’Reilly Media, Inc.: Sebastopol, CA,
USA, 2018.
90. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A Survey of Methods for
Explaining Black Box Models. ACM Comput. Surv. 2018, 51, 1–42. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
158
applied
sciences
Article
Taylor Bird Swarm Algorithm Based on Deep Belief
Network for Heart Disease Diagnosis
Afnan M. Alhassan 1,2, * and Wan Mohd Nazmee Wan Zainon 1
1 School of Computer Science, Universiti Sains Malaysia, George Town 11800, Malaysia; [email protected]
2 College of Computing and Information Technology, Shaqra University, Shaqra 11961, Saudi Arabia
* Correspondence: [email protected]
Keywords: deep belief network; heart disease diagnosis; sparse FCM; bird swarm algorithm
1. Introduction
Contemporary medicine depends on a large amount of information accumulated in
medical datasets. The extraction of such constructive knowledge can help when making scientific
decisions to diagnose disease. Medical data can enhance the management of hospital information and
endorse the growth of telemedicine. Medical data primarily focuses on patient care first, and research
resources second. The main rationalization to collect medical data is to promote patient health
conditions [1]. The accessibility of numerous medical data causes redundancy, which requires
effectual and significant techniques for processing data to extract beneficial knowledge. However,
the diagnostics of various diseases indicate significant issues in data analysis [2]. Quantifiable diagnosis
is performed by adoctor’s guidance rather than patterns of the medical dataset; thus, there is the
possibility of incorrect diagnosis [3]. Cloud-based services can assist with managing medical data,
including compliance management, policy integration, access controls, and identity management [4].
Now a day, heart disease is a foremost source of death. We are moving towards a new
industrial revolution; thus, lifestyle changes should take place to prevent risk factors of heart disease,
such as obesity, diabetes, hypertension, and smoking [5]. The treatment of disease is a complex mission
in medical field. The discovery of heart disease, with different risk factors, is considered a multi-layered
issue [6]. Thus, patient medical data are collected to simplify the diagnosis process. Offering a valuable
service (at less cost) is a major limitation in the healthcare industry. In [7], valuable quality service refers
to the precise diagnosis (and effective treatment) in patients. Poor clinical decisions cause disasters,
which may affect the health of patients. Automated approaches, such as the machine-learning
approach [8,9] and data mining [10] approach, assist with attaining clinical tests, or diagnoses,
at reduced risks [11,12]. The classification and pattern recognition by machine learning algorithms are
widely included in prognostic and diagnosis monitoring. The machine learning approach supports
decision-making, which increases the safety of the patients and avoids medical errors, so that it can be
used in clinical decision support systems (CDSS) [13,14].
Several methods are devised for automatic heart disease detection to evaluate the efficiency of the
decision tree and Naive Bayes [15]. Moreover, optimization with the genetic algorithm is employed
for minimizing the number of attributes without forfeiting accuracy and efficiency to diagnose
heart disease [16]. Data mining methods for heart disease diagnosis include the bagging algorithm,
neural network, support vector machine, and automatically defined groups [17]. In [18], the study
acquired 493 samples from a cerebrovascular disease prevention program, and utilized three
classification techniques (the Bayesian classifier, decision tree, and backpropagation neural network) for
constructing classification models. In [19], a method is devised for diagnosing coronary artery disease.
The method utilized 303 samples by adapting the feature creation technique. In [20], a methodology
is devised for automatically detecting the efficiency of features to reveal heart rate signals. In [21],
a hybrid algorithm is devised with K-Nearest Neighbour (KNN), and the genetic algorithm for
effectual classification. The method utilized a genetic search as a decency measure for ranking attributes.
Then, the classification algorithm was devised on evaluated attributes for heart disease diagnosis.
The extraction of valuable information from huge data is a time-consuming task [22]. The size of
the medical dataset is increasing in a rapid manner and advanced techniques of data mining help
physicians make effective decisions. However, the issues of heart disease data involve feature selection,
in which the imbalance of samples and the lack of magnitude of features are just some of the issues [23].
Although there are methods for heart disease detection with real-world medical data, these methods are
devised to improve accuracy and time for computation in disease detection [24]. In [25], a hybrid model
with the cuckoo search (CS)—and a rough set—is adapted for diagnosing heart disease. The drawback
is that a rough set produces an unnecessary number of rules. To solve these challenges in heart
disease diagnoses; a novel method, named the Taylor-based bird swarm algorithm–deep belief network
(Taylor-BSA–DBN), is proposed for medical data classification.
The purpose of the research is to present a heart disease diagnosis strategy, for which the proposed
Taylor-BSA–DBN is employed. The major contribution of the research is the detection of heart disease
using selected features. Here, the feature selection is performed using sparse FCM for selecting
imperative features. In addition, DBN is employed for detecting heart disease data using the features.
Here, the DBN is trained by the proposed Taylor-BSA, in such a way that the model parameters are
learned optimally. The proposed Taylor-BSA is developed through the inheritance of the high global
convergence property of BSA in the Taylor series. Hence, the proposed Taylor-BSA–DBN renders
effective accuracy, sensitivity, and specificity while facilitating heart disease diagnosis.
The major portion of the paper focuses on:
Other sections of the paper are arranged as follows: Section 2 elaborates the descriptions of
conventional heart disease detection strategies utilized in the literature, as well as challenges faced,
which are considered as the inspiration for developing the proposed technique. The proposed method
for heart disease diagnosis using modified DBN is portrayed in Section 3. The outcomes of the proposed
strategy with other methods are depicted in Section 4; Section 5 presents the conclusion.
160
Appl. Sci. 2020, 10, 6626
2. Motivations
This section illustrates eight strategies employed for heart disease diagnosis, along with
its challenges.
Literature Survey
Reddy, G.T. et al. [22] devised an adaptive genetic algorithm with fuzzy logic (AGAFL) model
for predicting heart disease, which assists clinicians in treating heart disease at earlier phases.
The model comprises rough sets with a fuzzy rule-based classification module and heart disease feature
selection module. The obtained rules from fuzzy classifiers are optimized by adapting an adaptive
genetic algorithm. Initially, the significant features that affect heart disease are chosen using the rough
set theory. Then, the second step predicts heart disease with the AGAFL classifier. The method is effective
in handling noisy data and works effectively with large attributes. Nourmohammadi-Khiarak et al. [23]
devised a method for selecting features and reducing the number of features.
Here, the imperialist competitive algorithm was devised to choose important features from
heart disease. This algorithm offers an optimal response in selecting features. Moreover, the k-nearest
neighbor algorithm was utilized for classification. The method showed that the accuracy of
feature selection was enhanced. However, the method failed to utilize incomplete or missed data.
Magesh, G. and Swarnalatha, P. [26] devised a model using Cleveland heart samples for heart
disease diagnosis. The method employed cluster-based Decision Tree learning (CDTL) for diagnosing
heart disease. Here, the original set was partitioned using target label distribution. From elevated
distribution samples, the possible class was derived. For each class set, the features were detected
using entropy for diagnosing heart disease. Thiyagaraj, M. and Suseendran, G. [27] developed
Particle Swarm Optimization and Rough Sets with Transductive Support Vector Machines (PSO and
RS with TSVM) for heart disease diagnosis. This method improved data integrity to minimize
data redundancy. The normalization of data was carried out using Zero-Score (Z-Score). Then,
the PSO was employed for selecting the optimal subset of attributes, reduce computational overhead,
and enhance prediction performance. The Radial Basis Function-Transductive Support Vector Machines
(RBF-TSVM) classifier was employed for heart disease prediction. Abdel-Basset, M. et al. [28] devised
a model using Internet of Things (IoT) for determining and monitoring heart patients. The goal of
the healthcare model was to obtain improved precision for diagnosis. The neutrosophic multi-criteria
decision-making (NMCDM) technique was employed for aiding patients (i.e., for observing patients
suffering from heart failure). Moreover, the model provided an accurate solution that decreases the rate
of death and the cost of treatment. Nilashi, M. et al. [24] devised a predictive technique for heart disease
diagnosis with machine learning models. Here, the method adapted unsupervised and supervised
learning for diagnosing heart disease. In addition, the method employed Self-Organizing Map,
Fuzzy Support Vector Machine (FSVM), and Principal Component Analysis (PCA) for missing
value assertion. Moreover, incremental PCA and FSVM are devised for incremental learning of
data to minimize the time taken for computation in disease prediction. Shah, S.M.S. et al. [29]
devised an automatic diagnostic technique for diagnosing heart disease. The method evaluated
the pertinent feature subset by employing the benefits of feature selection and extraction models.
For accomplishing the feature selection, two algorithms: accuracy based feature selection algorithm
(AFSA) and Mean Fisher based feature selection algorithm (MFFSA) for heart disease diagnosis.
However, the method failed to employ PCA for dimension reduction. Acharjya, D.P. [25] devised
a hybrid method for diagnosing heart disease. The method combined the cuckoo search (CS) and
rough set to infer decision rules. Moreover, the CS was employed for discovering essential features.
In addition, three major features were evaluated with rough set rules. The method improved feasibility,
but failed to induce an intuitionistic fuzzy rough set and CS for diagnosing heart disease.
161
Appl. Sci. 2020, 10, 6626
Figure 1. Schematic view of the proposed Taylor-based bird swarm algorithm (Taylor-BSA)–deep belief
network (DBN) for heart disease diagnosis.
Consider an input medical data be given as A, with various attributes, and is expressed as
A = AG,H ; (1 ≤ G ≤ B); (1 ≤ H ≤ C) (1)
where AG,H denotes Hth attribute in Gth data, B specifies a total number of data, and C specifies total
attributes in each data. The dimension of the database is represented as [B × C].
3.1. Pre-Processing
The importance of pre-processing is to facilitate smoother processing of the input data. Additionally,
the pre-processing is carried out for eliminating the noise and artefacts contained in the data.
162
Appl. Sci. 2020, 10, 6626
In this method, the pre-processing is carried out by using log transformation, in which data are
replaced with a log function, wherein the base of the log is set by the analyst (maybe 2, or 10).
The process is used to compress the massive data. In addition, the log transformation has extensively
adapted the method to solve skewed data and assist data normalization. The log transformation is
formulated as,
D = log10 (A) (2)
163
Appl. Sci. 2020, 10, 6626
Step 4. Foraging behavior of birds: the individual bird searches for the food based on its
own experience, and the behavior of the swarm, which is given below. The standard equation of the
foraging behavior of birds [32] is given by,
h+1
Fi,j = Fhi, j − Fhi,j Rand(0, 1)[Z + T ] + Rand(0, 1)[Pi, j Z + Y j T ] (3)
where, Fhi,j+1 and Fhi,j denotes the location of ith bird in jth dimension at (h + 1) and h, Pi, j refers to the
previous best position of the ith bird, Rand(0, 1) is independent uniformly distributed numbers,
Y j indicates the best previous location shared by the birds swarm, Z denotes the cognitive
accelerated coefficients, and T denotes the social accelerated coefficients. Here, Z and T are positive
numbers.
According to the Taylor series [31], the update equation is expressed as,
Fi,h+
j
1
= 0.5Fhi,j + 1.3591Fh−1
i,j
− 1.359Fh−2
i, j
+ 0.6795Fh−3
i, j
(4)
−0.2259Fh−4
i,j
+ 0.0555Fh−5
i,j
− 0.0104Fh−6
i,j
+ 1.38e−3 Fh−7
i, j
− 9.92e−5 Fi,h−8
j
⎡ h+1 ⎤
1 ⎢⎢⎢ Fi,j − 1.3591Fi, j + 1.359Fi,j − 0.6795Fi, j + 0.2259Fi, j ⎥⎥
h−1 h−2 h−3 h−4
Fhi,j = ⎢⎢ ⎥⎥
⎥⎦ (5)
0.5 ⎣ −0.0555FI,J + 0.0104Fi,j − 1.38e Fi, j + 9.92e Fi, j
h−5 h−6 −3 h−7 −5 h−8
where, V represents the number of birds, w1 and w2 are the positive constants lying in the range of [0, 2],
RQ(U )i denotes the optimal fitness value of ith bird, and RQ corresponds to the addition of the best
164
Appl. Sci. 2020, 10, 6626
fitness values of the swarm. ψ be the constant that keeps optimization away from zero-division error.
T signifies the positive integer.
Step 6. Flight Behavior: this behavior is of the birds’ progress, when the birds fly to another site
in case of any threatening events and foraging mechanisms. When the birds reach a new site, they
search for food. Some birds in the group act as producers and others as scroungers. The behavior is
modeled as,
Fhi,j+1 = Fhi,j + Rand r(0, 1) × Fhi, j (13)
Fi,h+
j
1
= Fhi, j + Fhγ,j − Fhi,j × Fl × Rand(0, 1) (14)
where, Random (0, 1) refer to the Gaussian distributed random number with zero-mean and
standard deviation.
Step 7. Determination of best solution:the best solution is evaluated based on error function. If the
newly computed solution is better than the previous one, then it is updated by the new solution.
Step 8. Terminate: the optimal solutions are derived in an iterative manner until the maximum number
of iterations is reached. The pseudo-code of the proposed Taylor-BSA algorithm is illustrated in Algorithm 1.
Algorithm 1. Pseudocode for the proposed Taylor-BSA algorithm
Input: Bird swarm population Wk,l ; (1 ≤ k ≤ b)
Output: Best solution
Procedure:
Begin
Population initiation: Fi, j ; (1 ≤ i ≤ p)
Read the parameters: b − population size; hmax
maximal iteration, prob−probability of foraging food, Fl-frequency of flight behavior of birds
Determine the fitness of the solutions
While h < hmax
For k = 1 : b
If Rand(0, 1) < prob
Foraging behavior using Equation (3)
Else
Vigilance behavior using Equation (12)
End if
End for
Else
Split the swarm as scroungers and producers
For k = 1 : b
If k is a producer
Update using Equation (13)
Else
Update using Equation (14)
End if
End for
Check the feasibility of the solutions
Return the best solution
h = h+1
End while
Optimal solution is obtained
End
165
Appl. Sci. 2020, 10, 6626
and hidden units that are associated with weights. The basic structural design of the DBN is illustrated
in Figure 2.
I. Training of RBM Layers A training sample N is given as the input to the first layer of RBM.
It computes the probability distribution of the data and encodes it into the weight parameters.
The steps involved in the training process of RBM are illustrated below.
1. The input training sample is read and the weight vector is produced randomly.
2. The probability function of each hidden neuron in the first RBM is calculated.
3. The positive gradient is computed using a visible vector and the probability of the
hidden layer.
4. The probability of each visible neuron is obtained by reconstructing the visible layer from
the hidden layer.
5. The probability of reconstruction of hidden neurons is obtained by resampling the
hidden states.
6. The negative gradient is computed.
7. Weights are updated by subtracting the negative gradient from the positive gradient.
8. Weights are updated for the next iteration, using the steepest or gradient descent algorithm.
9. Energy is calculated for a joint configuration of the neurons in the visible and the
hidden layers.
II. Training of MLP The training procedure in MLP is based on a backpropagation approach
by feeding the training data, which are the hidden output of the second RBM layer through
the network. Analyzing the data, the network is adjusted iteratively until the optimal weights
are chosen. Moreover, Taylor-BSA is employed to compute the optimal weights, which are
determined using the error function. The training procedure is summarized below.
166
Appl. Sci. 2020, 10, 6626
2. Read the input sample from the result of the preceding layer.
3. Obtain the average error, based on the difference between the obtained output and the
desired output.
4. Calculate the weight updates in the hidden and the visible layers.
5. Obtain the new weights from the hidden and the visible layers by applying
gradient descent.
6. Identify the new weights using the updated equation of Taylor-BSA.
7. Estimate the error function using gradient descent and Taylor-BSA.
8. Choose the minimum error and repeat the steps.
Parameter Value
Number of input layers 2
Number of hidden layers 2
Number of output layers 1
Cluster size 5 to 9
Number of selected features in Cleveland dataset 123
Number of selected features in Hungarian dataset 139
Number of selected features in Switzerland dataset 139
Learning rate 0.1
167
Appl. Sci. 2020, 10, 6626
4.3.1. Accuracy
The accuracy is described as the degree of closeness of an estimated value with respect to its
original value in optimal medical data classification, and it is represented as,
Tp + Tn
Accuracy = (15)
Tp + T n + Fp + Fn
where, Tp represent true positive, Fp indicate false positive, Tn indicate true negative, and Fn represents
false negative, respectively.
4.3.2. Sensitivity
This measure is described as the ratio of positives that are correctly identified by the classifier,
and it is represented as,
Tp
Sensitivity = p (16)
T + Fn
4.3.3. Specificity
This measure is defined as the ratio of negatives that are correctly identified by the classifier, and is
formulated as.
Tn
Speci f icity = n (17)
T + Fp
168
Appl. Sci. 2020, 10, 6626
Table 2. Analysis of methods with cluster size = 5 using the Cleveland database. Abbreviations: SVM,
Support Vector Machine; NB, Naive Bayes; DBN, deep belief network.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7590 0.7603 0.7874 0.8625
60 0.7143 0.7682 0.7851 0.8632
70 0.7460 0.7627 0.8122 0.8531
80 0.7236 0.7619 0.7869 0.8644
90 0.7538 0.7647 0.7742 0.8710
Sensitivity
50 0.7535 0.7613 0.7908 0.8693
60 0.7120 0.7611 0.7886 0.8699
70 0.7473 0.7558 0.8172 0.8602
80 0.7167 0.7656 0.7903 0.8710
90 0.7576 0.7667 0.7714 0.8788
Specificity
50 0.7566 0.7667 0.7838 0.8551
60 0.7165 0.7750 0.7815 0.8559
70 0.7447 0.7692 0.8068 0.8452
80 0.7302 0.7581 0.7833 0.8571
90 0.7500 0.7576 0.7813 0.8621
Table 3. Analysis of methods with cluster size = 5 using the Hungarian database.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7810 0.8043 0.8428 0.9200
60 0.7906 0.7976 0.8595 0.8907
70 0.7674 0.8143 0.8182 0.8551
80 0.7576 0.8095 0.8710 0.8710
90 0.6957 0.7500 0.7647 0.9130
Sensitivity
50 0.8160 0.8456 0.8776 0.9388
60 0.8300 0.8347 0.8908 0.9160
70 0.8052 0.8539 0.8571 0.8876
80 0.8065 0.8400 0.8795 0.9000
90 0.7500 0.8000 0.8125 0.9333
Specificity
50 0.7294 0.7326 0.7805 0.8846
60 0.7143 0.7500 0.8030 0.8438
70 0.7115 0.7451 0.7500 0.7959
80 0.6757 0.7647 0.8182 0.8182
90 0.6111 0.6667 0.6842 0.8750
169
Appl. Sci. 2020, 10, 6626
Table 4. Analysis of methods with cluster size = 5 using the Switzerland database.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7619 0.7710 0.7895 0.8644
60 0.7009 0.7374 0.7800 0.8557
70 0.7073 0.7568 0.7895 0.8904
80 0.7551 0.7647 0.7778 0.8400
90 0.6774 0.7037 0.7143 0.8462
Sensitivity
50 0.7656 0.7742 0.7818 0.8710
60 0.6981 0.7292 0.7843 0.8627
70 0.7073 0.7500 0.7948 0.8974
80 0.7500 0.7692 0.7857 0.8462
90 0.6875 0.6923 0.7143 0.8571
Specificity
50 0.7581 0.7667 0.7966 0.8571
60 0.7037 0.7451 0.7755 0.8478
70 0.7073 0.7632 0.7838 0.8823
80 0.7600 0.7600 0.7692 0.8333
90 0.6667 0.7143 0.7143 0.8333
170
Appl. Sci. 2020, 10, 6626
Table 5. Analysis of methods with cluster size = 9 using the Cleveland database.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7590 0.7603 0.7993 0.8690
60 0.7166 0.7430 0.7851 0.8632
70 0.7354 0.7363 0.8123 0.8857
80 0.7419 0.7607 0.7619 0.8475
90 0.7460 0.7910 0.8710 0.9016
Sensitivity
50 0.7535 0.7613 0.8039 0.8758
60 0.7107 0.7440 0.7886 0.8699
70 0.7303 0.7368 0.8172 0.8925
80 0.7419 0.7544 0.7656 0.8548
90 0.7419 0.8000 0.8788 0.9091
Specificity
50 0.7566 0.7667 0.7945 0.8613
60 0.7222 0.7419 0.7815 0.8559
70 0.7340 0.7419 0.8068 0.8780
80 0.7419 0.7581 0.7667 0.8393
90 0.7500 0.7813 0.8621 0.8929
Table 6. Analysis of methods with cluster size = 9 using the Hungarian database.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7500 0.7957 0.8515 0.9200
60 0.7513 0.7870 0.8075 0.8907
70 0.7777 0.8273 0.8500 0.9118
80 0.7755 0.8298 0.8974 0.9341
90 0.7674 0.8000 0.8298 0.8696
Sensitivity
50 0.7907 0.8389 0.8844 0.9388
60 0.8017 0.8218 0.8487 0.9160
70 0.8242 0.8652 0.8732 0.9326
80 0.8226 0.8667 0.9130 0.9500
90 0.8077 0.8438 0.8667 0.9000
Specificity
50 0.6897 0.7209 0.7927 0.8846
60 0.6667 0.7353 0.7353 0.8438
70 0.6981 0.7600 0.8163 0.8723
80 0.6944 0.7647 0.8750 0.9032
90 0.7059 0.7222 0.7647 0.8125
171
Appl. Sci. 2020, 10, 6626
Table 7. Analysis of methods with cluster size = 9 using the Switzerland database.
Methods
SVM NB DBN Proposed Taylor-BSA–DBN
Training Percentage
Accuracy
50 0.7460 0.7479 0.8017 0.8644
60 0.7170 0.7624 0.7684 0.8947
70 0.7368 0.7500 0.7662 0.8904
80 0.6786 0.7551 0.8200 0.8400
90 0.7333 0.7600 0.7679 0.7778
Sensitivity
50 0.7414 0.7500 0.8065 0.8710
60 0.7170 0.7609 0.7647 0.9020
70 0.7297 0.7561 0.7692 0.8974
80 0.6786 0.7500 0.8300 0.8462
90 0.7300 0.7500 0.7667 0.7857
Specificity
50 0.7419 0.7541 0.7966 0.8571
60 0.7170 0.7600 0.7755 0.8864
70 0.7436 0.7436 0.7632 0.8824
80 0.6786 0.7600 0.8200 0.8333
90 0.7143 0.7300 0.7556 0.7692
172
Appl. Sci. 2020, 10, 6626
showed maximal accuracy of 0.913, maximal sensitivity of 0.933, and maximal specificity of 0.875.
Considering the Switzerland database, the proposed Taylor-BSA–DBN showed maximal accuracy
of 0.846, which is 19.98%, 16.78%, and 15.60% better than the existing methods, such as SVM,
NB, and DBN, respectively. Similarly, the proposed system has a maximum sensitivity of 0.857.
The percentage of improvement of the proposed system sensitivity, with the existing methods,
such as SVM, NB, and DBN is 19.72%, 19.25%, and 16.69%, respectively. Likewise, the proposed
Taylor-BSA–DBN showed maximal specificity of 0.833.
173
Appl. Sci. 2020, 10, 6626
Methods
Metrics SVM NB DBN Proposed Taylor-BSA–DBN
k-Fold
Cleveland
5 0.7021 0.7088 0.7126 0.7239
6 0.7122 0.7189 0.7245 0.7365
Accuracy
7 0.7345 0.7543 0.7634 0.7843
8 0.7528 0.7843 0.7965 0.8132
5 0.7567 0.7678 0.7898 0.7956
6 0.7834 0.8045 0.8156 0.8232
Sensitivity
7 0.8032 0.8189 0.8256 0.8321
8 0.8145 0.8229 0.8365 0.8448
5 0.7586 0.7656 0.7699 0.7865
6 0.7854 0.7745 0.7965 0.8043
Specificity
7 0.7940 0.8088 0.8124 0.8227
8 0.8021 0.8178 0.8249 0.8339
Hungarian
5 0.7960 0.8758 0.8791 0.8822
6 0.8071 0.8413 0.8838 0.8854
Accuracy
7 0.7985 0.8030 0.8324 0.8917
8 0.7982 0.8626 0.8948 0.9021
5 0.7959 0.8027 0.8197 0.8231
6 0.7857 0.7891 0.7925 0.8367
Sensitivity
7 0.7891 0.7925 0.7959 0.8393
8 0.7857 0.7925 0.8061 0.8458
5 0.7494 0.7513 0.7645 0.7656
6 0.7146 0.7343 0.7645 0.7760
Specificity
7 0.7433 0.7535 0.7645 0.7719
8 0.7530 0.7645 0.7697 0.7873
Switzerland
5 0.7528 0.7789 0.7799 0.7896
6 0.7658 0.7828 0.7925 0.8012
Accuracy
7 0.7712 0.7958 0.8028 0.8156
8 0.7828 0.8128 0.8159 0.8259
5 0.7428 0.7689 0.7847 0.7956
6 0.7625 0.7758 0.7858 0.8028
Sensitivity
7 0.7748 0.7896 0.7952 0.8125
8 0.7828 0.7986 0.8078 0.8225
5 0.7658 0.7589 0.7750 0.7896
6 0.7758 0.7832 0.7962 0.8020
Specificity
7 0.7841 0.7911 0.8025 0.8196
8 0.7958 0.8002 0.8178 0.8219
174
Appl. Sci. 2020, 10, 6626
Using cluster size = 9, and considering the Cleveland database, the proposed Taylor-BSA–DBN
showed maximal accuracy of 0.934, which is 16.92%, 11.13%, and 3.96%, better than the
existing methods, such as SVM, NB, and DBN, respectively. In the existing methods, the DBN
offers maximum sensitivity of 0.913, but the proposed method is 3.89% better than the existing DBN.
The proposed method has a maximum specificity of 0.903. The percentage of improvement of
the proposed method with the existing methods, such as SVM, NB, and DBN, is 23.15%, 15.28%,
and 3.10%, respectively. Considering the Hungarian database, the proposed Taylor-BSA–DBN
showed maximal accuracy of 0.902, maximal sensitivity of 0.909, and maximal specificity of 0.893.
Considering the Switzerland database, the proposed Taylor-BSA–DBN showed maximal accuracy
of 0.840, which is 19.17%, 10.12%, and 2.38%, better than the existing methods, such as SVM,
NB, and DBN, respectively. Similarly, the proposed system has a maximum sensitivity of 0.846.
The percentage of improvement of the proposed system sensitivity with the existing methods,
such as SVM, NB, and DBN is 19.74%, 11.35%, and 1.89%, respectively. Likewise, the proposed
Taylor-BSA–DBN showed maximal specificity of 0.833.
Table 11 shows the computational time of the proposed system and the existing methods,
such as SVM, NB, and DBN, in which the proposed Taylor-BSA–DBN has a minimum computation
time of 6.31 sec.
Table 12 shows the statistical analysis of the proposed work and the existing methods based on
mean and variance.
175
Appl. Sci. 2020, 10, 6626
Dataset Methods Accuracy Mean Variance Sensitivity Mean Variance Specificity Mean Variance
Cluster size = 5
SVM 0.754 0.752 0.002 0.758 0.754 0.004 0.750 0.748 0.002
Cleveland NB 0.765 0.761 0.004 0.767 0.765 0.002 0.758 0.754 0.004
DBN 0.774 0.771 0.003 0.771 0.768 0.003 0.781 0.779 0.002
Proposed Method 0.871 0.869 0.002 0.879 0.878 0.001 0.862 0.860 0.002
SVM 0.696 0.693 0.003 0.750 0.748 0.002 0.611 0.608 0.003
Hungarian NB 0.750 0.746 0.004 0.800 0.799 0.001 0.667 0.665 0.002
DBN 0.765 0.763 0.002 0.813 0.810 0.003 0.684 0.682 0.002
Proposed Method 0.913 0.911 0.002 0.933 0.932 0.001 0.875 0.873 0.002
SVM 0.677 0.675 0.002 0.688 0.684 0.004 0.667 0.665 0.002
Switzerland NB 0.704 0.702 0.003 0.692 0.690 0.002 0.714 0.711 0.003
DBN 0.714 0.711 0.003 0.714 0.713 0.001 0.714 0.712 0.002
Proposed Method 0.846 0.844 0.002 0.857 0.855 0.002 0.833 0.831 0.002
Cluster size = 9
SVM 0.776 0.773 0.003 0.823 0.822 0.001 0.694 0.691 0.003
Cleveland NB 0.830 0.826 0.004 0.867 0.865 0.002 0.765 0.761 0.004
DBN 0.897 0.895 0.002 0.913 0.911 0.002 0.875 0.873 0.002
Proposed Method 0.934 0.932 0.002 0.950 0.948 0.002 0.903 0.901 0.002
SVM 0.746 0.743 0.003 0.742 0.740 0.002 0.750 0.748 0.002
Hungarian NB 0.791 0.790 0.001 0.800 0.797 0.003 0.781 0.780 0.001
DBN 0.871 0.868 0.003 0.879 0.878 0.001 0.862 0.860 0.002
Proposed Method 0.902 0.900 0.002 0.909 0.907 0.002 0.893 0.891 0.002
SVM 0.679 0.677 0.002 0.679 0.675 0.004 0.679 0.677 0.002
Switzerland NB 0.755 0.752 0.003 0.750 0.748 0.002 0.760 0.758 0.002
DBN 0.820 0.818 0.002 0.830 0.827 0.003 0.820 0.819 0.001
Proposed Method 0.840 0.838 0.002 0.846 0.844 0.002 0.833 0.832 0.001
5. Conclusions
Contemporary medicine depends on a huge amount of information contained in medical databases.
The obtainability of large medical data leads to the requirement of effective data analysis tools for
extracting constructive knowledge. This paper proposes a novel, fully automated DBN for heart disease
diagnosis using medical data. The proposed Taylor-BSA is employed to train DBN. The proposed
Taylor-BSA is designed by combining the Taylor series and BSA algorithm, which can be utilized for
finding the optimal weights for establishing effective medical data classification. Here, the sparse-FCM
is employed for selecting significant features. The incorporation of sparse FCM for the feature selection
process provides more benefits for interpreting the models, as this sparse technique provides important
features for detection, and can be utilized for handling high dimensional data. The obtained selected
features are fed to DBN, which is trained by the proposed Taylor-BSA. The proposed Taylor-BSA is
designed by integrating the Taylor series and BSA in order to generate optimal weights for classification.
The proposed Taylor-BSA–DBN outperformed other methods with maximal accuracy of 93.4%, maximal
sensitivity of 95%, and maximal specificity of 90.3%, respectively. The proposed method does not
classify the type of heart disease. In the future, other medical data classification datasets will be
employed for computing efficiency of the proposed method. In addition, the proposed system will be
further improved to classify heart diseases, such ascongenital heart disease, coronary artery disease,
and arrhythmia.
Author Contributions: Conceptualization, A.M.A. methodology, A.M.A.; software, A.M.A.; validation, A.M.A.;
resources, A.M.A.; data curation, A.M.A.; writing—original draft preparation, A.M.A. and W.M.N.W.Z.;
176
Appl. Sci. 2020, 10, 6626
writing—review and editing, A.M.A. and W.M.N.W.Z.; visualization, A.M.A. and W.M.N.W.Z.; supervision,
W.M.N.W.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Abdel-Basset, M.; Gamal, A.; Manogaran, G.; Long, H.V. A novel group decision making model based on
neutrosophic sets for heart disease diagnosis. Multimed. Tools Appl. 2019, 79, 9977–10002. [CrossRef]
2. Acharjya, D.P. A Hybrid Scheme for Heart Disease Diagnosis Using Rough Set and Cuckoo Search Technique.
J. Med. Syst. 2020, 44, 27.
3. Ahn, G.J.; Hu, H.; Lee, J.; Meng, Y. Representing and Reasoning about Web Access Control Policies.
In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference, Seoul, Korea,
19–23 July 2010; pp. 137–146.
4. Alizadehsani, R.; Habibi, J.; Hosseini, M.J.; Mashayekhi, H.; Boghrati, R.; Ghandeharioun, A.; Bahadorian, B.;
Sani, Z.A. A data mining approach for diagnosis of coronary artery disease. Comput. Methods Programs Biomed.
2013, 11, 52–61. [CrossRef] [PubMed]
5. Alzahani, S.M.; Althopity, A.; Alghamdi, A.; Alshehri, B.; Aljuaid, S. An overview of data mining techniques
applied for heart disease diagnosis and prediction. Lect. Notes Inf. Theory 2014, 2, 310–315. [CrossRef]
6. Babič, F.; Olejár, J.; Vantová, Z.; Paralič, J. Predictive and descriptive analysis for heart disease diagnosis.
In Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS),
Prague, Czech Republic, 3–6 September 2017; pp. 155–163.
7. Chang, X.; Wang, Q.; Liu, Y.; Wang, Y. Sparse Regularization in Fuzzy c-Means for High-Dimensional
Data Clustering. IEEE Trans. Cybern. 2017, 47, 2616–2627. [CrossRef]
8. Fatima, M.; Pasha, M. Survey of Machine Learning Algorithms for Disease Diagnostic. J. Intell. Learn.
Syst. Appl. 2017, 9, 1–16. [CrossRef]
9. Ghumbre, S.; Patil, C.; Ghatol, A. Heart disease diagnosis using support vector machine. In Proceedings of
the International Conference on Computer Science and Information Technology (ICCSIT), Mumbai, India,
10–12 June 2011.
10. Ghumbre, S.U.; Ghatol, A.A. Heart Disease Diagnosis Using Machine Learning Algorithm. In Proceedings of
the International Conference on Information Systems Design and Intelligent Applications, Visakhapatnam,
India, 5–7 January 2012; Volume 132, pp. 217–225.
11. Giri, D.; Acharya, U.R. Automated diagnosis of Coronary Artery Disease affected patients usingLDA, PCA,
ICA and Discrete WaveletTransform. Knowl. Based Syst. 2013, 37, 274–282. [CrossRef]
12. Heart Disease Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Heart+Disease (accessed on
22 April 2020).
13. Jabbar, M.A.; Deekshatulu, B.; LandChandra, P. Classification of Heart Disease Using K- Nearest Neighbor
and Genetic Algorithm. Procedia Technol. 2013, 10, 85–94. [CrossRef]
14. Jabbar, M.A.; Deekshatulu, B.L.; Chandra, P. Heart disease classification using nearest neighbor classifier
with feature subset selection. An. Ser. Inform. 2013, 11, 47–54.
15. Kukar, M.; Kononenko, I.; Groselj, C.; Kralj, K.; Fettich, J. Analysing and Improving the Diagnosis of Ischaemic
Heart Disease with Machine Learning. Artif. Intell. Med. 1999, 16, 25–50. [CrossRef]
16. Magesh, G.; Swarnalatha, P. Optimal feature selection through a cluster-based DT learning (CDTL) in heart
disease prediction. Evol. Intell. 2020, 1–11. [CrossRef]
17. Mangai, S.A.; Sankar, B.R.; Alagarsamy, K. Taylor Series Prediction of Time Series Data with Error Propagated
by Artificial Neural Network. Int. J. Comput. Appl. 2014, 89, 41–47.
18. Mannepalli, K.; Sastry, P.N.; Suman, M. A novel Adaptive Fractional Deep Belief Networks for speaker
emotion recognition. Alex. Eng. J. 2017, 56, 485–497. [CrossRef]
19. Medhekar, D.S.; Bote, M.P.; Deshmukh, S.D. Heart disease prediction system using naive Bayes. Int. J.
Enhanc. Res. Sci. Technol. Eng. 2013, 2, 1–5.
20. Meng, X.; Gao, X.Z.; Lu, L.; Liu, Y.; Zhang, H. A new bio-inspired optimisation algorithm:
Bird Swarm Algorithm. J. Exp. Theor. Artif. Intell. 2016, 28, 673–687. [CrossRef]
177
Appl. Sci. 2020, 10, 6626
21. Mohan, S.; Thirumalai, C.; Srivastava, G. Effective Heart Disease Prediction Using Hybrid Machine
Learning Techniques. IEEE Access 2019, 7, 81542–81554. [CrossRef]
22. Nilashi, M.; Ahmadi, H.; Manaf, A.A.; Rashid, T.A.; Samad, S.; Shahmoradi, L.; Aljojo, N.; Akbari, E.
Coronary Heart Disease Diagnosis Through Self-Organizing Map and Fuzzy Support Vector Machine with
Incremental Updates. Int. J. Fuzzy Syst. 2020, 23, 1376–1388. [CrossRef]
23. Nourmohammadi-Khiarak, J.; Feizi-Derakhshi, M.R.; Behrouzi, K.; Mazaheri, S.; Zamani-Harghalani, Y.;
Tayebi, R.M. New hybrid method for heart disease diagnosis utilizing optimization algorithm in
feature selection. Health Technol. 2019, 10, 667–678. [CrossRef]
24. Oyyathevan, S.; Askarunisa, A. An expert system for heart disease prediction using data mining technique:
Neural network. Int. J. Eng. Res. Sports Sci. 2014, 1, 1–6.
25. Palaniappan, S.; Awang, R. Intelligent heart disease prediction system using data mining techniques.
In Proceedings of the International Conference on Computer Systems and Applications, Doha, Qatar,
31 March–4 April 2008.
26. Palaniappan, S.; Awang, R. Intelligent Heart Disease Prediction System Using Data Mining Techniques. Int. J.
Comput. Sci. Netw. Secur. 2008, 8, 343–350.
27. Patil, S.B.; Kumaraswamy, Y.S. Extraction of significant patterns from heart disease warehouses for heart
attack prediction. Int. J. Comput. Sci. Netw. Secur. 2009, 9, 228–235.
28. Pattekari, S.A.; Parveen, A. Prediction system for Heart Disease using Naive Bayes. Int. J. Adv. Comput.
Math. Sci. 2012, 3, 290–294.
29. Ranganatha, S.; Raj, H.P.; Anusha, C.; Vinay, S.K. Medical data mining and analysis for heart disease dataset
using classification techniques. In Proceedings of the National Conference on Challenges in Research &
Technology in the Coming Decades (CRT), Ujire, India, 27–28 September 2013.
30. Reddy, G.T.; Reddy, M.P.K.; Lakshmanna, K.; Rajput, D.S.; Kaluri, R.; Srivastava, G. Hybrid genetic algorithm
and a fuzzy logic classifier for heart disease diagnosis. Evol. Intell. 2019, 13, 185–196. [CrossRef]
31. Safdar, S.; Zafar, S.; Zafar, N.; Khan, N. Machine learning based decision support systems (DSS) for heart
disease diagnosis: A review. Artif. Intell. Rev. 2018, 50, 597–623. [CrossRef]
32. Shah, S.M.S.; Shah, F.A.; Hussain, S.A.; Batool, S. Support Vector Machines-based Heart Disease Diagnosis
using Feature Subset, Wrapping Selection and Extraction Methods. Comput. Electr. Eng. 2020, 84, 106628.
[CrossRef]
33. Shouman, M.; Turner, T.; Stocker, R. Using data mining techniques in heart disease diagnosis and treatment.
In Proceedings of the IEEE Japan-Egypt Conference on Electronics, Communications and Computers,
Alexandria, Egypt, 6–9 March 2012; pp. 173–177.
34. Subbalakshmi, G. Decision support in heart disease prediction system using naive bayes. Indian J. Comput.
Sci. Eng. 2011, 2, 170–174.
35. Thiyagaraj, M.; Suseendran, G. Enhanced Prediction of Heart Disease Using Particle Swarm
Optimization and Rough Sets with Transductive Support Vector Machines Classifier. In Data Management,
Analytics and Innovation; Springer: Singapore, 2020; Volume 2, pp. 141–152.
36. Yeh, D.Y.; Cheng, C.H.; Chen, Y.W. A predictive model for cerebrovascular disease using data mining.
Expert Syst. Appl. 2011, 38, 8970–8977. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
178
applied
sciences
Article
A Decision Support System for Elective Surgery
Scheduling under Uncertain Durations
Daniel Clavel 1 , Cristian Mahulea 1, *, Jorge Albareda 2 and Manuel Silva 1
1 Department of Computer Science and Systems Engineering, University of Zaragoza, 50018 Zaragoza, Spain;
[email protected] (D.C.); [email protected] (M.S.)
2 “Lozano Blesa” Hospital of Zaragoza, 50009 Zaragoza, Spain; [email protected]
* Correspondence: [email protected]; Tel.: +34-976-762517
Featured Application: A software tool for helping doctors in the automatic scheduling of elective
patients is proposed. The tool has been developed considering the organizational structure of
Spanish hospitals. However, it may be easily extended to any hospital (public or private) working
under operation room block booking criterion.
Abstract: The operation room (OR) is one of the most expensive material resources in hospitals.
Additionally, the demand for surgical service is increasing due to the aging population, while the
number of surgical interventions performed is stagnated because of budget reasons. In this context,
the importance of improving the efficiency of the surgical service is accentuated. The main objective
of this work is to propose and to evaluate a Decision Support System (DSS) for helping medical staff in
the automatic scheduling of elective patients, improving the efficiency of medical teams’ work. First,
the scheduling criteria are fixed and then the scheduling problem of elective patients is approached
by a mathematical programming model. A heuristic algorithm is proposed and included in the DSS.
Moreover, other different features are implemented in a software tool with a friendly user interface,
called CIPLAN. Considering realistic data, a simulation comparison of the scheduling obtained using
the approach presented in this paper and other similar approaches in the bibliography is shown and
analyzed. On the other hand, a case study considering real data provided by the Orthopedic Surgical
Department (OSD) of the “Lozano Blesa” hospital in Zaragoza (HCU) is proposed. The simulation
results show that the approach presented here obtains similar occupation rates and similar confidence
levels of not exceeding the available time than approaches in the bibliography. However, from the
point of view of respecting the order of the patients in the waiting list, the approach in this paper
obtains scheduling much more ordered. In the case of the Orthopedic Surgical Department of the
“Lozano Blesa” hospital in Zaragoza, the occupation rate may be increased by 2.83%, which represents
a saving of 110,000 euros per year. Moreover, medical doctors (who use this tool) consider CIPLAN
as an intuitive, rapid and efficient software solution that can make easier the corresponding task.
Keywords: decision support system; heuristic approach; surgery scheduling; software tool;
case study
1. Introduction
The Operation Room (OR) is one of the most expensive material resources of the hospitals (in Spain
being about 40% of the total hospitalization cost [1]) and approximately 60% of patients need it at some
point during their hospital stay [2]. On the other hand, the demand for surgical services is increasing
due to the aging population. In Spain, the average life expectancy is about 81.8 years being one of the
highest in Europe [3].
Moreover, as a result of the economic crisis in 2008, the number of surgical interventions performed
annually in Spanish public hospitals has stagnated at 3.5 million. This situation has also contributed
to a continuous increment in the number of patients in the surgical waiting lists, reaching 600,000
in 2017 [4]. In some cases, these huge lists lead into waiting times greater than six months, which is
becoming a serious problem for the healthcare system. Being difficult to increase the surgical resources,
the importance of improving the efficiency is accentuated. New management techniques, scheduling
methods, and specific information systems could help to improve the efficiency of the surgical services
and, consequently, could reduce the surgical waiting lists.
The main objective of this work is to provide a DSS for managing surgical services in hospitals.
In particular, a software tool (called CIPLAN from the Spanish words PLANificación de CIrugías) is
presented that may be used by medical doctors to automatically perform a scheduling of the patients.
It is assumed that the uncertain parameters (surgery duration and cleaning time) follow a normal
distribution and, consequently, the expected total duration of a surgical block also follows a normal
distribution. In this way, the probability of exceeding the working time can be introduced as a capacity
chance constraint. Even if the lognormal distribution may give better approximations of the surgery
times [5], the optimization problem has nonlinear constraints, being difficult to solve. Since, in general,
acceptable error is obtained for normal distribution, this work assumes this type of distribution.
The following three criteria have been considered for the scheduling problem.
C1. Maximize the occupation of the ORs. Since the ORs with their associated cost constitute the
bottleneck of the system (they can be open only during a given period of time every working day and
their cost is high), the first scheduling criterion is to maximize the occupation of the ORs. In this way,
the underutilization of the ORs is prevented and the use of resources is maximized.
C2. Ensure with a minimum confidence level that the total available time in a time block is not going to
be exceeded. On contrary to C1, an over-scheduling of the ORs is not desired since the last scheduled
surgeries can be canceled if the medical personnel believes that the time of the time block will be
exceeded. Among other problems, such cancellations reduce the real occupation rate in the ORs.
C3. Respect as much as possible the order of the patients in the waiting list. Due to ethical and common
sense reasons, patients with the same urgency level should spend a similar time in the waiting list.
For that, once the patients in a waiting list are ordered depending on the urgency level and the
admission date, the third scheduling criterion is to respect as much as possible the order of the patients
in the waiting list. For example, in the Orthopedic Department of the HCU, each team has assigned
two time blocks per week for the elective patients, so the difference of scheduling a patient in the first
time block with the sixth one is 21 days.
Mathematical programming is an optimization technique [6] commonly used in managing
hospitals [7], not only for surgery scheduling problems [8], but also for other problems such as bed
management [9], nurse scheduling [10], or management policy decisions [11]. This paper proposes first
an Mixed Integer Quadratic Constrained Programming problem for performing the scheduling of patients
based on the three criteria presented before. It is well known that the computational complexity to
obtain the optimal solution is high and a heuristic method is proposed called Specific Heuristic Algorithm
(SHA) [12]. The software tool, in addition to helping doctors in the scheduling task, has also been
developed for managing medical doctors, medical teams, OR timetable, and patients.
A preliminary version of this work has been reported in [12] where a first version of the SHA has
been proposed. In this paper, the heuristic SHA is improved by adding a post arranging of the time
scheduled blocks. This improvement provides solutions respecting the order of the patients in the
waiting list more. Additionally, the DSS is explained (not only the part related to surgery scheduling
but also related to management tasks), and the solution is compared using real data from the HCU
with two other methods in literature.
180
Appl. Sci. 2020, 10, 1937
2. Related Work
In literature, many researchers studied the problem of planning and scheduling of elective patients
(see [8,13] and the references herein for an overview while for an overview of the problems see [14]).
Almost all the works consider criterion C1 presented before. Criterion C2 (of using chance constraints
for the scheduling) is considered also in some contributions presented in the last years. For example,
in [15], integer stochastic formulations are proposed for two sub-problems in which the original
problem is decomposed. The approach to solve them is based on neighborhood search techniques and
Monte Carlo simulation. In [16], a two-stage robust optimization model is proposed that is solved
based on a column-and-constraint generation method. A similar method is used by the authors in [17]
to deal with uncertain surgery duration and emergency demand, while, in [18], the authors use particle
swarm optimization for the scheduling of surgeries.
The approaches presented in [19,20] are the most related to the one presented in this paper.
The authors in [19] present an optimization framework for batch scheduling within a block booking
system that maximizes the expected utilization of ORs subject to a set of probabilistic capacity
constraints. They propose an algorithm that iteratively solves a series of mixed-integer programs that
are based on a normal approximation of cumulative surgery duration. In [20], constructive and local
search heuristics for maximization of ORs utilization and minimization of the overcoming risk are
proposed. In their model, to address the randomness of surgery processing times, a planned time
slack is reserved in each scheduling block, which is a function of total mean and variance of surgeries
assigned to the corresponding block. When determining an appropriate size of the planned slacks,
the authors assume that the sum of surgery duration follows a normal distribution.
Both approaches in [19,20] require setting in advance the patients that are going to be scheduled
in the next blocks. Therefore, in these approaches, all considered patients must be scheduled in one
of the available blocks. For this, both approaches start with an initial scheduling obtained through
the scheduling rule: first-fit probabilistic. Following this rule, sequentially each surgery is assigned to
the first available block for which the probabilistic capacity constraint is satisfied after the assignment.
Once the initial scheduling is obtained, the expected occupation rate of the first blocks is improved by
rescheduling the surgeries. In this way, the last blocks are totally or partially released.
Unlike [19,20], the approach presented in this paper does not require to know in advance the set of
patients that should be scheduled in the next surgical blocks, since any patient on the waiting list may
or may not be scheduled. Another big difference with [19,20] consists of the usage of two balanced
terms in the cost function that favor patients to be scheduled in an orderly manner at the same time
that the expected occupation rate of the OR is maximized. An exhaustive numerical comparison using
realistic data is performed and analyzed showing that the approach presented in this manuscript
provides better solutions to the considered problem than the one in [19,20]. In particular, even if the
results show similar occupation rates and similar confidence levels, the approach presented in this
paper respect the preference order of the patients in the waiting list more. Therefore, criterion C3 is
better considered. In the case of the hospital in which the approach is implemented, the number of
available time blocks per week for each medical team is equal to two, while, on average, three patients
are scheduled in a time block, making it very important to keep, if possible, the order of patients in
the waiting lists. This is one of the main differences of the approach presented in this paper with the
other approaches.
DSS are used for managing tasks in different fields of applications, as for example in stormwater
management [21], dental manufacturing production line [22], offshore wind farms [23] and healthcare
systems [24,25]. Moreover, a DSS for OR scheduling was proposed in [26] based on a Mixed Integer
Linear Programming (MILP) problem [27,28] that takes into account the preference order of the
patients. However, this approach does not consider the variability on the uncertain parameters
(surgery duration, cleaning time, etc.) and consequently can not manage the risk of exceeding the
total time. The DSS proposed in this work uses an heuristic algorithm allowing to impose a minimum
confidence level of not exceeding the total time in a surgical block. In this way, doctors not only can
181
Appl. Sci. 2020, 10, 1937
manage the risk but also can know in advance the probability that each block exceeds the available
time. The DSS proposed in this paper can also perform the management of medical teams and OR
time table. Moreover, it is possible to perform iterative scheduling which is useful if, for example,
after obtaining a first scheduling, some patients can not attend the day of their surgeries. Furthermore,
after a surgery is performed, the DSS automatically updates and customizes the average and standard
deviation of the surgery duration. Finally, in order to provide a safe environment to work and better
security of the medical data, different access levels are provided to every user.
The second main contribution of this paper is the presentation of a DSS that can help medical
doctors not only for scheduling of surgeries, but also it can be used for management tasks of patients,
medical teams, hospital resources, etc. Finally, the scheduling approach presented is compared with
real historical data from the studied hospital, this being a third contribution of this manuscript.
Because all these variables are assumed to follow a normal distribution, the total duration Tbi of the
time block bi also follows a normal distribution Tbi = N (μ Tb , σTb ).
i i
182
Appl. Sci. 2020, 10, 1937
The goal of the scheduling problem is the assignment of the patients from the waiting list W to
the set of time blocks B considering the scheduling criteria C1, C2 and C3. Let us define the following
further notations:
• μw be a row vector containing the average duration of surgeries of the patients in the waiting
list W . For example, μw ( j) represents the average duration of the surgery corresponding to the
patient w j .
• Po be a row vector containing the preference order of the patients. Assuming that patients are
ordered according to their priority, then Po( j) = j.
• Cl is the minimum confidence level of not exceeding the total time.
then:
Criterion C1 is formalized by objective O1 given by Equation (2). This equation maximizes the
expected surgery time in each block (μw · Si ) giving more importance to the first ones (|B| + 1 − i ):
|B|
max [μw · Si · (|B| + 1 − i )] . (2)
i =1
Criterion C2 is imposed by the chance probability constraint given by Equation (3). It ensures
with a minimum confidence level Cl that the total duration Tbi in each time block bi ∈ B does not
exceed the available time l (bi ):
Finally, criterion C3 is formalized by the objective O2 given by Equation (4). Here, by minimizing
the sum of the preference order of the patients scheduled in each time block (Po · Si ), preference to the
first patients in the waiting list is given. Moreover, multiplying by the term (|B| + 1 − i ), first patients
should be scheduled in the first time blocks:
|B|
min [Po · Si · (|B| + 1 − i )] , (4)
i =1
|B|
min [−μw · Si · (|B| − i + 1) +
i =1
β · Po · Si · (|B| − i + 1)]
Subject to:
(5)
⎧
⎪
⎪ P[ Tbi ≤ l (bi )] ≥ Cl, ∀i = 1, 2, . . . , |B|.
⎪
⎨ |B|
S [ j] ≤ 1, ∀ j = 1, 2, . . . , |W |
⎪ i =1 i
⎪
⎪
⎩
Si ∈ {0, 1}1×|W | , ∀i = 1, 2, . . . , |B|.
The MIQCP model has two sets of constraints. The first one is related to the chance probability
constraint given in Equation (3), while the second one imposes that each patient in the waiting list is
scheduled at most once. Development of the chance constraint (3) quadratic constraints is obtained [29].
183
Appl. Sci. 2020, 10, 1937
Moreover, the decision variables Si are binary, so the computational complexity of the MIQCP model
is really high. In order to obtain scheduling in a reasonable time, heuristic approaches are needed.
4. Proposed Solution
where Dt is the delay with respect to the starting time and Ct is the cleaning time duration after
each surgery.
As observed before, Tbi is following a normal distribution with mean
Since {Sk , Sk , S j } is a possible scheduling type, the chance constraint (3) is satisfied. In order to
check this equation, based on normal distribution property [34], the following inequality
184
Appl. Sci. 2020, 10, 1937
P[ Tbi ≤ l (bi )] ≥ Cl
is equivalent to
l (bi ) − μ Tb
i
≥ VCl ,
σTb
i
where VCl is the value corresponding to the normal variable ( x ∼ N (0, 1)) with a cumulative
distribution function Cl, i.e., P[ x ≤ VCl ] = Cl and this value is tabulated.
Part (ii): Scheduling of the time blocks, consisting of the following four steps executed sequentially
for each time block bi to schedule.
Step 3: For each possible scheduling type of time block bi , obtain the sets of real scheduling. Given a
possible scheduling type, a real scheduling is composed by a set of patients with the types of surgeries
equal to the possible scheduling types.
For example, considering the scheduling type {Sk , Sk , S j }, a real scheduling is composed from
two patients w and w on which a surgery sk ∈ Sk should be performed and one patient w on
which a surgery s j ∈ S j should be performed. The first real scheduling corresponding to {Sk , Sk , S j } is
chosen by taking the patients with a lower number of preference order (corresponding to the higher
preference). However, all other combinations of patients satisfying the chosen scheduling type having
no patients with preference order greater than the maximum preference order of the patient in the first
real scheduling will be considered.
Let us assume, for example, the list of patients given in Table 1 with W = {w1 , w2 , w3 , w4 , w5 , w6 }
such that patients w1 , w3 , and w5 have assigned the surgery s1 , the patient w2 has assigned the
surgery s2 , the patient w4 the surgery s3 , and the patient w6 the surgery s4 . Assuming the set of
surgeries S = {S1 , S2 , S3 } with S1 = {s1 , s2 }, S2 = {s3 } and S3 = {s4 } and two possible scheduling
types {S1 , S1 , S2 } and {S1 , S2 , S3 }, the first real scheduling considered for {S1 , S1 , S2 } is {w1 , w2 , w4 }.
The maximum order of patients in this real scheduling is 4 (corresponding to w4 ), hence the following
real scheduling will be considered as well: {w1 , w3 , w4 } and {w2 , w3 , w4 }. For the scheduling type
{S1 , S2 , S3 }, the first real scheduling is {w1 , w4 , w6 }. Since the maximum preference order of patients is
6, the following real scheduling will be considered as well: {w2 , w4 , w6 }, {w3 , w4 , w6 } and {w4 , w5 , w6 }.
All of this real scheduling is shown in Table 2.
Table 1. Surgery data and waiting list of patients used for explaining part 2 of the SHA.
Surgeries Data
Waiting List
S Types Average Duration
s1 90 Patient w1 w2 w3 w4 w5 w6
S1
s2 110 Surgery type s1 s2 s1 s3 s1 s4
S2 s3 125
S3 s4 135
185
Appl. Sci. 2020, 10, 1937
Step 4: Evaluate the real scheduling and select the best one for each scheduling type. First, the real
scheduling not fulfilling the chance constraint (3) is removed and the check is similar to the one in
Step 2. For the remaining real scheduling, the expected occupation rate r is computed by summing
the average duration of all surgeries in the real scheduling divided by the time block duration l (bi )
(multiplied by 100 if percentage). Furthermore, the average preference order Ap is computed by
summing the preference order of all patients in the real scheduling divided by the number of patients
in the real scheduling. The real scheduling with the minimum value of the following fitness function
is chosen:
H = ( Ap − Min Ap ) · β + ( Maxr − r ), (6)
where Min Ap is the minimum average preference order of all real scheduling; Maxr is the maximum
occupation rate r of all real scheduling; and β is a balancing parameter between the two terms (one for
preference order and one for occupation rate).
For example, in Table 2, the computations of the occupation rates are given, preference order, and
fitness function (fifth column) for the set of real scheduling considered before. In this example, a value
of β = 2.6 is used and the time block duration are assumed to be l (bi ) = 420 (min), corresponding to
7 h. In this step, the real scheduling w1 w2 w4 is chosen for the possible scheduling type S1 S1 S2 and
w2 w4 w6 is chosen for S1 S2 S3 .
Step 5: Evaluate the best real scheduling chosen in Step 4 (one for each scheduling type) and select the best
one for the current time block bi . From the set of selected real scheduling in the previous step, the one
with minimum value of the fitness function (6) is chosen. Notice that the value of the fitness function
of a real scheduling in this step is, in general, different by the one computed in step 4. In particular,
Min Ap and Maxr have different values since are computed based only on the best real scheduling
computed in step 4.
The sixth column of Table 2 is showing the computation of the fitness function on the two selected
real scheduling instances on the previous step. Hence, in this step, Min Ap = 2.33 corresponding
to w1 w2 w4 while Maxr = 88.09 corresponding to w2 w4 w6 . According to them, patients w2 w4 w6 are
selected for the time block l (bi ).
Step 6: Remove the scheduled patients. The patients scheduled in Step 5 are removed from the
waiting list, and the procedure is iterated for time block bi+1 from Step 3.
By using the case study in the HCU hospital, it has been observed that values of t = 3 and of
β = 2.6 are appropriate. However, for different medical departments, a previous study is necessary in
order to better choose these design parameters.
Part (iii): Sorting the time blocks with the same available time. First, the time blocks are grouped
depending on their available time l (bi ). Then, the time blocks of each group are sorted in ascending
order according to the average preference order of the scheduled patients (indicator Ap). At the end,
for each date, a time block is assigned sequentially starting with the closest one.
186
Appl. Sci. 2020, 10, 1937
the DSS. Moreover, the head of the department should define the ORs time-table in the DSS. The OR,
the date, the starting time, and the ending time of each time block assigned to each team is the
information that must be included in the DSS.
Updating the waiting list. Generally, the new arrived patients are included at the end of the waiting
list. However, depending on medical criteria, the surgeon could decide to advance the patient into a
higher position in the waiting list. The DSS automatically orders the waiting list of patients. There are
two parameters influencing directly the position of the patient in the waiting list.
1. The first parameter is related to the waiting time for surgery. A score S1 of 10 is given to the
patient with the highest number of waiting days, while the newest patient has a score of 0.
A proportional score between 10 and 0 is given to other patients. In the calculation of total score
(denoted ST ), the score S1 has a weight of p.
2. The second parameter has to do with the surgery priority. Although non-urgent surgeries are
scheduled with the DSS, three levels of priorities 1, 2, and 3 are considered by doctors depending
on the urgency. A corresponding score (S2 ) of 0, 5, and 10 is associated respectively.
S T = p · S1 + S2 (7)
Finally, the patients are ordered according to their total score. The patient with the highest total
score will be the first, while the patient with the lowest punctuation will be the last one.
Iterative planning. The scheduler of each medical team performs the scheduling for the next |B|
time blocks (this is done by using the SHA). The secretary calls the patients scheduled and asks them
for their availability on the scheduled date. The secretary gives back this information to the team
scheduler who should schedule the empty gaps again. This process is repeated until the |B| time
blocks are completely scheduled with all patients confirmed.
During the iterative scheduling, if a patient confirms the attendance, she/he is fixed in the
corresponding time block. However, in the case that a patient cannot be contacted or she/he cannot be
hospitalized, then the patient is not considered for the scheduling of the next |B| time blocks.
Thus, in the next iteration, the patients previously confirmed and the patients who cannot be
hospitalized are not considered in the waiting list. The SHA schedules new patients in the gaps of the
uncompleted time blocks.
Updating and customizing the average duration. The average duration and standard deviation
of each type of surgery should be computed using historical data from the hospital department.
Considering two years, a sufficiently high number of surgeries of each type is obtained, and the
average duration is representative. However, depending on the skills and experience of the surgeons,
significant differences could exist between their average duration. Moreover, for the same medical
doctor, a continuous decrease of the average duration, due to the experience occurring, so these input
values should be dynamically updated.
When a surgery is finished, the name of the surgeon and the time spent are introduced in the DSS.
The DSS registers this information in a database and updates the average duration.
Overview of the DSS. All the previously explained features are integrated into the DSS whose
flowchart is given in Figure 1.
187
Appl. Sci. 2020, 10, 1937
Figure 1. Flowchart of the DSS. The steps form the moment when a patient arrives until she/he
is surgically operated are shown. The discontinuous colored boxes reference the five main panels
included in the developed software tool: (1) Medical teams; (2) Patients; (3) Surgeries; (4) OR timetable;
and (5) Scheduling.
When a new patient arrives at the service, she/he is introduced into the waiting list of a medical
doctor. Each surgeon is responsible for including their patients in the system. The DSS recognizes the
surgeon (using a personal password) and she/he must enter some personal information of the patient,
the pathology and the priority of the surgery. Additionally, the DSS saves the actual date (to compute
the waiting time in the list) and the surgeon to whom the patient belongs. When a team scheduler
wants to perform a scheduling, she/he is recognized by the DSS and automatically composes the
ordered waiting list of his patients.
The DSS has a database, which is updated every time that a surgery is performed. The pathology
and its surgery durations are saved in the database. Thus, considering this information, the tool assigns
average theoretical duration and standard deviation to each surgery in the waiting list. In this way,
the vectors μw and σ w are generated. On the other hand, the DSS saves the OR time-table introduced
previously by the head of the department—that is, the time blocks booked for each team as well as their
duration defined as l (bi ). The DSS performs an operation scheduling in an iterative way. The input
data that the team’s manager has to introduce in the DSS to schedule the next time blocks are: (i) the
minimum confidence level Cl and (ii) the number of time blocks to schedule |B|.
When a patient has been scheduled, his/her state changes from “pending” to “scheduled”. Once
a specific surgery is performed, the surgeon indicates this situation in the DSS and the corresponding
patient is removed from the waiting list. Additionally, the surgeon introduces the operating time in the
tool and these new input data are used to update the average duration. If finally a scheduled surgery
is not performed, the DSS changes the state of this surgery from “scheduled” to “pending”.
Based on the previously described DSS, a Software tool for scheduling elective patients in Spanish
Hospitals called CIPLAN has been developed. A first version of this software was described in [12].
188
Appl. Sci. 2020, 10, 1937
CIPLAN has been developed in JAVA language and has a SQL database. Moreover, in order to provide
a safe work environment and a better security of the medical data, different access levels are provided
to the users. The preferences and concerns of medical doctors in the OSD have been considered not
only at the surgery scheduling level but also at the user interface level. Thus, a friendly interface
divided into six main panels was proposed. These panels are based on the DSS and the first five are
shown with color boxes in the flowchart of Figure 1: (1) Medical teams, (2) Patients, (3) Surgeries, (4) OR
timetable, (5) Scheduling, and (6) User. Figure 2 shows the interface of the scheduling panel.
Figure 2. Scheduling panel. The main interface of the software tool is shown. Particularly,
the scheduling panel where once the number of blocks to schedule, the minimum confidence level, and
the surgical team are introduced, a summary of the scheduling performed is shown at the bottom part.
5. Simulations
In this section, some scheduling results derived of simulation are described and analyzed.
The approach presented here is compared with two approaches in literature.
189
Appl. Sci. 2020, 10, 1937
Figure 3. Occurrence of surgeries. The occurrence of some types of surgery that can be performed in
the OSD of the LBH are shown.
For each set of simulations, 50 replications are performed. Each replication is a schedule of
156 time blocks (78 weeks × 2 time blocks per week) all of them having the same time duration of
l (bi ) = 390 min (6.5 h). The average values of the probability of not exceeding the total time (confidence
level) and the occupation rate are computed. Moreover, the total overtime and the total number of
treated patients per year are considered.
In order to be able to compare two different schedules from the point of view of the order of the
patients, indicator Ω is defined. This indicator measures the disorder of the patients in the obtained
scheduling, so, the smaller it is, the more ordered are the patients. To compute this value, for each
time block, an interval [ f i , li ] is defined. If the preference order of the surgeries scheduled in the time
block bi belongs to the interval [ f i , li ], Ω is not increased. On the contrary, each patient w j with a
preference order j outside the interval increases the value of Ω. The formal calculation of Ω is given in
Np
Algorithm 1, where N p is the total number of patients scheduled and Pd = |B| is the average number
of patients scheduled per time block.
Algorithm 1 computes first the lower ( f i ) and upper (li ) preference orders of the patients that
should be ideally scheduled in the time block bi . However, we allow one block before and one block
after without any penalization and, for this reason, f i = Pd · (i − 2) and li = Pd · (i + 1). Then,
for all patients scheduled in bi , if their preference order in the waiting list does not belong to [ f i , li ], Ω
parameter is increased with a value depending on the deviation (step 4).
190
Appl. Sci. 2020, 10, 1937
Table 4. Comparison of the one and the half years scheduling using different chain-constrained
approaches with a minimum confidence level of 70%.
The three approaches analyzed in Table 4 improve the occupation rates obtained by using the
first-fit probabilistic rule. However, the improvement in the occupation rate of the time block implies
a decrement in the confidence level. For example, the Batch Scheduling approach [19] achieves the
highest occupation rate (79.06%) and the highest number of treated patients (447.78), and, consequently,
the lower confidence level (75%) and the highest total overtime (1183 [min]) are obtained. Taking
into account the pairs of values occupation rate and confidence level, the Batch Scheduling approach
obtains the better solution with: (1) the highest occupation rate and (2) a confidence level within
the allowed.
The SHA approach presented in this paper obtains a little worse occupation rate (78.28%) than the
Batch Scheduling (79.06%). However, considering the order of the patients given by the indicator Ω, it
can be checked that the SHA approach by far obtains the best scheduling (Ω = 395). Medical doctors
of the OSD consider that the scheduling obtained using the SHA approach is the best one since it
maintains the order of the waiting list better. Due to patients having more orders with SHA scheduling,
it is possible to provide an estimation period for the patients in the list and this is a possible extension.
191
Appl. Sci. 2020, 10, 1937
OR 1 OR 2 OR 3
Mon team 1 team 2 team 3
Tue team 4 team 5 team 1
Wed team 2 team 3 team 4
Thu team 5 team 1 team 2
Fri team 3 team 4 team 5
According to Table 5, each medical team has two time blocks per week for performing surgeries
on the elective patients.
Methodology of the case study. The main objective of the developed software tool is to improve
the efficiency and quality of the surgical service. Thus, in the case study, the scheduling obtained
manually is compared with the scheduling obtained by using CIPLAN. The methodology is as follows:
1. Analysis of manual scheduling. Considering historical data of one medical team of the OSD,
36 consecutive OR time blocks of elective patients are analyzed. For each time block, the following
information is obtained:
The effective time in a surgery block is the time of using the OR for surgeries; therefore, the
occupation rate in a time block is computed as the effective time divided by the total time.
2. Create the waiting list. The manual scheduling and consequently the surgery types in the
waiting list are known. However, the preference order of each surgery in the waiting list is unknown.
In order to be able to obtain an ordered waiting list from which the scheduling using CIPLAN is
performed, it is assumed that the position of the patients (surgeries) in the list corresponds to the order
in which they were operated by using the manual scheduling. That is, the first patient in the waiting
list is the first patient operated in the first time block. Thus, considering the surgeries performed in the
36 time blocks, an ordered waiting list containing 111 patients is created.
3. Scheduling using CIPLAN. The data necessary to perform the scheduling using CIPLAN are
included in the tool and then the scheduling is performed:
• All elective surgery types are added to CIPLAN including their average duration and standard
deviation (computed considering historical data of the team during last two years).
• A new medical team (team 1) is added to the tool. Moreover, a new doctor is added to the team.
192
Appl. Sci. 2020, 10, 1937
• All patients in the waiting list constructed using the 36 time blocks are assigned to the doctor.
• For the next three months, two time blocks are booked weekly from 8:30 a.m. to 3:00 p.m., in total
being 24 time blocks.
• Using CIPLAN, the 24 time blocks previously booked are scheduled considering a minimum
confidence level of not exceeding the total time equal to 69%. This value is fixed by doctors.
4. Analysis of the scheduling performed using CIPLAN. For each time block scheduled,
the following parameters are obtained:
The scheduling obtained by using CIPLAN has not been applied in practice. Thus, to compute
the real occupation rate, the time spent when the surgeries were performed (manual scheduling) has
been considered. Moreover, in order to obtain the real ending time, 20 min of cleaning time between
two consecutive surgeries and 10 min of delay with respect the starting time have been considered.
Figure 4. Comparison of manual and CIPLAN scheduling. A set of 24 time blocks has been scheduled
manually and using CIPLAN. The following values are shown: (a) expected occupation rate; (b) real
occupation rate; (c) confidence level of not exceeded the available time; (d) ending time.
193
Appl. Sci. 2020, 10, 1937
In Figure 4b, the real occupation rate obtained in each time block is shown. Again, the average
value obtained using CIPLAN (77.86%) is greater than the average value using the manual method
(75.03%). Using CIPLAN, the improvement in the occupation rate with respect to use the manual
method is 2.83%.
In Figure 4b, it is also possible to see that using CIPLAN there are some time blocks (1, 2, 23,
and 24) with occupation rates too high (about 90%) and could lead to exceeding the available time.
This does not happen in manual scheduling. Analyzing the results, it has been observed that it
happens because the scheduler (in the manual method) knows the surgery skill of the surgeon and can
estimate better the expected duration of the surgery depending on the surgeon. However, the CIPLAN
scheduling algorithm initially uses the average duration of the surgeries. These average durations are
computed considering historical data, independently from the surgeon. As the tool is used, the time
spent in each surgery as well as the surgeon who performs it will be registered in CIPLAN. Thus,
the surgery duration will be more and more customized depending on the surgeon and better results
will be obtained.
In Figure 4c, the confidence level of not exceeding the available time is shown. The corresponding
values are computed by considering the average duration and the standard deviation of each surgery.
It can be seen that using CIPLAN does not exist time blocks with a confidence level lower than 69%.
However, using the manual method, there are several time blocks with a really low confidence level.
For example, in time block 4, the confidence level of not exceeding the available time is only 23.3%.
Again, these low values are obtained since the scheduler knows the surgery skills of the surgeon.
Particularly, in time block 4, three knee arthroplasties have been scheduled. Considering the average
duration and the standard deviation based on historical data, the expected occupation rate is 94.8%
and the confidence level that will not exceed the available time is 23.3%. However, the scheduler
knows that the surgeon who has to perform the knee arthroplasties needs 25 min less than the average
duration of the team. Consequently, the real occupation rate obtained is 73.3% and the time block ends
at 13:52.
Figure 4d shows the ending time of each block. It can be observed that, using CIPLAN, there
are four time blocks (1, 2, 13, and 24) ending around 3:00 p.m. and the time block 23 exceeding the
available time 30 min (ending 15:30). This situation does not happen in manual scheduling.
According to the results obtained in the case study, the quality of the service decreases, because
some blocks exceed the available time. However, this situation will be solved when the average
duration and standard deviation of the surgeries will be customized depending on the surgeon. This
customization will allow that the real occupation rate will be more similar to the expected one and,
consequently, not only the number of blocks exceeding the available time will decrease, but also the
total exceeding time.
In Figure 5, the time execution of the manual (a) and automatic (b) scheduling are shown by
two bar-graphs.
Each row is composed by boxes that represent different actions in one surgical time block.
Colored boxes represent surgeries while white boxes are idle times. The idle time could be: (i) delay
regarding the starting time if the box is at the beginning or (ii) cleaning time if the box is between two
surgeries. On the other hand, to check how much the order of patients in the waiting list is respected,
the preference order of the patients inside the colored boxes has been included. Since the waiting list
of patients has been constructed considering real surgeries, in Figure 5a, the patients are perfectly
ordered. However, considering the automatic method (Figure 5b), although the SHA tries to make the
scheduling respecting as much as possible, the order of the patients in the waiting list results in being
a little disorderly. In order to see it visually, three colors have been used with a different meaning:
• The patients delayed or advanced at most a time block respect to the manual scheduling (perfectly
ordered) are given in green boxes.
• The patients delayed or advanced two or three time blocks with respect to the manual scheduling
are shown in orange or red boxes, respectively.
194
Appl. Sci. 2020, 10, 1937
Figure 5. Comparison of the time execution of the surgical blocks. The time execution of the 24
time blocks scheduled (a) manually and (b) using CIPLAN are compared. Numbers inside the boxes
represent the preference order of the patients. In case (b) color green/orange/red means that the
patient has been advanced or delayed at most 1/2/3 time blocks with respect to the manual scheduling.
Remember that the waiting list has been constructed considering the real surgeries performed
in 36 surgical blocks, obtaining a waiting list composed of 111 patients. Considering the manual
scheduling, patients with preference order from 1 to 73 have been scheduled in the first 24 time blocks.
Using CIPLAN, patients with preference order from 1 to 76 (except 65 and 71) have been scheduled in
24 time blocks. The number of patients advanced or delayed two time blocks (orange) is 14 while the
number of patients advanced or delayed three time blocks (red) is 9. However only three patients (15,
18, and 57) have been delayed three time blocks and seven patients (22, 25, 47, 54, 58, 60, and 64) have
been delayed two time blocks. This means:
• 4.05% of the patients are delayed three time blocks with respect to the preference order;
• 9.45% of the patients are delayed two time blocks with respect to the preference order.
7. Discussion
7.1. The Use and Experiences with the Scheduling Software Tool
Medical doctors who have used CIPLAN consider it an intuitive, rapid, and efficient software
solution that can make easier the scheduling task. With the manual method, the confidence level of not
exceeding the available time is unknown before building the schedule. However, with CIPLAN, they
find an important advantage that the input parameter is the confidence level and the tool maximizes
the occupation rate respecting this confidence level. Moreover, they said that the time spent for
scheduling is drastically reduced, a check of the obtained solution being necessary.
195
Appl. Sci. 2020, 10, 1937
Using CIPLAN, from an economic point of view, the improvement in the occupation rate using
the manual method is 2.83%. Only considering the OSD, this 2.83% implies an increment ΔOc of 95.65
effective hours per year in the use of the ORs:
ï ò ï ò
hours block
ΔOc = 0.0283·6.5 ·2
block team · week
ï ò ï ò (8)
week e f f e. hours
·52 · 5[team] = 95.65
year year
Assuming an occupation rate of 77.86% and time blocks of 6.5 h, these 95.65 effective hours are
equivalent to an increment Δ Tb of 18.89 blocks per year:
î e f f e. h ó
95.65 year ï ò
blocks
Δ Tb = î e f f e. h ó î ó = 18.89 (9)
0.7786 · 6.5 block
h year
h
Considering that the cost to open an OR during 6.5 h is 5850 euros, the use of CIPLAN in the surgical
scheduling could suppose a saving of about 110,000 euros per year. The savings could be really greater
extending the use of CIPLAN to other Departments.
8. Conclusions
In Spanish hospitals, doctors are responsible for scheduling of elective patients. Currently, this
task is performed manually and has three main problems: (i) doctors need to spend their time in
the administrative task (scheduling); (ii) usually under or overutilization of the ORs is obtained; and
(iii) objectivity can be questioned due to fact that the scheduling is done by a human. This paper
presents a Decision Support System (DSS) that uses a specific heuristic algorithm SHA, in order to help the
hospital managers in the scheduling of the elective patients.
Considering the duration of the surgeries and of the cleaning time as random variables with
normal probability density function, the SHA with some statistics concepts can be used to schedule the
elective patients from the waiting list to the next time blocks in such a way that: (i) the occupation rate
of the time block is maximized; (ii) ensures a minimum confidence level of not exceeding the available
time; and (iii) respect as much as possible the order of the patients in the waiting list. In addition,
the DSS also includes features enabling the management of the medical teams and the OR time-table,
updating the waiting list of patients, iterative planning, and automatic improvement of the input data.
In collaboration with the Orthopedic Surgery Department of the “Lozano Blesa” Hospital in
Zaragoza, a software tool (CIPLAN) based on the proposed DSS has been developed. The preferences
and concerns of doctors have been considered not only at the surgery scheduling level, but also at the
196
Appl. Sci. 2020, 10, 1937
user interface level. Thus, a friendly interface divided into six main panels is proposed in CIPLAN.
It has been developed in Java language and uses a SQL database. Moreover, in order to provide a safe
environment to work and better security of the medical data, different access levels are provided to
every user.
Finally, in order to check the impact of the software tool in the efficiency and the quality of the
surgical service, a case study considering real data from the Orthopedic Department of the HCU has
been discussed. The results show that, using CIPLAN, an improvement in the occupation rate of the
ORs can be reached.
However, it has been observed that there are significant differences in the surgery duration
depending on the experience and on the surgery skill of the surgeon. Thus, if the SHA works with
general average duration, over and under occupation rates are obtained more frequently than if it
works with customized data. It would be convenient to customize the data of the surgical durations
depending on each specific surgeon. In this way, the expected occupation rate and the expected ending
time will be more similar to the real ones improving the quality of the service.
As future work, a new panel in CIPLAN can be developed that, once a time block has been
finished, allows doctors to compare in a visual way the expected scheduling and the real execution.
On the other hand, a new feature in CIPLAN can be included that enables them to perform virtual
scheduling in which all patients in the waiting list are scheduled. In this way, an approximated date
could be given to patients from the moment when they are included in the system. Moreover, other
extensions can be developed—among others, an app connected to CIPLAN allowing patients to know
the approximated date of their surgeries.
9. Patents
The developed software tool has been registered by the Office of Transfer of Research Result
(OTRI) at the University of Zaragoza.
Author Contributions: All authors contributed to the design of the mathematical model and the conceptual
design of the decision support system. J.A. was responsible for collecting the necessary surgical data to analyze
the mathematical models and perform the case study. D.C., C.M. and M.S. were involved in the development of
the proposed heuristic method as well as the implementation of the DSS. All authors have read and agreed to the
published version of the manuscript.
Funding: This work has been partially supported by CICYT-FEDER (Spain-EU) under Grant DPI2014-57252-R
and by the Aragonese Government (Spain).
Acknowledgments: We would like to acknowledge with much appreciation Lidia Castan’s help in the collection
of the necessary surgical data.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
OR Operation Room
DSS Decision Support System
SHA Specific Heuristic Algorithm
RHS Receding Horizon Strategy
HCU “Lozano Blesa” Hospital in Zaragoza
MIQCP Mixed Integer Quadratic Constrained Programming
Ap Average Preference Order
OSD Orthopedic Surgery Department
CIPLAN Developed Software Tool
MILP Mixed Integer Linear Programming.
197
Appl. Sci. 2020, 10, 1937
References
1. Antares Consulting. Libro Blanco de la Actividad y Gestion del Bloque Quirurgico en
España. Available online: https://www.antares-consulting.com/uploads/TPublicaciones/
356f8ea46ff1e222fbcdcdafb4415c0363c9c9aa.pdf (accessed on 11 March 2020).
2. OECD. Health Data 2005—Statistics and Indicators for 30 Countries; OECD: Paris, France, 2005.
3. C.I.A. The World Factbook. 2017. Available online: https://www.cia.gov/~library/publications/the-world-
factbook/ (accessed on 11 March 2020).
4. Ministerio de Sanidad, Consumo y Bienestar Social. Crisis Económica y Salud en España; Informes, Estudios e
Investigación; Ministerio de Sanidad, Consumo y Bienestar Social: Madrid, Spain, 2018.
5. Spangler, W.E.; Strum, D.P.; Vargas, L.G.; May, J.H. Estimating Procedure Times for Surgeries by Determining
Location Parameters for the Lognormal Model. Health Care Manag. Sci. 2004, 7, 97–104. [CrossRef] [PubMed]
6. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, MA, USA, 2004.
7. Capan, M.; Khojandi, A.; Denton, B.T.; Williams, K.D.; Ayer, T.; Chhatwal, J.; Kurt, M.; Lobo, J.M.;
Roberts, M.S.; Zaric, G.; et al. From Data to Improved Decisions: Operations Research in Healthcare
Delivery. Med. Decis. Mak. 2017, 37, 849–859. doi:10.1177/0272989X17705636. [CrossRef] [PubMed]
8. Cardoen, B.; Demeulemeester, E.; Belien, J. Operating room planning and scheduling: A literature review.
Eur. J. Oper. Res. 2010, 201, 921–932. [CrossRef]
9. Schmidt, R.; Geisler, S.; Spreckelsen, C. Decision support for hospital bed management using adaptable
individual length of stay estimations and shared resources. BMC Med. Inform. Decis. Mak. 2013, 13, 3.
[CrossRef] [PubMed]
10. Miller, H.E.; Pierskalla, W.P.; Rath, G.J. Nurse scheduling using mathematical programming. Oper. Res. 1976,
24, 857–870. [CrossRef]
11. Epstein, D.M.; Chalabi, Z.; Claxton, K.; Sculpher, M. Efficiency, Equity, and Budgetary Policies: Informing
Decisions Using Mathematical Programming. Med. Decis. Mak. 2007, 27, 128–137. [CrossRef] [PubMed]
12. Clavel, D.; Botez, D.; Mahulea, C.; Albareda, J. Software Tool for Operating Room Scheduling in a Spanish
Hospital Department. In Proceedings of the 22nd International Conference on System Theory, Control and
Computing, Sinaia, Romania, 10–12 October 2018.
13. Zhu, S.; Wenjuan, F.; Shanlin, Y.; Jun, P.; Panos, M.P. Operating room planning and surgical case scheduling:
A review of literature. J. Comb. Optim. 2019, 37, 757–805. [CrossRef]
14. Samudra, M.; Van-Riet, C.; Demeulemeester, E.; Cardoen, B.; Vansteenkiste, N.; Rademakers, F.E. Scheduling
operating rooms: Achievements, challenges and pitfalls. J. Sched. 2016, 19, 493–525. [CrossRef]
15. Landa, P.; Aringhieri, R.; Soriano, P.; Tanfani, E.; Testi, A. A hybrid optimization algorithm for surgeries
scheduling. Oper. Res. Health Care 2016, 8, 103–114. [CrossRef]
16. Neyshabouri, S.; Berg, B.P. Two-stage robust optimization approach to elective surgery and downstream
capacity planning. Eur. J. Oper. Res. 2017, 260, 21–40. [CrossRef]
17. Wang, Y.; Tang, J.; Fung, R.Y.K. A column-generation-based heuristic algorithm for solving operating theater
planning problem under stochastic demand and surgery cancellation risk. Int. J. Prod. Econ. 2014, 158, 28–36.
[CrossRef]
18. Wang, Y.; Tang, J.; Pan, Z.; Yan, C. Particle swarm optimization-based planning and scheduling for a
laminar-flow operating room with downstream resources. Soft Comput. 2015, 19, 2913–2926. [CrossRef]
19. Shylo, O.V.; Prokopyev, O.A.; Schaefer, A.J. Stochastic operating room scheduling for high-volume specialties
under block booking. INFORMS J. Comput. 2012, 25, 682–692. [CrossRef]
20. Hans, E.; Wullink, G.; Van Houdenhoven, M.; Kazemier, G. Robust surgery loading. Eur. J. Oper. Res. 2008,
185, 1038–1050. [CrossRef]
21. Kazak, J.; Chruściński, J.; Szewrański, S. The Development of a Novel Decision Support System for the
Location of Green Infrastructure for Stormwater Management. Sustainability 2018, 12, 4388. [CrossRef]
22. Cheng, Y.J.; Chen, M.H.; Cheng, F.C.; Cheng, Y.C.; Lin, Y.S.; Yang, C.J. Developing a Decision Support System
(DSS) for a Dental Manufacturing Production Line Based on Data Mining. In Proceedings of the 2018 IEEE
International Conference on Applied System Invention (ICASI), Chiba, Japan, 13–17 April 2018; Volume 7.
23. Seyr, H.; Muskulus, M. Decision Support Models for Operations and Maintenance for Offshore Wind Farms:
A Review. Appl. Sci. 2019, 9, 278. [CrossRef]
198
Appl. Sci. 2020, 10, 1937
24. Bernardi, S.; Mahulea, C.; Albareda, J. Toward a decision support system for the clinical pathways assessment.
Discret. Event Dyn. Syst. Theory Appl. 2019, 29, 91–125. [CrossRef]
25. Mahulea, C.; Mahulea, L.; Garcia-Soriano, J.M.; Colom, J.M. Modular Petri Net Modeling of Healthcare
Systems. Flex. Serv. Manuf. J. 2018, 30, 329–357. [CrossRef]
26. Dios, M.; Molina-Pariente, J.M.; Fernandez-Viagas, V.; Andrade-Pineda, J.L.; Framinan, J.M. A Decision
Support System for Operating Room scheduling. Comput. Ind. Eng. 2015, 88, 430–443. [CrossRef]
27. Karlof, J. Integer Programming: Theory and Practice; Operations Research Series; CRC Press: Boca Raton, FL,
USA, 2005.
28. Apt, K. Principles of Constraint Programming; Cambridge University Press: Cambridge, MA, USA, 2003.
29. Clavel, D.; Mahulea, C.; Albareda, J.; Silva, M. Robust Scheduling of Elective Patients under Block Booking by
Chance Constrained Approaches; Technical Report RR-18-01; Universidad de Zaragoza: Zaragoza, Spain, 2018.
Available online: http://webdiis.unizar.es/~cmahulea/papers/rr_2018.pdf (accessed on 11 March 2020).
30. Camacho, E.; Bordons, C. Model Predictive Control; Advanced Textbooks in Control and Signal Processing;
Springer: London, UK, 2004.
31. Johnson, D.S. Near-Optimal Bin Packing Algorithms. Ph.D. Thesis, Massachusetts Institute of Technology,
Cambridge, MA, USA, 1973.
32. Klement, N.; Grangeon, N.; Gourgand, M. Medical Imaging: Exams Planning and Resource Assignment:
Hybridization of a Metaheuristic and a List Algorithm. In Proceedings of the 10th International Joint
Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), Porto, Portugal,
21–23 February 2017; pp. 260–267.
33. Arabnejad, H.; Barbosa, J.G. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost
Table. IEEE Trans. Parallel Distrib. Syst. 2014, 25, 682–694. doi:10.1109/TPDS.2013.57. [CrossRef]
34. Siegel, A.F. (Ed.) Practical Business Statistics, 7th ed.; Academic Press: Cambridge, MA, USA, 2016;
pp. 549–552.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
199
applied
sciences
Article
A Prostate MRI Segmentation Tool Based on Active
Contour Models Using a Gradient Vector Flow
Joaquín Rodríguez 1 , Gilberto Ochoa-Ruiz 2 and Christian Mata 3, *
1 Vibot ERL Laboratory, Université de Bourgogne Franche-Comté, IUT Le Creusot, 12 rue de la Fonderie,
71200 Le Creusot, France; [email protected]
2 Tecnologico de Monterrey, School of Engineering and Sciences, Ave. Eugenio Garza Sada 2501,
Monterrey, N.L., México 64849, Mexico; [email protected]
3 Centre for Technological Risk Studies, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
* Correspondence: [email protected]
Abstract: Medical support systems used to assist in the diagnosis of prostate lesions generally
related to prostate segmentation is one of the majors focus of interest in recent literature. The main
problem encountered in the diagnosis of a prostate study is the localization of a Regions of Interest
(ROI) containing a tumor tissue. In this paper, a new GUI tool based on a semi-automatic prostate
segmentation is presented. The main rationale behind this tool and the focus of this article is
facilitate the time consuming segmentation process used for annotating images in the clinical practice,
enabling the radiologists to use novel and easy to use semi-automatic segmentation techniques instead
of manual segmentation. In this work, a detailed specification of the proposed segmentation algorithm
using an Active Contour Models (ACM) aided with a Gradient Vector Flow (GVF) component is
defined. The purpose is to help the manual segmentation process of the main ROIs of prostate gland
zones. Finally, an experimental case of use and a discussion part of the results are presented.
1. Introduction
Prostate cancer (PCa) remains one of the most commonly diagnosed solid organ tumor types
among men. Early diagnosis and an active follow-up can allow improved prognosis and prevent
life-threatening conditions. Once the decision of treatment has been taken, having the most complete
information for treatment and follow up is crucial. In 2020, there will be approximately 1,806,590 cancer
cases diagnosed and 606,520 cancer deaths are projected to occur in the United States. This classification
has been reported by the ICD-O, excepting childhood and adolescent cancers, which were classified
according to the ICCC [1]. On the other hand, PCa is the third predicted cause of death in EU with
78,800 death in 2020 [2]. This entails a rate estimation of 10.0/100,000 men.
Medical support systems used to assist in the diagnosis of prostate lesions generally focus on
prostate segmentation [3–6]. They rely on computerized techniques for prostate cancer detection
applied to ultrasound, magnetic resonance and computed tomodensitometric images [7]. For example,
Vos et al. [8] used 3D-T2 imaging to define a specific Region of Interest (ROI), which was subsequently
used on diffusion- and perfusion-weighted images to extract relevant features. MRI has been
established as the best imaging modality for the detection, localization and staging of PCa on account of
its high resolution and excellent spontaneous contrast of soft tissues and the possibilities of multi-planar
and multi-parametric scanning [9]. Among the techniques used to detect PCa, MRI also allows
non-invasive analysis of the anatomy and the metabolism in the entire prostate gland.
Furthermore, many recent works confirm that the combination of several MRI techniques
facilitates the evaluation of prostate diagnoses according to the PI-RADS classification, both for
radiologists and medical experts [10]. For instance, a thorough evaluation of various prostate cancer
studies was presented in [11]; this study highlights that the correct localization of ROI-containing
tumor tissue as the most prevalent problem encountered in the diagnosis of a prostate cancer and
other related tasks. Traditionally, clinical experts use different software tools to validate the diagnoses
and make annotations in several files. This approach is far from being efficient when managing and
dealing with abundant medical data. For the reasons stated above, we propose a new software
tool to provide support for the time consuming segmentation process used for annotating images
in the clinical practice. Additionally, it enables radiologists to make use of novel semi-automatic
segmentation techniques instead of the traditional manual segmentation approaches.
The contributions of this paper are two-fold: we introduce a novel prostate cancer segmentation
graphical user interface on which several segmentation algorithms can be easily implemented. In order
to demonstrate the strengths and the potential applications of the proposed tool, we included several
segmentation algorithms with the software:
The extensions of original algorithm, in tandem with the novel graphical user interface greatly
simplify the segmentation process as it enables a boosted semi-automatic interaction with the
clinicians who, eventually, can validate the correctness of the ROI generated by the different methods.
Finally, a quantitative study to validate the correctness of the generated segmentation compared to
the ground truth and the algorithms is provided. It is worth to mention that the developed program
serves as an initial version of a tool that we hope, will create a lot of opportunities to contribute with
newer algorithms or improvements to the existing ones in the field of prostate segmentation.
The rest of this paper is organized as follows: in Section 2 we introduce the used dataset and
we explain the implementation and features of the proposed Graphical User Interface. Then we
describe the Basic ACS and the modifications we introduced to make it work in prostate images.
Finally, we present details about the Gradient Vector Flow (GVF) model and its implementation into
our tool. In Section 3, we include a single experimental case study to demonstrate the advantages
of the proposed approach. In Section 4 we discuss the results in the light of the clinical perspective,
and in Section 5 we present the conclusions as well as future works.
202
Appl. Sci. 2020, 10, 6163
(France). The institutional committee on human research approved the study, with a waiver for the
requirement for written consent, because MRI and MRSI were included in the workup procedure for
all patients referred for brachytherapy or radiotherapy.
The prostate is composed of a peripheral zone (PZ), central zone (CZ), transitional zone (TZ) and
an anterior fibromuscular tissue (AFT) depicted in Figure 1. Most cancer lesions occur in the peripheral
zone of the gland, fewer occur in the TZ and almost none arise in the CZ. A detailed description of
the influence of the prevalence factor risk according the prostate zone is described in [13]. The T2WI
modality was chosen because it provides the best depiction of the prostate’s zonal anatomy.
Figure 1. Anatomy of the prostate in (a) transversal and (b) sagittal planes [11].
The PZ represents up to 70% of a normal prostate gland and around 75% of prostate cancers
originate in this zone. The CZ represents about 25% of a normal healthy prostate gland in young
adults. Even though the frequency of cancers originating here is much lower, they tend to be of the
more aggressive type [14]. An example of a real study case provided by the panel visualization of the
prostate tool [11] is depicted in Figure 2. All the overlays are annotated with manual drawings made
by radiologists. Each region is represented by different colors (CZ in white and PZ in blue) focusing
on a tumor area represented in red (Tum). This example shows the importance of making annotations
in regions of interest. In this case, a manual segmentation is validated by experts.
Figure 2. An example of the anatomy (T2WI) panel visualization of the tool developed in [11] with the
ROIs annotated by radiologists.
203
Appl. Sci. 2020, 10, 6163
Therefore, one of the main motivations of this work is to develop a new tool focused on improving
the results of this initial manual segmentation using Active Contour Models (ACM), facilitating the
task of the experts.
In order to attain the above-mentioned goals, we have implemented a novel Graphical User
Interface that permits to import, analyze and annotate prostate MR images. The program requires an
initial annotation from the user, that does not require to be perfect, and the outcome is an annotation
more accurate than the initial one. The modular design of the software allows for integrating
segmentation methods as plug-ins or extensions that can be used for testing and clinical purposes.
Additionally, this design enables easy performance comparisons between already integrated methods
and new methods. In what follows, we will discuss the design of this GUI and subsequently, we will
present a study case using three implementations of an ACS model for segmenting the different prostate
sections in MR images. It is important to note that this tool and the developed algorithms have not yet
been clinically tested and the results presented here represent a proof of concept. However, for the
design of the GUI, the expertise and suggestions of clinical experts have been taken into account.
Figure 3. GUI interface of the developed active contour segmentation tool using a prostate MR image.
204
Appl. Sci. 2020, 10, 6163
the right class for the object that will produce the segmentation, according to the method chosen
by the user. The selection of the segmentation method can be done in Section 2 of Figure 3.
By implementing the methods in this manner, the actual implementation of the segmentation
task is completely transparent for other classes that make use of this module. To select certain
segmentation method, only a label that defines the method must be provided.
This module also contains two other files with useful functions:
1. A specific MatLab module which implements some functions that are found in MatLab but
not in Python, and their names are the same as in this program.
2. The other file corresponds to a class that finds the ellipse parameters that better fits to a set
of points. This code is used for the shape prior of the snake algorithm (Basic ACS) which
will be explained in Section 2.3.
• Image widget. This module is in charge of showing the image, as well as the initial and the
resulting contours from the segmentation. The same module is used for both, left and right
images. It also captures the clicks from the user to generate either the circular or the manual initial
contour. Both types of initialization will be explained later. The image widget is represented in
Sections 3 and 4 of Figure 3.
• Model 3D widget. It shows a 3D representation of the output segmentation, based on the Python
module Matplotlib. This process is relatively simple as it only replicates the given contour
vertically. If the spatial resolution is given in the input DICOM file, the scale of the axis will be
given in millimeters. It can be accessed through the button Show 3D Model located in Section 5 of
Figure 3.
• Snake init widget. This widget is in charge of handling all the functionalities of the Snake
initialization group, placed at the top of the central control panel of the application.
Each button will request the main window to ask the user to enter the initial contour in the left
image. Once the user has finished, the gathered contours are returned back to this module.
Afterwards, the corresponding masks for those contours can be generated for the snake algorithm
(the mask consists of an image with values of 1 in the interior of the contour, and zeros outside of
it). The interface for this module is represented in Section 6 of Figure 3.
• User data widget. It is implemented as an efficient and easy to use manner to handle all the user
information as an object, and to show it. This class stores all the information related to the patient
or the study, retrieved from the DICOM file. It also contains the corresponding widgets that are
shown below the left image area, where the gathered information is displayed (see Section 7 of
Figure 3).
Figure 4. Main software components of the proposed tool. The Segmentation-Hub module can be
extended with new modules seamlessly.
205
Appl. Sci. 2020, 10, 6163
Finally, regarding the user interface showed in Figure 3, this program has four well-defined sections:
• Left panel. In this area (highlighted in blue), the loaded images are shown, and the user can
draw the initial contours of the snake of the different areas of the prostate. Each prostate zone is
surrounded with a different color to make it easily distinguishable (see Figure 3)
• Right panel. This area (highlighted in orange) shows the loaded images, and the results of the
evolution of the selected snake algorithm. Each prostate zone has a pre-defined color, and they
match with the colors used for the initial contours (see Section 4 of Figure 3).
• Bottom area. The patient information retrieved from the DICOM file is shown (highlighted in
blue). Moreover, information related to the spatial resolution and thickness of the gathered slice
are included (Section 7 of Figure 3)
• Central area. This is the control panel area (highlighted in green). All the functionalities and
possible configurations of the program are located in this area. It is represented in the Sections 1,
2, 5 and 6 of Figure 3.
To the best of our knowledge, tools as the proposed in this work, which integrate a Graphical
User Interface and semi-automated segmentation algorithms for image annotation do not exist (at least
not for prostate MR images). Therefore, a preliminary effort to develop a GUI that includes several
active contour model implementations is very important and has driven the development of this
work. On the other hand, from the implementation point of view, this program represents an excellent
tool to deploy new snake-based algorithms: given that the initial contour and the original image are
available in the Segmentation-Hub module, new algorithms can be created using Python, and added
as an standalone file to the Image Segmentation folder. This procedure will add another option to the
GUI, and the new algorithm’s performance and accuracy can be compared with the already existing
techniques. This code-base is now available as a GitHub repository to the community [15].
In order to demonstrate how different segmentation methods can be integrated into the
proposed tool, we have implemented various versions of an Active Counter Model within the Image
Segmentation module of our tool. These methods have been included not only for demonstration
purposes, but also to serve as a baseline for future developments in prostate MRI segmentation research
(people can improve them, or develop new methods based on them). The proposed Active Contour
Model, to be discussed in the following section, also represents a novel contribution, as it has not been
previously reported to be used for prostate segmentation purposes.
206
Appl. Sci. 2020, 10, 6163
In this previous work [12], the goal is to segment the shape of the left ventricle of the heart,
but since it is aimed to work with MR images, and the priors are similar to the assumptions that
can be done for the prostate zones, we considered it as a good starting point for carrying out our
experiments and implement the GVF extension to demonstrate its viability in this image modality.
The basic assumption of the developed algorithm is that the contour will split the image into two areas,
the target object and the background. The evolution of the curve will be driven trying to minimize an
energy functional that in this case, it is composed of four components. Each of them will define the
priors the final contour must comply with [12]:
1. Model matching. it is assumed that the pixel values inside and outside the object follow different
statistical models, specified as M1 and M2, respectively.
2. Edge information. the contour must be attracted to certain clues that describe the limits of the
object, i.e., the strongest edge values of the image should attract the curve.
3. Shape prior. Besides certain possible variations between patients, each prostatic zone has a
general shape. This shape information can be thought as a function CH (θ ), with θ = [θ1 , θ2 , ..., θn ] T .
The parameters θi define a model of the actual shape and size of the requested zone,
and depending of how distant is the contour from this estimated shape, this prior will contribute
more or less to the snake evolution.
4. Contour smoothness: The contour is supposed to not be jagged or too noisy, so this parameter
controls that.
Then the functional that defines the energy described above, depending on the contour C, can be
expressed as:
J (C ) = λ1 J1 (C ) + λ2 J2 (C ) + λ3 J3 (C ) + λ4 J4 (C ) (1)
where λi are the weights of each prior included in the energy function. This function will be used to
make evolve the contour. It follows that it is possible to transform the information of the contour in a
level set function, i.e., the information about the contours can be expressed as a function, where the
pixels that belong to the object have certain value (or level), the background ones have another level,
and the frontier, which is the contour C, is represented by another level. In the case of this paper,
the contour is expressed through the level set function as in Equation (2). As it can be seen, the contour
is in the level where the function φ is equal to zero. Ω represents the space of the image.
C = {( x, y) ∈ Ω : φ( x, y) = 0} (2)
This function φ evolves with each iteration (i.e.,time), depending on the direction in which the
forces push the contour. In the following sections the particularities of each term will be described.
207
Appl. Sci. 2020, 10, 6163
c1 . The boundary of this region is C0 . Then, the pixel values of the image u0 are c1 inside C0 , and c2
outside of it. Given this, it is possible to build the following fitting term:
where C is any variable curve, and c1 and c2 are two constants that depend on the contour C. Based on
the previous explanation, if C = C0 , then F (C ) is approximately zero, because each distance between
the image value and the constant ci are approximately zero for each region. If the curve C is outside
C0 , then its interior contains parts of the object and parts of the background, while the exterior of C
contains only elements of the background. Because of this, F1 (C ) > 0 and F2 (C ) ≈ 0. In the backwards
case, i.e., the contour C is smaller than C0 , the outside contains parts of the object and the background,
whilst the inside only contains parts of the object. It follows that F1 (C ) ≈ 0 and F2 (C ) > 0. Finally, this
fitting function can be used to find the boundaries of the object, by minimizing its value such as
In order to achieve this boundary, the evolution of the curve C is done by moving pixels from the
exterior to the interior of the contour, and vice-versa. The first estimation of the energy function for
this prior is in Equation (6):
J0 (C ) = p(u|C, M1 , M2 ) (6)
where p is the probability density function of the image intensities u, given the contour C, and the
models M1 and M2 . Over these probabilities, it is assumed that each element, object and background,
are statistically independent, and they have pdfs p1 and p2 , respectively. Moreover, the pixel intensities
are considered to be independent from each other. With these assumptions, and taking the negative of
the logarithm of the pdfs, the Equation (6) becomes Equation (7).
Assuming a continuous function, the contour embedded in the zero level-set function φ,
and considering the Heaviside function in order to create a single integral over all the domain of the
image, the Equation (7) can be generalized as shown in Equation (8).
J1 (φ) = − ln[ p1 (u( x, y))] H (φ( x, y)) − ln[ p2 (u( x, y))](1 − H (φ( x, y)))dxdy (8)
Ω
It is assumed that both elements in the image follow a Gaussian distribution, with means m1 and
m2 , and variances σ12 and σ22 , which are unknowns, and for each contour C, they will be estimated.
where Gσ is a Gaussian kernel of parameter σ, ∇ is the gradient operator and (.) ∗ (.) is the convolution
operator. Finally, the energy functional, depending on the level set function φ, which corresponds to
this prior is given by the Equation (10).
208
Appl. Sci. 2020, 10, 6163
where δ(φ( x, y)) will return non-zero values only for the pixels that are on the contour C.
The parameters θ = [ a, b, c, d, e, f ] are computed in such a way that the cloud of points around the
contour C fits with this ellipse equation. This means that the parameters of θ must be computed for
each contour C, and they will change the force direction in order to minimize this distance function.
Finally, introducing the level set function φ, where the contour C is embedded, the shape prior is
computed as in Equation (12), when integrating in all the space Ω.
J4 (C ) = ds (13)
C
It is worth mentioning that if only this term is included in the functional, the shape of the contour
will become a circle, and then it will start shrinking until it eventually disappears. Nonetheless, when all
the terms are put together, this situation is avoided, and the effect is to smooth the curve. Again, adding
the level set function where the contour is embedded, the final function for J4 is defined in Equation (14).
The following step is to combine all the functional together, and expand the minimization problem,
in such a way the function φ( x, y), which contains the contour C as its level zero, evolves towards the
edge of the object. The entire procedure is well explained in [12], and it consists of three tasks:
1. Define the statistical model constants, in such a way that they comply with the prior that the pixel
distribution follows a Gaussian distribution.
2. Estimate the ellipse parameters to obtain the distance functional contribution.
3. Define the contour evolution equation. This can be done by applying the Euler-Lagrange (EL)
equation to the functional of the four priors. Then, considering that φ is a function of time, the EL
equation is equaled to the derivative of φ with respect to the time. The resulting function is given
in Equation (15).
∂φ ∇φ
= λ1 ( M2 − M1 ) − ∇ P − Pκ δ(φ( x, y)) (15)
∂t ||∇φ||
209
Appl. Sci. 2020, 10, 6163
where M 1 , M2 are the stochastic models of the object and the background, respectively,
∇φ
κ = div ||∇φ|| is the curvature of the contour, φ( x, y) is our level set function, and P( x, y) =
λ2 Υ ( x, y) + λ3 D2 ( x, y) + λ4 is the potential function.
The different values of λ1 , λ2 , λ3 and λ4 will define the contribution of the corresponding energy
term, from J1 to J4 , and the one with the biggest value will enforce its properties over the others.
In practice, the parameters setting is done empirically, and there is no rule that defines how to chose
them. It depends on the quality and the properties of the input image. One solution is to make the
weights to evolve at each iteration to obtain the best results. The general idea of this evolution is that
at the beginning, the algorithm should minimize the energy functional trying to cover all the object,
pushing the boundaries towards the edges (effect produced by λ1 and λ2 ), and when the algorithm is
close to the end, the curve must enforce the shape of the contour and its smoothness (λ3 and λ4 ).
For the particular case of prostate segmentation, several objects with different shapes have to be
considered. Therefore, the shape prior cannot be only an ellipse, as it happens in the case of the left
ventricle of the heart. Thus for this implementation, the value of λ3 has been set to zero.
Additionally, through several experiments, we concluded that the contribution produced by the
smoothness prior is not noticeable. In fact, it produces small variations in the contour, and the result is
that the algorithm takes more time to converge. This is the reason why the value of λ4 has been set to
zero too. Finally, the suggested evolution of the weights is shown in Figure 5.
with the constraint that it should be closed, i.e., C (0) = C (1). Geometric Active Contours, on the other
hand, define the curve to be deformed as embedded in a level-set function, generally as its level zero.
This last case is the one implemented and described in Section 2.3. GACs have advantages over PACs,
210
Appl. Sci. 2020, 10, 6163
since they are computationally more stable and they are able to change their topology as the function is
deformed. Nonetheless, the implemented method is an edge-based method, since it uses the gradient
as pushing force, and local intensity information to determine if a pixel belongs to the object or not.
Due to this, several problems can appear, such as:
1. If the edges of the object are not well defined, a leaking problem occurs, leading the snake outside
of the object’s boundaries.
2. If the initial contour is not close to the limits of the object, the gradient will not be able to guide
the snake to the edges of the object. If the image is just black and white with one object on it,
an initial contour far away from the object will not evolve. On the other hand, if there are other
objects around, the snake might be attracted to them, instead of the desired object.
3. Since the snake uses the local pixel information to determine the evolution of the curve, it is
sensible to the image noise.
Additionally, the fact of using gradients only, brings another effect related to concavities, that is
illustrated in Figure 6.
Figure 6. Example of Gradient vector field of a U shape object. (Left) Original image. (Center) Gradient
of the object. (Right) Zoom of the vector field in the central concavity of the object.
In this case, a simple object with a narrow concavity, and its corresponding gradient vector field
are shown. From it, it is easy to see that:
• In all the space where there the intensities are constant, the gradient is zero. There are gradient
vectors only in the borders of the object. This is the main reason why (in an edge-based snake) an
initial contour placed far away from the object might not converge to the real object boundaries.
• Narrow concavities start to have horizontal gradient vectors (as it can be seen in the right plot of
the Figure 6). As a consequence, the snake will attach to the borders, but it will not address towards
the interior of such concavity because the gradient does not have any downwards component.
These last two problems can be solved in the previously explained algorithm with an external
force that, not only adds vectors where the gradient is zero, but that also adds a downwards component
in the long concavities in such a way that the active contour will be pushed to the interior of this type
of areas. This external force is called Gradient Vector Flow (GVF) [17]. Adding this component to the
energy described in Section 2.3 will help to not only make the system more robust against initialization
effects, but it will improve the outcome by attaching the resulting snake really close to the edges of
the object.
211
Appl. Sci. 2020, 10, 6163
The GVF is a dense vector field v( x, y) = [u( x, y), v( x, y)] derived from the original image, and it
is the result of minimizing the energy term defined in Equation (17) [17].
where u x and v x are the partial derivatives of u and v with respect to the variable x, respectively, uy and
vy are the partial derivatives of u and v with respect to the variable y, respectively, and Υ ( x, y) is the
edge map of the original image. The parameter μ is a constant that serves as regularization: it adjusts
the balance between the first and the second term of the integral, and it should be adjusted based
on the noise. The higher the noise, the higher the value of μ. Interpreting the energy term presented
in (17), it is possible to see that if the gradient of the edge map is too small, the energy term depends
only on the partial derivatives of u and v. On the other hand, if this gradient is large, the integral is
dominated by it, and the value of the GVF that minimizes the energy is v = Υ, i.e., the gradient vector
flow is equal to the gradient of the edge map on those cases. It can be demonstrated that the vector field
v = (u, v) that minimizes (17) is the result of the decoupled pair of differential Equations (18) and (19).
where ∇2 is the Laplacian operator. Similarly as in Equation (17), it is possible to see that in the
regions where the gradient of the edge map is small, the solution of the differential equations for
the GVF are the solution of Laplace’s equations. Furthermore, this solution is interpolated from the
boundaries of the object, and that’s the reason why the resulting field, even far away from the object,
points to the edges of it. The gradient vector flow of the same image as in Figure 6, is shown in Figure 7.
Now, the solution to the Equations (18) and (19) provides vectors in the areas where the gradient is
zero, and the vectors in the concavity contains downwards components too.
Figure 7. Example of normalized Gradient Vector Flow. (Left) GVF vector field for the same image
used before. (Right) Zoom in the concavity.
Once the GVF has been computed, it is fixed for the entire the image, and what remains is the
optimization equation that will guide the deformation of the active contour to the edges of the object,
based on this new force. Several researchers have been working on adding gradient vector flow to
the traditional snake equations. For the implementation made in Section 2.3, only GACs algorithms
can be adapted to be added as an external force to the energy functional defined in the Equation (15),
212
Appl. Sci. 2020, 10, 6163
as it has been done in [18,19]. A simplified version of the GVF optimization equation from [19] is used,
and defined as:
∂φ ∇φ
γ = ακ − v. |∇φ| (20)
∂t |∇φ|
∇φ
The parameter γ regularizes the step size, φ is the level-set function, κ = div is the
|∇φ|
curvature, v is the Gradient Vector Flow vector field, and the parameter α regulates the balance between
the two forces. The right hand factor of (20) is added directly to Equation (15), giving Equation (21):
∂φ ∇φ ∇φ
= λ1 ( M2 − M1 ) − ∇ P − Pκ + β ακ − v. |∇φ| δ(φ( x, y)) (21)
∂t ||∇φ|| |∇φ|
This new segmentation algorithm has been included in the list of options to segment the different
prostate zones. After having implemented the Basic ACS in Python, the extension with a GVF
component is trivial: all the components of this new force are already available in the code except for the
vector field v. Since this vector field is fixed for a given image, a new class has been created that solves
the optimization problem for the GVF. This new piece of code is executed before the main loop of the
segmentation algorithm. Since there are no rule for setting both α and γ, we arbitrarily decided to set
them to 1. In order to adjust the contribution of this new force, the β factor has been introduced. For the
following experiments, this factor has been also set to 1. In the coming sections, the graphical user
interface result of the integration of all the modules will be showed. Then, the described algorithms
will be tested to show their performance.
1. Select the type of contour to enter, circular or manual. In case of circular, enter also the radious.
2. Select an area: TZ, PZ, CZ or Tumour.
3. Create the initial contour for the snake in the left image. If it is a circle, the mouse cursor must be
placed in the center of the area of interest, and click with the left button of the mouse. If manual
contour has been chosen, the mouse must be placed in the point where the user wishes to start the
contour and make click with the left button. Then move the cursor until the next point, make a
left click and go to the next point. More than 2 points are required.
4. In manual mode, once the contour is finished, press the “Ok” button. If not, go to step 5.
5. A message box will appear informing if the contour has been correctly added or not.
213
Appl. Sci. 2020, 10, 6163
The user can repeat the same procedure for all the zones, or just some of them. In order to
differentiate the various selected areas, each initial contour will be drawn with a different color.
There are no constraints about the type of contours to use. All the contours can be manual, all of them
can be circles or they can be a combination of them.
The following area of the control panel is used for triggering the segmentation. This section is
depicted in Figure 9. After an image has been loaded, at least one initial contour has been entered,
then the segmentation process will start after pressing the Run button. The segmentation method will
be triggered as many times as initial contours have been entered, and they will be executed sequentially.
During this process, the user can cancel the segmentation at any moment, pressing the Cancel button.
Once the algorithm had finished, the results will be shown in the right image. Additionally, the color
code of the contours is the same for both, left and right images. For instance, the resulting contour of
color blue in the right image has as initial contour the blue snake from the left image.
If some results have been produced, then a 3D representation of them can be visualized if the
user presses the button Show 3D model. This representation consists only of the resulting contours,
replicated several times along the Z axis. It is an extra option for users to visualize the main structure
of the segmentation in a 3D model. In fact, this option should be improved using better visualization
techniques, but it is out of scope of this work for the moment. The amount of repetitions has been
set to 50. The axis are represented in real scale, in mm, if the DICOM file contains the information
about slice thickness and spatial resolution. If they are not provided, a pop up message will appear
reporting the situation, and the X and Y axis will be presented in pixels, and the Z axis will have a size
of 10 units. An example of this 3D representation of a result is shown in Figure 10.
Regarding to file management tasks provided to this tool, users can open a DICOM formatted
file or a folder that contains several DICOM formatted files. When the Open f ile button is pressed,
a pop up window will appear, and it will only allow the user to select a file, not folders. If a file
is not in DICOM format, then the program will reject it, and it will not do further actions. If the
user wants to open a folder, it should click the Open f older button, navigate until the folder it wants,
and go into that folder. Without touching any file, users should click the Choose button and the
system will automatically search for files, and open only the DICOM formatted files. It is important to
emphasize that the search system is not recursive. That means, if there are folders inside the selected
214
Appl. Sci. 2020, 10, 6163
one, and those folders contain DICOM files, the program will not open them. It will only open the files
that are in the current directory.
As it can be seen in the graphical user interface, there are three types of algorithms: Basic ACS,
Simplified ACS and GVF ACS. The first and the second algorithms are based on [12], but the second
one does not follow the mathematical approach described in this previous work. On the other hand,
Basic ACS implements exactly what the paper states, as explained in Section 2.3. Both algorithms have
been included so that there is an example in the code of how to add a new algorithm to this code-base.
Additionally, the Simplified ACS performs a faster than Basic ACS but with less accuracy. Nonetheless,
in order to produce improvements to the original paper, the Basic ACS was required.
Finally, considering that this application is focused on DICOM format and it can be used for
radiologists and/or medical experts, it is important to maintain the private information of the patient.
For this reason, an anonymization module is integrated in this tool. If the user wants to anonymize
the opened file or files, it should click into the Anonymize button. If there is a file/files opened, a pop
up window will be shown where the user must specify the destination folder of the anonymized
files. The name of the output files will be the same as the source files, but the string _anonymized
will be appended to it. For instance, if the original file is called Image00001, the output file will be
called Image00001_anonymized. The fields that are anonymized are: Patient Name, Patient ID and the
215
Appl. Sci. 2020, 10, 6163
Patient Date of Birth. Those fields are not cleared, but changed by some default data: Patient Name
is changed by NN, Patient ID is transformed to A00000000000, and the date of birth is changed
to 01/01/1970.
We believe that the proposed tool can be beneficial for the research community as it enables
to integrate various image management tools thoroughly discussed in previous sections, as well as
the capability to segment prostate cancer lesions and other areas as mentioned in the introduction.
Even though the application includes the option of segmenting four zones only (CZ, TZ, PZ and
Tumor), nothing stops the users of trying to segment other areas of their interest (additional buttons
can be appended into the GUI in order to select more areas, or the already existing buttons can be
used to segment those areas). These features, as well as the capability to integrate new segmentation
models seamlessly make it very appealing for research and teaching purposes in clinical settings.
In the former case, it can help in speeding up the creation of datasets for machine learning applications.
However, in our experience, doctors are reluctant to use fully automated annotation tools and prefer
to use interactive semi-automated tools that enable them to verify and contrast the results with other
specialists. In order to test the performance of the proposed tool, we performed one experiment to
assess the quality of the segmentation algorithm implemented in this work, which will be detailed in
the following section.
Figure 11. Initialization used for the testing of all the algorithms.
The results for all the algorithms and the ground truth for the selected image are shown in the
Figure 12.
From this experiment, it is possible to confirm what it has been said in previous sections.
The Basic ACS algorithm performs relatively well when there is some gradient information between
different areas. In this image, the cyan and the blue contours (corresponding to the PZ and CZ,
respectively) are practically attached to the borders; the contour corresponding to the TZ (yellow
contour) is almost perfect. Nonetheless, the tumour area invades the TZ area, since the intensities
contrast between them is not enough in order to define a good edge.
For the case of Simplified ACS, the results are not so satisfactory. This algorithm is uses
simplifications of the formulas described in [12]. It has been tested for heart’s left ventricle segmentation
and it has a good performance in both, quality and speed, but for prostate segmentation, this is not
true. Except for the TZ, all the results take part of the other zones, and only when the edge is too
strong, the snake seems to be perfectly attached to it.
216
Appl. Sci. 2020, 10, 6163
Figure 12. Segmentation results and ground truth for all the algorithms implemented.
Finally, as it is expected, the Gradient Vector Flow implementation overcomes all the problems,
and the curves are strongly attached to the borders of each zone. There is only a little overlapping
between the tumour and the TZ.
Another point to observe is the smoothness of the results from Basic ACS and GVF ACS. None of
them include either a shape prior or the smoothing factor. This allows the snake to evolve in any
shape, which is desired in this case, since there is not a single prior that work for all the prostate
zones. Additionally, for the same reason, a generic post-processing that increases the smoothness
of the curves is not trivial to implement. This problem is visible mainly in the result given by the
Basic ACS, where all the contours do not present smooth borders. Nonetheless, the GVF ACS generates
a result that even without these factors, the outcomes are not so jagged. Once again, the addition of
the Gradient Vector Flow component surpasses the results given by the Basic ACS.
It is worth mentioning that the initialization (in all the cases) requires to be inside of the object to
produce valid results. Since a lot of objects are present in an MR image, a larger contour than the object
will produce poor segmentation results. In order to reduce this effect, the image to be segmented is
cropped to an square of 150x150 pixels around the center of the initialization contour before starting
the segmentation process. As a consequence, not only the influence of other objects over the snake is
reduced, but also it reduces the execution time, since less operations are required at each iteration of
the algorithm.
In order to measure quantitatively the efficiency of the three model implementations, a contour
evaluation of this test was performed. For this evaluation, the Hausdorff distance (HD) and Dice index
(DI) metrics for the various segmentation results were compared with a ground truth provided by a
specialist. The Hausdorff distance between two finite set of points A and B is defined as follows:
where . is the Euclidean distance and A and B are the resulting contours of the segmentation.
In computer vision and image processing applications, the largest segmentation errors quantified
using the Hausdorff Distance can be a good measure of the usefulness of the segmentation for the
217
Appl. Sci. 2020, 10, 6163
intended task. Intuitively, HD is the longest distance one has to travel from a point in one of the two sets
to its closest point in the other set. In image segmentation, this metric is computed between boundaries
of the estimated and the ground-truth segmentation, which consist of curves in 2D. We computed the
Hausdorff distance between the generated segmentation and the ground truth. The methods with the
lowest distance are considered to have yielded the best segmentation results.
Another popular metric for image segmentation tasks is based on the Dice coefficient, which is
essentially a measure of overlapping between two samples and it is equivalent to the F1 score.
This measure ranges from 0 to 1, where a Dice coefficient of 1 denotes perfect and complete overlap.
This metric is calculated as:
2 | A ∩ B|
Dice( A, B) = (24)
| A ∪ B|
In Table 1, we summarize the results obtained from the performance analysis for the three
segmentation methods, using the two above-mentioned metrics. This procedure has been done for
each of the prostate’s gland zones, which corresponds to the examples of the annotations depicted
on Figure 12. The ground truth is used as a reference image to calculate all the measurements.
First, according to the initialization image, a comparison of the results between Simplified ACS,
Basic ACS and GVF ACS is performed. It can easily be inferred from the table that the Basic ACS yields
more accurate segmentation results in all the prostate zones than the Simplified ACS. Moreover, it can
be seen that the GVF ACS implementation improves the results in all the cases.
Table 1. Analyses of Hausdorff distance (in mm) and Dice index for the CZ, PZ, TZ and tumour area (TUM).
The geometry of the CZ zone is considered the most difficult part of the prostate MR image
because it is difficult sometimes to demarcate the edges of this area, given that it is surrounded by the
other zones; for this reason, it is important to do a good initialization. The results for the PZ are good
in all the cases, especially when GVF is used. Moreover, the results on the TZ area are good for all
the algorithms considered in this study. Finally, the results for the tumor area indicate that the best
segmentation is obtained using GVF ACS methods.
Finally, in order to assess the performance of the various studied segmentation methods, a time
analysis has been performed. As in Table 1, the analysis has been divided by zones and by algorithm.
Then, the execution time of the segmentation algorithm has been assessed for each zone, using the
same initialization as in Figure 11. For this test, a computer with a microprocessor i7-8650U @ 1.90 GHz,
16 gb of RAM and running Linux 4.15.0-64 has been used. The execution times are depicted in Table 2.
All the measurements has been done in milliseconds. Both Simplified ACS and Basic ACS run at
maximum 200 iterations, and the GVF ACS runs at maximum 80 iterations. Both Tables 1 and 2 confirm
what it has been said about the Simplified ACS algorithm. It is the least accurate, but it runs faster than
the other ones. On the other hand, even though the GVF ACS runs less iterations than the Basic ACS,
their running time are similar. This is due to the GVF ACS algorithm computes the same code, plus
the additional ones in order to get the Gradient Vector Flow contribution. Nonetheless, for this
work, this execution time could have been reduced if the equations related to the shape prior and
the smoothness would have been removed. Only the weights have been set to zero for those priors.
This code has not been erased so that other developers can prove other parameters values, and try
218
Appl. Sci. 2020, 10, 6163
to find other combination of weights to make this program work with either, other image modalities,
or other organs to be studied.
Table 2. Execution time (in ms) for each segmentation algorithm, and for each prostate zone.
CZ PZ TZ TUM
Simplified ACS 605.79 231.69 317.19 616.93
Basic ACS 1054.97 574.30 819.18 1108.16
GVF ACS 641.66 644.95 725.72 1058.67
These results show that a good segmentation can be obtained in a reasonably time, which is
important for both clinical and machine learning applications. As it is expected, one result is not
representative of the quality of these algorithms. An analysis of a larger population must be done in
order to have a real measure of quantitative error of our implementations. The work we introduced
here is to present the tool, the algorithms implemented and how to use them. As future work, we plan
to perform a more comprehensive study encompassing other methods known to have good results to
contrast results as well as other regions of interest within the prostate. This will enable us to assess the
validity of the proposed segmentation approach in other setting, and to get feedback from a larger
pool of specialists. Other avenues of research pertain the validation of the proposed tool, as well as of
the segmentation algorithm in a larger pool o prostate MRI images, as well as with a wider number of
clinicians. We also plan to integrate other segmentation methods into the proposed GUI tool to carry
out a more comprehensive quantitative comparison and to investigate its diagnostic differentiation
with other prostate diseases such as BPH, chronic or acute prostatitis.
5. Conclusions
In this paper we described the implementation of a snake-based segmentation algorithm,
based on an Active Contours Models (ACM), which has been integrated into a GUI based application.
We focused our research in a method originally conceived for segmenting the shape of the heart’s left
ventricle. A modification of this specific prior has yielded very good results in the context of prostate
cancer segmentation: the snake models tries to evolve and to adjust to the boundaries of the proposed
zones. The problem with this implementation is that the performance of this previous model highly
depends on the quality of the input image, as it uses the pixel intensities and the edges of the object in
order to determine the correct curve evolution. As it can be expected, if the boundaries are not well
defined, or if the intensities of different sections have similar values, then either the model will not be
able to assess the difference, preventing the algorithm of evolving, or it will cause the snake to overlap
with other elements in the image. This scenario can be observed in Figure 12, where the red contour
overlaps with the area that corresponds to the TZ.
Furthermore, in order to get that segmentation, several shapes for the initial contours had been
tested to obtain the best results, which might be very different from one run of the algorithm to another.
This entails that the implemented snake-based algorithms are very sensitive to the initialization
conditions. This type of problems have fostered researches in image analysis and computer vision to
find novel ways of constraining the shape of the force that pushes the objects’ boundaries in the right
direction. Doing so, the segmentation algorithm becomes robust against to initialization conditions
and less sensitive to the quality of the image.
Regarding the performance of both implementations, the Simplified ACS and Basic ACS,
their behaviour are not the same. Since Basic ACS includes all the weights and all the functions
described by the original document, the boundaries of the final segmentation tends to be closer to the
real boundaries of the object to segment than in the case of Simplified ACS. Nonetheless, both of them
fail in the task of finding the boundaries when the image intensities between two contiguous areas are
not noticeably different, like in the case of the tumor and the central zone in Figure 12. This problem
has been partially overcame when using the Gradient Vector Flow external force.
219
Appl. Sci. 2020, 10, 6163
Further improvements to this algorithm will thus be directed towards implementing other types
of Gradient Vector Flow energy optimization algorithms, or exploring how to include shape priors for
each zone of the prostate separately. These modifications would help to create a segmentation that
does not go too far from where the real object is. Additionally, no pre-processing nor post-processing
functions have been included. Adding pre-processing operations might help to remove the noise that
addresses the snake towards the wrong directions, and post-processing of the resulting contour will
help to make it smoother, and closer to the real boundaries of the object.
Moreover, since the DICOM files can contain different image modalities, and each modality has
its particularities, a profiles section can be added to the program. Each profile will contain a set of
parameters (priors weights, GVF weight, regularization constants, and iterations) specific for each
algorithm. These profiles will be chosen in a way that they will produce the best segmentation results
for the selected image modality, increasing the scope of application of the program.
Other avenues of research pertain the validation of the proposed tool, as well as of the
segmentation algorithm in a larger pool o prostate MRI images, as well as with a wider number
of clinicians. We also plan to integrate other segmentation methods into the proposed GUI tool to carry
out a more comprehensive quantitative comparison and to investigate its diagnostic differentiation
with other prostate diseases such as BPH, chronic or acute prostatitis
Author Contributions: Conceptualization, C.M. and J.R.; methodology, all authors.; software, J.R.; validation,
C.M., G.O.-R.; resources , C.M and G.O.-R.; data curation, C.M.; writing—original draft preparation, J.R. and C.M.;
writing—review and editing, all authors; visualization, all authors; supervision, C.M., G.O.-R. All authors have
read and agreed to the published version of the manuscript.
Funding: This work was partially funded by the Spanish R+D+I grant n. TIN2012-37171-C02-01, by UdG grant
MPCUdG2016/022. The Regional Council of Burgundy under the PARI 1 scheme also sponsored this work.
C. Mata held a Mediterranean Office for Youth mobility grant.
Conflicts of Interest: The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Abbreviations
The following abbreviations are used in this manuscript:
220
Appl. Sci. 2020, 10, 6163
References
1. Siegel, R.L.; Miller, K.D.; Jemal, A. An Enhanced Contextual Fire Detection Algorithm for MODIS. CA Cancer
J. Clin. 2020, 70, 7–30. [CrossRef] [PubMed]
2. Carioli, G.; Bertuccio, P.; Boffetta, P.; Levi, F.; La Vecchia, C.; Negri, E.; Malvezzi, M. European cancer mortality
predictions for the year 2020 with a focus on prostate cancer. Ann. Oncol. 2020, 31, 650–658. [CrossRef]
[PubMed]
3. Shah, V.; Turkbey, B.; Mani, H.; Pang, Y.; Pohida, T.; Merino, M.J.; Pinto, P.A.; Choyke, P.L.; Bernardo, M.
Decision support system for localizing prostate cancer based on multiparametric magnetic resonance
imaging. Med. Phys. 2012, 39, 4093–4103. [CrossRef] [PubMed]
4. Dickinson, L.; Ahmed, H.U.; Allen, C.; Barentsz, J.O.; Carey, B.; Futterer, J.J.; Heijmink, S.W.; Hoskin, P.J.;
Kirkham, A.; Padhani, A.R.; et al. Magnetic resonance imaging for the detection, localisation,
and characterisation of prostate cancer: Recommendations from a European consensus meeting. Eur. J. Neurol.
2011, 59, 477–494. [CrossRef] [PubMed]
5. Fradet, V.; Kurhanewicz, J.; Cowan, J.E.; Karl, A.; Coakley, F.V.; Shinohara, K.; Carroll, P.R. Prostate cancer
managed with active surveillance: Role of anatomic MR imaging and MR spectroscopic imaging.
Radiographics 2010, 256, 176–183. [CrossRef] [PubMed]
6. Ghose, S.; Oliver, A.; Mitra, J.; Martí, R.; Lladó, X.; Freixenet, J.; Meriaudeau, F.; Sidibé, D.; Vilanova, J.C.;
Comet, J. A supervised learning framework of statistical shape and probability priors for automatic prostate
segmentation in ultrasound images. Med. Image Anal. 2013, 17, 587–600. [CrossRef] [PubMed]
7. Ghose, S.; Oliver, O.; Martí, J.; Lladó, X.; Vilanova, J.; Freixenet, J. A survey of prostate segmentation
methodologies in ultrasound, magnetic resonance and computed tomography images. Comput. Methods
Programs Biomed. 2012, 108, 262–287. [CrossRef] [PubMed]
8. Vos, P.C.; Hambrock, T.; Barenstz, J.O.; Huisman, H.J. Computer-assisted analysis of peripheral zone
prostate lesions using T2-weighted and dynamic contrast enhanced T1-weighted MRI. Phys. Med. Biol. 2010,
55, 1719–1734. [CrossRef] [PubMed]
9. Chen, M.; Dang, H.; Wang, J.; Zhou, C.; Li, S.; Wang, W. Prostate cancer detection: Comparison of t2-weighted
imaging, diffusion-weighted imaging, proton magnetic resonance spectroscopic imaging, and the three
techniques combined. Acta Radiol. 2008, 49, 602–610. [CrossRef] [PubMed]
10. Brancato, V.; Di Costanzo, G.; Basso, L.; Tramontano, L.; Puglia, M.; Ragozzino, A.; Cavaliere, C. Assessment
of DCE Utility for PCa Diagnosis Using PI-RADS v2.1: Effects on Diagnostic Accuracy and Reproducibility.
Diagnostics 2020, 10, 164. [CrossRef] [PubMed]
11. Mata, C.; Walker, P.; Oliver, A.; Brunotte, F.; Martí, J.; Lalande, A. Prostateanalyzer: Web-based medical
application for the management of prostate cancer using multiparametric mr images. Inform. Health Soc. Care
2015, 87, 1–21. [CrossRef] [PubMed]
12. Pluempitiwiriyawej, C.; Moura, J.M.; Wu, Y.J.L.; Ho, C. STACS: New active contour scheme for cardiac MR
image segmentation. IEEE Trans. Med. Imaging 2005, 24, 593–603. [CrossRef] [PubMed]
13. De Marzo, A.; Platz, E.; Sutcliffe, S.; Xu, J.; Grönberg, H.; Drake, C.; Nakay, Y. Inflammation in prostate
carcinogenesis. Nat. Rev. Cancer 2007, 7, 256–269. [CrossRef] [PubMed]
14. Verma, S.; Rajesh, A. A clinically relevant approach to imaging prostate cancer: Review. Am. J. Roentgenol.
2011, 196, S1–S10. [CrossRef] [PubMed]
15. Rodríguez, J.; Ochoa-Ruíz, G.; Mata, C. Prostate-MRI-segmentation-App. GitHub Repos. 2020. Available
online: https://github.com/joako1991/Prostate-MRI-segmentation-App (accessed on 2 September 2020).
16. Chan, T.F.; Vese, L.A. Active contours without edges. IEEE Trans. Image Process. 2001, 10, 266–277.
[CrossRef] [PubMed]
17. Xu, C.; Prince, J.L. Snakes, shapes, and gradient vector flow. IEEE Trans. Image Process. 1998, 7, 359–369.
[CrossRef] [PubMed]
221
Appl. Sci. 2020, 10, 6163
18. Paragios, N.; Mellina-Gottardo, O.; Ramesh, V. Gradient vector flow fast geometric active contours.
IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 402–407. [CrossRef] [PubMed]
19. Xu, C.; Yezzi, A.; Prince, J.L. On the relationship between parametric and geometric active contours.
In Proceedings of the Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems
and Computers (Cat. No.00CH37154), Pacific Grove, CA, USA, 29 October–1 November 2000; Volume 1,
pp. 483–489. [CrossRef]
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
222
applied
sciences
Article
Medical Assistant Mobile Application for Diabetes
Control by Simulating a Compartmental Model
Martín Hernández-Ordoñez 1, *, Marco Aurelio Nuño-Maganda 2 , Carlos Adrián Calles-Arriaga 2 ,
Abelardo Rodríguez-León 3, *, Guillermo Efren Ovando-Chacon 1, *, Rolando Salazar-Hernández 4 ,
Omar Montaño-Rivas 5 and José Margarito Canseco-Cortinas 2
1 Department of Metal-Mechanics, México National Technological/I.T. Campus, Veracruz 91860, Mexico
2 Intelligent Systems Department, Polytechnic University of Victoria, Tamaulipas 87138, Mexico;
[email protected] (M.A.N.-M.); [email protected] (C.A.C.-A.); [email protected] (J.M.C.-C.)
3 Department of Systems and Computing, México National Technological/I.T. Campus,
Veracruz 91860, Mexico
4 Facultad de Comercio, Administración y Ciencias Sociales, Universidad Autónoma de Tamaulipas,
Tamaulipas 87000, Mexico; [email protected]
5 Academy of Information Technologies, Polytechnic University of San Luis Potosí, San Luis 78369, Mexico;
[email protected]
* Correspondence: [email protected] (M.H.-O.); [email protected] (A.R.-L.);
[email protected] (G.E.O.-C.)
Abstract: This paper presents an educational mobile assistant application for type 1 diabetes
patients. The proposed application is based on four mathematical models that describe the
glucose-insulin-glucagon dynamics using a compartmental model, with additional equations
to reproduce aerobic exercise, gastric glucose absorption by the gut, and subcutaneous insulin
absorption. The medical assistant was implemented in Java and deployed and validated on several
smartphones with Android OS. Multiple daily doses can be simulated to perform intensive insulin
therapy. As a result, the proposed application shows the influence of exercise periods, food intakes,
and insulin treatments on the glucose concentrations. Four parameter variations are studied, and their
corresponding glucose concentration plots are obtained, which show agreement with simulators of
the state of the art. The developed application is focused on type-1 diabetes, but this can be extended
to consider type-2 diabetes by modifying the current mathematical models.
1. Introduction
In recent years, obesity and overweight have become a severe problem for human health. In 2014
more than 1.9 billion adults aged 18 and older were overweight globally, from which over 600 million
were obese. The data indicated that 39% of these adults were overweight and 13% were obese [1].
Body mass index (BMI) provides the most useful population-level measure of overweight and obesity
as it is the same for both sexes and all ages of adults. Patients with higher BMI are at greater risk
for having diabetes. In 2014, more than 422 million people had diabetes in the world, with a global
prevalence of 8.5% among adults over 18 years old. Diabetes is a leading cause of blindness, kidney
failure, heart attacks, stroke and lower limb amputation. In 2012, an estimated 1.5 million deaths
were directly caused by diabetes, and another 2.2 million deaths were attributable to high blood
glucose [2]. Diabetes was promoted from 21st place in 1990 to the 14th in 2010 in the list of death
causes worldwide [3], and recently, to the 6th place in 2016 [4].
In Mexico, about 72.5% of adults from 18 years and over, 36.3% of teenagers, and 33.2% of children
were overweight or obese in 2016. Diabetes statistics are not too optimistic; Mexico is in sixth place
in countries with more diabetes, just behind China, India, US, Brazil and Russia [5]. Recent statistics
show that 9.17% of adults from 18 years (about 6.4 million people) have diabetes [6]. Type 2 diabetes
has become the leading cause of death in Mexico. The disease claims nearly 80,000 lives each year, and
forecasters say the health problem is expected to get worse in the decades to come [7].
On the other hand, smartphones have been accessible to more people due to the high worldwide
market growth rate. In 2016, around 1.5 billion smartphones were sold worldwide [8]; by the end
of 2016, more than 3.2 billion individuals worldwide used the Internet regularly, accounting for
nearly 45% of the world’s population; about 2.5 billion people access the web via mobile phone [9],
which represents novel opportunities for designing applications running on smartphones, accessible
to large population groups. Specifically, healthcare and disease management have benefited with
applications such as fitness tracking, drug dose calculators, information apps, tutorials, surveys, diaries,
and other tools created to help empower users to put more emphasis on their health care. Some benefits
of using smartphones with Internet connectivity consists of the possibility of sharing data with family,
friends, classmates, colleges, collaborators but especially with professionals who can guide the user
and give feedback regarding treatment to be followed.
This paper is structured as follows. In Section 2, a concise review of the state of the art is given.
A description of the models used in the developed application is provided in Section 3. In Section 4,
the results and discussion are described. Finally, in Section 5, the conclusion is presented where final
recalls are established, and future work is outlined.
224
Appl. Sci. 2020, 10, 6846
modify glucose concentration. It uses three mathematical models to consider all the mentioned effects;
however, it does not describe periods of exercise or other modeled effects.
In [16], a system is proposed based on the combination of Compartmental Models (CMs) and
artificial Neural Networks (NNs). Data from a Type 1 diabetes patient, which contains information
about measured blood glucose levels, insulin intake, and food intake description, along with the
corresponding time, were used as input to the model. The proposed strategy was not adequately
validated. Furthermore, no performance criteria were mentioned to assess the results objectively.
The literature review indicates that early studies of compartmental modelling of glucose/insulin
interactions were proposed since the 1980s in [17–19]. On the other hand, recently, different research
groups have reported important works related with the implementations of compartmental models to
develop glucose level simulators, for example, in [20,21] are presented a glucose forecasting algorithm
based on a compartmental composite model of glucose-insulin dynamics. In [22] is developed a
dynamic model for type 1 diabetic patient that focuses on blood glucose regulation. A first model for
insulin adsorption in infusion sets using a minimal lumped two-compartment modelling approach is
reported in [23]. A review of the key role of the UVA/Padova simulator is presented in [24], which has
tested the use of new molecules for better glucose control.
225
Appl. Sci. 2020, 10, 6846
carry a risk of incorrect dose recommendations ranging from those that might lead to suboptimal
disease control to those with potentially life-threatening consequences. In [30], an analysis of the
benefits of smartphone apps for type 2 diabetes mellitus (T2DM) self-management was presented.
It can be concluded that more research with valid study designs and longer follow-up is needed to
evaluate the impact of health apps for diabetes care and self-management, specifically for T2DM.
In [31], both Heuristic and keystroke level modeling (KLM) are used to measure efficiency for each
analyzed mobile diabetes management application for patients with Type-1 Diabetes Mellitus.
With recent advances in connectivity technologies, there are some efforts to build tools based on
cloud computing [32] or social network support [33,34]. One disadvantage of these tools is that they
always require Internet connectivity at all times, which makes them useless in places with connectivity
problems. Some applications are targeted to a particular segment of the population, such as pregnant
women; one of these applications is reported in [35,36], where a tool to follow patients with gestational
diabetes is described. In [37], an application for insulin dose calculation is reported. The authors
establish that this application’s innovation consists of the incorporation of exercise and alcohol intake
to the calculator in addition to carbohydrates intake. In [38], an application created explicitly for
controlling an insulin pump taking as real-time input data from a continuous glucose monitor is
reported. A closed-loop control system used to automatically manage insulin infusion to maintain the
desire range´s glucose level is proposed. To summarize, there is a strong interest from the scientific
community in developing applications around the problems associated with diabetes. This article
adds to this trend with an emphasis on the exercise model.
3. Models
The idea of creating an artificial model for a glucose-insulin system for several purposes has
been widely explored. In [39], a minimal glucose-insulin system is proposed; the model starts with
the Bergman model [17] and presents modifications to improve, considering the glucose level´s
description in the subcutaneous layer and a meal disturbance term. The developed application uses
four mathematical models related to them, following the different dynamics that they describe:
1. Glucose-Insulin Model: Sorensen in [40] proposed it. This model comprises 19 nonlinear
differential equations, which include significant metabolic effects related to glucose regulation.
The model was obtained through a compartmental technique, where the pancreas compartment
is omitted to consider a type 1 diabetic patient. Recently, this glucose-insulin model was extended
to consider exercise periods [41].
2. Subcutaneous Insulin model: Berger and Rodbard [42] obtained it. This model describes the
subcutaneous insulin uptake pattern after an insulin injection. The model considers interactions
between components after subcutaneous injections such as insulin absorption and elimination in
plasma insulin, active insulin, glucose utilization, plasma glucose, and glucose input. The model
is based on a set of differential equations. Variations in plasma insulin concentration (I) can be
estimated by Equation (1).
dI I
= abs − ke ·I (1)
dt Vi
where Iabs represents the rate of insulin absorption, Vi is the volume of insulin distribution, and ke
is the first-order rate constant of insulin elimination. The absorption of insulin can be calculated
as follows:
st2 T50
s D
Iabs = 2 (2)
s + ts
t T50
where s is a parameter associated with the insulin absorption pattern, t represents the time of the
injection, T50 is when the dose (D) is absorbed in an amount of 50%.
226
Appl. Sci. 2020, 10, 6846
3. To estimate steady-state conditions, insulin profile (Iss ) using the superposition principle is
computed assuming three days as follows:
Vmax G
Gout (G) = (5)
km + G
G represents the glucose concentration, Vmax maximal glucose utilization for reference insulin
concentration, and Km is glucose concentration for the half-maximal response. Vmax depends on
insulin; mathematically, this is
where Iss is plasma insulin concentration equivalent to steady-state insulin, and Gx is the reference
glucose value. Glucose utilization for any combination of plasma glucose and insulin levels is
given by
G(c·Iss + d)(Km + Gx )
Gout (G, Iss ) = (7)
Gx (Km + G)
where c is insulin sensitivity, and d represents insulin-independent glucose utilization. The insulin
absorption dynamics for regular, NPH, lente, ultralente formulations were characterized in [42]
just by changing some specific parameters in the general model. Recently, in [44], the Lispro
insulin formulation was also characterized by the model in [42].
5. The exercise model was considered from [41]. The liver’s glycogen reservoir (GLY) can be
calculated as shown in Equation (8).
dGLY(t)
dt
= −∅ PVOmax 2 (t), GLY(t) + Ψ (Gabs (t) + GLY(t)) (8)
⎧
⎪ ⎪
⎨ ζ PVOmax if 0 < GLY ≤ GLYmax
∅ PVO2 , GLY = ⎪
max
⎪
2 (9)
⎩ 0 if GLY ≥ 0
Gabs if GLY < GLYmax
Ψ(Gabs , GLY) = (10)
0 if GLY > GLYmax
2
ζ PVOmax
2 = 0.006 PVOmax2 + 1.226PVOmax
2 − 10.1958 mg/min (11)
where ϕ is the net glycogenolysis rate, Ψ represents the absorptive period, ζ is the glucose uptake
due to exercise intensity, and PVO2 max is the percentage of the maximum oxygen consumption rate.
In Figure 1, the block diagram of the diabetic patient utilized in this work is shown. There are
four subsystems, each one corresponds to a previously described model, and their interactions are
shown as solid arrows. The dotted lines delimit the mathematical models proposed in the literature,
which describe different but comparable physiological phenomena because they have the same scale.
227
Appl. Sci. 2020, 10, 6846
Figure 1. The diabetic patient model utilized for the implementation of simulations reported in
this work.
1. Meal parameters: these parameters are related to carbohydrate intake (in grams) and the day´s
carbohydrate intake schedule.
228
Appl. Sci. 2020, 10, 6846
2. Insulin dose parameters: these parameters are related to the type (fast or slow actions) and the
dose of insulin (in units).
3. Exercise parameters: these parameters are related to the quantity and intensity of exercise.
4. Simulation configuration: these parameters are related to the simulation’s duration and the type
of plot to be generated.
5. The architecture contains the following blocks:
6. Simulation variables container: this block stores data required to be accessed and shared by all
model blocks. It contains three profiles: meal dosing, insulin dosing, and exercise quantification.
7. Meal model block: this block takes as input the meal profile and generates as output the glucose
obtained according to the carbohydrate intake grams.
8. Insulin model block: this block takes as input the insulin dosing profile and generates as output
the quantity and duration of insulin in the blood flow.
9. Exercise model block: this block takes as input the exercise dosing profile and generates as output
the blood redistribution volume according to the quantity of exercise.
10. Simulation loop block: this block updates each model block with required data from others and
controls the simulation’s start and end.
11. Plot generation module: this block generates data from the simulation to be used for generating
a plot.
12. Plot visualization module: this block takes as input the vector generated by the plot generation
module and displays the plot.
Finally, the architecture generates as the output the plot with the information according to the
user’s input.
229
Appl. Sci. 2020, 10, 6846
Figure 3. Screens of the proposed AEDMA app. (a) Main screen of the proposed applications.
(b) Components of the meal model input Table. (c) Components of the insulin model tab.
230
Appl. Sci. 2020, 10, 6846
In Figure 3b, the tab for the meal model parameters is shown. This tab contains interface
controls where users can define each meal’s time and its estimated carbohydrate contents (in grams).
The estimated carbohydrate contents also can be automatically calculated based on a default estimation
of calories for a day, or user can input the total calories, and the application estimates the carbohydrate
content in grams based on the given calories.
In Figure 3c, the tab for the insulin model is shown. This tab contains interface controls that define
the type of insulin and the unit to be dosed in each meal, for both the rapid and prolonged effect.
In Figure 4a, the tab for the exercise model is shown. This tab contains interface controls where
the user defines the intensity, duration, start time, and exercise routine.
Figure 4. The last two configuration screens. (a) Components of the exercise model Table.
(b) Components of the simulations tab.
In Figure 4b, the tab for simulation is shown. In this tab, the user defines the time of the simulation
in weeks and days. It is also possible to set all variables to their default values, start simulation, shown
insulin, carbohydrate, or glucose concentration plots.
In Table 2, the application’s default values are listed, but the user can syntonize this to get a better
understanding of the model behavior.
Meal Model
Breakfast time 7:00 AM
Lunch time 1:00 PM
Dinner time 7:00 PM
Breakfast CHO (g) 97.91
Lunch CHO (g) 138.97
Dinner CHO (g) 59.32
231
Appl. Sci. 2020, 10, 6846
Table 2. Cont.
Insulin Model
Rapid effect insulin type Lispro
Prolonged effect insulin Type NPH
Rapid effect insulin units at breakfast 2
Rapid effect insulin units at lunch 2
Rapid effect insulin units at dinner 2
Prolonged effect insulin units at breakfast 3
Prolonged effect insulin units at dinner 3
Exercise Model
Intensity Light
Duration 10 min
Start time 9:00 AM
Routine Monday checked
Meal
Insulin
Exp Carbohydrate Grams Intake Rapid Effect Prolongated Effect Dossification
B-CHO L-CHO D-CHO Type B-u L-u D-u Type B-u D-u
1 97.91 138.97 59.32 Lispro 2 2 2 NPH 3 3
2 49.85 69.48 29.66 Lispro 2 2 2 NPH 3 3
3 97.91 138.97 59.32 Lispro 4 4 4 NPH 6 6
4 49.85 69.48 29.66 Lispro 4 4 4 NPH 6 6
(B-CHO), carbohydrate grams in lunch (L-CHO), carbohydrate grams in dinner (D-CHO), insulin
units at breakfast (B-u), insulin units at lunch (L-u), insulin units at dinner (D-u). In the first experiment,
the default values for carbohydrate and insulin units are used. In the second experiment, carbohydrate
grams are reduced to half, but the insulin units are left at their default values. In the third experiment,
default values for carbohydrate grams are used, but the insulin dose is doubled, and in the last
experiment, carbohydrate grams are reduced to half, and the insulin dose is doubled.
Figure 5 shows a glucose concentration plot for a 5-day simulation for each experiment reported
in Table 3. In Figure 5a, the maximum glucose peak is close to 195 mg/dL every day of the reported
simulation. In contrast, in Figure 5b, the maximum glucose peak surpasses 175 mg/dL due to the
reduction of carbohydrate intake. In Figure 5c, the same carbohydrate intake is used, but both
prolonged and rapid insulin doses are doubled. It can be shown that glucose levels are lower when
compared with the plot of experiment 1, and in Figure 5d, also the carbohydrate intake is reduced to
half to validate both meal and insulin models. In both experiments (3 and 4), it can be shown for its
corresponding plots that even a risk of hyperglycemia is present due to the high insulin dose.
232
Appl. Sci. 2020, 10, 6846
Figure 5. Comparison of blood glucose plots generated by AEDMA application for several values
of carbohydrate grams and insulin units. (a) Experiment 1. (b) Experiment 2. (c) Experiment 3.
(d) Experiment 4.
In Figure 6, a validation of the exercise models through the corresponding simulation plots are
shown. In order to obtain the reported plots, the used exercise intensities are light, moderate, and
heavy, and the exercise durations are 10, 30, and 90 min. The default values for grams of carbohydrate
intake and insulin doses are used. In Figure 6a, the simulation using 10 min of light exercise is shown.
The value of the glucose peaks after exercise (at 9:00 AM) is close to 185 mg/dL, but using 30 min of
light exercise, this is under 180 mg/dL (Figure 6b), and when using 90 min of light exercise, the glucose
peaks is close to 175 mg/dL (Figure 6c). Similar behavior is shown when using the same duration but
with moderate exercise (Figure 6d–f). The value of the glucose peaks after exercise (at 9:00 AM) is close
to 175 mg/dL, and it is low when compared with the same peak obtained when using light exercise
(180 mg/dL). Finally, this behavior remains when using heavy exercise (Figure 6g–i). For comparing
exercise intensities, it can be shown from plots that the glucose peaks after 10 min exercise are: close to
233
Appl. Sci. 2020, 10, 6846
185 mg/dL for light exercise, over 175 mg/dL for moderate exercise, and below 170 mg/dL for heavy
exercise. Similar results are obtained for 30 and 90 min of exercise duration. In general, there are many
applications, however, long-term prediction of glucose still remains a challenge [20]. In this work,
an app has been developed capable of predicting the behavior of glucose levels for several days (see
Figure 5) or during a day with different exercise regimes (see Figure 6).
Figure 6. Comparison of blood glucose plots generated by AEDMA application for several exercise
intensities using experiment 1 settings. (a) Light exercise—10 min. (b) Light exercise—30 min. (c) Light
exercise—90 min. (d) Moderate exercise—10 min. (e) Moderate exercise—30 min. (f) Moderate
exercise—90 min. (g) Heavy exercise—10 min. (h) Heavy exercise—30 min. (i) Heavy exercise—90 min.
Until now, the complete software of the presented application has not been compared with other
applications since our priority is to focus on the diet and physical activity of the Mexican population
with diabetes. There are many applications, but with European or United States standards. However,
each submodel has been compared and validated separately by its respective authors. It is important
to point out that the current version of the application presented in this work is an extension of a
234
Appl. Sci. 2020, 10, 6846
previous compartmental model that takes into account exercise periods. This model was developed by
the main author, who validated and compared its results with experimental data [41]. So far, we have
performed a prototype test on 26 young adults and execution times comparison for different devices.
After testing the ADEMA by at least 1 h in the participant’s device, they were asked to answer
a questionnaire that contains 10 questions related to the key aspects of the AEDMA. Each question
has 5 options, rated from NO, Rather NO, Do not know, Rather YES, and YES. The questions and
the percentage of responses for each option are summarized in Table 5. From the obtained responses,
we can summarize the following remarks:
235
Appl. Sci. 2020, 10, 6846
Table 6. Time to complete computation of glucose concentration plots (in seconds) using several
simulation time durations.
5. Conclusions
In this paper, an application for diabetes education is proposed. The proposed application was
designed to provide a software tool to help support therapy in patients with diabetes, seeking to
improve the quality of life. The simulation is fixed, but an interactive simulation functionality can
be added to provide a richer user experience. The software is based on four mathematical models
originally designed as a desktop computer program. An exercise option in the interface helps patients
to schedule their routines and to estimate in advance its impact on blood glucose. This app could
also be developed for Apples’ iOS products. Besides, this app could potentially be implemented as a
smartwatch app. The most important conclusions are:
1. The App includes glucose regulation associated with metabolism and the application of an
insulin injection.
2. The App is capable of taking into account the effects of food intake and physical activity.
3. The developed application can simulate the behavior of glucose levels for long periods.
4. The App is capable of considering three types of physical activity: light, moderate and heavy.
5. The developed application is focused on type-1 diabetes, but this can be extended to consider
type-2 diabetes.
6. As a future improvement, the application should be modified to split the simulation processing
into smaller operations running on multiple threads. This could be beneficial in devices with
more than one core.
Author Contributions: Conceptualization, M.H.-O., M.A.N.-M.; Formal Analysis, C.A.C.-A., A.R.-L.; Investigation,
M.A.N.-M., G.E.O.-C., J.M.C.-C.; Methodology, M.H.-O., C.A.C.-A., G.E.O.-C., R.S.-H., O.M.-R.; Project
Administration, M.H.-O., R.S.-H.; Software, M.H.-O., G.E.O.-C., J.M.C.-C., O.M.-R.; Supervision, M.H.-O.,
M.A.N.-M., A.R.-L., O.M.-R.; Validation, A.R.-L., R.S.-H., J.M.C.-C.; Visualization, A.R.-L., J.M.C.-C., O.M.-R.;
Writing Original Draft, M.H.-O., M.A.N.-M., C.A.C.-A., G.E.O.-C.; Writing Review and Editing, M.A.N.-M.,
C.A.C.-A., R.S.-H. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
236
Appl. Sci. 2020, 10, 6846
Acknowledgments: The present project was partially funded by the National Council of Science and Technology of
México (CONACYT) through a scholarship (no. 337087) granted to José Margarito Canseco-Cortinas. The authors
are grateful for the support provided by TNM-ITV, CONACYT and PRODEP.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. World Health Organization. Obesity and Overweight Fact Sheet; WHO: Geneva, Switzerland, 2016.
2. World Health Organization. Diabetes Fact Sheet; WHO: Geneva, Switzerland, 2016.
3. Institute for Health Metrics and Evaluation. The Global Burden of Disease: Generating Evidence, Guiding Policy;
IHME: Washington, DC, USA, 2013.
4. World Health Organization. The Top 10 Causes of Death; WHO: Geneva, Switzerland, 2017.
5. International Diabetes Federation. IDF Diabetes Atlas, 7th ed.; International Diabetes Federation: Brussels,
Belgium, 2015.
6. Levy, T.S.; Nasu, L.C.; Dommarco, J.R.; Ávila, M.H. Informe Final de Resultados, Encuesta Nacional de Salud y
Nutrición de Medio Camino 2016; Instituto Nacional de Salud Pública INSP: Mexico City, Mexico, 2016.
7. Institute for Health Metrics and Evaluation (IHME). Country Profile Mexico. Mexico, 2017. Available online:
http://www.healthdata.org/ (accessed on 7 October 2018).
8. Statista Inc. Global Smartphone Sales to End Users 2007–2016. Available online: https://www.statista.com/
statistics/263437/global-smartphone-sales-to-end-users-since-2007/ (accessed on 12 October 2018).
9. eMarketer. Worldwide Internet and Mobile Users: eMarketer’s Updated Estimates and Forecast for
2015–2020. 2016. Available online: https://www.emarketer.com/Report/Worldwide-Internet-Mobile-Users-
eMarketersUpdated-Estimates-Forecast-20152020/2001897#moreReport (accessed on 25 September 2018).
10. Lehmann, E. Preliminary experience with the Internet release of AIDA—An interactive educational diabetes
simulator. Comput. Methods Programs Biomed. 1998, 56, 109–132. [CrossRef]
11. Hernandez-Ordonez, M.; Montano, O.; Campos-Delgado, D.U.; Palacios, E. Development of an Educational
Simulator and Graphical User Interface for Diabetic Patients. In Proceedings of the 2007 4th International
Conference on Electrical and Electronics Engineering, Mexico City, Mexico, 5–7 September 2007; pp. 82–85.
[CrossRef]
12. Lehmann, E.D.; Tarín, C.; Bondia, J.; Teufel, E.; Deutsch, T. Development of AIDA v4.3b Diabetes Simulator:
Technical Upgrade to Support Incorporation of Lispro, Aspart, and Glargine Insulin Analogues. J. Electr.
Comput. Eng. 2011, 2011, 427196. [CrossRef]
13. Thestrup, J.; Gergely, T.; Beck, P. Exploring new care models in diabetes management and therapy with
a wireless mobile eHealth platform. In Wireless Mobile Communication and Healthcare: Second International
ICST Conference, MobiHealth 2011, Kos Island, Greece, 5–7 October 2011; Revised Selected Papers; Nikita, K.S.,
Lin, J.C., Fotiadis, D.I., Arredondo Waldmeyer, M.T., Eds.; Springer: Berlin/Heidelberg, Germany, 2012;
pp. 203–210.
14. Erzen, F.C.; Birol, G.; Cinar, A. Glucosim: A simulator for education on the dynamics of diabetes mellitus.
In Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology
Society, Istambul, Turkey, 25–28 October 2001; Volume 4, pp. 3163–3166. [CrossRef]
15. Agar, B.U.; Eren, M.; Cinar, A. Glucosim: Educational software for virtual experiments with patients with type
1 diabetes. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference,
Shanghai, China, 17–18 January 2006; pp. 845–848. [CrossRef]
16. Mougiakakou, S.G.; Prountzou, K.; Nikita, K.S. A Real Time Simulation Model of Glucose-Insulin Metabolism
for Type 1 Diabetes Patients. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th
Annual Conference, Shanghai, China, 17–18 January 2005; pp. 298–301. [CrossRef]
17. Bergman, R.N.; Phillips, L.S.; Cobelli, C. Physiologic evaluation of factors controlling glucose tolerance in
man: Measurement of insulin sensitivity and beta-cell glucose sensitivity from the response to intravenous
glucose. J. Clin. Investig. 1981, 68, 1456–1467. [CrossRef] [PubMed]
18. Cobelli, C.; Caumo, A.; Omenetto, M.; Sacca, L. Minimal model SG overestimation and SI underestimation:
Improved accuracy by a Bayesian two-compartment model. Am. J. Physiol. 1999, 277, E481–E488.
237
Appl. Sci. 2020, 10, 6846
19. Hovorka, R.; Shojaee-Moradie, F.; Carroll, P.V.; Chassin, L.J.; Gowrie, I.J.; Jackson, N.C.; Tudor, R.S.;
Umpleby, A.M.; Jones, R.H. Partitioning glucose distribution/trans-port, disposal, and endogenous production
during IVGTT. Am. J. Physiol. 2002, 282, E992–E1007.
20. Liu, C.; Vehí, J.; Avari, P.; Reddy, M.; Oliver, N.; Georgiou, P.; Herrero, P. Long-Term Glucose Forecasting
Using a Physiological Model and Deconvolution of the Continuous Glucose Monitoring Signal. Sensors 2019,
19, 4338. [CrossRef]
21. Liu, C.; Avari, P.; Leal, Y.; Wos, M.; Sivasithamparam, K.; Georgiou, P.; Reddy, M.; Fernández-Real, J.M.;
Martin, C.; Fernández-Balsells, M.; et al. A Modular Safety System for an Insulin Dose Recommender:
A Feasibility Study. J. Diabetes Sci. Technol. 2019, 14, 87–96. [CrossRef]
22. Sangeetha, S.; Sreepradha, C.; Sobana, S.; Bidisha, P.; Panda, R.C. Modeling and Control of the
Glucose-Insulin-Glucagon System in Type I Diabetis Mellitus. ChemBioEng Rev. 2020, 3, 89–100. [CrossRef]
23. Knopp, J.L.; Hardy, A.R.; Vergeer, S.; Chase, J.G. Modelling insulin adsorption in intravenous infusion sets in
the intensive care unit. IFAC J. Syst. Control 2019, 8, 100042. [CrossRef]
24. Visentin, R.; Schiavon, M.; Basu, R.; Basu, A.; Man, C.D.; Cobelli, C. Physiological models for artificial pancreas
development. In The Artificial Pancreas: Current Situation and Future Directions, 1st ed.; Sánchez-Peña, R.S.,
Cherñavvsky, D.R., Eds.; Academic Press: Cambridge, MA, USA, 2019; Volume 1, pp. 123–152.
25. Maas, A.H.; van der Molen, P.; van de Vijver, R.; Chen, W.; van Pul, C.; Cottaar, E.J.; van Riel, N.A.;
Hilbers, P.A.; Haak, H.R. Concept Development of the Eindhoven Diabetes Education Simulator Project.
Games Health J. 2016, 5, 120–127. [CrossRef]
26. Ristau, R.A.; Yang, J.; White, J.R. Evaluation and Evolution of Diabetes Mobile Applications: Key Factors for
Health Care Professionals Seeking to Guide Patients. Diabetes Spectr. 2013, 26, 211–215. Available online:
http://spectrum.diabetesjournals.org/content/26/4/211.full.pdf (accessed on 17 September 2018). [CrossRef]
27. El-Gayar, O.; Timsina, P.; Nawar, N.; Eid, W. Mobile Applications for Diabetes Self-Management: Status and
Potential. J. Diabetes Sci. Technol. 2013, 7, 247–262. [CrossRef] [PubMed]
28. Chomutare, T.; Fernandez-Luque, L.; Arsand, E.; Hartvigsen, G. Features of mobile diabetes applications:
Review of the literature and analysis of current applications compared against evidence-based guidelines.
J. Med Internet Res. 2011, 13, e65. [CrossRef] [PubMed]
29. Huckvale, K.; Adomaviciute, S.; Prieto, J.T.; Leow, M.K.S.; Car, J. Smartphone apps for calculating insulin
dose: A systematic assessment. BMC Med. 2015, 13, 106. [CrossRef] [PubMed]
30. Cui, M.; Wu, X.; Mao, J.; Wang, X.; Nie, M. T2DM Self-Management via Smartphone Applications: A
Systematic Review and Meta-Analysis. PLoS ONE 2016, 11, e0166718. [CrossRef] [PubMed]
31. Garcia, E.; Martin, C.; Garcia, A.; Harrison, R.; Flood, D. Systematic Analysis of Mobile Diabetes Management
Applications on Different Platforms. In Proceedings of the 7th Conference on Workgroup Human-Computer
Interaction and Usability Engineering of the Austrian Computer Society: Information Quality in e-Health,
Graz, Austria, 25–26 November 2011; Springer: Berlin/Heidelberg, Germany, 2011. USAB’11. pp. 379–396.
32. Baskaran, V.; Prescod, F.; Dong, L. A Smartphone-Based Cloud Computing Tool for Managing Type 1
Diabetes in Ontarians. Can. J. Diabetes 2015, 39, 200–203. [CrossRef]
33. Chomutare, T.; Tatara, N.; Arsand, E.; Hartvigsen, G. Designing a diabetes mobile application with social
network support. In Studies in Health Technology and Informatics; IOS Press: Clifton, NJ, USA, 2013.
34. Nguyen, H.D.; Jiang, X.; Poo, D.C.C. Designing a Social Mobile Platform for Diabetes Self-management:
A Theory-Driven Perspective. In Social Computing and Social Media: 7th International Conference, SCSM 2015,
Held as Part of HCI International 2015, Los Angeles, CA, USA, 2–7 August 2015, Proceedings; Meiselwitz, G., Ed.;
Springer International Publishing: Cham, Switzerland, 2015; pp. 67–77.
35. Garnweidner-Holme, L.M.; Borgen, I.; Garitano, I.; Noll, J.; Lukasse, M. Designing and Developing a Mobile
Smartphone Application for Women with Gestational Diabetes Mellitus Followed-Up at Diabetes Outpatient
Clinics in Norway. Healthcare 2015, 3, 310–323. [CrossRef]
36. Borgen, I.; Garnweidner-Holme, L.M.; Jacobsen, A.F.; Bjerkan, K.; Fayyad, S.; Joranger, P.; Lilleengen, A.M.;
Mosdøl, A.; Noll, J.; Småstuen, M.C.; et al. Smartphone application for women with gestational diabetes
mellitus: A study protocol for a multicentre randomised controlled trial. BMJ Open 2017, 7, e013117. Available
online: http://bmjopen.bmj.com/content/7/3/e013117.full.pdf (accessed on 12 December 2018). [CrossRef]
37. Lloyd, B.; Groat, D.; Cook, C.; Kaufman, D.; Grando, A. IDECIDE: A mobile application for insulin dosing
using an evidence based equation to account for patient preferences. In Studies in Health Technology and
Informatics; IOS Press: Clifton, NJ, USA, 2015; Volume 216, pp. 93–97. [CrossRef]
238
Appl. Sci. 2020, 10, 6846
38. Keith-Hynes, P.; Mize, B.; Robert, A.; Place, J. The Diabetes Assistant: A Smartphone-Based System for
Real-Time Control of Blood Glucose. Electronics 2014, 3, 609–623. [CrossRef]
39. González, A.A.; Voos, H.; Darouach, M. Glucose-Insulin System Based on Minimal Model: A Realistic
Approach. In Proceedings of the 2015 17th UKSIM-AMSS International Conference on Modelling and
Simulation, Cambridge, UK, 25–27 March 2015; IEEE Computer Society: Washington, DC, USA, 2015.
UKSIM ’15. pp. 55–60. [CrossRef]
40. Sorensen, J.T. A Physiologic Model of Glucose Metabolism in Man and Its Use to Design and Assess Improved
Insulin Therapies for Diabetes. Ph.D. Thesis, Massachusetts Institute of Technology. Dept. of Chemical
Engineering, Berkeley, CA, USA, 1985.
41. Hernández-Ordoñez, M.; Campos-Delgado, D. An extension to the compartmental model of type 1 diabetic
patients to reproduce exercise periods with glycogen depletion and replenishment. J. Biomech. 2008, 41,
744–752. [CrossRef]
42. Berger, M.; Rodbard, D. Computer simulation of plasma insulin and glucose dynamics after subcutaneous
insulin injection. Diabetes Care 1989, 12, 725–736. [CrossRef] [PubMed]
43. Lehmann, E.; Deutsch, T. A physiological model of glucose-insulin interaction in type 1 diabetes mellitus.
J. Biomed. Eng. 1992, 14, 235–242. [CrossRef]
44. Wilinska, M.E.; Chassin, L.J.; Schaller, H.C.; Schaupp, L.; Pieber, T.R.; Hovorka, R. Insulin kinetics in type-1
diabetes: Continuous and bolus delivery of rapid acting insulin. IEEE Trans. Biomed. Eng. 2005, 52, 3–12.
[CrossRef] [PubMed]
45. Jahoda, P. MPAndroidChart. 2016. Available online: https://github.com/PhilJay/MPAndroidChart (accessed
on 15 November 2018).
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
239
MDPI
St. Alban-Anlage 66
4052 Basel
Switzerland
Tel. +41 61 683 77 34
Fax +41 61 302 89 18
www.mdpi.com