s12911 024 02793 9
s12911 024 02793 9
Abstract
Background Efficient triage in emergency departments (EDs) is critical for timely and appropriate care. Traditional
triage systems primarily rely on structured data, but the increasing availability of unstructured data, such as clini-
cal notes, presents an opportunity to enhance predictive models for assessing emergency severity and to explore
associations between patient characteristics and severity outcomes. This study aimed to evaluate the effectiveness
of combining structured and unstructured data to predict emergency severity more accurately.
Methods Data from the 2021 National Hospital Ambulatory Medical Care Survey (NHAMCS) for adult ED patients
were used. Emergency severity was categorized into urgent (scores 1–3) and non-urgent (scores 4–5) based
on the Emergency Severity Index. Unstructured data, including chief complaints and reasons for visit, were processed
using a Bidirectional Encoder Representations from Transformers (BERT) model. Structured data included patient
demographics and clinical information. Four machine learning models—Logistic Regression, Random Forest, Gradient
Boosting, and Extreme Gradient Boosting—were applied to three data configurations: structured data only, unstruc-
tured data only, and combined data. A mean probability model was also created by averaging the predicted prob-
abilities from the structured and unstructured models.
Results The study included 8,716 adult patients, of whom 74.6% were classified as urgent. Association analysis
revealed significant predictors of emergency severity, including older age (OR = 2.13 for patients 65 +), higher heart
rate (OR = 1.56 for heart rates > 90 bpm), and specific chronic conditions such as chronic kidney disease (OR = 2.28)
and coronary artery disease (OR = 2.55). Gradient Boosting with combined data demonstrated the highest perfor-
mance, achieving an area under the curve (AUC) of 0.789, an accuracy of 0.726, and a precision of 0.892. The mean
probability model also showed improvements over structured-only models.
Conclusions Combining structured and unstructured data improved the prediction of emergency severity in ED
patients, highlighting the potential for enhanced triage systems. Integrating text data into predictive models can
*Correspondence:
Xingyu Zhang
[email protected]
Wenbin Zhang
[email protected]
Full list of author information is available at the end of the article
© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or
parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To
view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 2 of 13
provide more accurate and nuanced severity assessments, improving resource allocation and patient outcomes. Fur-
ther research should focus on real-time application and validation in diverse clinical settings.
Keywords Emergency department, Predictive modeling, Association study, Unstructured data, Natural language
processing, Clinical decision support
indicates non-urgent cases that can safely wait for care. Unstructured data and BERT model
These scores are typically assigned by a triage nurse dur- Unstructured data consisted of the chief complaints
ing the initial evaluation, using a combination of objec- and reasons for the injury presented at the ED visits.
tive measures (such as vital signs) and clinical judgment To ensure the quality and consistency of the input data,
[19, 20]. ESI scores directly inform decisions about the a structured text cleaning process was applied. This
urgency of treatment. For example, patients with a score involved converting all text to lowercase for uniform-
of 1 need immediate intervention to prevent death, while ity, removing punctuation and numbers, and filtering
those with a score of 2 are high-risk and must be seen out common stopwords (e.g., "and," "the") that do not
quickly to avoid deterioration. Patients with a score of contribute meaningfully to clinical interpretation. These
3, though stable, still require timely care but can wait steps ensured that the text data retained only relevant
longer than those with more urgent scores. Meanwhile, clinical information.
scores of 4 and 5 represent minor conditions that can These cleaned text fields were tokenized using the
safely wait for extended periods [21, 22]. While the ESI is BERT tokenizer from the HuggingFace library [25], pre-
an ordinal scale, the differences in urgency between con- paring the text data for input into a BERT-based model.
secutive scores are not evenly spaced. The difference in The BERT model represents a significant advancement
urgency between a score of 1 and 2 is much greater than in natural language processing by enabling deep bidirec-
between scores 3 and 4. For this study, we grouped ESI tional understanding of text [26, 27]. Unlike traditional
scores into two categories: urgent (scores 1–3) and non- models that read text either left-to-right or right-to-left,
urgent (scores 4–5). This binary categorization reflects BERT processes text in both directions simultaneously,
common clinical practice, where the main concern is allowing it to understand the context of a word based on
whether a patient requires urgent intervention. Although all surrounding words. This bidirectional approach ena-
this approach reduces some granularity, it aligns with the bles BERT to capture the nuanced meanings of words
critical decision-making process in EDs, prioritizing the and phrases in their specific contexts. BERT’s architec-
need for urgent care [23]. ture is based on transformers, a type of deep learning
model that relies on self-attention mechanisms to weigh
the importance of different words in a sentence. This
Structured predictors allows BERT to excel at tasks that require understanding
The structured data extracted from the dataset included the relationships between words and the overall meaning
a variety of variables related to patient demographics, of sentences. Pre-trained on a vast corpus of text data,
visit characteristics, and clinical information. Specifically, including books and Wikipedia articles, BERT can be
the structured data encompassed patient demographics fine-tuned on specific tasks such as classification, ques-
such as age, sex, and race/ethnicity. Visit characteristics tion answering, and named entity recognition.
included arrival time, mode of arrival, day of the week, To prepare the unstructured text data for analysis, we
and whether the patient arrived by ambulance. Clinical used the BERT tokenizer. This process converts the clini-
information comprised vital signs (temperature, heart cal text into a structured format that BERT can inter-
rate, diastolic blood pressure, systolic blood pressure, pret, ensuring that important contextual information
pulse oximetry, respiratory rate), pain level, and medical is preserved. The tokenizer breaks down sentences into
history (conditions such as Alzheimer’s disease/demen- smaller units, allowing BERT to understand the rela-
tia, asthma, cancer, cerebrovascular disease, chronic tionships between words in a given sentence. Follow-
kidney disease, chronic obstructive pulmonary disease, ing tokenization, the text was passed through the BERT
congestive heart failure, coronary artery disease, depres- model to generate numerical embeddings—dense vectors
sion, diabetes mellitus types I and II, end-stage renal that represent the semantic meaning of the text. These
disease, pulmonary embolism, HIV infection/AIDS, embeddings capture the context and meaning of the text,
hyperlipidemia, hypertension, obesity, obstructive sleep allowing the model to utilize the full depth of clinical nar-
apnea, osteoporosis, and substance abuse or depend- ratives. The embeddings were then combined with the
ence). Additional factors considered were the type of structured data, integrating both textual and numerical
residence (private residence, nursing home, homeless, or information to enhance the predictive capability of the
other), insurance type, whether the visit was a follow-up model.
or within the last 72 h, and the nature of any injury or
trauma, overdose/poisoning, or adverse effect of medi-
cal/surgical treatment. Missing values in the structured Predictive model development
data were handled using median imputation, and the data For this study, four different machine learning models
were standardized using StandardScaler [24]. were applied: Logistic Regression (LR), Random Forest
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 4 of 13
(RFM) [28], Gradient Boosting (GB) [29], and Extreme discrimination across various threshold values. Accuracy
Gradient Boosting (XGB) [30]. We implemented four reflects the proportion of correct classifications (both
machine learning models using Python’s scikit-learn and urgent and non-urgent) out of the total predictions; how-
xgboost libraries to evaluate the predictive performance ever, its utility may be limited when class distribution is
of structured data in classifying emergency severity. For imbalanced. Precision measures the proportion of true
each model, key parameters were configured, while all positives (correctly classified urgent cases) among all
other parameters were set to their default values. The instances predicted as urgent, making it particularly use-
LogisticRegression function from sklearn.linear_model ful when minimizing false positives is important. Sensi-
was set with a maximum iteration limit of 1000 (max_ tivity (Recall), on the other hand, evaluates the model’s
iter = 1000) to ensure convergence. The RandomForest- ability to correctly identify all urgent cases, which is
Classifier function from sklearn.ensemble was employed crucial in emergency department settings where miss-
with 500 estimators (n_estimators = 500), balancing accu- ing urgent cases could have serious consequences. Speci-
racy and computational efficiency. The GradientBoost- ficity assesses the model’s ability to correctly classify
ingClassifier, also from sklearn.ensemble, was applied non-urgent cases, thereby avoiding over-triage, where
with default settings, allowing the model to iteratively non-urgent patients are incorrectly labeled as urgent.
adjust for errors made by prior trees, thereby focusing Finally, the F1 score, which is the harmonic mean of
subsequent trees on misclassified instances to improve precision and recall, offers a balanced evaluation of the
predictive precision. For XGBoost, we utilized XGBClas- model’s handling of both false positives and false nega-
sifier from the xgboost library, configuring it with logloss tives, especially valuable in scenarios with uneven class
as the evaluation metric to prioritize probability calibra- distributions. ROC curves were plotted for each model
tion and classification accuracy. Each of these models to compare their performance. Additionally, visualiza-
was trained and evaluated using four distinct approaches: tions such as forest plots of odds ratios and word clouds
structured data, unstructured data, combined data, and a of unstructured variables were generated to illustrate the
mean probability model. significance and frequency of different variables in the
The first approach used only structured data, includ- dataset.
ing patient demographics, clinical information, and visit
characteristics from the NHAMCS-ED dataset. Logis- Results
tic Regression, Random Forest, Gradient Boosting, and Among the 8,716 patients included in the study, 25.4%
XGBoost models were trained on this data. The second were categorized as non-urgent or semi-urgent, while
approach focused solely on unstructured data, processed 74.6% were classified as urgent, emergent, or immedi-
using the BERT model to generate feature vectors. These ate. Table 1 and Supplement Table 1 present the base-
vectors were used as input for the same machine learn- line characteristics of U.S. patients presenting to the ED,
ing models. The third strategy combined both structured stratified by emergency severity score. Significant differ-
and unstructured data, merging quantitative information ences were observed between the two groups in terms of
with BERT-extracted features to provide a comprehen- gender, with a higher proportion of females in the urgent
sive input for the models. The final method employed a category (55.1%) compared to the non-urgent group
mean probability model, which averaged the predicted (52.0%, p = 0.0096). Age also varied significantly, with
probabilities from the structured and unstructured mod- older patients more likely to be in the urgent category
els. This technique combined the strengths of both data (p < 0.0001). Specifically, 27.5% of patients aged 65 and
types without retraining. All approaches were evaluated above were in the urgent group, compared to 15.7% in the
using fivefold cross-validation. non-urgent group. Race/ethnicity did not show signifi-
cant differences between groups (p = 0.0603). However,
Evaluation metrics differences were noted in residence type (p < 0.0001),
The evaluation of all models involved calculating ROC with a higher percentage of urgent patients residing in
AUC, accuracy, F1 score, precision, recall, sensitivity, nursing homes (2.8% vs. 1.0%) and a greater proportion
and specificity. The models’ predictive probabilities and of non-urgent patients living in private residences (95.8%
true labels were recorded, and ROC curves were plot- vs. 94.2%). Insurance type also showed significant dif-
ted to visualize the performance of each model. The ferences (p < 0.0001), with a higher percentage of urgent
cutoff points for classification were determined by find- patients covered by Medicare (29.7% vs. 19.0%) and a
ing the thresholds closest to the top-left corner of the higher percentage of non-urgent patients being unin-
ROC curve [31, 32]. The ROC AUC quantifies the mod- sured (10.4% vs. 8.0%). Arrival by ambulance was signifi-
el’s ability to differentiate between these two catego- cantly more common in the urgent group (25.4% vs. 8.2%,
ries, with higher values (closer to 1.0) indicating better p < 0.0001). Follow-up visits were slightly more frequent
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 5 of 13
Table 1 Baseline characteristics of U.S. patients presenting to the ED, stratified by Emergency Severity Score, NHAMCS 2021
Non-urgent or Semi-urgent Urgent, Emergent or Immediate p value
Table 1 (continued)
Non-urgent or Semi-urgent Urgent, Emergent or Immediate p value
in the non-urgent group (8.9% vs. 7.4%, p = 0.0243). Pain groups. In terms of medical history, conditions such as
levels, temperature, heart rate, diastolic blood pressure, cancer, cerebrovascular disease, chronic kidney disease,
systolic blood pressure, pulse oximetry, and respiratory chronic obstructive pulmonary disease, congestive heart
rate all showed significant differences between the two failure, coronary artery disease, diabetes mellitus type
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 7 of 13
II, end-stage renal disease, pulmonary embolism, hyper- Figure 2 presents the frequency and word cloud of the
lipidemia, hypertension, obesity, obstructive sleep apnea, words in the unstructured variables, providing a visual
osteoporosis, and substance abuse were more common in representation of the most common terms found in the
the urgent group. chief complaints and reasons for the injury presented
Figure 1a and Fig. 1b display forest plots of odds ratios at the ED visits. Table 2 and Fig. 3 summarize the per-
with 95% confidence intervals for the various structured formance metrics for the different models. The results
variables used in the study. These figures illustrate the demonstrate that integrating structured and unstruc-
significant predictors of emergency severity, highlight- tured data leads to improved model performance across
ing the relative importance of different factors. In Fig. 1a, all classifiers. Logistic Regression showed significant
demographic and visit characteristics are detailed. Female improvements when combining both data types, achiev-
patients had higher odds of being classified as urgent ing an AUC of 0.784, an accuracy of 0.717, and a high
(OR = 1.15, 95% CI: 1.05–1.26). Age was a significant precision of 0.894. Random Forest and Gradient Boosting
predictor, with patients aged 40–65 having higher odds models similarly benefited from the combination, with
of urgency (OR = 1.32, 95% CI: 1.21–1.44) compared to Random Forest achieving an AUC of 0.766 and Gradient
those aged 18–39. Patients aged 65 and above had even Boosting reaching 0.789. In particular, Gradient Boosting
higher odds (OR = 2.13, 95% CI: 1.88–2.41). Arrival by demonstrated strong predictive capabilities with a preci-
ambulance markedly increased the odds of being urgent sion of 0.892 and an F1 score of 0.797. Extreme Gradient
(OR = 3.65, 95% CI: 3.05–4.36). Medicare coverage was Boosting, although slightly weaker with structured data
associated with higher odds of urgency (OR = 1.79, 95% alone, showed notable gains when unstructured data was
CI: 1.58–2.02), while being uninsured was associated included, with a combined AUC of 0.779 and a precision
with lower odds (OR = 0.75, 95% CI: 0.61–0.91). Patients of 0.886.
from nursing homes had higher odds of being classi-
fied as urgent (OR = 2.80, 95% CI: 1.73–4.54). Figure 1b Discussion
focuses on clinical information and medical history. Our study demonstrated that combining structured and
Heart rate was a significant predictor, with patients hav- unstructured data significantly improved the prediction
ing heart rates over 90 bpm showing higher odds of being of emergency severity in an ED setting. By integrating
urgent (OR = 1.56, 95% CI: 1.42–1.72). Blood pressure clinical narratives with traditional patient demographics,
was also significant; diastolic blood pressure less than vital signs, and medical history, we were able to capture
60 mm Hg was associated with higher odds of urgency a more comprehensive representation of the patient’s
(OR = 1.53, 95% CI: 1.20–1.95), and DBP greater than condition. The results showed that models incorporat-
80 mm Hg showed increased odds (OR = 1.29, 95% CI: ing both data types outperformed those relying solely on
1.17–1.42). Systolic blood pressure greater than 120 mm structured or unstructured data. This finding highlights
Hg was associated with higher urgency (OR = 1.10, 95% the potential of leveraging advanced NLP techniques,
CI: 1.00–1.22). Several medical conditions significantly such as BERT, in conjunction with structured clinical
increased the odds of being classified as urgent, includ- data to enhance decision-making in emergency care.
ing cancer (OR = 2.54, 95% CI: 1.91–3.37), chronic kidney While the BERT model effectively captured the contex-
disease (OR = 2.28, 95% CI: 1.71–3.03), chronic obstruc- tual nuances in clinical notes, the combined approach
tive pulmonary disease (OR = 1.78, 95% CI: 1.46–2.17), proved most robust, supporting the idea that integrating
congestive heart failure (OR = 2.45, 95% CI: 1.88–3.18), diverse data sources can yield more accurate and action-
coronary artery disease (OR = 2.55, 95% CI: 2.03–3.20), able predictions in complex medical environments like
end-stage renal disease (OR = 3.37, 95% CI: 1.88–6.05), the ED.
diabetes mellitus type II (OR = 1.90, 95% CI: 1.57–2.30),
hyperlipidemia (OR = 1.63, 95% CI: 1.40–1.90), hyper- Association analysis and clinical implications
tension (OR = 1.73, 95% CI: 1.55–1.93), and obesity While the association analysis identified several statisti-
(OR = 1.25, 95% CI: 1.08–1.44). cally significant predictors of emergency severity, it is
Fig. 2 Frequency and the word cloud of the word in the unstructured variables
Logistic Regression
Structured Data 0.744 0.704 0.635 0.847 0.623 0.668 0.718
Unstructured Data 0.783 0.760 0.696 0.882 0.685 0.730 0.771
Combined (Structured + Unstructured) 0.779 0.784 0.717 0.894 0.705 0.753 0.788
Mean (Structured + Unstructured) 0.735 0.787 0.732 0.888 0.733 0.728 0.803
Random Forest
Structured Data 0.726 0.701 0.652 0.845 0.654 0.647 0.737
Unstructured Data 0.715 0.749 0.696 0.868 0.698 0.687 0.774
Combined (Structured + Unstructured) 0.720 0.766 0.712 0.875 0.716 0.698 0.787
Mean (Structured + Unstructured) 0.718 0.779 0.723 0.882 0.726 0.716 0.797
Gradient boosting
Structured Data 0.737 0.711 0.651 0.845 0.652 0.649 0.736
Unstructured Data 0.747 0.763 0.714 0.876 0.719 0.700 0.790
Combined (Structured + Unstructured) 0.759 0.789 0.726 0.892 0.719 0.745 0.797
Mean (Structured + Unstructured) 0.740 0.789 0.729 0.889 0.727 0.734 0.800
Extreme Gradient Boosting
Structured Data 0.803 0.678 0.623 0.836 0.615 0.645 0.709
Unstructured Data 0.897 0.734 0.682 0.866 0.679 0.691 0.761
Combined (Structured + Unstructured) 0.893 0.779 0.714 0.886 0.708 0.733 0.787
Mean (Structured + Unstructured) 0.782 0.759 0.699 0.870 0.702 0.691 0.777
crucial to differentiate between statistical significance these findings for real-time decision-making in ED set-
and clinical relevance. Chronic conditions such as coro- tings should be critically examined. For instance, while
nary artery disease, chronic kidney disease, and chronic the presence of chronic conditions may inform long-
obstructive pulmonary disease were significant predic- term risk stratification, their immediate impact on triage
tors in our model. However, the practical relevance of decisions may be limited unless the condition is actively
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 10 of 13
Fig. 3 Receiver Operating Characteristic (ROC) curves for the four models evaluated in the study. Each model include the structured data
model, which uses only structured data such as patient demographics, visit characteristics, vital signs, and medical history; the unstructured
data model, a BERT-based natural language processing (NLP) model that uses only unstructured data, including chief complaints and reasons
for injury; the combined input model, a machine learning classification model that integrates both structured data and BERT-extracted features
from the unstructured data; and the mean probability model, which averages the predicted probabilities from the structured data model
and the unstructured data model
contributing to the acute presentation. Thus, although missing in structured data, thus improving the predictive
these conditions were associated with increased urgency, accuracy of the models.
further research is needed to explore their practical role A notable finding was the association between insur-
in ED triage processes. ance status and emergency severity. Patients covered by
In addition, older age and higher heart rate emerged Medicare had higher odds of being classified as urgent,
as significant predictors, aligning with clinical expecta- while uninsured patients were less likely to be classified
tions that elderly patients and those with abnormal vital as urgent. This result raises important questions regard-
signs require urgent attention. However, it is important ing access to care and its influence on triage outcomes.
to interpret these findings with caution, particularly in One potential explanation is that uninsured patients
the context of retrospective analysis. While our model may delay seeking care due to financial concerns, lead-
can identify factors associated with higher acuity, it does ing to underrepresentation in our dataset or potentially
not substitute clinical judgment, which remains critical presenting with less acute conditions. Alternatively,
in real-time decision-making. The inclusion of unstruc- these findings may reflect broader disparities in health-
tured data, particularly chief complaints, offers a way to care access and utilization, where insurance status influ-
incorporate nuanced patient information that is often ences not only access to primary care but also ED triage
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 11 of 13
decisions [33–35]. The association between Medicare In comparison with traditional triage systems, such
coverage and higher urgency might reflect the higher as the Manchester Triage System [36], our models show
baseline health risks of the elderly population, who are promise in enhancing predictive accuracy by leveraging
more likely to suffer from multiple comorbidities. Further a broader range of patient data, particularly unstructured
exploration of how insurance status interacts with other clinical narratives. However, it is essential to emphasize
social determinants of health, such as socioeconomic that clinical judgment remains a critical component of
status and healthcare access, is warranted. Future stud- ED decision-making. Predictive models, while valu-
ies should aim to validate these findings and examine able, should complement—not replace—the expertise
whether controlling for other factors, such as pre-exist- of healthcare providers, who are best equipped to make
ing health conditions, changes the relationship between nuanced decisions in real-time clinical settings.
insurance status and triage classification. Moreover, this
finding highlights the need for ED policies that address Limitations and future directions
potential biases in triage based on insurance status and There are several limitations to this study. First, the study
other social determinants of health. is retrospective and relies on the accuracy and com-
pleteness of the NHAMCS-ED dataset. Any missing
or inaccurately recorded data could impact the model’s
Model performance and clinical application performance. Although the proportion of missing data
Our results demonstrated that integrating structured was relatively low (< 10%), the method of imputation
and unstructured data improves the performance of (median) might not capture the true underlying values in
predictive models, particularly in complex cases where all cases, and different imputation techniques could lead
traditional triage systems may fall short. The Gradi- to slightly different results. Second, the study focuses on
ent Boosting and Extreme Gradient Boosting models data from a single year (2021), which may limit the gen-
achieved the highest performance, with AUCs of 0.789 eralizability of the findings to other years or different
and 0.779, respectively, when both data types were com- hospital settings. Emergency presentations can vary over
bined. The strong performance of these models under- time due to factors such as seasonal changes, pandem-
scores the value of using machine learning techniques ics, or other public health events. Future studies should
that can account for non-linear interactions and complex validate these findings with data from multiple years and
relationships between variables, which are often present diverse clinical environments to ensure the robustness
in clinical data. and applicability of the models across varying contexts.
Our findings can be compared to the findings of Third, while BERT proved highly effective in processing
Brouns et al. (2019) [36] and Veldhuis et al. (2022) [37]. unstructured clinical text, it is computationally intensive
Brouns et al. evaluated the Manchester Triage System in compared to simpler models such as TF-IDF or logis-
older emergency department patients, reporting an AUC tic regression. The complexity and resource demands
of 0.74 for predicting hospital admissions, a result simi- of BERT may limit its use in real-time ED settings, par-
lar to the AUCs achieved by our combined data models. ticularly in resource-constrained environments. For real-
However, their study noted that MTS had a lower AUC time applications, it may be beneficial to explore lighter
of 0.71 for predicting in-hospital mortality, highlighting models like DistilBERT [38] or other simplified NLP
the limitations of relying solely on structured triage sys- approaches that balance computational efficiency with
tems in medically complex populations. Our results dem- performance. Additionally, another important limitation
onstrate that combining structured and unstructured involves the potential bias inherent in machine learning
data can address some of these limitations by improving models [39]. Bias can emerge from the data used to train
predictive accuracy, particularly in more complex cases. the model, particularly if the dataset reflects existing dis-
Similarly, Veldhuis et al. compared clinical judgment to parities in healthcare access, treatment, or outcomes. For
early warning scores and found that clinical judgment instance, the underrepresentation of uninsured patients
outperformed risk stratification models, with AUCs in the dataset may skew the model’s ability to predict
between 0.70 and 0.89, especially for ICU admissions and outcomes for this group, potentially reinforcing inequi-
severe adverse events [37]. While our models performed ties in healthcare delivery. Furthermore, models trained
similarly, this emphasizes the need to integrate machine on past data may perpetuate historical biases in clinical
learning with clinical judgment. Our models, combining decision-making, such as differences in treatment rec-
structured and unstructured data, outperformed single- ommendations based on race, gender, or insurance sta-
source models, aligning with Veldhuis et al.’s suggestion tus. Addressing this issue will require careful evaluation
that clinical tools combined with automated systems of the model’s performance across diverse patient popu-
yield the best results. lations and the implementation of fairness-enhancing
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 12 of 13
Supplementary Information
The online version contains supplementary material available at https://doi.
org/10.1186/s12911-024-02793-9. References
1. Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency
Supplementary Material 1. department crowding: a systematic review of causes, consequences
and solutions. PLoS ONE. 2018;13(8):e0203316.
2. Mostafa R, El-Atawi K. Strategies to measure and improve emergency
department performance: a review. Cureus. 2024;16(1):e52879.
Zhang et al. BMC Medical Informatics and Decision Making (2024) 24:372 Page 13 of 13
3. Ahsan KB, Alam M, Morel DG, Karim M. Emergency department resource 26. Deepa MD. Bidirectional encoder representations from transformers
optimisation for improved performance: a review. J Indust Eng Int. (BERT) language model for sentiment analysis task. Turkish J Comput
2019;15(Suppl 1):253–66. Math Educ. 2021;12(7):1708–21.
4. Yancey CC, O’Rourke MC: Emergency department triage. 2020. 27. Alaparthi S, Mishra M: Bidirectional Encoder Representations from
5. Christ M, Grossmann F, Winter D, Bingisser R, Platz E. Modern triage in the Transformers (BERT): A sentiment analysis odyssey. arXiv preprint
emergency department. Dtsch Arztebl Int. 2010;107(50):892–8. arXiv:200701127 2020.
6. Wuerz RC, Milne LW, Eitel DR, Travers D, Gilboy N. Reliability and validity of 28. Parmar A, Katariya R, Patel V: A review on random forest: An ensemble
a new five-level triage instrument. Acad Emerg Med. 2000;7(3):236–42. classifier. In: International conference on intelligent data communication
7. Chiu CC, Wu CM, Chien TN, Kao LJ, Li C, Chu CM. Integrating Structured technologies and internet of things (ICICI) 2018: 2019: Springer; 2019:
and Unstructured EHR Data for Predicting Mortality by Machine Learning 758–763.
and Latent Dirichlet Allocation Method. Int J Environ Res Public Health. 29. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neuroro-
2023;20(5):4340. bot. 2013;7:21.
8. Zhang X, Bellolio MF, Medrano-Gracia P, Werys K, Yang S, Mahajan P. Use 30. Chen T: Xgboost: extreme gradient boosting. R package version 04–2
of natural language processing to improve predictive models for imaging 2015, 1(4).
utilization in children presenting to the emergency department. BMC 31. Zhou J, Gandomi AH, Chen F, Holzinger A. Evaluating the quality of
Med Inform Decis Mak. 2019;19(1):287. machine learning explanations: A survey on methods and metrics. Elec-
9. Zhang X, Kim J, Patzer RE, Pitts SR, Patzer A, Schrager JD. Prediction of tronics. 2021;10(5):593.
emergency department hospital admission based on natural language 32. Naidu G, Zuva T, Sibanda EM: A review of evaluation metrics in machine
processing and neural networks. Methods Inf Med. 2017;56(05):377–89. learning algorithms. In: Computer Science On-line Conference: 2023:
10. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextual- Springer; 2023: 15–25.
ized embeddings on large-scale structured electronic health records for 33. Zhang X, Carabello M, Hill T, Bell SA, Stephenson R, Mahajan P. Trends
disease prediction. NPJ Dig Med. 2021;4(1):86. of racial/ethnic differences in emergency department care outcomes
11. Tang R, Yao H, Zhu Z, Sun X, Hu G, Li Y, Xie G: Embedding Electronic among adults in the United States from 2005 to 2016. Front Med.
Health Records to Learn BERT-based Models for Diagnostic Decision Sup- 2020;7:300.
port. In: 2021 IEEE 9th International Conference on Healthcare Informatics 34. Myran D, Hsu A, Kunkel E, Rhodes E, Imsirovic H, Tanuseputro P. Socioeco-
(ICHI): 9–12 Aug. 2021 2021; 2021: 311–319. nomic and geographic disparities in emergency department visits due
12. Lu H, Ehwerhemuepha L, Rakovski C. A comparative study on deep to alcohol in Ontario: a retrospective population-level study from 2003 to
learning models for text classification of unstructured medical notes with 2017. Can J Psychiatry. 2022;67(7):534–43.
various levels of class imbalance. BMC Med Res Methodol. 2022;22(1):181. 35. Pierce A, Marquita Norman M, Rendon J, Rucker D, Velez L, Powers R:
13. Turchin A, Masharsky S, Zitnik M. Comparison of BERT implementations Health Disparities in the Emergency Department. Emerg Med Rep
for natural language processing of narrative medical documents. Inform 2021;42(20).
Med Unlocked. 2023;36:101139. 36. Brouns SH, Mignot-Evers L, Derkx F, Lambooij SL, Dieleman JP, Haak
14. Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Rep- HR. Performance of the Manchester triage system in older emergency
resentation to Predict the Future of Patients from the Electronic Health department patients: a retrospective cohort study. BMC Emerg Med.
Records. Sci Rep. 2016;6(1):26094. 2019;19:1–11.
15. Suresh H, Hunt N, Johnson AEW, Celi LA, Szolovits P, Ghassemi M: Clinical 37. Veldhuis LI, Ridderikhof ML, Bergsma L, Van Etten-Jamaludin F, Nanay-
Intervention Prediction and Understanding using Deep Networks. ArXiv akkara PW, Hollmann M. Performance of early warning and risk stratifica-
2017, abs/1705.08498. tion scores versus clinical judgement in the acute setting: a systematic
16. Su D, Li Q, Zhang T, Veliz P, Chen Y, He K, Mahajan P, Zhang X. Prediction of review. Emerg Med J. 2022;39(12):918–23.
acute appendicitis among patients with undifferentiated abdominal pain 38. Adoma AF, Henry N-M, Chen W: Comparative analyses of bert, roberta,
at emergency department. BMC Med Res Methodol. 2022;22(1):18. distilbert, and xlnet for text-based emotion recognition. In: 2020 17th
17. Stewart J, Lu J, Goudie A, Arendts G, Meka SA, Freeman S, Walker K, Spri- International Computer Conference on Wavelet Active Media Technology
vulis P, Sanfilippo F, Bennamoun M, et al. Applications of natural language and Information Processing (ICCWAMTIP): 2020: IEEE; 2020: 117–121.
processing at emergency department triage: A narrative review. PLoS 39. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on
ONE. 2023;18(12):e0279953. bias and fairness in machine learning. ACM computing surveys (CSUR).
18. Cairns C, Kang K: National hospital ambulatory medical care survey: 2019 2021;54(6):1–35.
emergency department summary tables. 2022. 40. Zhang X, Bellolio MF, Medrano-Gracia P, Werys K, Yang S, Mahajan P. Use
19. Eitel DR, Travers DA, Rosenau AM, Gilboy N, Wuerz RC. The emergency of natural language processing to improve predictive models for imaging
severity index triage algorithm version 2 is reliable and valid. Acad Emerg utilization in children presenting to the emergency department. BMC
Med. 2003;10(10):1070–80. Med Inform Decis Mak. 2019;19:1–13.
20. Green NA, Durani Y, Brecher D, DePiero A, Loiselle J, Attia M. Emergency 41. Chan SL, Lee JW, Ong MEH, Siddiqui FJ, Graves N, Ho AFW, Liu N. Imple-
Severity Index version 4: a valid and reliable tool in pediatric emergency mentation of prediction models in the emergency department from an
department triage. Pediatr Emerg Care. 2012;28(8):753–7. implementation science perspective—determinants, outcomes, and
21. Tanabe P, Gimbel R, Yarnold PR, Adams JG: The Emergency Severity Index real-world impact: a scoping review. Ann Emerg Med. 2023;82(1):22–36.
(version 3) 5-level triage system scores predict ED resource consumption.
J Emerg Nurs 2004;30(1):22–29.
22. Hinson JS, Martinez DA, Schmitz PS, Toerper M, Radu D, Scheulen J, Publisher’s Note
Stewart de Ramirez SA, Levin S: Accuracy of emergency department Springer Nature remains neutral with regard to jurisdictional claims in pub-
triage using the Emergency Severity Index and independent predictors lished maps and institutional affiliations.
of under-triage and over-triage in Brazil: a retrospective cohort analysis.
International journal of emergency medicine 2018, 11:1-10.
23. Alnasser S, Alharbi M, AAlibrahim A, Aal Ibrahim A, Kentab O, Alassaf
W, Aljahany M. Analysis of Emergency Department Use by Non-Urgent
Patients and Their Visit Characteristics at an Academic Center. Int J Gen
Med. 2023;16:221–32.
24. Zollanvari A: Supervised Learning in Practice: the First Application Using
Scikit-Learn. In: Machine Learning with Python: Theory and Implementa-
tion. edn.: Springer; 2023: 111–131.
25. Jain SM: Hugging face. In: Introduction to transformers for NLP: With the
hugging face library and models to solve problems. edn.: Springer; 2022:
51–67.