0% found this document useful (0 votes)
21 views9 pages

Clinical Text Classification With Rule-Based Features and Knowledge-Guided Convolutional Neural Networks

Uploaded by

TIRTHA DEB NATH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Clinical Text Classification With Rule-Based Features and Knowledge-Guided Convolutional Neural Networks

Uploaded by

TIRTHA DEB NATH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Yao et al.

BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71


https://doi.org/10.1186/s12911-019-0781-4

R ES EA R CH Open Access

Clinical text classification with


rule-based features and knowledge-guided
convolutional neural networks
Liang Yao1 , Chengsheng Mao1 and Yuan Luo2*
From The Sixth IEEE International Conference on Healthcare Informatics (ICHI 2018)
New York, NY, USA. 4–7 June 2018

Abstract
Background: Clinical text classification is an fundamental problem in medical natural language processing. Existing
studies have cocnventionally focused on rules or knowledge sources-based feature engineering, but only a limited
number of studies have exploited effective representation learning capability of deep learning methods.
Methods: In this study, we propose a new approach which combines rule-based features and knowledge-guided
deep learning models for effective disease classification. Critical Steps of our method include recognizing trigger
phrases, predicting classes with very few examples using trigger phrases and training a convolutional neural network
(CNN) with word embeddings and Unified Medical Language System (UMLS) entity embeddings.
Results: We evaluated our method on the 2008 Integrating Informatics with Biology and the Bedside (i2b2) obesity
challenge. The results demonstrate that our method outperforms the state-of-the-art methods.
Conclusion: We showed that CNN model is powerful for learning effective hidden features, and CUIs embeddings
are helpful for building clinical text representations. This shows integrating domain knowledge into CNN models is
promising.
Keywords: Clinical text classification, Obesity challenge, Convolutional neural networks, Word embeddings, Entity
embeddings

Introduction methods have shown powerful feature learning capability


Clinical records are an important type of electronic health recently in the general domain [8].
record (EHR) data and often contain detailed and valuable In this study, we propose a new method which combines
patient information and clinical experiences of doctors. As rule-based feature engineering and knowledge-guided
a basic task of natural language processing, text classifica- deep learning techniques for disease classification. We
tion plays an critical role in clinical records retrieval and first identify trigger phrases using rules, then use these
organization, it can also support clinical decision making trigger phrases to predict classes with very few exam-
and cohort identification [1, 2]. ples, and finally train a convolutional neural network
Existing clinical text classification studies often use dif- (CNN) on the trigger phrases with word embeddings and
ferent forms of knowledge sources or rules for feature Unified Medical Language System (UMLS) [9] Concept
engineering [3–7]. But most of the studies could not Unique Identifiers (CUIs) with entity embeddings. We
learn effective features automatically, while deep learning evaluated our method on the 2008 Integrating Informat-
ics with Biology and the Bedside (i2b2) obesity challenge
*Correspondence: [email protected] [10], a multilabel classification task focused on obesity
2
Department of Preventive Medicine, Feinberg School of Medicine, and its 15 most common comorbidities (diseases). The
Northwestern University, Chicago 60611, IL, USA
Full list of author information is available at the end of the article

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 32 of 114

experimental results show that our method outperforms EHR data for phenotype stratification. Gehrmann et al.
state-of-the-art methods for the challenge. [23] compared CNN to the traditional rule-based entity
extraction systems using the cTAKES and Logistic Regres-
Related Work sion (LR) with n-gram features. They tested ten differ-
Clinical text classification ent phenotyping tasks on discharge summaries. CNN
A systematic literature review of clinical coding and clas- outperformed other phenotyping algorithms on the pre-
sification systems has been conducted by Stanfill et al. diction of the ten phenotypes, and they concluded that
[11]. Some challenge tasks in biomedical text mining also deep learning-based NLP methods improved the patient
focus on clinical text classification, e.g., Informatics for phenotyping performance compared to other methods.
Integrating Biology and the Bedside (i2b2) hosted text Luo et al. applied both CNN, RNN, and Graph Convo-
classification tasks on determining smoking status [10], lutional Networks (GCN) to classify the semantic rela-
and predicting obesity and its co-morbidities [12]. In this tions between medical concepts in discharge summaries
work, we focus on the obesity challenge [12]. Among the from the i2b2-VA challenge dataset [24] and showed
top ten systems of obesity challenge, most are rule-based that CNN, RNN and GCN with only word embedding
systems, and the top four systems are purely rule-based. features can obtain similar or better performances com-
Many approaches for clinical text classification rely on pared to state-of-the-art systems by challenge participants
biomedical knowledge sources [3]. A common approach with heavy feature engineering [25–27]. Wu et al. [28]
is to first map narrative text to concepts from knowledge applied CNN using pre-trained embeddings on clinical
sources like Unified Medical Language System (UMLS), text for named entity recognization. They showed that
then train classifiers on document representations that their models outperformed the conditional random fields
include UMLS Concept Unique Identifiers (CUIs) as fea- (CRF) baseline. Geraci et al. [29] applied deep learning
tures [6]. More knowledge-intensive approaches enrich models to identify youth depression in unstructured text
the feature set with related concepts [4] for apply seman- notes. They obtained a sensitivity of 93.5% and a speci-
tic kernels that project documents that contain related ficity of 68%. Jagannatha et al. [30, 31] experimented with
concepts closer together in a feature space [7]. Similarly, RNN, long short-term memory (LSTM), gated recurrent
Yao et al. [13] proposed to improve distributed document units (GRU), bidirectional LSTM, combinations of LSTM
representations with medical concept descriptions for tra- with CRF, to extract clinical concepts from texts. They
ditional Chinese medicine clinical records classification. demonstrated that all RNN variants outperformed the
On the other hand, some clinical text classification CRF baseline. Lipton et al. [32] evaluated LSTM in phe-
studies use various types of information instead of knowl- notype prediction using multivariate time series clinical
edge sources. For instance, effective classifiers have been measurements. They showed that their model outper-
designed based on regular expression discovery [14] and formed multi-layer perceptron (MLP) and LR. They also
semi-supervised learning [15, 16]. Active learning [17] has concluded that combining MLP and LSTM leads to the
been applied in clinical domain, which leverages unla- best performance. Che et al. [33] also applied deep neural
beled corpora to improve the classification of clinical text. networks to model time series in ICU data. They intro-
Although these methods used rules, knowledge sources duced a Laplacian regularization process on the sigmoid
or different types of information in many ways. They layer based on medical knowledge bases and other struc-
seldom use effective feature learning methods, while tured knowledge. In addition, they designed an incremen-
deep learning methods are recently widely used for text tal training procedure to iteratively add neurons to the
classification and have shown powerful feature learning hidden layer. They then used causal inference to analyze
capabilities. and interpret hidden layer representations. They showed
that their method improved the performance of pheno-
Deep learning for clinical data mining type identification, the model also converges faster and
Recently, deep learning methods have been success- has better interpretation.
fully applied to clinical data mining. Two representative Although deep learning techniques have been well stud-
deep models are convolutional neural networks (CNN) ied in clinical data mining, most of these works do not
[18, 19] and recurrent neural networks (RNN) [20, 21]. focus on long clinical text classification (e.g., an entire
They achieve state of the art performances on a num- clinical note) or utilize knowledge sources, while we pro-
ber of clinical data mining tasks. Beaulieu-Jones et al. [22] pose a novel knowledge-guided deep learning method for
designed a neural network approach to construct phe- clinical text classification.
notypes for classifying patient disease status. The model
performed better than decision trees, random forests Obesity challenge
and Support Vector Machines (SVM). They also showed The objective of the i2b2 2008 obesity challenge [12]
to successfully learn the structure of high-dimensional is to assess text classification methods for determining
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 33 of 114

patient disease status with respect to obesity and 15 tree/master/perl_classifier of Solt’s system provided by
of its comorbidities: Diabetes mellitus (DM), Hyperc- the authors.
holesterolemia, Hypertriglyceridemia, Hypertension,
atherosclerotic cardiovascular disease (CAD), Heart Trigger phrases identification
failure (CHF), Peripheral vascular disease (PVD), We recognize trigger phrases following Solt’s system [5].
Venous insufficiency, Osteoarthritis (OA), Obstruc- We first conduct the same preprocessing like abbrevi-
tive sleep apnea (OSA), Asthma, Gastroesophageal ation resolution and family history removing. We then
reflux disease (GERD), Gallstones, Depression, and use the disease names (class names), their directly asso-
Gout. Our goal is to label each document as either ciated terms and negative/uncertain words to recog-
Present (Y), Absent (N), Questionable (Q) or Unmen- nize trigger phrases. The trigger phrases are disease
tioned (U) for each disease. Macro F1 score is the names (e.g., Gallstones) and their alternative names
primary metric for evaluating and ranking classification (e.g., Cholelithiasis) with/without negative or uncertain
methods. words.
The challenge consists of two tasks, namely textual
task and intuitive task. The textual task is to identify Predicting classes with very few examples using trigger
explicit evidences of the diseases, while the intuitive task phrases
focused on the prediction of the disease status when the As the classes in obesity challenge are very unbalanced,
evidence is not explicitly mentioned. Thus, the Unmen- and some classes even don’t have training examples,
tioned (U) class label was excluded from the intuitive we could not make prediction for these classes using
task. The classes are distributed very unevenly: there machine learning methods and resort to rules defined
are only few N and Q examples in textual task data in Solt’s system [5]. We exclude classes with very few
set and few Q examples in intuitive task data set, as examples in training set of each disease. Specifically, we
shown in Table 1. There exist classes even without train- remove examples with Q label in intuitive task and remove
ing example. For instance, there is no training exam- examples with Q or N label for textual task. Then for
ple with Q and N label for Depression in textual task, examples in test set, we use trigger phrases to predict
and there is no training example with Q label for Gall- their labels. As Solt’s system [5], we assume positive
stones in intuitive task. The details of the datasets can be trigger phrases (disease names and alternatives without
found in [12]. uncertain or negative words) are prior to negative trig-
ger phrases, and negative trigger phrases are prior to
Method uncertain trigger phrases. Therefore, if a clinical record
Our method contains three steps: (1). identifying trigger contains uncertain trigger phrases and dosen’t contain
phrases; (2). predicting classes with very few examples positive or negative trigger phrases, we label it as Q. Sim-
using trigger phrases; (3). learning a knowledge-guided ilarly, if a clinical record contains negative trigger phrases
CNN for more populated classes. Our implementation and dosen’t contain positive trigger phrases, we label
is available at https://github.com/yao8839836/obesity. We it as N.
use Solt’s system [5] to recognize trigger phrases and pre-
dict classes with very few examples. Solt’s system is a Knowledge-guided convolutional neural networks
very powerful rule-based system. It ranked the first in After excluding classes with very few examples, only two
the intuitive task and the second in the textual task and classes remain in the training set of each disease (Y and
overall the first in the obesity challenge. Solt’s system can N for intuitive task, Y and U for textual task). We learn a
identify very informative trigger phrases with different CNN on positive trigger phrases and UMLS CUIs in train-
contexts (positive, negative or uncertain). We use the Perl ing records, then classify test examples using the trained
implementation: https://github.com/yao8839836/obesity/ CNN model. CNN is a powerful deep learning model for
text classification, and it performs better than recurrent
neural networks in our preliminary experiment. The test
phase of our method is given in Fig. 1. If a record in test set
Table 1 The class distribution in the obesity challenge datasets is labeled Q or N by Solt’s system, we trust Solt’s system.
Training Set Test Set Otherwise, we use the CNN to predict the label of the
Label
Textual Intuitive Textual Intuitive record.
Y 3208 3267 2192 2285 For each disease, we feed its positive trigger phrases
with word2vec [34] word embeddings to CNN. We
N 87 7362 65 5100
employed the 200 dimensional pre-trained word embed-
Q 39 26 17 14
dings learned from MIMIC-III [35] clinical notes.
U 8296 0 5770 0 We experimented with 100, 200, 300, 400, 500 and
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 34 of 114

Fig. 1 The test phase of our method

600 dimensional word embeddings, and found using We implement our knowledge-guided CNN model
200 dimensional word embeddings achieves the best using TensorFlow [38], a popular deep learning frame-
performance. work. We set the following parameters for our CNN
We also utilize medical knowledge base to enrich the model: the convolution kernel size: 5, the number of
CNN model input. We link the full clinical text to CUIs in convolution filters: 256, the dimension of hidden layer
UMLS [9] via MetaMap [36]. Each clinical record is rep- in the fully connected layer: 128, dropout keep prob-
resented as a bag of CUIs after entity linking. We feed ability: 0.8, the number of learning epochs: 30, batch
13 types of CUIs which are closely connected to dis- size: 64, learning rate: 0.001. We also experimented with
eases as the input entities of CNN: Body Part, Organ, other settings of the parameters but didn’t find much dif-
or Organ Component (T023), Finding (T033), Labora- ference. We use softmax cross entropy loss and Adam
tory or Test Result (T034), Disease or Syndrome (T047), optimizer [39].
Mental or Behavioral Dysfunction (T048), Cell or Molec-
ular Dysfunction (T049), Laboratory Procedure (T059), Results
Diagnostic Procedure (T060), Therapeutic or Preven- Tables 3 and 4 show Macro F1 scores and Micro F1
tive Procedure (T061), Pharmacologic Substance (T121), scores of our method and Solt’s system. We report results
Biomedical or Dental Material (T122), Biologically Active of both the Solt’s paper [5] and the Perl implementa-
Substance (T123) and Sign or Symptom (T184). We list tion because we base our method on the Perl imple-
these CUIs types with type unique identifier (TUI) in mentation and we found there are some differences
Table 2. We found using the subset of CUIs achieves between the paper’s results and Perl implementation’s
better performances than using all CUIs. We employ pre- results. This is likely due to further feature engineer-
trained CUIs embeddings made by [37] as the input entity ing that are not reflected when Solt et al. submitted
representations of CNN. classification output to the challenge. For complete-
Our CNN architecture is given in Fig. 2. The input layer ness of the results, we show the performances from
looks up word embeddings of positive trigger phrases and both Solt’s paper and code. We also report the results
entity embeddings of selected CUIs in each clinical record. of our method when using only word embeddings as
w0 , w1 , w2 , . . . , wn are words in positive trigger phrases CNN input.
and e0 , e1 , e2 , . . . , en are CUIs in a record. A one dimen- From the two tables, we can note that the Perl imple-
sional convolution layer is built on the word embeddings mentation performs slightly better than the paper, the
and entity embeddings. We use max pooling to select the authors might not submit their best results to the obesity
most prominent feature with the highest value in the con- challenge. We can also see that CNN model with word
volutional feature map, then concatenate the max pooling embeddings only performs better than the Perl implemen-
results of word embeddings and entity embeddings. The tation in intuitive task, which means using a deep learning
concatenated hidden representations are fed into a fully- model can learn effective features for better classification.
connected layer, then a dropout and a ReLU activation The input trigger phrases for CNN are the same as the
layer. Lastly, a fully-connected layer is fed to a softmax trigger phrases for Y/U (textual task) or Y/N (intuitive
layer, whose output is the multinomial distribution over task) labeling in the Perl code. The results in the tex-
labels. tual task are not improved when using word embeddings
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 35 of 114

Table 2 The types of CUIs we used only, because the textual task needs explicit evidences to
TUI Semantic type description label the records, and the positive trigger phrases contain
T023 Body Part, Organ, or Organ Component enough information, therefore CNN with word embed-
T033 Finding dings only may not be particularly helpful. Nevertheless,
after adding CUIs embeddings as additional input, more
T034 Laboratory or Test Result
scores for different diseases are improved, and the over-
T047 Disease or Syndrome
all F1 scores are higher than Solt’s system in the two tasks.
T048 Mental or Behavioral Dysfunction This is likely due to the fact that the disambiguated CUIs
T049 Cell or Molecular Dysfunctions are closely connected to diseases and their embeddings
T059 Laboratory Procedure have more semantic information, which is beneficial for
T060 Diagnostic Procedure disease classification. To the best of our knowledge, we
have achieved the highest overall F1 scores in intuitive task
T061 Therapeutic or Preventive Procedure
so far.
T121 Pharmacologic Substance
Note that the F1 scores of Solt’s paper and Perl imple-
T122 Biomedical or Dental Material mentation remain the same, while our model produces
T123 Biologically Active Substance slightly different F1 scores in different runs. We run our
T184 Sign or Symptom model 10 times and observed that the overall Macro F1
scores and Micro F1 scores are significantly higher than
Solt’s paper and implementation (p value <0.05 based on

Fig. 2 Our knowledge-guided convolutional neural network architecture


Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 36 of 114

Table 3 Macro F1 scores and Micro F1 scores of Solt’s system [5] (paper) and our method with word and entity embeddings
Solt’s paper [5] Our method with word & entity embeddings
Disease Textual Intuitive Textual Intuitive
Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1
Asthma 0.9434 0.9921 0.9784 0.9894 0.9434 0.9921 0.9784 0.9894
CAD 0.8561 0.9256 0.6122 0.9192 0.8551 0.9235 0.6233 0.9345
CHF 0.7939 0.9355 0.6236 0.9315 0.7939 0.9355 0.6236 0.9315
Depression 0.9716 0.9842 0.9346 0.9539 0.9716 0.9842 0.9602 0.9727
DM 0.9032 0.9761 0.9682 0.9729 0.9056 0.9801 0.9731 0.9770
Gallstones 0.8141 0.9822 0.9729 0.9857 0.8141 0.9822 0.9689 0.9837
GERD 0.4880 0.9881 0.5768 0.9131 0.4880 0.9881 0.5768 0.9131
Gout 0.9733 0.9881 0.9771 0.9900 0.9733 0.9881 0.9771 0.9900
Hypercholesterolemia 0.7922 0.9721 0.9053 0.9072 0.7922 0.9721 0.9113 0.9118
Hypertension 0.8378 0.9621 0.8851 0.9283 0.8378 0.9621 0.9240 0.9484
Hypertriglyceridemia 0.9732 0.9980 0.7981 0.9712 0.9434 0.9961 0.7092 0.9630
OA 0.9594 0.9761 0.6286 0.9589 0.9626 0.9781 0.6307 0.9610
Obesity 0.4879 0.9675 0.9724 0.9732 0.4885 0.9696 0.9747 0.9754
OSA 0.8781 0.9920 0.8805 0.9939 0.8781 0.9920 0.8805 0.9939
PVD 0.9682 0.9862 0.6348 0.9763 0.9682 0.9862 0.6314 0.9742
Venous insufficiency 0.8403 0.9822 0.8083 0.9625 0.8816 0.9882 0.8083 0.9625

Overall 0.8000 0.9756 0.6745 0.9590 0.8016 0.9763 0.6768 0.9624


Scores in bold font means they are higher than the corresponding scores of the paper and Perl implementation

Table 4 Macro F1 scores and Micro F1 scores of Solt’s system [5] (code) and our method with word embeddings only
Solt’s code Our method with word embeddings only
Disease Textual Intuitive Textual Intuitive
Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1
Asthma 0.9434 0.9921 0.9784 0.9894 0.9434 0.9921 0.9784 0.9894
CAD 0.8551 0.9235 0.6122 0.9192 0.8551 0.9235 0.6122 0.9192
CHF 0.7939 0.9355 0.6236 0.9315 0.7939 0.9355 0.6236 0.9315
Depression 0.9716 0.9842 0.9346 0.9539 0.9716 0.9842 0.9602 0.9767
DM 0.9056 0.9801 0.9731 0.9770 0.9056 0.9801 0.9731 0.9770
Gallstones 0.8141 0.9822 0.9729 0.9857 0.8141 0.9822 0.9729 0.9857
GERD 0.4880 0.9881 0.5768 0.9131 0.4880 0.9881 0.5768 0.9131
Gout 0.9733 0.9881 0.9771 0.9900 0.9733 0.9881 0.9771 0.9900
Hypercholesterolemia 0.7922 0.9721 0.9101 0.9118 0.7922 0.9721 0.9042 0.9049
Hypertension 0.8378 0.9621 0.8861 0.9283 0.8378 0.9621 0.9240 0.9484
Hypertriglyceridemia 0.9732 0.9980 0.7092 0.9630 0.9732 0.9980 0.7092 0.9630
OA 0.9626 0.9781 0.6307 0.9610 0.9626 0.9781 0.6307 0.9610
Obesity 0.4885 0.9696 0.9747 0.9754 0.4885 0.9696 0.9747 0.9754
OSA 0.8781 0.9920 0.8805 0.9939 0.8781 0.9920 0.8805 0.9939
PVD 0.9682 0.9862 0.6314 0.9742 0.9682 0.9862 0.6314 0.9742
Venous insufficiency 0.8403 0.9822 0.8083 0.9625 0.8403 0.9822 0.8083 0.9625

Overall 0.8014 0.9760 0.6745 0.9592 0.8014 0.9760 0.6760 0.9612


Scores in bold font means they are higher than the corresponding scores of the paper and Perl implementation
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 37 of 114

student t test). We checked the cases our method failed which means positive trigger phrases themselves are infor-
to predict correctly. and found the most error cases are mative enough, while word embeddings could not help
caused by using Solt’s positive trigger phrases. For many to improve the performances. Nevertheless, we run our
error cases, our method predicted N or U when no posi- model 10 times and observed that the overall Macro F1
tive trigger phrases are identified, but the real labels are Y. scores and Micro F1 scores are significantly higher than
For some other cases, our method predicted Y when posi- SVM and Logistic Regression (p value <0.05 based on
tive trigger phrases are identified, but the real labels are N student t test), which verifies the effectiveness of CUIs
or U. For some diseases, our proposed method and Solt’s embeddings again.
system achieved a very high Micro F1 but a low Macro F1 .
This is due to the fact that there are only a few Q or N Discussion
records for these diseases (i.e., imbalanced class ratio), and We note that the knowledge features part does not
we could not identify effective negative/uncertain trigger improve much. In fact, we think MetaMap will indeed
phrases using Solt’s rules. The regular expressions in Solt’s introduce some noisy and unrelated CUIs, as previous
system can be further enriched so that we can identify studies also showed. To remedy this, following Weng et al.
trigger phrases more accurately. [40], we only kept CUIs from selected semantic types that
We also compared our method with two commonly are considered most relevant to clinical tasks. We found
used classifiers: Logistic Regression and linear kernel sup- that filtering CUIs based on semantic types did lead to
port Vector Machine (SVM). We use LogisticRegression moderate performance improvement over using all CUIs.
and LinearSVC class in scikit-learn as our implementa- In another related computational phenotyping study [41],
tions. For fair comparison, we use the same training set we found that manually curated CUI set resulted in signif-
as knowledge-guided CNN. We represent a record as a icant performance improvement. We believe that improv-
binary vector, each dimension means whether an unique ing entity recognition and integrating word/entity sense
word is in its positive trigger phrases. For test exam- disambiguation will improve the performance, and plan to
ples, we also use Solt’s system to predict Q and N. If explore such directions in future work.
a test example is not labeled Q or N by Solt’s system,
we use Logistic Regression or SVM to predict the label. Conclusion
Table 5 shows the results, we can observe that the results In this work, we propose a novel clinical text classification
are similar to our method with word embeddings only, method which combines rule-based feature engineering
Table 5 Macro F1 scores and Micro F1 scores of Logistic Regression and SVM
Logistic Regression SVM
Disease Textual Intuitive Textual Intuitive
Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1
Asthma 0.9434 0.9921 0.9784 0.9894 0.9434 0.9921 0.9784 0.9894
CAD 0.8551 0.9235 0.6204 0.9301 0.8551 0.9235 0.6122 0.9192
CHF 0.7939 0.9355 0.6236 0.9315 0.7939 0.9355 0.6236 0.9315
Depression 0.9716 0.9842 0.9573 0.9706 0.9716 0.9842 0.9573 0.9706
DM 0.9056 0.9801 0.9731 0.9770 0.9056 0.9801 0.9731 0.9770
Gallstones 0.8141 0.9822 0.9729 0.9857 0.8141 0.9822 0.9729 0.9857
GERD 0.4880 0.9881 0.5768 0.9131 0.4880 0.9881 0.5768 0.9131
Gout 0.9733 0.9881 0.9771 0.9900 0.9733 0.9881 0.9771 0.99
Hypercholesterolemia 0.7922 0.9721 0.9043 0.9049 0.7922 0.9721 0.9134 0.9142
Hypertension 0.8378 0.9621 0.9271 0.9507 0.8378 0.9621 0.9271 0.9507
Hypertriglyceridemia 0.9732 0.9980 0.7092 0.9630 0.9732 0.9980 0.7092 0.9630
OA 0.9626 0.9781 0.6307 0.961 0.9626 0.9781 0.6307 0.9610
Obesity 0.4885 0.9696 0.9747 0.9754 0.4885 0.9696 0.9747 0.9754
OSA 0.8781 0.992 0.8805 0.9939 0.8781 0.9920 0.8805 0.9939
PVD 0.9682 0.9862 0.6314 0.9742 0.9682 0.9862 0.6314 0.9742
Venous insufficiency 0.8403 0.9822 0.8083 0.9625 0.8403 0.9822 0.8083 0.9625

Overall 0.8014 0.9760 0.6764 0.9619 0.8014 0.9760 0.6764 0.9618


Classes with very few examples are labeled by Solt’s system
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 38 of 114

and knowledge-guided deep learning. Specifically, we use 2. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural
rules to identify trigger phrases which contain diseases language processing do for clinical decision support?. J Biomed Inform.
2009;42(5):760–72.
names, their alternative names and negative or uncertain 3. Wilcox AB, Hripcsak G. The role of domain knowledge in automating
words, then use these trigger phrases to predict classes medical text report classification. J Am Med Inform Assoc. 2003;10(4):
with very limited examples, and finally train a knowledge- 330–8.
4. Suominen H, Ginter F, Pyysalo S, Airola A, Pahikkala T, Salanter S,
guided CNN model with word embeddings and UMLS Salakoski T. Machine learning to automate the assignment of diagnosis
CUIs entity embeddings. The evaluation results on the codes to free-text radiology reports: a method description. In:
Proceedings of the ICML/UAI/COLT Workshop on Machine Learning for
obesity challenge demonstrate that our method outper- Health-Care Applications; 2008.
forms state-of-the-art methods for the challenge. We 5. Solt I, Tikk D, Gál V, Kardkovács ZT. Semantic classification of diseases in
showed that CNN model is powerful for learning effec- discharge summaries using a context-aware rule-based classifier. J Am
Med Inform Assoc. 2009;16(4):580–4.
tive hidden features, and CUIs embeddings are helpful for 6. Garla V, Brandt C. Knowledge-based biomedical word sense
building clinical text representations. This shows integrat- disambiguation: an evaluation and application to clinical document
ing domain knowledge into CNN models is promising. classification. J Am Med Inform Assoc. 2013;20(5):882–6.
7. Garla V, Brandt C. Ontology-guided feature engineering for clinical text
In our future work, We plan to design more principled classification. J Biomed Inform. 2012;45(5):992–8.
methods and evaluate our methods on more clinical text 8. Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning.
Cambridge: MIT press; 2016.
datasets. 9. Bodenreider O. The unified medical language system (umls): integrating
biomedical terminology. Nucleic Acids Res. 2004;32(suppl_1):267–70.
Acknowledgment 10. Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking
We would like to thank i2b2 National Center for Biomedical Computing status from medical discharge records. J Am Med Inform Assoc.
funded by U54LM008748, for providing the clinical records originally prepared 2008;15(1):14–24.
for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. 11. Stanfill MH, Williams M, Fenton SH, Jenders RA, Hersh WR. A systematic
Ozlem Uzuner. We thank Dr. Uzuner for helpful discussions. We would like to literature review of automated clinical coding and classification systems. J
also thank NVIDIA GPU Grant program for providing the GPU used in our Am Med Inform Assoc. 2010;17(6):646–51.
computation. This work was supported in part by NIH Grant 1R21LM012618-01. 12. Uzuner Ö. Recognizing obesity and comorbidities in sparse data. J Am
Med Inform Assoc. 2009;16(4):561–70.
Funding
13. Yao L, Zhang Y, Wei B, Li Z, Huang X. Traditional chinese medicine
Publication charges for this article have been funded by NIH Grants
clinical records classification using knowledge-powered document
1R21LM012618-01.
embedding. In: Bioinformatics and Biomedicine (BIBM), 2016 IEEE
Availability of data and materials International Conference On. Piscataway: IEEE; 2016. p. 1926–8.
We released the implementation at https://github.com/yao8839836/obesity. 14. Bui DDA, Zeng-Treitler Q. Learning regular expressions for clinical text
classification. J Am Med Inform Assoc. 2014;21(5):850–7.
About this supplement 15. Wang Z, Shawe-Taylor J, Shah A. Semi-supervised feature learning from
This article has been published as part of BMC Medical Informatics and Decision clinical text. In: Bioinformatics and Biomedicine (BIBM), 2010 IEEE
Making Volume 19 Supplement 3, 2019: Selected articles from the first International International Conference On. Piscataway: IEEE; 2010. p. 462–6.
Workshop on Health Natural Language Processing (HealthNLP 2018). The full 16. Garla V, Taylor C, Brandt C. Semi-supervised clinical text classification
contents of the supplement are available online at https:// with laplacian svms: an application to cancer case management. J
bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume- Biomed Inform. 2013;46(5):869–75.
19-supplement-3. 17. Figueroa RL, Zeng-Treitler Q, Ngo LH, Goryachev S, Wiechmann EP.
Active learning for clinical text classification: is it better than random
Authors’ contributions
sampling?. J Am Med Inform Assoc. 2012;19(5):809–16.
LY and YL designed the study and wrote the manuscript. CM contributed to
the experiment and analysis. All authors contributed to the discussion and 18. Kim Y. Convolutional neural networks for sentence classification. In:
reviewed the manuscript. All authors read and approved the final manuscript. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP). Stroudsburg: Association for
Ethics approval and consent to participate Computational Linguistics; 2014. p. 1746–51.
Not applicable. 19. Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural
network for modelling sentences. In: Proceedings of the 52nd Annual
Consent for publication Meeting of the Association for Computational Linguistics (Volume 1:
Not applicable. Long Papers). Stroudsburg: Association for Computational Linguistics;
Competing interests 2014. p. 655–65.
The authors declare that they have no competing interests. 20. Tai KS, Socher R, Manning CD. Improved semantic representations from
tree-structured long short-term memory networks. In: Proceedings of the
Publisher’s Note 53rd Annual Meeting of the Association for Computational Linguistics
Springer Nature remains neutral with regard to jurisdictional claims in and the 7th International Joint Conference on Natural Language
published maps and institutional affiliations. Processing (Volume 1: Long Papers); 2015. p. 1556–66.
21. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention
Author details networks for document classification. In: Proceedings of the 2016
1 Northwestern University, Chicago 60611, IL, USA. 2 Department of Preventive
Conference of the North American Chapter of the Association for
Medicine, Feinberg School of Medicine, Northwestern University, Chicago Computational Linguistics: Human Language Technologies. Stroudsburg:
60611, IL, USA. Association for Computational Linguistics; 2016. p. 1480–9.
22. Beaulieu-Jones BK, Greene CS, et al. Semi-supervised learning of the
Published: 4 April 2019 electronic health record for phenotype stratification. J Biomed Inform.
2016;64:168–78.
References 23. Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT, et al. Comparing
1. Huang C-C, Lu Z. Community challenges in biomedical text mining over deep learning and concept extraction based methods for patient
10 years: success, failure and the future. Brief Bioinforma. 2015;17(1): phenotyping from clinical narratives. PLOS ONE. 2018;13(2):e0192360.
132–44. https://doi.org/10.1371/journal.pone.0192360.
Yao et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 3):71 Page 39 of 114

24. Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/va challenge on
concepts, assertions, and relations in clinical text. J Am Med Inform Assoc.
2011;18(5):552–6.
25. Luo Y. Recurrent neural networks for classifying relations in clinical notes.
J Biomed Inform. 2017;72:85–95.
26. Luo Y, Cheng Y, Uzuner Ö, Szolovits P, Starren J. Segment convolutional
neural networks (seg-cnns) for classifying relations in clinical notes. J Am
Med Inform Assoc. 2017;25(1):93–8.
27. Li Y, Jin R, Luo Y. Classifying relations in clinical narratives using segment
graph convolutional and recurrent neural networks (Seg-GCRNs). J Am
Med Inform Assoc. 2018;26.3:262–268.
28. Wu Y, Jiang M, Lei J, Xu H. Named entity recognition in chinese clinical
text using deep neural network. Stud Health Technol Inform. 2015;216:
624.
29. Geraci J, Wilansky P, de Luca V, Roy A, Kennedy JL, Strauss J. Applying
deep neural networks to unstructured text notes in electronic medical
records for phenotyping youth depression. Evid-Based Ment Health.
2017;20(3):83–7.
30. Jagannatha AN, Yu H. Structured prediction models for rnn based
sequence labeling in clinical text. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing. Conference on
Empirical Methods in Natural Language Processing, vol 2016.
Stroudsburg: Association for Computational Linguistics; 2016. p. 856.
31. Jagannatha AN, Yu H. Bidirectional rnn for medical event detection in
electronic health records. In: Proceedings of the Conference. Association
for Computational Linguistics. North American Chapter. Meeting. vol
2016. Stroudsburg: Association for Computational Linguistics; 2016.
p. 473.
32. Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to Diagnose with LSTM
Recurrent Neural Networks. In: International Conference on Learning
Representations (ICLR); 2016.
33. Che Z, Kale D, Li W, Bahadori MT, Liu Y. Deep computational
phenotyping. In: Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. New York; 2015.
p. 507–16.
34. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed
representations of words and phrases and their compositionality. In: NIPS.
Cambridge: MIT Press; 2013. p. 3111–9.
35. Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody
B, Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical care
database. Sci Data. 2016;3:160035.
36. Aronson AR, Lang F-M. An overview of metamap: historical perspective
and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36.
37. De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic
similarity with a neural language model. In: Proceedings of the 23rd ACM
International Conference on Conference on Information and Knowledge
Management. ACM; 2014. p. 1819–22.
38. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, et al.
Tensorflow: A system for large-scale machine learning. In: 12th USENIX
Symposium on Operating Systems Design and Implementation (OSDI
16); 2016. p. 265–283.
39. Kinga D, Ba JA. A method for stochastic optimization. In: International
Conference on Learning Representations (ICLR); 2015.
40. Weng W-H, Wagholikar KB, McCray AT, Szolovits P, Chueh HC. Medical
subdomain classification of clinical notes using a machine learning-based
natural language processing approach. BMC Med Inform Decis Mak.
2017;17(1):155.
41. Zeng Z, Li X, Espino S, Roy A, Kitsch K, Clare S, Khan S, Luo Y.
Contralateral breast cancer event detection using nature language
processing. In: AMIA Annual Symposium Proceedings, vol 2017. Bethesda:
American Medical Informatics Association; 2017. p. 1885.

You might also like