Precursor-Induced Conditional Random
Precursor-Induced Conditional Random
https://doi.org/10.1186/s12911-019-0865-1
Abstract
Background: This paper presents a conditional random fields (CRF) method that enables the capture of specific
high-order label transition factors to improve clinical named entity recognition performance. Consecutive clinical
entities in a sentence are usually separated from each other, and the textual descriptions in clinical narrative
documents frequently indicate causal or posterior relationships that can be used to facilitate clinical named
entity recognition. However, the CRF that is generally used for named entity recognition is a first-order model
that constrains label transition dependency of adjoining labels under the Markov assumption.
Methods: Based on the first-order structure, our proposed model utilizes non-entity tokens between separated
entities as an information transmission medium by applying a label induction method. The model is referred to as
precursor-induced CRF because its non-entity state memorizes precursor entity information, and the model’s structure
allows the precursor entity information to propagate forward through the label sequence.
Results: We compared the proposed model with both first- and second-order CRFs in terms of their F1-scores, using
two clinical named entity recognition corpora (the i2b2 2012 challenge and the Seoul National University Hospital
electronic health record). The proposed model demonstrated better entity recognition performance than both the
first- and second-order CRFs and was also more efficient than the higher-order model.
Conclusion: The proposed precursor-induced CRF which uses non-entity labels as label transition information
improves entity recognition F1 score by exploiting long-distance transition factors without exponentially increasing the
computational time. In contrast, a conventional second-order CRF model that uses longer distance transition factors
showed even worse results than the first-order model and required the longest computation time. Thus, the proposed
model could offer a considerable performance improvement over current clinical named entity recognition methods
based on the CRF models.
Keywords: Clinical named entity recognition, Conditional random fields, High-order dependency, Clinical natural
language processing, Induction method
* Correspondence: [email protected]
1
Interdisciplinary Program for Bioengineering, Graduate School, Seoul
National University, 103 Daehak-ro, Jongno-gu, Seoul 03080, South Korea
2
Department of Biomedical Engineering, Seoul National University College of
Medicine, 103 Daehak-ro, Jongno-gu, Seoul 03080, South Korea
Full list of author information is available at the end of the article
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 2 of 13
an induction method that allows information to propa- summing the numerator for all possible y sequences
gate from one state to state between two entities [35]. The learning objective is to find the weight set that
through non-entity sequence within a single instance. maximizes the conditional distribution. The function fk is
Concentrating on the CRF model study rather than a binary indicator function that has a value of 1 only if the
the medical NER, this paper would briefly introduce re- function matches the target condition, and is otherwise 0.
cent studies in medical NER. Deep-learning based Dependencies between random variables are presented in
methods for clinical concept identification are actively the form of feature function fk in the CRF; the feature
studied especially based on recurrent neural network functions are either transition factors or observation factor
structures [16, 23–28]. In the long short-term memory functions. The transition factors in the CRF model take
and CRF architecture, the CRF is still used for labeling the form of fk ij(y, y’, x) = 1{y = i}1{y’ = j} where i and j are cer-
of a sequence because the CRF model can jointly use tain label symbols having transition relationship according
neighboring tags in its output decision [15]. In order to to this function. The observation factors takes the form as
automate medical NER a research [29] has been pro- Eq. (2) where i and o are certain symbols having an expli-
posed to incorporate active learning. Once named en- cit relationship according to this function:
tities are extracted, the identified terms can be utilized
in order to derive more information beyond textual data,
f k io ðy; y’; xÞ ¼ 1fy¼ig 1fx¼og :: ð2Þ
such as temporal information extraction [3, 30], drug-
disease relationship recognition from large scale medical
literature [31], and identification of risk factors related Based on this definition of the feature function, the CRF
to a particular disease [32]. In order to support re- model explicitly represents not only observation informa-
searchers requiring NER modules, off-the-shelf medical tion but also label transition information for sequence
NER programs are recently published such as CLAMP labeling. For instance, presume a set {A, B, O} as the label
[33] and MetaMap Lite [34]. symbol set; assign A or B to NEs, assign a label symbol O
The remainder of this paper is organized as follows. to non-entity tokens, and presume a label sequence of
The Methods section details the proposed CRF model length 4, [A, B, O, B], where the first occurrence of entity B
and the model evaluation method. The Results section follows entity A, and a single non-entity token exists
presents the evaluation results, and the Discussion sec- between the two entity Bs. The first-order CRF models only
tion considers several observations related to the use of those label transitions between adjoining state labels, that
the proposed model in clinical NER. The Conclusion is, the label transition data {(A, B), (B, O), (O, B)}, in which
section summarizes the study’s main findings. the transition between labels A and B is explicitly
expressed. Presume another label sequence [A, O, …O, B]
Methods where entity A precedes entity B by some distance and an
Conditional random fields arbitrary length of consecutive non-entity tokens are be-
In the conventional CRF model applied to NER, a textual tween the two NEs. The first-order CRF model learns only
instance (i.e., sentence) can be represented as a pair (x, y) the label transitions {(A, O), (O, O), (O, B)} from the data, in
where x is an observed feature sequence including one or which the dependency (A, B) is not explicitly cap-
more words (tokens) and y is the feature sequence’s corre- tured by the model and the fact that entity A pre-
sponding label sequence. Because the text is a linear cedes entity B is not learned during the training time.
sequence of tokens, the CRF for NER takes the form of a Because the CRF model treats single observation to-
linear chain. The length of x is the number of tokens, and kens as single time steps in a sequence, the gap size
the sequence y has the same length as x. The label is hid- between two separate entities is broadened by the
den, and a hidden state value set consists of the target entity number of intermediary non-entities, as shown in
labels and a single non-entity label for non-entity tokens. Fig. 2.
The CRF model then represents the conditional distribu- In Fig. 2, each circle denotes a random variable for la-
tion P(y|x) as an equation of feature functions as follows: bels, and each edge denotes that there is a dependency
between connected random variables. In this structure,
1 YT nXK o
pðyjxÞ ¼ ∙ t¼1 exp θ f ð y ; y ; x Þ ; labels have dependency only between neighbors. Thus a
k t
Z ð xÞ k¼1 k t t−1
dependency for entity prediction between the label sym-
ð1Þ bols ‘Symptom’ and ‘Drug’ for predicting the word ‘ASA’
seems to be ignored. In the case of the ‘ASA,’ we sus-
where fk is a kth arbitrary feature function having the pected that the preceding label information could
corresponding weight θk , K is the number of feature provide additional information for prediction of a par-
functions, t is the time step, T is the number of tokens ticular label for the word if the information can be
in an instance of x, and Z(x) is a partition function delivered forward.
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 4 of 13
Fig. 2 Example of entities separated by non-entity words in the CRF model (S: symptom; D: drug; O: non-entity)
Precursor-induced conditional random fields the next outside label, which in turn propagates
In order to improve the CRF model for NER applica- the information to the next outside label, as
tions, this study introduces a precursor-induced CRF shown in Fig. 3 (b); and
(pi-CRF) model to capture specific long-distance transi- Uses an induction process to transmit the
tion dependencies between two NEs separated by mul- information from the first entity through multiple
tiple non-entities. The pi-CRF model: outside label sequence to the second entity state,
even though the model uses the first-order depend-
Uses non-entity labels to propagate transition infor- ency (Fig. 3 (b)).
mation between separated NEs; Modifies the observation feature functions of the
Retains the first-order model structure to reduce the CRF in order to share observation symbols among
model’s computational complexity than the second- outside label symbols (Eq. 4).
order or higher-order CRF;
Focuses on label subsequences with the [entity,
outside+, entity] pattern, as shown in Fig. 3 (a), Label induction
where the outside+ notation denotes one or In the pi-CRF, a state with an outside label binds with an
successive non-entity label symbols; additional memory element and behaves as an informa-
Adds a memory element to the hidden state tion transmission medium, delivering information about
variables to represent those states labeled as non- the presence or absence of the preceding entity forward,
entities, such that the initial outside label in a which requires the expansion of the hidden state value
non-entity subsequence propagates its explicit set (label symbols). The entity label symbols are col-
first-order dependency on its adjacent entity to lected from the training data, and the expanded state
value set is eventually derived by a concatenation of
entity label symbol and the outside label symbol. The is engaged within only adjacent tokens (i.e., yt and yt-1)
concatenated outside label symbols thus indicate that because this model is designated to keep the first-order
the outside label follows a specific entity label. As a structure. Thus, the information exists flows forward
naming convention, we use label[O]+ to implicitly indi- with the induced outside label by the first-order transi-
cate that the sequence of O (outside) labels follows the tion. This structure makes the conveyed information
concatenated label series. In the example, the symbol flows forward regardless of the distance.
A[O]+ is one outside label symbol that indicates that an This induction process subsequently expands the ori-
entity A precedes itself, and O[O]+ is one fragmented ginal label symbol set inside the model, producing newly
outside label symbol indicating that no entity has induced and multiple outside label symbols instead of
occurred before this non-entity state. The CRF models the single outside label symbol. For example, the process
distinguish the features for observation symbols and the modifies an original label sequence [A, O, ⋯O, B] to [A,
label symbols. Thus, any types of label symbols do not A[O]+, ⋯A[O]+, B] according to Code 1. This transform-
violate the token symbols, and any label naming conven- ation helps the model learn long-distance transitions be-
tion can be used. tween successive NEs even in the first-order form: from
The form of the pi-CRF is derived from Eq. (1), and the modified example sequence, the model can learn
the conditional probability distribution of the CRF label transition data {(A[O]+, B)} where the entity B de-
model extension takes the form of feature functions as pends on the non-entity taking entity A as its precursor.
follows: This process also generates a trellis structure (Fig. 4 (c))
nXK o that is slightly more complex than the trellis generated
1 YT
pðy; ajxÞ ¼ ∙ t¼1 exp θk f k ðyt ; yt−1 ; xt ; at ; at−1 Þ ; by the conventional first-order CRF model (Fig. 4 (a)),
Z ðxÞ k¼1
but simpler than the trellis generated by a conventional
ð3Þ second-order CRF model (Fig. 4 (b)). The CRF models
generally have as many hidden state options (represented
by the nodes in Fig. 4) as there are variables at each time
step, and each combination of hidden states denotes a
path forward. If N is the number of hidden states in the
original first-order CRF model, the pi-CRF model intro-
duces N additional new states; however, this increase in
computational complexity is relatively moderate com-
pared to the increase induced by second- or higher-order
CRF models. In addition, if the IOB2 tagging scheme [36]
where the variable a stores the induced label informa- is applied to the pi-CRF model, the increase in the num-
tion, and the value of at is activated by the value of at-1 ber of newly induced hidden states is halved.
and yt. The conjoined variables a and y are eventually One of the main factors determining the CRF model’s
used to derive a newly induced label sequence: once at is complexity is the model’s graphical structure. The struc-
activated, at transmutes the value of yt (see the Code 1). ture can be presented in the form of a tuple. Thus, the
Based on this model, the dependency of label transition structures of the first-order CRF can be presented in
Fig. 4 Trellis graphs generated by different CRFs; each circle indicates random hidden state variables at each time step, and lines indicate the
transition paths among the labels. The small circles in (c) are the memory elements added to the hidden states for the non-entity label
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 6 of 13
(yt-1, y t, xt). Because the relationship between ys is re- outside symbol. Unlike the feature functions in the con-
lated to transition, the number of transition pair (yt-1, yt) ventional CRF constrain ‘one-to-one’ relationship be-
can be N2. It means that at least N2 calculations are re- tween a label symbol and an observation symbol in a
quired for each time step of a sequence in both of the feature function, the third indicator term allows ‘many-
training and testing time. In the same way, the graphical to-one’ relationship between whole outside label symbols
structure of the second-order CRF can be presented in and one observation symbol.
(yt-2, yt-1, y t, xt) and the transition pair (yt-2, yt-1, y t) de- In the pi-CRF, the model used the Eq. (4) for its obser-
rives at least N4 (=N2 times N2) calculations for each vation feature function instead of using the Eq. (2) that
time step in training and testing the second-order is used in the conventional CRF. By way of illustration,
model. According to the formulation of the pi-CRF (Eq. presume a token, “doctor,” occurred with three outside
2), the variable a does not act as a hidden variable but label symbols (O[O]+, A[O]+, and B[O]+) in the training
interacts with the variable y in order to expand the pos- set. According to the definition of the observational
sible values of the variable y. This system allows the pi- feature function constraining one-to-one relationship, a
CRF to operate in the first-order structure and it keeps first-order CRF has three distinct feature functions
the model’s complexity feasible. faio(x = doctor, y = O[O]+), fbio(x = doctor, y = A[O]+), and
fcio(x = doctor, y = B[O]+). Although the original CRF
Observation symbol sharing treats the three feature functions independently, the pi-
It is worth addressing one of the attributes of the pi- CRF has one single feature function for the observation
CRF. The model uses modified observation feature func- symbol and the outside label symbols, for instance,
tions. The observation feature function fkio (Eq. 2) dir- fkio(x = doctor, y = outside symbol).
ectly implies that a certain label i has ‘one-to-one’
relationship with a certain observation symbol o. If a Model implementation
label symbol does not have a relationship with a particu- Both the original and the pi-CRF models were imple-
lar observation symbol, its relationship is not trained. mented using Java. The basic CRF structure and algo-
The label induction process makes multiple outside rithms were implemented in MALLET [38]. The pi-CRF
label symbols (i.e., ‘label[O]+’ symbols), instead of using model was trained using the original linear chain CRF al-
one single outside symbol (i.e., ‘O’ symbol for the outside gorithms without modification because the graphical
label). This induction process would interrupt an outside architecture of the pi-CRF model is fixed as a template for
label symbol to have relationships with whole observa- each time step in the same manner as in the original CRF
tion symbols related to non-entities. model. In order to train the pi-CRF model, the L-BFGS
Finally, each outside label symbol has relationships optimization method [12] and l2-regularization [39] were
with only a portion of observation symbols. For the same used to exploit the conventional CRF model’s most advan-
training data, it is generally known that machine learn- tageous features [35]. Furthermore, the Viterbi algorithm
ing models with more hidden states are more likely to was used for inferences from unlabeled sequences. The
experience data sparseness problems because of their in- executable files are available online.1
creased feature dimensions [37]. Likewise, in our devel-
opment period, we observed that the first-order CRF Parameter tuning
performs worse if the conventional model was trained In order to train both models properly, the model pa-
with the induced label pattern. rameters were regularized during the development
In order to prevent the performance decrease, the phase. In both the original and the pi-CRF models, l2-
multiple outside symbols are allowed to share an obser- regularization [39] was used in order to avoid overfitting,
vation symbol each other in the pi-CRF model, accord- and the form of regularization is as that in Eq. 5:
ing to the following observation feature function:
0 XK θ2k
f k io y; y ; x ¼ 1fx¼og ∙ 1fi∈¬outside and y¼ig þ 1fi∈outside and y∈outsideg − k¼1 2σ 2
; ð5Þ
ð4Þ
where K is the number of feature functions and θk is
The second and the third indicator terms in the right- the weight of the kth feature function fk , and σ is the
hand side determine whether the y value is an outside hyper-parameter for the regularization that adjusts the
label symbol or not. If the i (the corresponding label amount of penalty. The regularization term is applied to
symbol of the function fk) is not outside symbol, then a log-likelihood form of the CRF models and penalizes
this equation tests whether the y value is equal to i. Con- large weights.
trary, if the i is an outside symbol, then the third indica-
1
tor term has value 1 as long as the value of the y is an The executable jar files are available in https://github.com/jinsamdol
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 7 of 13
During the model development process, the training Table 2 Annotation statistics
data were split by 8:2 for each training and development a) i2b2 2012
set and the parameter σ was chosen to provide the best Set Problem Test Treatment
F1-score for the development set. The parameter tuning Train 4,962 2,558 3,719
was independently performed on each data set, and the
Test 4,270 2,140 3,213
third feature set was used during the tuning process.
b) SNUH
Results Set Symptom Test Disease Medication Procedure
Dataset description Train 3,923 4,559 5,084 3,642 1,175
All the experiments were performed on the NER sets in Test 3,737 3,917 4,828 3,496 1,147
clinical and general domains: English clinical texts (i2b2
c) CoNLL 2003
2012 NLP shared task data [3]), rheumatism patients’ dis-
Set Location Person Organization Miscellaneous
charge summaries obtained from Seoul National University
Hospital (SNUH) [40], and the CoNLL-2003 NER shared Train 7,140 6,600 6,321 3,438
task corpus [41]. The documents in the SNUH set were Test 1,656 1,617 1,662 694
written using English and Korean. The discharge summar-
ies were annotated using the IOB2 tagging scheme [36]. order range is less than the number of entities within
Although the original annotation in the i2b2 2012 data the second- or higher-order ranges. In addition, the ra-
contains more semantic classes, this evaluation was con- tios of the number of entities having transition depend-
ducted using the problem, test, and treatment entities. ency to the total number of entities were 0.85, 0.73, and
For the SNUH corpus, the entities of symptom, disease, 0.78 for i2b2 2012, SNUH, and CoNLL2003 data sets,
clinical lab test, medication, and procedure/operation respectively. These values indicate that in most cases,
were used. We are interested in identifying clinical entities tend to be interrelated in an instance, rather
events related to a patient’s clinical events. Thus, we than being present as single entities.
used the clinical semantic classes listed above in our
evaluation. For the CoNLL-2003 data, the entities of Feature settings
location, person, organization, and miscellaneous were Three types of feature settings were investigated in this
annotated from the general domain news articles. evaluation, as summarized in Table 4. The setting #1 is
Tables 1 and 2 show the data and annotation statistics for the simplest available, and the setting #2 is the configur-
each data set. The training and testing sets in the i2b2 2012 ation in which character-wise prefixes and suffixes could
and the CoNLL-2003 NER sets were divided following the be exploited. Although these two settings use only simple
official distribution set by the data source administrators. features, these configurations reduce the potential bias
As we assumed that a significant portion of the NEs is that the features could exert on the performance compari-
separated in sentences, we measured the word distance son. The setting #3 implemented features used in previous
between the entities in the data sets. The distance de- evaluations of NER methods for each data set [17, 40, 42];
pendency was measured within each instance. Table 3 some particular features that are easy to implement were
shows examples of the distances between entities in the selected for use here. Also, “Token” and “n-gram” are typ-
i2b2 corpus and Fig. 5 shows the distributions of dis- ical features used in NER. The morphologic information
tances between entities in the entire data set for each used included character-wise affixes (i.e., the first two
corpus. The median distance value between entities was characters of a token), capitalization patterns (e.g., all
3 and the mean values were within the range of from 3
to 5, indicating that the NEs in the data sets tended to
Table 3 Example sentences of the entity distances (single:
be separated by 3 to 5 non-entity tokens. The data also entity not having a precursor)
indicates that the number of entities within the first-
Type Example sentence with entity annotation
Table 1 Data specification single The patient is a 28-year-old woman who is
Corpus Domain Set Article Sentence Token Entity [HIV positive]problem for 2 years .
i2b2 2012 Clinical Train 190 7,258 94,836 11,239 distance With [intravenous hydration]treatment [the BUN]test
0 and …
Test 120 5,547 78,564 9,623
distance … because of [pancytopenia]problem and
SNUH Clinical Train 196 11,669 116,402 18,383 1 [vomiting]problem on [DDI]treatment
Test 193 11,042 107,666 17,125 distance She was brought in for [an esophagogastroduodenoscopy]test
CoNLL 2003 General Train 946 14,987 203,621 23,499 8 on 9/26 but she basically was not sufficiently
[sedated]treatment and readmitted at this time for
Test 231 3,684 46,435 5,629 [a GI work-up]test .
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 8 of 13
Fig. 5 Histograms of distances between named entities in each corpus. The number ‘n’ on the x-axis means n non-entities exist within the
two entities
capitalized or capitalization at the word beginning) [17]. for all feature settings on both the i2b2 2012 and the
Matching indicates whether a token matches a controlled SNUH data sets.
vocabulary, e.g., the previous token is an obvious modifier In addition, the first-order CRF with induced labels shows
of the current token, or a token is matched to a list consist- the worst performance than others. Even though the in-
ing of the first entity tokens in the training data (fre- duced label patterns can be easily obtained in the first-
quency > 10) as performed by Li, et al. [43]. order model, we can see that the use of the label induction
without the ‘observation symbol sharing’ in the conven-
tional model rather negatively affects its performance.
Performance evaluation We also evaluated higher-order CRF models such as
We used the three NER datasets to compare the proposed the conventional second-order CRF, semi-Markov CRF
model structure with the first- and the second-order linear [19] and the high-order CRF [18, 20] implemented by A
chain CRFs, and semi-Markov CRF [19], high-order CRF Allam and M Krauthammer [44]. The semi-Markov CRF
[18] that are variants of the CRF leveraging higher-order and the high-order CRF are CRF variants using higher-
label transition dependency. order transition dependencies. The two CRF variants
At first, we compared the pi-CRF with the first-order were trained with the stochastic gradient descent for 50
models. Table 5 shows the F1 scores of the first-order epochs. The results are reported in Table 6. As shown in
CRF, the first-order CRF trained with the induced labels, the table, the pi-CRF shows a bit better performance
and the pi-CRF for each test set. F1 score is harmonic than the other models in several settings and the pi-CRF
mean of the precision and recall scores. We first tested also shows similar performance with the variants in a
the models on all instances in each data set, and then complex feature set.
tested the models on only those instances having two or In addition, we may observe the performance of the
more entities. The table shows that the proposed model higher-order models including the pi-CRF were de-
structure offers a demonstrable improvement over the creased in the general domain set (CoNLL 2003) in the
first-order models. The pi-CRF showed higher F1 scores simple feature settings. When we compare this result
Table 4 Summary of the feature settings. (The w denotes the window size. If the value is absent, only feature of the current token is
used. The n denotes the n of the n-gram. The ‘len’ denotes the length of affixes. The matching features denote the result of
controlled vocabulary matching)
Set Token Norm-token n-gram character affix capitalization POS/Chunk Matching
#1-context w=3 w=3
#2-morph w=3 w=3 len = 2~3
w=3
#3-i2b2 w=5 w=5 n=2 len = 2~7 w=1
w=5 w=3
#3-snuh w=5 w=3 n=2 len = 2~3 modifier /control
w=5
#3-conll w=5 len = 3~4 w=5 n=1
w=5
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 9 of 13
Table 5 F1 scores of the first-order models and the pi-CRF for each corpora. The first value (‘whole instance’) is F1 score with whole
test set and the second value (‘distanced instance’) is F1 score evaluated only with instances having transition dependency between
NEs. (bold: best performance, shaded: pi-CRF)
Feature Models i2b2 2012 SNUH CoNLL 2003
whole distanced whole distanced whole distanced
instance instance instance instance instance instance
Set 1 1st-order CRF 67.22 68.24 74.75 73.20 60.68 62.19
1st-order CRF with induced 66.60 67.69 74.09 72.85 23.38 15.24
labels
pi-CRF 67.29 68.43 75.50 74.43 45.54 43.41
Set 2 1st-order CRF 71.61 72.85 75.81 75.04 68.43 72.93
1st-order CRF with induced 70.73 71.98 75.24 74.36 44.90 41.89
labels
pi-CRF 71.99 73.35 76.04 75.29 69.61 72.31
Set 3 1st-order CRF 72.55 73.97 76.18 75.06 82.57 83.13
1st-order CRF with induced 71.25 72.75 75.37 74.18 80.81 81.55
labels
pi-CRF 72.58 74.04 76.24 75.33 82.08 82.76
with the corresponding tests in Table 5, the pi-CRF per- exploiting the transition information between NEs sepa-
forms worse than the conventional models for the rated by long and arbitrary distances.
CoNLL data, though, we may interpret the performance
decrease of the higher-order models in naïve feature set- Result analysis
ting might be expected. We also examined the model’s behavior on the test data set.
Table 7 compares the proposed model’s training and in- Table 8 shows the numbers of predicted entities and correct
ference times using the feature setting #3 with the conven- predictions on each held-out data set, using feature setting
tional models. The table shows the numbers of parameters, #1. For the clinical data sets, the models that used long-
states, elapsed training time, training time per iteration, distance transition dependency (i.e., the second-order and
and elapsed inference time. These values indicate that the pi-CRF) tended to predict more entities than the first-order
pi-CRF design was slightly more complicated than the model, and the pi-CRF model correctly predicted more en-
first-order CRF, although the proposed design was less tities than both the first- and second-order CRF models,
complicated than the second-order CRF while still resulting in an improvement in recall performance: + 0.7
Table 6 F1 scores of higher-order CRF models and pi-CRF for each corpora. The first value (‘whole instance’) is F1 score with whole
test set and the second value (‘distanced instance’) is F1 score evaluated only with instanced having transition dependency between
NEs. (bold: best performance, shaded: pi-CRF)
Feature Models i2b2 2012 SNUH CoNLL 2003
whole instance distanced instance whole instance distanced instance whole instance distanced instance
Set 1 2nd-order CRF 69.46 70.88 73.43 72.21 58.34 54.52
semi-Markov CRF 67.87 68.91 73.44 71.61 37.31 34.13
high-order CRF 68.38 69.52 73.50 71.69 36.97 33.87
pi-CRF 67.29 68.43 75.50 74.43 45.54 43.41
Set 2 2nd-order CRF 70.99 72.31 74.31 73.27 73.21 72.26
semi-Markov CRF 72.19 73.54 76.01 74.87 63.19 63.32
high-order CRF 71.50 72.74 76.11 74.97 63.56 63.76
pi-CRF 72.30 73.61 76.20 75.47 69.61 72.31
Set 3 2nd-order CRF 71.75 73.01 75.17 74.05 83.13 83.96
semi-Markov CRF 69.30 70.73 76.70 75.79 82.47 83.29
high-order CRF 69.26 70.64 76.73 75.91 82.18 82.80
pi-CRF 72.58 74.04 76.28 75.45 82.08 82.76
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 10 of 13
Table 7 Efficiency test results. The numbers of parameters and states indicate the model’s size. The elapsed training/inference times
indicate the model’s speed. (shaded: pi-CRF)
Data Model Parameter State Elapsed training time (sec) Training time per iteration (sec) Elapsed
inference time (sec)
i2b2 1st-order CRF 442,705 8 1,550 12.5 1.7
2nd-order CRF 581,604 64 6,819 55.4 5.7
pi-CRF 442,768 11 3,751 17.0 2.1
SNUH 1st-order CRF 396,245 12 2,946 19.5 1.9
2nd-order CRF 495,772 144 27,388 139.7 9.3
pi-CRF 396,400 17 6,231 23.6 2.1
CoNLL 1st-order CRF 313,672 10 4,031 19.1 0.6
2nd-order CRF 431,044 100 24,828 173.6 2.6
pi-CRF 313,776 14 13,512 29.4 0.7
and + 1.13 for the i2b2 and SNUH, respectively. The final along the distances from 0 to the maximum distance
F1-score of the pi-CRF was improved than the first-order for each data set. Figure 6 shows the analysis result.
model, and we may indicate that the improvement of the re- The graph of the models moved similarly along with
call consequently affects the improvement of the F1-score of the distance between entities: according to this figure,
the pi-CRF. However, the models that used long-distance we can observe the recall scores of the CRF decrease
transition dependency (the second-order and the pi-CRF) as distance increases. The CRF models seem to miss
showed the opposite behavior on the general data set, pre- the entities following when two entities are consecu-
dicting noticeably fewer entities than the first-order model, tive. We could not observe a significant performance
although most of the higher-order models’ predictions were improvement of the pi-CRF compared to other
correct. Thus, the precision performance of the pi-CRF models. However, the pi-CRF shows better results in
showed an improvement of + 16.4 for the CoNLL set, even this result when this model was compared with the
though the recall performance was relatively low. first-order CRF that uses a similar graphical structure
The models’ expectation performance were add- with the pi-CRF. Especially, the performance of the
itionally analyzed along the distances from the pre- first-order model, which was trained with induced la-
ceding entities. Trying to analyze the models bels, was remarkably decreased according to the dis-
according to the distance between the entities, we in- tance. The use of the induced label is easy in the
evitably used the recall. Because this evaluation of the conventional model, but, it would not guarantee the
models with recall alone has its limitations, so this performance improvement in the model without the
result was presented as an auxiliary indicator. The observation symbol sharing. The models’ recall scores
initial recall scores were calculated only for the en- have risen sharply at the points where distance is 1
tities not having precursors, and then the recall in the i2b2 2012 and CoNLL. There is a small num-
scores were updated sequentially by adding entities ber of the entities having gap (order) value as 0 in
Table 8 The numbers of the models’ expectation and the correct on each held-out set. (shaded: pi-CRF)
Data Model Whole instances Distanced instances
gold expected correct gold expected correct
i2b2 (clinical) 1st-order CRF 9,623 7,361 5,708 8,552 6,188 4,927
2nd-order CRF 7,785 6,046 6,547 5,245
pi-CRF 7,542 5,775 6,397 5,012
SNUH (clinical) 1st-order CRF 17,125 15,326 12,128 12,520 10,813 8,540
2nd-order CRF 15,702 12,053 11,088 8,524
pi-CRF 15,516 12,322 11,012 8,758
CoNLL (general) 1st-order CRF 5,629 3,785 2,856 4,331 2,693 2,184
2nd-order CRF 2,778 2,529 1,986 1,799
pi-CRF 1,855 1,704 1,280 1,218
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 11 of 13
Fig. 6 Recalls along the distances between named entities in each corpus. The y-axis denotes recall score, numeric labels on the x-axis denote
sets of entities having outside labels between the entity and its precursors as much as the numbers. (feature set: set #3)
both data collections: the numbers of entities having model was derived from a CRF model that used virtual
gap value as zero are 50, 30, and 707 in the i2b2, evidence [45], which incorporates prior knowledge of
CoNLL, and SNUH data respectively. prototypes to make the model prefer to label consecutive
values for a subsequence that matches a predefined
Discussion pattern.
In this study, we investigated the performance of the pi- In contrast, our model used the formula to extend
CRF model which is a newly proposed variant of the the hidden variables by joining two variables, y and a.
CRF model designed particularly for extracting clinical The two hidden variables are conjoined in Eq. 3: the
NEs: the proposed model utilizes long-distance depend- variables are multiplied, and they are merged into a
ency relationships between the NEs separated by mul- new hidden variable instead of using two hidden vari-
tiple non-entities in the CRF. The model fragments the ables in the mathematics form. Because the variable a
non-entity state into fine-grained non-entity states and has values only if the value of the corresponding y is
treats them as an information transmission medium the non-entity state, the multiplication implies that
based on the first-order linear chain CRF structure. The the newly derived hidden variable y’ has multiplied
evaluation results showed that the proposed pi-CRF non-entity hidden states and the total number of the
model is more effective at clinical NER. Although the pi- hidden states is expanded compared to the conven-
CRF model was slower than the first-order CRF, it was tional CRF.
significantly faster than the second-order CRF model The design of the pi-CRF model improves the CRF
even while expressing higher-order transition dependen- model’s expressive power according to the evaluation re-
cies between NEs. sults. The transition information is implemented as fea-
Higher-order transitions are expressed as fixed-size ture functions, and thus the transition information
label transitions in the conventional CRF model. Because ultimately affects the model as one of many features. Le-
the NEs tend to be separated by arbitrary distances, the veraging the high-order label transition information, the
conventional higher-order CRF model using a fixed-size pi-CRF shows better performance than other higher-
state transition dependency has limited ability to express order CRF models in many evaluation settings. It could
the desired information. One study of a semi-Markov be the model’s advantageous attribute that the proposed
CRF [19] proposed that consecutive units with the same model preserves relatively compact model complexity
label can be presented as a group although the model than other higher-order models.
could not convey the information from the separated Avoiding the data sparseness problem was another sig-
NEs. Based on this idea, we developed an induction nificant concern in the model design. We expected the
method to present consecutive non-entity labels grouped data sparseness problem to occur because the induction
by their precursor information. Besides, the mathemat- algorithm divides a single non-entity state into multiple
ical formula (Eq. 3) used to express the proposed CRF states, and thus the frequency of observation features
Lee and Choi BMC Medical Informatics and Decision Making (2019) 19:132 Page 12 of 13
related to the outside label symbols was divided. In the Availability of data and materials
model development phase, we observed that the model’s The executable Java file is available at the GitHub repository https://github.
com/jinsamdol/precursor-induced_CRF. However, all data were extracted
performance was inferior without the feature sharing im- from the medical record of patients who had been admitted at SNUH, so
plemented by Eq. (4). For the clinical NER tasks, the re- the clinical data cannot be shared with other research groups without
sults showed that the pi-CRF design increased the F1 permission.
13. McDonald R, Pereira F. Identifying gene and protein mentions in text using 38. Andrew Kachites McCallum. MALLET: a machine learning for language
conditional random fields. BMC Bioinformatics. 2005;6(Suppl 1):S6. https:// toolkit. 2002. http://mallet.cs.umass.edu. Accessed 27 Mar 2013.
doi.org/10.1186/1471-2105-6-S1-S6. 39. Ng AY. Feature selection, L1 vs. L2 regularization, and rotational invariance.
14. Bethard S, Savova G, Chen W-T, Derczynski L, Pustejovsky J, Verhagen M. In: ICML 2004; 2004.
SemEval-2016 Task 12: Clinical TempEval. Proc 10th Int Conf Semant Eval 40. Lee W, Kim K, Lee EY, Choi J. Conditional random fields for clinical named
(SemEval 2016); 2016. p. 1052–62. https://doi.org/10.18653/v1/S16-1165. entity recognition: a comparative study using Korean clinical texts. Comput
15. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural Biol Med. 2018;101:7–14.
architectures for named entity recognition. In: Proceedings of NAACL-HLT 41. Tjong EF, Sang K, De MF. Introduction to the CoNLL-2003 shared Task : language-
2016; 2016. p. 260–70. independent named entity recognition. In: Proceedings of the seventh
16. Liu Z, Yang M, Wang X, Chen Q, Tang B, Wang Z, et al. Entity recognition conference on natural language learning at HLT-NAACL 2003; 2003. p. 142–7.
from clinical texts via recurrent neural network. BMC Med Inform Decis Mak. 42. Xu Y, Wang Y, Liu T, Tsujii J, EI-C C. An end-to-end system to identify
2017;17(Suppl 2):53–60. temporal relation in discharge summaries: 2012 i2b2 challenge. J Am Med
17. Ratinov L, Roth D. Design challenges and misconceptions in named entity Inform Assoc. 2013;20:849–58. https://doi.org/10.1136/amiajnl-2012-001607.
recognition. In: Proceedings of the thirteenth conference on computational 43. Li L, Zhou R, Huang D. Two-phase biomedical named entity recognition
natural language learning; 2009. p. 147–55. using CRFs. Comput Biol Chem. 2009;33:334–8.
18. Ye N, Lee WS, Chieu HL, Wu D. Conditional random fields with high-order 44. Allam A, Krauthammer M. PySeqLab an open source Python package for
features for sequence labeling. In: Advances in neural information sequence labeling and segmentation. https://pyseqlab.readthedocs.io.
processing systems; 2009. p. 2196–204. 45. Li X. On the Use of Virtual Evidence in Conditional Random Fields; 2009. p.
19. Sarawagi S, Cohen WW. Semi-Markov conditional random fields for 1289–97.
information extraction. In: Advances in neural information processing
systems; 2005. p. 1185–92.
20. Cuong NV, Ye N, Lee WS, Chieu HL. Conditional random field with high-
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
order dependencies for sequence labeling and segmentation. ACM JMLR.
published maps and institutional affiliations.
2014;15:981–1009.
21. Fersini E, Messina E, Felici G, Roth D. Soft-constrained inference for named
entity recognition. Inf Process Manag. 2014;50:807–19. https://doi.org/10.1
016/j.ipm.2014.04.005.
22. Li X, Wang Y-Y, Acero A. Extracting structured information from user queries
with semi-supervised conditional random fields. In: Proc 32nd Int ACM SIGIR
Conf res dev Inf Retr - SIGIR ‘09; 2009. p. 572. https://doi.org/10.1145/1571
941.1572039.
23. Li L, Jin L, Jiang Z, Song D, Huang D. Biomedical named entity recognition
based on extended Recurrent Neural Networks. In: Proc - 2015 IEEE Int Conf
Bioinforma biomed BIBM 2015; 2015. p. 649–52.
24. Chalapathy R, Borzeshi EZ, Piccardi M. Bidirectional LSTM-CRF for clinical
concept extraction. In: Proceedings of the clinical natural language
processing workshop; 2016. p. 7–12. http://arxiv.org/abs/1611.08373.
25. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes
with recurrent neural networks. J Am Med Informatics Assoc. 2017;24:596–606.
26. Jauregi Unanue I, Zare Borzeshi E, Piccardi M, et al. J Biomed Inform. 2017;
76:102–9. https://doi.org/10.1016/j.jbi.2017.11.007.
27. Jagannatha A, Yu H. Bidirectional recurrent neural networks for medical
event detection in electronic health records. In: NAACL-HLT; 2016. p. 473–
82. http://arxiv.org/abs/1606.07953.
28. Sahu SK, Anand A. Recurrent neural network models for disease name
recognition using domain invariant features. In: Proceedings of the 54th
annual meeting of the Association for Computational Linguistics; 2016. p.
2216–25. http://arxiv.org/abs/1606.09371.
29. Kholghi M, Sitbon L, Zuccon G, Nguyen A. Active learning: a step towards
automating medical concept extraction. J Am Med Informatics Assoc. 2016;23:
289–96.
30. Hao T, Pan X, Gu Z, Qu Y, Weng H. A pattern learning-based method for
temporal expression extraction and normalization from multi-lingual
heterogeneous clinical texts. BMC Med Inform Decis Mak. 2018;18(Suppl 1):22.
31. Wang P, Hao T, Yan J, Jin L. Large-scale extraction of drug–disease pairs
from the medical literature. J Assoc Inf Sci Technol. 2017;68:2649–61.
32. Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease
over time: overview of 2014 i2b2/UTHealth shared task track 2. J Biomed
Inform. 2015;58:S67–77.
33. Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S, Liu H, et al. CLAMP - a toolkit
for efficiently building customized clinical natural language processing
pipelines. J Am Med Informatics Assoc. 2018;25:331–6.
34. Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap lite: an evaluation of
a new Java implementation of MetaMap. J Am Med Informatics Assoc.
2017;24:841–4.
35. Sutton C, McCallum A. An introduction to conditional random fields. Found
Trends Mach Learn. 2011;4:267–373.
36. Tjong EF, Sang K. Representing text chunks; 1995. p. 173–9.
37. Freitag D, McCallum A. Information extraction with HMM structures learned
by stochastic optimization. In: AAAI; 2000.