Example-Based Named Entity Recognition
Example-Based Named Entity Recognition
Morteza Ziyadi∗ Yuting Sun∗ Abhishek Goswami Jade Huang Weizhu Chen
Microsoft Dynamics 365 AI
{moziyadi,yusun,agoswami,jade.huang,wzchen}@microsoft.com
1
proach to adapt to a new domain. But they use all Stratos, 2019 to 78% using our approach) on the
of the examples, which number in the thousands, SNIPS GetWeather domain using only 10 support
in the target domain to adapt for their evaluation. examples per entity. For knowledge transfer to a
This is an unrealistic setting, as in many real-world faraway domain, we train a model on OntoNotes
problems, new domains can have as few as 10 or 20 5.0 and run train-free few-shot evaluation on differ-
examples available per entity type. We empirically ent mixed datasets and achieve significant gain on
observe that their approach performs poorly in the F1-score. Finally, we perform an ablation study to
scenario of using only a few examples as support. compare the performance of different training and
scoring algorithms for train-free few-shot NER.
In this paper, we propose a novel solution to
address previous limitations in the train-free and
few-shot setting, in which the trained model is di-
2 Train-Free Example-Based NER
rectly applied to identify new entities in a new do- In general, the goal of example-based NER is to
main without further fine-tuning. The proposed perform entity recognition after utilizing a few ex-
approach is inspired by recent advances in extrac- amples for any entity, even those previously unseen
tive question-answering models (Rajpurkar et al., during training, as support. For example, given this
2018) and few-shot learning in language model- example of the entity xbox game, “I purchased a
ing (Raffel et al., 2019; Brown et al., 2020). First, game called NBA 2k 19” where NBA 2k 19 is the
we formulate train-free few-shot NER learning as entity, the xbox game entity Minecraft is expected
an entity-agnostic span extraction problem with to be recognized in the following query “I cannot
the capability to distinguish sentences with enti- play Minecraft with error code 0x111”. This simple
ties from sentences without entities. Our proposed example demonstrates a single example per single
approach is designed to model the correlation be- entity type scenario. In real-world scenarios which
tween support examples and a query. This way, it we have considered in this paper, there are multiple
can leverage large open-domain NER datasets to entity types and we have a few examples per entity
train an entity-agnostic model to further capture type.
this correlation. The trained model can be used to Figure 1 shows an example of example-based
perform recognition on any new and custom enti- entity recognition using a few examples per entity
ties. Second, our model applies a novel sentence- type. In this example, we have two support exam-
level attention to choose the most related examples ples for the “Game”, and “Device” entity types and
as support examples to identify new entities. Third, three support examples for the “Error Code” entity
we systematically compare various self-attention- type. The goal is to identify the entities in the query
based token-level similarity strategies for entity “I cannot play Minecraft with error code 0×111”.
span detection. For each support example for an entity type, we
We conduct extensive empirical studies to show perform span prediction on the query for that entity
the proposed approach can achieve significant and type and utilize the start/end span scores from each
consistent improvements over previous approaches of the predictions to inform the final prediction per
in the train-free few-shot setting. For instance, we entity type. Then, we aggregate the results from
train a model on the OntoNotes 5.0 dataset and eval- different entity types to obtain the final identifica-
uate it on multiple out-of-domain datasets (ATIS, tion of entities in the query. It should be noted that
MIT Movie, and MIT Restaurant Review), show- for the aggregation, we also consider the span score
ing that the proposed model can achieve >30% which will be explained in detail later.
gain on F1-score using only 10 examples per entity There are several challenges that our approach
type in the target domain as support, in compar- faces. One is how to use multiple examples with
ison to Wiseman and Stratos, 2019. In addition, different entity types to run train-free scoring. One
we investigate the domain-agnostic properties of might consider heuristic voting algorithms as an
different approaches: how much knowledge can initial approach but we found that they did not lead
be transferred from one domain to another from a to good performance. Another is that we need to
similar or different distribution. For instance, in fine-tune the language model to get a better repre-
an experiment testing knowledge transfer from a sentation that can be utilized for a better train-free
similar domain on the SNIPS dataset, we achieve inference. And finally we have to deal with the gap
48% gain on F1-score (from 30% in Wiseman and between the training approach and inference tech-
2
Figure 1: An example of example-based NER approach. Prediction per entity type is based on start/end scores of
each prediction. Final prediction is an aggregation of the predictions based on the final span scores. For non-entity
prediction, the prediction span happens on [CLS] token.
3
Figure 2: The framework of example-based NER
simstart
i = qi sstart (5) qrep = V ectorSumi (qi ) (9)
simend
i = qi send (6) sjrep = V ectorSumi (sji ) (10)
The result of the operations so far is the similar- The vector sum of all token embeddings of a
ity of each of the query tokens with the start/end sentence represents that sentence, either for a query
of an entity type of a single support example. Ide- or a support example. Another important factor in
ally, we have multiple support examples (e.g., K) the above equation is the atten function. We use
per entity type. To measure the probability of a the following soft attention mechanism to measure
token in the query being the start/end of the entity sentence-level similarity.
type using multiple support examples, we use the
following formula: atten(qrep , sjrep ) = Sof tmax(T ∗cos(qrep , sjrep ))
(11)
K where T is a hyper-parameter (temperature of
atten(qrep , sjrep )(qi sjstart ) (7)
X
Pistart = Sof tmax function) and cos is the cosine simi-
j=1 larity measure. The atten function measures the
sentence-level similarity between the query and the
support example. We combine this sentence-level
K
similarity with token-level similarity to produce
atten(qrep , sjrep )(qi sjend )
X
Piend = (8)
the probability of a token being the start/end of an
j=1
entity type. We utilize these probabilities to fine-
where qi is the embedding of token i in the query, tune the language model described in the following
sjstart sjend are the embeddings of the start and end section.
4
2.2 Training 2.3 Entity Type Recognition: Scoring
Our approach focuses on fine-tuning the BERT The second part of our proposed algorithm con-
language model in a way such that the contextual cerns assigning an entity type to the potential de-
encoded text can be used directly for entity recog- tected spans. One of the approaches used in liter-
nition. We initialize the language model with pre- ature (Tan et al., 2020) is to use a softmax layer
trained BERT and then pass the minibatches of data on top of multi-layer perceptron (MLP) classifier.
to minimize the loss function. Our loss function is This type of structure brings limitations to train-
the average of span start prediction loss and span free few-shot learning when trying to recognize
end prediction loss using cross-entropy, as follows. unseen entity types. Similar to our training schema,
for scoring, we instead measure the probability of
Xk each of the query tokens being the start and/or end
start start of an entity type using the examples for that entity
Lstart = − (yt logPt
t=1 type. Let us say that we have M entity types with
start start mEl support examples per entity type and a query
+ (1 − yt )log(1 − Pt )) (12)
q. For each entity type E in E1 , ..., EM , we predict
with a corresponding score similarly as in training.
For each of the tokens in the query, we calculate
Xk the scores of being the start Pistart and end of a
Lend = − (ytend logPtend +(1−ytend )log(1−Ptend ))span P start as:
i
t=1
(13)
(qi sjstart )
X
Pistart = (15)
topK
Loss = (Lstart + Lend )/2 (14)
(qi sjend )
X
Piend = (16)
in which k is the length of query; ytstart , ytend topK
indicate whether the token t in the query is the start It should be noted that these scores are very sim-
or end of the entity span, respectively; Ptstart , Ptend ilar to the probability measure that we had during
correspond to the probabilities that we calculated training but differ when it comes to attention (i.e,
during span detection. Eq. 7, 8). During training, we used a soft atten-
While preparing data, we add < e >, < /e > tion schema with a fixed number of examples per
tokens to the support examples around the entities. entity type (i.e., K). At inference time, we use
We convert a sentence with multiple entities into a hard attention since we have a varying number
multiple examples such that each of the support ex- of support examples per entity type. For the hard
amples include exactly one single entity. Another attention, first we calculate the token-level sim-
important consideration is to construct negative ex- ilarity measure (qi sjstart/end ) for each of the
amples as well as positive examples. For instance, entities in the support examples and then pick the
for a query “This is Los Angeles” where “Los An- top K ones with highest measure and then sum their
geles” is a city entity type, the support example similarity scores with an equal attention of 1 for
“< e > New York < /e > is a big city .” is consid- these example, to calculate the start/end probabil-
ered a positive example since it has the same entity ity. In other words, we sum the token-level simi-
type as city. But the support example “Tomorrow, larity of the top K support examples with highest
I have an appointment at < e > 2 pm < /e >.” is (qi sjstart/end ) for probability calculation. Then,
considered a negative example for the query as it for each potential span, we calculate the total span
contains a different entity type (in this case date). score as the summation of start and end scores, i.e.,
In other words, we treat negative examples as no- score(span) = P start + P end where P start and
entity in Eq. 13. For each training data point, we P end are calculated using above equations (Eq. 15,
construct pairs of query and positive/negative sup- 16). After getting all the potential spans, we select
port examples to train the model. It should be noted the one with the highest score as the final span for
that the ratio of positive versus negative example that specific entity type, treating span prediction
is an important factor. Furthermore, we use mul- similarly as question-answering (QA) models.
tiple examples per entity type for calculating the So far, we have finalized the spans for each of
aforementioned probabilities as Eq. 7 and Eq. 8. the entity types. Algorithm I shows the top span
5
prediction per entity type. We should highlight OntoNotes5.0 1 , Conll2003 2 , ATIS 3 , MIT Movie
that similar to the QA framework, if the predicted and Restaurant Review 4 , and SNIPS 5 . Addition-
span’s start and end occur on the CLS token, we ally, we use a proprietary dataset (Table 3) as a test
treat it as no span for that entity type in the query. set in some of the experiments.
We first fine-tune the language model on source
Algorithm I: Top span prediction per entity datasets and then run train-free few-shot evaluation
type on a target domain with unseen entity types. During
startindexes , endindexes = range(len(query(q))) the fine-tuning, we use all the training data from the
for startid in startindexes do: source datasets. When predicting an unseen entity
| for endid in endindexes do: in target domain, we sample a subset of instance in
| | if startid , endid in the query the training dataset from the target domain as the
| | & startid < endid : support set. We run the sampling multiple times
| | | calculate P start , P end using and produce the metric with mean and standard
| | Eq. 15, 16 deviation in the experimental results. Note that
| | | add (startid , endid , none of the examples or entity types of the target
| | P start , P end ) to the output dataset are seen in the fine-tuning step.
sort the spans based on the (P start + P end ) to 3.2 Training and Evaluation Details
select the top high probable span We use the PyTorch framework (Paszke et al., 2019)
for all of our experiments. To train our models, we
To merge the spans from different entity type use the AdamW optimizer with a learning rate of
predictions, we use their score to sort, remove the 5e − 5, adam epsilon of 1e − 8, and weight decay
overlap, and obtain the final predictions. Algorithm of 0.0. The maximum sentence length is set to 384
II shows the overall scoring functionality. tokens and K, the number of example per entity
type for training and hard attention scoring, is 5.
For training, T , the temperature of attention, is 1.
Algorithm II: Entity Type Recognition -
We use the BERT-base-uncased model to initialize
Hard Attention Algorithm
our language model. For evaluation, we report the
Suppose we have M entity types with ml sup-
precision, recall, and F1 scores for all entities in the
port examples per entity type and a query Q.
test sets based on exact matching of entity spans.
for each entity type E in E1 , .., EM do:
| get the span prediction per entity type using
4 Experimental Results
Algorithm I
4.1 Benchmark Study
aggregate all the predictions per entity type and
4.1.1 Knowledge Transfer from a General
sort based on span score.
Domain to Specific Domains
remove overlaps: select the top score span and
search for the second top span without any over- In this section, we run a benchmark study to in-
lap with the first one and continue this for all vestigate the performance of our approach. In the
the predicted spans. first set of experiments, we investigate how we
can transfer knowledge from a general domain to
specific domains. We train a model on a general
Aside from the hard attention used in our scoring source domain and then immediately evaluate the
method, we also experimented with other scoring model on the target domains with support examples.
methods which are explained in the ablation study. Figure 4 shows the results of training a model on
the OntoNotes 5.0 dataset as a generic domain and
3 Experimental Setup evaluating it on other target domains (i.e., ATIS,
1
from: https://github.com/swiseman/neighbor-tagging
3.1 Dataset Preparation 2
from: https://github.com/swiseman/neighbor-tagging
3
from: https://github.com/yvchen/JointSLU
We test our proposed approach on different bench- 4
from: https://groups.csail.mit.edu/sls/downloads/
mark datasets that are publicly available with 5
from:https://github.com/snipsco/nlu-
results in tables 1 and 2. The datasets are benchmark/tree/master/2017-06-custom-intent-engines
6
Dataset OntoNotes 5.0 ATIS Movie.Review Restaurant.Review Conll2003
#Train(Support) 59.9k 4.6k 7.8k 7.6k 12.7k
#Test 7.8k 850 2k 1.5k 3.2k
#Entity Types 18 79 12 8 4
Table 1: Public datasets statistics for OntoNotes 5.0, ATIS, Movie.Review, Restaurant.Review, and Conll2003
SNIPS
Dataset
d1 d2 d3 d4 d5 d6 d7
#Train (Support) 2k 2k 2k 2k 2k 2k 1.8k
#Test 100 100 99 100 98 100 100
#Entity Types 5 14 9 9 7 2 7
Table 2: Public datasets statistics for SNIPS dataset with different domains of: d1: AddToPlaylist, d2: BookRestau-
rant, d3: GetWeather, d4: PlayMusic, d5: RateBook, d6: SearchCreativeWork, d7: SearchScreeningEvent
7
Number of support examples per entity type
Target Dataset Approach
10 20 50 100 200 500
Neigh.Tag. 6.7±0.8 8.8±0.7 11.1±0.7 14.3±0.6 22.1±0.6 33.9±0.6
ATIS
Ours 17.4±1.1 19.8±1.2 22.2±1.1 26.8±2.7 34.5±2.2 40.1±1.0
Neigh.Tag. 3.1±2 4.5±1.9 4.1±1.1 5.3±0.9 5.4±0.7 8.6±0.8
MIT Movie
Ours 40.1±1.1 39.5±0.7 40.2±0.7 40.0±0.4 40.0±0.5 39.5±0.7
Neigh.Tag. 4.2±1.8 3.8±0.8 3.7±0.7 4.6±0.8 5.5±1.1 8.1±0.6
MIT.Restaurant
Ours 27.6±1.8 29.5±1.0 31.2±0.7 33.7±0.5 34.5±0.4 34.6
Neigh.Tag. 4.5±0.8 5.5±0.5 6.7±0.6 8.7±0.4 13.1±0.6 20.3±0.4
Mixed Domain
Ours 16.6±1.1 20.3±0.8 23.5±0.6 27.4±1.2 32.2±1.2 35.9±0.6
Table 4: General domain (OntoNotes 5.0) to specific target domains: ATIS, Moive Review, Restaurant Review,
MixedDomains (ATIS+MIT.Restaurant.Review as mixed distributed dataset), impact of number of support exam-
ples per entity type, comparison of our approach with the Neighbor Tagging (Wiseman and Stratos, 2019)
in terms of F1-score (mean±standard deviation is calculated based on 10 random samples)
as a generic domain dataset and evaluate it in a on this figure, we observe that our approach is con-
train-free few-shot manner on other domains. Ta- sistently and significantly better than the baseline
ble 5 shows similar results as when training with when the number of example per entity is up to
OntoNotes 5.0. When we compare tables 4 and 100. For the extreme scenario when the number
5, we see that knowledge transfer from OntoNotes of the example per entity is 200 or 500, these two
5.0 achieves overall higher performance in com- methods are comparable.
parison to Conll2003. We conjecture this is due to
4.1.6 Performance on Our Proprietary
the larger size and larger number of entity types in
Dataset
OntoNotes 5.0, compared to Conll2003.
Finally, we also evaluate the proposed approach on
4.1.5 Knowledge Transfer from a Similar our proprietary dataset. Figure 4 shows the perfor-
Domain mance of our approach compared with neighbor
Another interesting question to answer is how much tagging on a model trained on the OntoNotes 5.0
knowledge can we transfer from one domain to an- dataset and evaluated on our proprietary dataset.
other domain from a similar distribution. In this Different samples of support examples are created
set of experiments, we simulate a scenario where by randomly sampling the entire support with a
we combine different training sets coming from different number of examples per entity type (i.e.,
similar distributions and then evaluate the model 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100). Simi-
on an similar but unseen target domain. We use larly to previous results, our approach achieves a
the SNIPS datasets which has seven domains (Ad- consistently better performance .
dToPlaylist, BookRestaurant, GetWeather, PlayMu-
5 Conclusion
sic, RateBook, SearchCreativework, and Search-
ScreeningEvent). We train a model on a combined This paper presents a novel technique for train-free
dataset from six domains and evaluate it on the re- few-shot learning for the NER task. This entity-
maining domain. Figure 3 shows the results of such agnostic approach is able to leverage large open-
experiments in terms of the total number of support domain NER datasets to learn a generic model and
examples. Similar to tables 4 and 5, we use 10, then immediately recognize unseen entities via a
20, 50, 100, 200, and 500 examples per entity type, few supporting examples, without the need to fur-
but instead of showing the number of examples per ther fine-tune the model. This brings dramatic ad-
entity type, we show the total number of examples vantage to the NER problem requiring quick adapta-
in the figure. For example, the first figure in Fig. tion and provides a feasible way for business users
3 shows the experiment where we combine the six to easily customize their own business entities. To
domains of BookRestaurant, GetWeather, PlayMu- the best of our knowledge, this is the first work that
sic, RateBook, SearchCreativework, and Search- applies the train-free example-based approach on
ScreeningEvent to train a model and then evaluate the NER problem. Compared to the recent SOTA
it on the held-out domain of AddToPlaylist. Based designed for this setting, i.e., the neighbor tagging
8
Number of support examples per entity type
Target Dataset Approach
10 20 50 100 200 500
Neigh.Tag. 2.4±0.5 3.4±0.6 5.1±0.4 5.7±0.3 6.3±0.3 10.1±0.4
ATIS
Ours 22.9±3.8 16.5±3.3 19.4±1.4 21.9±1.2 26.3±1.1 31.3±0.5
Neigh.Tag. 0.9±0.3 1.4±0.3 1.7±0.4 2.4±0.2 3.0±0.3 4.8±0.5
MIT Movie
Ours 29.2±0.6 29.6±0.8 30.4±0.8 30.2±0.6 30.0±0.5 29.6±0.5
Neigh.Tag. 4.1±1.2 3.6±0.8 4.0±1.1 4.6±0.6 5.6±0.8 7.3±0.5
MIT.Restaurant
Ours 25.2±1.7 26.1±1.3 26.8±2.3 26.2±0.8 25.7±1.5 25.1±1.1
Neigh.Tag. 2.3±0.5 2.9±0.5 4.1±0.6 4.7±0.3 5.4±0.4 7.9±0.4
Mixed Domain
Ours 20.5±1.6 18.6±1.9 20.9±1.2 22.5±0.5 24.7±0.9 27.3±0.5
Table 5: General domain (Conll2003) to specific target domains, impact of number of support examples per entity
type, comparison of our approach with Neighbor Tagging (Wiseman and Stratos, 2019)
in terms of F1-score (mean±standard deviation is calculated based on 10 random samples)
Figure 3: Training on multiple domains and evaluating on a held-out domain: impact of number of support exam-
ples, comparison of our approach with Neighbor Tagging (Wiseman and Stratos, 2019)
References
Oshin Agarwal, Yinfei Yang, Byron C. Wallace, and
Ani Nenkova. 2020a. Entity-switched datasets: An
approach to auditing the in-domain robustness of
named entity recognition models.
9
Chess, Jack Clark, Christopher Berner, Sam Mc- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Candlish, Alec Radford, Ilya Sutskever, and Dario Know what you don’t know: Unanswerable ques-
Amodei. 2020. Language models are few-shot learn- tions for SQuAD. In Proceedings of the 56th An-
ers. nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 784–
Shumin Deng, Ningyu Zhang, Zhanlin Sun, Jiaoyan 789, Melbourne, Australia. Association for Compu-
Chen, and Huajun Chen. 2020. When low resource tational Linguistics.
nlp meets unsupervised language model: Meta-
pretraining then meta-learning for few-shot text clas- Erik F. Tjong Kim Sang and Fien De Meulder.
sification (student abstract). In AAAI, pages 13773– 2003. Introduction to the conll-2003 shared task:
13774. Language-independent named entity recognition.
CoRR, cs.CL/0306050.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep Edwin Simpson, Jonas Pfeiffer, and Iryna Gurevych.
bidirectional transformers for language understand- 2020. Low resource sequence tagging with weak la-
ing. bels. In AAAI, pages 8862–8869.
Alexander Fritzler, Varvara Logacheva, and Maksim Jake Snell, Kevin Swersky, and Richard S. Zemel.
Kretov. 2018. Few-shot classification in named en- 2017. Prototypical networks for few-shot learning.
tity recognition task. CoRR, abs/1812.06158. CoRR, abs/1703.05175.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- Chuanqi Tan, Wei Qiu, Mosha Chen, Rui Wang, and
pat, and Ming-Wei Chang. 2020. Realm: Retrieval- Fei Huang. 2020. Boundary enhanced neural span
augmented language model pre-training. arXiv classification for nested named entity recognition.
preprint arXiv:2002.08909.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
John Lafferty, Andrew McCallum, and Fernando CN Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Pereira. 2001. Conditional random fields: Prob- Kaiser, and Illia Polosukhin. 2017. Attention is all
abilistic models for segmenting and labeling se- you need. CoRR, abs/1706.03762.
quence data.
Sam Wiseman and Karl Stratos. 2019. Label-agnostic
Chin Lee, Hongliang Dai, Yangqiu Song, and Xin Li. sequence labeling by copying nearest neighbors. In
2020. A chinese corpus for fine-grained entity typ- Proceedings of the 57th Annual Meeting of the
ing. arXiv preprint arXiv:2004.08825. Association for Computational Linguistics, pages
5363–5369, Florence, Italy. Association for Compu-
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong tational Linguistics.
Han, Fei Wu, and Jiwei Li. 2019. A unified mrc
framework for named entity recognition. Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and Ji-
wei Li. 2019. Coreference resolution as query-based
Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. span prediction.
2020. A rigourous study on named entity recogni-
tion: Can fine-tuning pretrained model lead to the Tao Zhang, Congying Xia, Chun-Ta Lu, and Philip Yu.
promised land? 2020. Mzet: Memory augmented zero-shot fine-
grained named entity typing.
Pierre Lison, Aliaksandr Hubin, Jeremy Barnes, and
Samia Touileb. 2020. Named entity recognition Shi Zhi, Liyuan Liu, Yu Zhang, Shiyin Wang, Qi Li,
without labelled data: A weak supervision approach. Chao Zhang, and Jiawei Han. 2020. Partially-typed
arXiv preprint arXiv:2004.14723. ner datasets integration: Connecting practice to the-
ory. arXiv preprint arXiv:2005.00502.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te-
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019. Pytorch:
An imperative style, high-performance deep learn-
ing library. In Advances in Neural Information Pro-
cessing Systems 32, pages 8026–8037. Curran Asso-
ciates, Inc.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.
10
A Appendix gle probabilistic annotation. Another approach is
to leverage noisy crowdsourced labels as well as
A.1 Related Work
pre-trained models from other domains as part of
Historically, the NER task was approached as a transfer learning–Simpson et al., 2020 combines
sequence labeling task with recurrent and convo- pre-trained sequence labelers using Bayesian se-
lutional neural networks, and then with the rise of quence combination to improve sequence labeling
the Transformer (Vaswani et al., 2017), pretrained in new domains with few labels or noisy data. To
Transformer-based models such as BERT (Devlin tackle noisy data resulting from distantly super-
et al., 2018). While the boundary is being pushed vised methods, Ali et al., 2020 instead use an edge-
inch by inch on well-established NER datasets such weighted attentive graph convolution network that
as CONLL 2003 (Sang and Meulder, 2003), such attends over corpus-level contextual clues, thus re-
large and well-labeled datasets can present an ideal fines noisy mention representations learned over
and unrealistic setting. Often, these benchmarks the distantly supervised data in contrast to just sim-
have strong name regularity, high mention cover- ply de-noising the data at the model’s input. A new
age, and contain sufficient training examples for domain can also be a new language–Lee et al., 2020
the entity types thus providing sufficient context created a small labeled dataset via crowdsourcing
diversity (Lin et al., 2020). This is called regular and a large corpus via distant supervision for Chi-
NER. In contrast, open NER does not have such nese NER, leveraging mappings between English
advantages–entity types may not be grammatical and Chinese and finding that pretraining on English
and the training set may not fully cover all test set data helped improve results on Chinese datasets.
mentions.
Rather than data augmentation, another approach
Robustness studies have been done with swap-
is few or zero-shot learning–relying on models pre-
ping out English entities with entities from other
trained on large datasets combined with a limited
countries, such as Ethiopia, Nigeria, and the Philip-
amount of support examples in the target domain.
pines, finding drops of up to 10 points F1 lead-
More generally for text classification, Deng et al.,
ing to questions of whether current state-of-the-art
2020 disentangle task-agnostic and task-specific
models trained on standard English corpora are
feature learning, leveraging large raw corpuses via
over-optimized, similar to models in the photog-
unsupervised learning to first pretrain task-agnostic
raphy domain which have become accustomed to
contextual features followed by meta-learning text
the perfect exposure of white skin (Agarwal et al.,
classification. Snell et al., 2017 also developed the
2020a). Other studies by Agarwal et al., 2020b
prototypical network for classification scenarios
expose that while context representations resulting
with scarce labeled examples such that objects of
from trained LSTM-CRF and BERT architectures
one class are mapped to similar vectors. For the
contribute to system performance, the main factor
NER task specifically, Fritzler et al., 2018 com-
driving high performance is learning name tokens
bined this model architecture with an RNN + CRF
explicitly, which is a weakness when it comes to
to simulate few-shot experiments on the OntoNotes
open NER, when novel and unseen name tokens
5.0 dataset.
run abundant.
To deal with partial entity coverage, one ap- Zhang et al., 2020 use memory to transfer knowl-
proach is data augmentation. Zhi et al., 2020 com- edge of seen coarse-grained entity types to unseen
pares strategies to combine partially-typed NER coarse and fine-grained entity types, additionally
datasets with fully-typed ones, demonstrating that incorporating character, word, and context-level
models trained with partially-typed annotations can information combined with BERT to learn entity
have similar performance with those trained with representations, as opposed to simply relying on
the same amount of fully-typed annotations. In this similarity. Their approach relies on hierarchical
theme of lacking datasets, one approach has been structure between the coarse and fine-grained en-
to utilize expert knowledge to craft labeling func- tity types.
tions in order to create large distantly or weakly Thinking more about the pretraining portion,
supervised datasets. Lison et al., 2020 utilizes label- given massive text corpora, language models like
ing functions to automatically annotate documents BERT Devlin et al., 2018 seem to be able to implic-
with named-entity labels followed by a trained hid- itly store world knowledge in the parameters of the
den Markov model to unify the labels into a sin- neural network. Guu et al., 2020 use the masked
11
language model pretraining task from BERT to diately. Meanwhile, this fix can also generalize to
pretrain prior to fine-tuning on the Open-QA task. similar DSAT as well. This provides an easy way
Wiseman and Stratos, 2019 also use BERT as a for content editors to improve their system inde-
pretrained language model for train-free few-shot pendently and confidently without the involvement
learning, using the BIO (beginning-inside-out) for- of either AI experts or model training.
mat to classify each of the tokens independently
while finetuning. A.3 Ablation Study
Another trend is formulating other NLP tasks as In this section, we take a detailed look at our ap-
a machine reading comprehension (MRC) problem. proach and investigate important factors of our
For instance, (Wu et al., 2019) and (Li et al., 2019) model.
have done this for coreference resolution and NER, A.3.1 Scoring Strategy
respectively. Using the MRC framework enables
One of the important factors in example-based
the combination of different datasets from different
NER is the scoring strategy, used to recognize the
domains to better fine-tune. A challenge that re-
target entity type. Above, we explain our main ap-
framing NER as an MRC task helps to alleviate is
proach as hard attention and provide some bench-
when a token may be assigned several labels. For
mark results. Potentially one could use soft at-
example, if a token is assigned two overlapping en-
tention in the scoring as well and we discuss two
tities, this can be broken out into two independent
methods of using soft attention for scoring.
questions (Li et al., 2019).
Soft Attention Scoring: This is very similar to
In our approach, we utilize the MRC framework
hard attention but instead of using Eq. 15, 16 in
to fine-tune a BERT-based language model with
algorithm I, and II to calculate the probabilities, we
a novel sentence-level attention and token-level
use the following equations:
similarity that is independent of training data entity
types, achieving a better representation to perform
mE
few-shot evaluation in a new domain with superior
atten(qrep , sjrep )(qi sjstart ) (17)
X
results. Pistart =
j=1
12
Heuristic Voting Scoring: In the other scoring Algorithm IV: Choose top n best spans
methods, we treat multiple examples all at once in startindexes =get n top indexes from top P(start)
an equation with different weights to calculate the endindexes = get n top indexes from top P(end)
score. A heuristic algorithm that we investigated for startid in startindexes do:
is to treat each of the support examples separately | for endid in endindexes do:
and run span prediction per example per entity and | | if startid , endid in the query
then use voting between multiple examples of an | | & startid < endid :
entity type to produce the final predictions per en- | | | calculate startprob , endprob
tity type. Each support example gets an equal vote. | | using Eq. 5, 6
Also, for the span score we use the base token-level | | | add (startid , endid ,
similarity to measure the probability as shown in | | startprob , endprob ) to the output
Eq. 5, 6. Algorithm III shows the voting algorithm.
sort the spans based on the (startprob +
Algorithm III: Entity Type Recognition - endprob ) to select the top n high probable spans
Voting Algorithm
Suppose we have M entity types with ml sup-
P redSpansQ
eE
= [span21 , span22 , ..., span2n ]
port examples per entity type and a query Q. 2
| | spans [spani1 , spani2 , ..., spanin ] To finalize the span prediction for an entity type
| | using algorithm IV E, we take the vote on the top predicted span from
| get the final prediction per entity as top each support example of E, i.e.:
| voted spans of [spani1 , span21 , ..., spanmE n ]
13
Figure 5: Impact of number of support examples, comparison of different scoring algorithms
14
splits the words into multiple tokens and
neighbor tagging uses the first token as the
word representative and applies word-level
prediction). The results were better than our
heuristic voting algorithm but not as good as
the current token-level scoring structure.
15