0% found this document useful (0 votes)

82 views15 pages

Example-Based Named Entity Recognition

This document presents a novel approach to named entity recognition (NER) called example-based NER. The approach aims to perform NER on new domains with limited data by using a small number of examples, without requiring retraining. It formulates the problem as identifying entity spans in a query based on support examples, inspired by question answering models. The method is evaluated on several datasets and shows significant gains over existing few-shot learning approaches for NER, especially when using only a small number of examples.

Uploaded by

xy w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views15 pages

Example-Based Named Entity Recognition

Uploaded by

xy w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Example-Based Named Entity Recognition

Morteza Ziyadi∗ Yuting Sun∗ Abhishek Goswami Jade Huang Weizhu Chen
Microsoft Dynamics 365 AI
{moziyadi,yusun,agoswami,jade.huang,wzchen}@microsoft.com

Abstract datasets, they struggle with incorporating these cus-

tom entities into their service. Additionally, these
We present a novel approach to named entity
non-expert users typically have little knowledge of
arXiv:2008.10570v1 [cs.CL] 24 Aug 2020

recognition (NER) in the presence of scarce

data that we call example-based NER. Our
model training, are unwilling to pay for expensive
train-free few-shot learning approach takes in- model training services, or do not want to main-
spiration from question-answering to identify tain a model. This is very popular for business
entity spans in a new and unseen domain. In customers of chatbots such as Alexa and Google
comparison with the current state-of-the-art, Home as well as small business customers of any
the proposed method performs significantly cloud-based service.
better, especially when using a low number of One existing approach to the challenge of iden-
support examples.
tifying custom entities in a new domain with little
1 Introduction data focuses on few-shot learning, which was first
designed for the classification task (e.g., Deng et al.,
Named Entity Recognition (NER) has been a pop- 2020) and recently applied to NER by Wiseman
ular area of research within the Natural Language and Stratos, 2019, Zhang et al., 2020, and Fritzler
Processing (NLP) community. Most commonly, et al., 2018. The general goal of few-shot learn-
the NER problem is formulated as a supervised se- ing is to build a model that can recognize a new
quence classification task with the aim of assigning category with a small number of labeled examples
a label to each entity in a text sequence. The entity quickly. What this means specifically for NER is
labels typically come from a set of pre-defined cat- the training of a model without being constrained
egories, such as person, organization, and location. by the seen entity types or labels from a source
A mature technique for handling NER is using dataset. The model can identify new entities in a
expert knowledge to perform extensive feature en- completely unseen target domain using only a few
gineering combined with shallow models such as supporting examples in this new domain, without
Conditional Random Fields (CRF) (Lafferty et al., any training in that target domain (i.e., train-free).
2001). The advantages of such a model is that it Existing approaches to few-shot NER have criti-
is easy to train and update the model especially cal limitations despite their successes. For instance,
with large datasets available in English (e.g. Sang Zhang et al., 2020 depend on hierarchical informa-
and Meulder, 2003), but unfortunately the resulting tion between coarse-grained and fine-grained entity
model is highly associated with known categories, types to perform few-shot learning and are embed-
often explicitly memorizing entity values (Agarwal ding this crucial hierarchical information into their
et al., 2020b). Adding new categories requires fur- model architecture. This is hard to generalize to
ther feature engineering, extensive data labeling, out-of-domain entity types that may not share hier-
and the need to build a new model from scratch or archical similarities with entities from the training
continue to fine-tune the model. This can be chal- data. While Fritzler et al., 2018 use a few number
lenging, expensive, and unrealistic for real-world of examples to adapt, they never test their model
settings. For example, when non-expert users cre- on out-of-domain entities, as the entities in the
ate a chatbot in new domains with new business- train, validation, and test set are in fact from the
related entities excluded from conventional NER same dataset and following the same distribution.
∗
* The first two authors contributed equally to this paper Wiseman and Stratos, 2019 propose a train-free ap-

1
proach to adapt to a new domain. But they use all Stratos, 2019 to 78% using our approach) on the
of the examples, which number in the thousands, SNIPS GetWeather domain using only 10 support
in the target domain to adapt for their evaluation. examples per entity. For knowledge transfer to a
This is an unrealistic setting, as in many real-world faraway domain, we train a model on OntoNotes
problems, new domains can have as few as 10 or 20 5.0 and run train-free few-shot evaluation on differ-
examples available per entity type. We empirically ent mixed datasets and achieve significant gain on
observe that their approach performs poorly in the F1-score. Finally, we perform an ablation study to
scenario of using only a few examples as support. compare the performance of different training and
scoring algorithms for train-free few-shot NER.
In this paper, we propose a novel solution to
address previous limitations in the train-free and
few-shot setting, in which the trained model is di-
2 Train-Free Example-Based NER
rectly applied to identify new entities in a new do- In general, the goal of example-based NER is to
main without further fine-tuning. The proposed perform entity recognition after utilizing a few ex-
approach is inspired by recent advances in extrac- amples for any entity, even those previously unseen
tive question-answering models (Rajpurkar et al., during training, as support. For example, given this
2018) and few-shot learning in language model- example of the entity xbox game, “I purchased a
ing (Raffel et al., 2019; Brown et al., 2020). First, game called NBA 2k 19” where NBA 2k 19 is the
we formulate train-free few-shot NER learning as entity, the xbox game entity Minecraft is expected
an entity-agnostic span extraction problem with to be recognized in the following query “I cannot
the capability to distinguish sentences with enti- play Minecraft with error code 0x111”. This simple
ties from sentences without entities. Our proposed example demonstrates a single example per single
approach is designed to model the correlation be- entity type scenario. In real-world scenarios which
tween support examples and a query. This way, it we have considered in this paper, there are multiple
can leverage large open-domain NER datasets to entity types and we have a few examples per entity
train an entity-agnostic model to further capture type.
this correlation. The trained model can be used to Figure 1 shows an example of example-based
perform recognition on any new and custom enti- entity recognition using a few examples per entity
ties. Second, our model applies a novel sentence- type. In this example, we have two support exam-
level attention to choose the most related examples ples for the “Game”, and “Device” entity types and
as support examples to identify new entities. Third, three support examples for the “Error Code” entity
we systematically compare various self-attention- type. The goal is to identify the entities in the query
based token-level similarity strategies for entity “I cannot play Minecraft with error code 0×111”.
span detection. For each support example for an entity type, we
We conduct extensive empirical studies to show perform span prediction on the query for that entity
the proposed approach can achieve significant and type and utilize the start/end span scores from each
consistent improvements over previous approaches of the predictions to inform the final prediction per
in the train-free few-shot setting. For instance, we entity type. Then, we aggregate the results from
train a model on the OntoNotes 5.0 dataset and eval- different entity types to obtain the final identifica-
uate it on multiple out-of-domain datasets (ATIS, tion of entities in the query. It should be noted that
MIT Movie, and MIT Restaurant Review), show- for the aggregation, we also consider the span score
ing that the proposed model can achieve >30% which will be explained in detail later.
gain on F1-score using only 10 examples per entity There are several challenges that our approach
type in the target domain as support, in compar- faces. One is how to use multiple examples with
ison to Wiseman and Stratos, 2019. In addition, different entity types to run train-free scoring. One
we investigate the domain-agnostic properties of might consider heuristic voting algorithms as an
different approaches: how much knowledge can initial approach but we found that they did not lead
be transferred from one domain to another from a to good performance. Another is that we need to
similar or different distribution. For instance, in fine-tune the language model to get a better repre-
an experiment testing knowledge transfer from a sentation that can be utilized for a better train-free
similar domain on the SNIPS dataset, we achieve inference. And finally we have to deal with the gap
48% gain on F1-score (from 30% in Wiseman and between the training approach and inference tech-

2
Figure 1: An example of example-based NER approach. Prediction per entity type is based on start/end scores of
each prediction. Final prediction is an aggregation of the predictions based on the final span scores. For non-entity
prediction, the prediction span happens on [CLS] token.

nique. In this paper, we address these challenges the architecture.

and propose a multi-example training and inference Figure 2 shows the framework for the span detec-
approach with a novel attention schema that results tion portion of our proposed example-based NER
in better performance on multiple experiments. system. Following the overview in Fig. 2, we use
In order to perform train-free few-shot learning, a similarity-based approach to identify the spans.
we fine-tune a BERT language model on a source As shown in the figure, we consider two sets of
dataset using a novel similarity-based approach data: query and support. Query examples are the
that is independent of seen entity types. We utilize sentences in which we aim to find the entities, and
token-level similarity with sentence-level attention support examples refer to example sentences with
to train the model to produce entity representations. labeled entity types. We highlight the entities in the
We then use this model to perform prediction in a support by adding the tokens < e > and < /e > to
new domain with several new entity types where the boundaries of the entities. We leverage the pre-
each has a few representative examples, such that trained language model, BERT, to obtain context-
given a new text, we need to identify whether there dependent information for each token.
is any entity from the set of new entity types in the First, we send the query and support examples
text. In this setting, the user is able to define or separately through the same BERT language model
remove entity types without the need to retrain the to get an encoded contextual vector for each of the
model. tokens (i.e, qi for query tokens and si for support
In the following, we first explain our model archi- tokens).
tecture and then introduce our training and scoring
approach. qi = BERT (wiQuery ) (1)
2.1 Model Architecture
Our approach to solve the problem of recognizing si = BERT (wiSupport ) (2)
unseen entity types consists of two parts: 1) iden-
tifying the span and 2) assigning a specific entity Next, we calculate the start and end probabilities
type to each of the detected spans. Span detection for each of the tokens in the query by measuring
aims to predict the start and end of span positions. the similarity between qi and the encoded vectors
With the span in hand, we then try to recognize the of < e >, < /e > tokens in support examples (i.e.,
entity type. Following is a detailed explanation of sstart , send ) which are the boundary vectors of the

3
Figure 2: The framework of example-based NER

entities. of an entity respectively in a support example j, and

K is the number of support examples for that spe-
Support
sstart = BERT (w<e> ) (3) cific entity type. qrep , sjrep are the representations
of the query and support example j, respectively.
Support To calculate the representation of the query and
send = BERT (w</e> ) (4)
support example, we use the vector sum of all to-
To measure similarity, we simply apply the dot- ken embeddings in the query and support examples,
product function between the vectors. i.e.,:

simstart
i = qi sstart (5) qrep = V ectorSumi (qi ) (9)

simend
i = qi send (6) sjrep = V ectorSumi (sji ) (10)
The result of the operations so far is the similar- The vector sum of all token embeddings of a
ity of each of the query tokens with the start/end sentence represents that sentence, either for a query
of an entity type of a single support example. Ide- or a support example. Another important factor in
ally, we have multiple support examples (e.g., K) the above equation is the atten function. We use
per entity type. To measure the probability of a the following soft attention mechanism to measure
token in the query being the start/end of the entity sentence-level similarity.
type using multiple support examples, we use the
following formula: atten(qrep , sjrep ) = Sof tmax(T ∗cos(qrep , sjrep ))
(11)
K where T is a hyper-parameter (temperature of
atten(qrep , sjrep )(qi sjstart ) (7)
X
Pistart = Sof tmax function) and cos is the cosine simi-
j=1 larity measure. The atten function measures the
sentence-level similarity between the query and the
support example. We combine this sentence-level
K
similarity with token-level similarity to produce
atten(qrep , sjrep )(qi sjend )
X
Piend = (8)
the probability of a token being the start/end of an
j=1
entity type. We utilize these probabilities to fine-
where qi is the embedding of token i in the query, tune the language model described in the following
sjstart sjend are the embeddings of the start and end section.

4
2.2 Training 2.3 Entity Type Recognition: Scoring
Our approach focuses on fine-tuning the BERT The second part of our proposed algorithm con-
language model in a way such that the contextual cerns assigning an entity type to the potential de-
encoded text can be used directly for entity recog- tected spans. One of the approaches used in liter-
nition. We initialize the language model with pre- ature (Tan et al., 2020) is to use a softmax layer
trained BERT and then pass the minibatches of data on top of multi-layer perceptron (MLP) classifier.
to minimize the loss function. Our loss function is This type of structure brings limitations to train-
the average of span start prediction loss and span free few-shot learning when trying to recognize
end prediction loss using cross-entropy, as follows. unseen entity types. Similar to our training schema,
for scoring, we instead measure the probability of
Xk each of the query tokens being the start and/or end
start start of an entity type using the examples for that entity
Lstart = − (yt logPt
t=1 type. Let us say that we have M entity types with
start start mEl support examples per entity type and a query
+ (1 − yt )log(1 − Pt )) (12)
q. For each entity type E in E1 , ..., EM , we predict
with a corresponding score similarly as in training.
For each of the tokens in the query, we calculate
Xk the scores of being the start Pistart and end of a
Lend = − (ytend logPtend +(1−ytend )log(1−Ptend ))span P start as:
i
t=1
(13)
(qi sjstart )
X
Pistart = (15)
topK
Loss = (Lstart + Lend )/2 (14)
(qi sjend )
X
Piend = (16)
in which k is the length of query; ytstart , ytend topK
indicate whether the token t in the query is the start It should be noted that these scores are very sim-
or end of the entity span, respectively; Ptstart , Ptend ilar to the probability measure that we had during
correspond to the probabilities that we calculated training but differ when it comes to attention (i.e,
during span detection. Eq. 7, 8). During training, we used a soft atten-
While preparing data, we add < e >, < /e > tion schema with a fixed number of examples per
tokens to the support examples around the entities. entity type (i.e., K). At inference time, we use
We convert a sentence with multiple entities into a hard attention since we have a varying number
multiple examples such that each of the support ex- of support examples per entity type. For the hard
amples include exactly one single entity. Another attention, first we calculate the token-level sim-
important consideration is to construct negative ex- ilarity measure (qi sjstart/end ) for each of the
amples as well as positive examples. For instance, entities in the support examples and then pick the
for a query “This is Los Angeles” where “Los An- top K ones with highest measure and then sum their
geles” is a city entity type, the support example similarity scores with an equal attention of 1 for
“< e > New York < /e > is a big city .” is consid- these example, to calculate the start/end probabil-
ered a positive example since it has the same entity ity. In other words, we sum the token-level simi-
type as city. But the support example “Tomorrow, larity of the top K support examples with highest
I have an appointment at < e > 2 pm < /e >.” is (qi sjstart/end ) for probability calculation. Then,
considered a negative example for the query as it for each potential span, we calculate the total span
contains a different entity type (in this case date). score as the summation of start and end scores, i.e.,
In other words, we treat negative examples as no- score(span) = P start + P end where P start and
entity in Eq. 13. For each training data point, we P end are calculated using above equations (Eq. 15,
construct pairs of query and positive/negative sup- 16). After getting all the potential spans, we select
port examples to train the model. It should be noted the one with the highest score as the final span for
that the ratio of positive versus negative example that specific entity type, treating span prediction
is an important factor. Furthermore, we use mul- similarly as question-answering (QA) models.
tiple examples per entity type for calculating the So far, we have finalized the spans for each of
aforementioned probabilities as Eq. 7 and Eq. 8. the entity types. Algorithm I shows the top span

5
prediction per entity type. We should highlight OntoNotes5.0 1 , Conll2003 2 , ATIS 3 , MIT Movie
that similar to the QA framework, if the predicted and Restaurant Review 4 , and SNIPS 5 . Addition-
span’s start and end occur on the CLS token, we ally, we use a proprietary dataset (Table 3) as a test
treat it as no span for that entity type in the query. set in some of the experiments.
We first fine-tune the language model on source
Algorithm I: Top span prediction per entity datasets and then run train-free few-shot evaluation
type on a target domain with unseen entity types. During
startindexes , endindexes = range(len(query(q))) the fine-tuning, we use all the training data from the
for startid in startindexes do: source datasets. When predicting an unseen entity
| for endid in endindexes do: in target domain, we sample a subset of instance in
| | if startid , endid in the query the training dataset from the target domain as the
| | & startid < endid : support set. We run the sampling multiple times
| | | calculate P start , P end using and produce the metric with mean and standard
| | Eq. 15, 16 deviation in the experimental results. Note that
| | | add (startid , endid , none of the examples or entity types of the target
| | P start , P end ) to the output dataset are seen in the fine-tuning step.

sort the spans based on the (P start + P end ) to 3.2 Training and Evaluation Details
select the top high probable span We use the PyTorch framework (Paszke et al., 2019)
for all of our experiments. To train our models, we
To merge the spans from different entity type use the AdamW optimizer with a learning rate of
predictions, we use their score to sort, remove the 5e − 5, adam epsilon of 1e − 8, and weight decay
overlap, and obtain the final predictions. Algorithm of 0.0. The maximum sentence length is set to 384
II shows the overall scoring functionality. tokens and K, the number of example per entity
type for training and hard attention scoring, is 5.
For training, T , the temperature of attention, is 1.
Algorithm II: Entity Type Recognition -
We use the BERT-base-uncased model to initialize
Hard Attention Algorithm
our language model. For evaluation, we report the
Suppose we have M entity types with ml sup-
precision, recall, and F1 scores for all entities in the
port examples per entity type and a query Q.
test sets based on exact matching of entity spans.
for each entity type E in E1 , .., EM do:
| get the span prediction per entity type using
4 Experimental Results
Algorithm I
4.1 Benchmark Study
aggregate all the predictions per entity type and
4.1.1 Knowledge Transfer from a General
sort based on span score.
Domain to Specific Domains
remove overlaps: select the top score span and
search for the second top span without any over- In this section, we run a benchmark study to in-
lap with the first one and continue this for all vestigate the performance of our approach. In the
the predicted spans. first set of experiments, we investigate how we
can transfer knowledge from a general domain to
specific domains. We train a model on a general
Aside from the hard attention used in our scoring source domain and then immediately evaluate the
method, we also experimented with other scoring model on the target domains with support examples.
methods which are explained in the ablation study. Figure 4 shows the results of training a model on
the OntoNotes 5.0 dataset as a generic domain and
3 Experimental Setup evaluating it on other target domains (i.e., ATIS,
1
from: https://github.com/swiseman/neighbor-tagging
3.1 Dataset Preparation 2
from: https://github.com/swiseman/neighbor-tagging
3
from: https://github.com/yvchen/JointSLU
We test our proposed approach on different bench- 4
from: https://groups.csail.mit.edu/sls/downloads/
mark datasets that are publicly available with 5
from:https://github.com/snipsco/nlu-
results in tables 1 and 2. The datasets are benchmark/tree/master/2017-06-custom-intent-engines

6
Dataset OntoNotes 5.0 ATIS Movie.Review Restaurant.Review Conll2003
#Train(Support) 59.9k 4.6k 7.8k 7.6k 12.7k
#Test 7.8k 850 2k 1.5k 3.2k
#Entity Types 18 79 12 8 4

Table 1: Public datasets statistics for OntoNotes 5.0, ATIS, Movie.Review, Restaurant.Review, and Conll2003

SNIPS
Dataset
d1 d2 d3 d4 d5 d6 d7
#Train (Support) 2k 2k 2k 2k 2k 2k 1.8k
#Test 100 100 99 100 98 100 100
#Entity Types 5 14 9 9 7 2 7

Table 2: Public datasets statistics for SNIPS dataset with different domains of: d1: AddToPlaylist, d2: BookRestau-
rant, d3: GetWeather, d4: PlayMusic, d5: RateBook, d6: SearchCreativeWork, d7: SearchScreeningEvent

Dataset Proprietary the experiments 10 times and taking the average of

#Train(Support) 8.6k the F1-score of them. Also, we show the standard
#Test 3k deviation for such multiple experiments.
#Entity Types 86 An important point to highlight is the higher
#Examples/EntityType in Test 2 ∼ 465 performance of our approach especially with a low
number of support examples. Even with as few
Table 3: Proprietary dataset statistics
as 10 or 20 examples per entity types, in some
datasets (e.g., MIT Movie and Restaurant Review),
MIT Movie Review, and MIT Restaurant Review, our approach can achieve results as good as using
and MixedDomains). We use Neighborhood Tag- a higher number of examples per entity types such
ging (Wiseman and Stratos, 2019) as the baseline. as 500.
It clearly demonstrated that our proposed approach
is significantly better than the baseline and the im- 4.1.3 Performance When the Target Dataset
provements are consistent in all the three datasets. Contains Different Distributions
We also analyze the domain-agnostic feature of dif-
4.1.2 Performance Using Differing Numbers ferent approaches, with the goal of finding out how
of Support Examples different approaches perform for a scenario where
Next, we investigate how different approaches per- the target dataset includes datapoints from different
form with a differing number of support examples distributions. This can arise due to different data
per entity type. In these experiments, we use differ- sources, different criteria for labeling, and/or when
ent support samples from the datasets for inference, new entity types from different domains are added
randomly sampling a fixed number of examples per to the system. To analyze such a scenario, we com-
entity type (i.e., 10, 20, 50, 100, 200, 500 examples bine the ATIS and MIT Restaurant Review datasets
per entity type) from the overall support set to use and run evaluation on this new mixed data. The
for inference. Note that if an entity has a smaller MixedDomains results in Table 4 shows the perfor-
number of support examples than the fixed number, mance analysis of the model trained on OntoNotes
we use all of them as our support examples. For 5.0 and evaluated on the new mixed domain dataset.
example, if an entity type has 30 examples in total, Based on this figure, we conclude that our approach
for the case of 100 examples per entity type, we achieves the best overall performance (F1-score)
use all 30 examples for that entity type. If an entity under different scenarios of support examples.
has 1000 examples, we sample a fraction of it (e.g.,
100 for the experiment on 100 examples per entity 4.1.4 Performance Using Another Source
type). Table 4 shows this study and we observe Dataset to Fine-Tune the Language
that our approach performs much better in compar- Model
ison to the neighbor-tagging (Wiseman and Stratos, Besides using OntoNotes 5.0 dataset as a training
2019) work. These results are based on running set, we also train a model on Conll2003 to function

7
Number of support examples per entity type
Target Dataset Approach
10 20 50 100 200 500
Neigh.Tag. 6.7±0.8 8.8±0.7 11.1±0.7 14.3±0.6 22.1±0.6 33.9±0.6
ATIS
Ours 17.4±1.1 19.8±1.2 22.2±1.1 26.8±2.7 34.5±2.2 40.1±1.0
Neigh.Tag. 3.1±2 4.5±1.9 4.1±1.1 5.3±0.9 5.4±0.7 8.6±0.8
MIT Movie
Ours 40.1±1.1 39.5±0.7 40.2±0.7 40.0±0.4 40.0±0.5 39.5±0.7
Neigh.Tag. 4.2±1.8 3.8±0.8 3.7±0.7 4.6±0.8 5.5±1.1 8.1±0.6
MIT.Restaurant
Ours 27.6±1.8 29.5±1.0 31.2±0.7 33.7±0.5 34.5±0.4 34.6
Neigh.Tag. 4.5±0.8 5.5±0.5 6.7±0.6 8.7±0.4 13.1±0.6 20.3±0.4
Mixed Domain
Ours 16.6±1.1 20.3±0.8 23.5±0.6 27.4±1.2 32.2±1.2 35.9±0.6

Table 4: General domain (OntoNotes 5.0) to specific target domains: ATIS, Moive Review, Restaurant Review,
MixedDomains (ATIS+MIT.Restaurant.Review as mixed distributed dataset), impact of number of support exam-
ples per entity type, comparison of our approach with the Neighbor Tagging (Wiseman and Stratos, 2019)
in terms of F1-score (mean±standard deviation is calculated based on 10 random samples)

as a generic domain dataset and evaluate it in a on this figure, we observe that our approach is con-
train-free few-shot manner on other domains. Ta- sistently and significantly better than the baseline
ble 5 shows similar results as when training with when the number of example per entity is up to
OntoNotes 5.0. When we compare tables 4 and 100. For the extreme scenario when the number
5, we see that knowledge transfer from OntoNotes of the example per entity is 200 or 500, these two
5.0 achieves overall higher performance in com- methods are comparable.
parison to Conll2003. We conjecture this is due to
4.1.6 Performance on Our Proprietary
the larger size and larger number of entity types in
Dataset
OntoNotes 5.0, compared to Conll2003.
Finally, we also evaluate the proposed approach on
4.1.5 Knowledge Transfer from a Similar our proprietary dataset. Figure 4 shows the perfor-
Domain mance of our approach compared with neighbor
Another interesting question to answer is how much tagging on a model trained on the OntoNotes 5.0
knowledge can we transfer from one domain to an- dataset and evaluated on our proprietary dataset.
other domain from a similar distribution. In this Different samples of support examples are created
set of experiments, we simulate a scenario where by randomly sampling the entire support with a
we combine different training sets coming from different number of examples per entity type (i.e.,
similar distributions and then evaluate the model 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100). Simi-
on an similar but unseen target domain. We use larly to previous results, our approach achieves a
the SNIPS datasets which has seven domains (Ad- consistently better performance .
dToPlaylist, BookRestaurant, GetWeather, PlayMu-
5 Conclusion
sic, RateBook, SearchCreativework, and Search-
ScreeningEvent). We train a model on a combined This paper presents a novel technique for train-free
dataset from six domains and evaluate it on the re- few-shot learning for the NER task. This entity-
maining domain. Figure 3 shows the results of such agnostic approach is able to leverage large open-
experiments in terms of the total number of support domain NER datasets to learn a generic model and
examples. Similar to tables 4 and 5, we use 10, then immediately recognize unseen entities via a
20, 50, 100, 200, and 500 examples per entity type, few supporting examples, without the need to fur-
but instead of showing the number of examples per ther fine-tune the model. This brings dramatic ad-
entity type, we show the total number of examples vantage to the NER problem requiring quick adapta-
in the figure. For example, the first figure in Fig. tion and provides a feasible way for business users
3 shows the experiment where we combine the six to easily customize their own business entities. To
domains of BookRestaurant, GetWeather, PlayMu- the best of our knowledge, this is the first work that
sic, RateBook, SearchCreativework, and Search- applies the train-free example-based approach on
ScreeningEvent to train a model and then evaluate the NER problem. Compared to the recent SOTA
it on the held-out domain of AddToPlaylist. Based designed for this setting, i.e., the neighbor tagging

8
Number of support examples per entity type
Target Dataset Approach
10 20 50 100 200 500
Neigh.Tag. 2.4±0.5 3.4±0.6 5.1±0.4 5.7±0.3 6.3±0.3 10.1±0.4
ATIS
Ours 22.9±3.8 16.5±3.3 19.4±1.4 21.9±1.2 26.3±1.1 31.3±0.5
Neigh.Tag. 0.9±0.3 1.4±0.3 1.7±0.4 2.4±0.2 3.0±0.3 4.8±0.5
MIT Movie
Ours 29.2±0.6 29.6±0.8 30.4±0.8 30.2±0.6 30.0±0.5 29.6±0.5
Neigh.Tag. 4.1±1.2 3.6±0.8 4.0±1.1 4.6±0.6 5.6±0.8 7.3±0.5
MIT.Restaurant
Ours 25.2±1.7 26.1±1.3 26.8±2.3 26.2±0.8 25.7±1.5 25.1±1.1
Neigh.Tag. 2.3±0.5 2.9±0.5 4.1±0.6 4.7±0.3 5.4±0.4 7.9±0.4
Mixed Domain
Ours 20.5±1.6 18.6±1.9 20.9±1.2 22.5±0.5 24.7±0.9 27.3±0.5

Table 5: General domain (Conll2003) to specific target domains, impact of number of support examples per entity
type, comparison of our approach with Neighbor Tagging (Wiseman and Stratos, 2019)
in terms of F1-score (mean±standard deviation is calculated based on 10 random samples)

Figure 3: Training on multiple domains and evaluating on a held-out domain: impact of number of support exam-
ples, comparison of our approach with Neighbor Tagging (Wiseman and Stratos, 2019)

References
Oshin Agarwal, Yinfei Yang, Byron C. Wallace, and
Ani Nenkova. 2020a. Entity-switched datasets: An
approach to auditing the in-domain robustness of
named entity recognition models.

Muhammad Asif Ali, Yifang Sun, Bing Li, and Wei

Wang. 2020. Fine-grained named entity typing over
distantly supervised data based on refined represen-
tations. In AAAI, pages 7391–7398.
Figure 4: Proprietary dataset scoring analysis (trained
a model on OntoNotes 5.0 and evaluated on the propri- Oshin Agarwal, Yinfei Yang, Byron C. Wallace, and
etary dataset) Ani Nenkova. 2020b. Interpretability analysis for
named entity recognition to understand system pre-
dictions and how they can improve.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
approach, extensive experiments demonstrate that Neelakantan, Pranav Shyam, Girish Sastry, Amanda
our proposed approach performs significantly bet- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
ter in multiple studies and experiments, especially Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
for low number of support examples. Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

9
Chess, Jack Clark, Christopher Berner, Sam Mc- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Candlish, Alec Radford, Ilya Sutskever, and Dario Know what you don’t know: Unanswerable ques-
Amodei. 2020. Language models are few-shot learn- tions for SQuAD. In Proceedings of the 56th An-
ers. nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 784–
Shumin Deng, Ningyu Zhang, Zhanlin Sun, Jiaoyan 789, Melbourne, Australia. Association for Compu-
Chen, and Huajun Chen. 2020. When low resource tational Linguistics.
nlp meets unsupervised language model: Meta-
pretraining then meta-learning for few-shot text clas- Erik F. Tjong Kim Sang and Fien De Meulder.
sification (student abstract). In AAAI, pages 13773– 2003. Introduction to the conll-2003 shared task:
13774. Language-independent named entity recognition.
CoRR, cs.CL/0306050.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep Edwin Simpson, Jonas Pfeiffer, and Iryna Gurevych.
bidirectional transformers for language understand- 2020. Low resource sequence tagging with weak la-
ing. bels. In AAAI, pages 8862–8869.
Alexander Fritzler, Varvara Logacheva, and Maksim Jake Snell, Kevin Swersky, and Richard S. Zemel.
Kretov. 2018. Few-shot classification in named en- 2017. Prototypical networks for few-shot learning.
tity recognition task. CoRR, abs/1812.06158. CoRR, abs/1703.05175.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- Chuanqi Tan, Wei Qiu, Mosha Chen, Rui Wang, and
pat, and Ming-Wei Chang. 2020. Realm: Retrieval- Fei Huang. 2020. Boundary enhanced neural span
augmented language model pre-training. arXiv classification for nested named entity recognition.
preprint arXiv:2002.08909.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
John Lafferty, Andrew McCallum, and Fernando CN Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Pereira. 2001. Conditional random fields: Prob- Kaiser, and Illia Polosukhin. 2017. Attention is all
abilistic models for segmenting and labeling se- you need. CoRR, abs/1706.03762.
quence data.
Sam Wiseman and Karl Stratos. 2019. Label-agnostic
Chin Lee, Hongliang Dai, Yangqiu Song, and Xin Li. sequence labeling by copying nearest neighbors. In
2020. A chinese corpus for fine-grained entity typ- Proceedings of the 57th Annual Meeting of the
ing. arXiv preprint arXiv:2004.08825. Association for Computational Linguistics, pages
5363–5369, Florence, Italy. Association for Compu-
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong tational Linguistics.
Han, Fei Wu, and Jiwei Li. 2019. A unified mrc
framework for named entity recognition. Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and Ji-
wei Li. 2019. Coreference resolution as query-based
Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. span prediction.
2020. A rigourous study on named entity recogni-
tion: Can fine-tuning pretrained model lead to the Tao Zhang, Congying Xia, Chun-Ta Lu, and Philip Yu.
promised land? 2020. Mzet: Memory augmented zero-shot fine-
grained named entity typing.
Pierre Lison, Aliaksandr Hubin, Jeremy Barnes, and
Samia Touileb. 2020. Named entity recognition Shi Zhi, Liyuan Liu, Yu Zhang, Shiyin Wang, Qi Li,
without labelled data: A weak supervision approach. Chao Zhang, and Jiawei Han. 2020. Partially-typed
arXiv preprint arXiv:2004.14723. ner datasets integration: Connecting practice to the-
ory. arXiv preprint arXiv:2005.00502.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te-
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019. Pytorch:
An imperative style, high-performance deep learn-
ing library. In Advances in Neural Information Pro-
cessing Systems 32, pages 8026–8037. Curran Asso-
ciates, Inc.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.

10
A Appendix gle probabilistic annotation. Another approach is
to leverage noisy crowdsourced labels as well as
A.1 Related Work
pre-trained models from other domains as part of
Historically, the NER task was approached as a transfer learning–Simpson et al., 2020 combines
sequence labeling task with recurrent and convo- pre-trained sequence labelers using Bayesian se-
lutional neural networks, and then with the rise of quence combination to improve sequence labeling
the Transformer (Vaswani et al., 2017), pretrained in new domains with few labels or noisy data. To
Transformer-based models such as BERT (Devlin tackle noisy data resulting from distantly super-
et al., 2018). While the boundary is being pushed vised methods, Ali et al., 2020 instead use an edge-
inch by inch on well-established NER datasets such weighted attentive graph convolution network that
as CONLL 2003 (Sang and Meulder, 2003), such attends over corpus-level contextual clues, thus re-
large and well-labeled datasets can present an ideal fines noisy mention representations learned over
and unrealistic setting. Often, these benchmarks the distantly supervised data in contrast to just sim-
have strong name regularity, high mention cover- ply de-noising the data at the model’s input. A new
age, and contain sufficient training examples for domain can also be a new language–Lee et al., 2020
the entity types thus providing sufficient context created a small labeled dataset via crowdsourcing
diversity (Lin et al., 2020). This is called regular and a large corpus via distant supervision for Chi-
NER. In contrast, open NER does not have such nese NER, leveraging mappings between English
advantages–entity types may not be grammatical and Chinese and finding that pretraining on English
and the training set may not fully cover all test set data helped improve results on Chinese datasets.
mentions.
Rather than data augmentation, another approach
Robustness studies have been done with swap-
is few or zero-shot learning–relying on models pre-
ping out English entities with entities from other
trained on large datasets combined with a limited
countries, such as Ethiopia, Nigeria, and the Philip-
amount of support examples in the target domain.
pines, finding drops of up to 10 points F1 lead-
More generally for text classification, Deng et al.,
ing to questions of whether current state-of-the-art
2020 disentangle task-agnostic and task-specific
models trained on standard English corpora are
feature learning, leveraging large raw corpuses via
over-optimized, similar to models in the photog-
unsupervised learning to first pretrain task-agnostic
raphy domain which have become accustomed to
contextual features followed by meta-learning text
the perfect exposure of white skin (Agarwal et al.,
classification. Snell et al., 2017 also developed the
2020a). Other studies by Agarwal et al., 2020b
prototypical network for classification scenarios
expose that while context representations resulting
with scarce labeled examples such that objects of
from trained LSTM-CRF and BERT architectures
one class are mapped to similar vectors. For the
contribute to system performance, the main factor
NER task specifically, Fritzler et al., 2018 com-
driving high performance is learning name tokens
bined this model architecture with an RNN + CRF
explicitly, which is a weakness when it comes to
to simulate few-shot experiments on the OntoNotes
open NER, when novel and unseen name tokens
5.0 dataset.
run abundant.
To deal with partial entity coverage, one ap- Zhang et al., 2020 use memory to transfer knowl-
proach is data augmentation. Zhi et al., 2020 com- edge of seen coarse-grained entity types to unseen
pares strategies to combine partially-typed NER coarse and fine-grained entity types, additionally
datasets with fully-typed ones, demonstrating that incorporating character, word, and context-level
models trained with partially-typed annotations can information combined with BERT to learn entity
have similar performance with those trained with representations, as opposed to simply relying on
the same amount of fully-typed annotations. In this similarity. Their approach relies on hierarchical
theme of lacking datasets, one approach has been structure between the coarse and fine-grained en-
to utilize expert knowledge to craft labeling func- tity types.
tions in order to create large distantly or weakly Thinking more about the pretraining portion,
supervised datasets. Lison et al., 2020 utilizes label- given massive text corpora, language models like
ing functions to automatically annotate documents BERT Devlin et al., 2018 seem to be able to implic-
with named-entity labels followed by a trained hid- itly store world knowledge in the parameters of the
den Markov model to unify the labels into a sin- neural network. Guu et al., 2020 use the masked

11
language model pretraining task from BERT to diately. Meanwhile, this fix can also generalize to
pretrain prior to fine-tuning on the Open-QA task. similar DSAT as well. This provides an easy way
Wiseman and Stratos, 2019 also use BERT as a for content editors to improve their system inde-
pretrained language model for train-free few-shot pendently and confidently without the involvement
learning, using the BIO (beginning-inside-out) for- of either AI experts or model training.
mat to classify each of the tokens independently
while finetuning. A.3 Ablation Study
Another trend is formulating other NLP tasks as In this section, we take a detailed look at our ap-
a machine reading comprehension (MRC) problem. proach and investigate important factors of our
For instance, (Wu et al., 2019) and (Li et al., 2019) model.
have done this for coreference resolution and NER, A.3.1 Scoring Strategy
respectively. Using the MRC framework enables
One of the important factors in example-based
the combination of different datasets from different
NER is the scoring strategy, used to recognize the
domains to better fine-tune. A challenge that re-
target entity type. Above, we explain our main ap-
framing NER as an MRC task helps to alleviate is
proach as hard attention and provide some bench-
when a token may be assigned several labels. For
mark results. Potentially one could use soft at-
example, if a token is assigned two overlapping en-
tention in the scoring as well and we discuss two
tities, this can be broken out into two independent
methods of using soft attention for scoring.
questions (Li et al., 2019).
Soft Attention Scoring: This is very similar to
In our approach, we utilize the MRC framework
hard attention but instead of using Eq. 15, 16 in
to fine-tune a BERT-based language model with
algorithm I, and II to calculate the probabilities, we
a novel sentence-level attention and token-level
use the following equations:
similarity that is independent of training data entity
types, achieving a better representation to perform
mE
few-shot evaluation in a new domain with superior
atten(qrep , sjrep )(qi sjstart ) (17)
X
results. Pistart =
j=1

A.2 Production Benefit of Example-Based

Approach mE
atten(qrep , sjrep )(qi sjend )
X
Piend = (18)
There are many benefits of the similarity and j=1
example-based approach. First, since the model is
decoupled from the trained entity types, any change These two equations are similar to Eq. 7, 8 in
made to the entity types does not require retraining training but instead of using K, a fixed number
of the model. There are also many advantages of of examples per entity type (E), we use all of the
this in a production system. We can onboard a new examples for that entity type, i.e., mE .
customer without training any model. For an exist- Top K Soft Attention Scoring: Another ap-
ing running system, there are many non-developer proach is to use the sentence similarity scores (i.e.,
editors working on the content to add or delete new atten scores) to filter the top K sentences and mea-
entity types simultaneously; any of their changes sure the probabilities using those top K. In other
can be immediately reflected in a production sys- words:
tem without further training.
Second, any prediction using this example-based
atten(qrep , sjrep )(qi sjstart )
X
approach can be traced back to which example(s) Pistart =
K highest atten
contributed to the decision. Every decision is in-
(19)
terpretable based on examples and it is easy to
produce online Key Performance Indicator (KPI)
atten(qrep , sjrep )(qi sjend )
X
metrics for each example, thus enabling a natural Piend =
way to measure the value of different content ver- K highest atten
sions. Third, it is easy to debug and fix any dissatis- (20)
fied (DSAT) cases. We can remove a bad example We also look into a heuristic approach as a very
or add a new example to address any DSAT imme- basic baseline.

12
Heuristic Voting Scoring: In the other scoring Algorithm IV: Choose top n best spans
methods, we treat multiple examples all at once in startindexes =get n top indexes from top P(start)
an equation with different weights to calculate the endindexes = get n top indexes from top P(end)
score. A heuristic algorithm that we investigated for startid in startindexes do:
is to treat each of the support examples separately | for endid in endindexes do:
and run span prediction per example per entity and | | if startid , endid in the query
then use voting between multiple examples of an | | & startid < endid :
entity type to produce the final predictions per en- | | | calculate startprob , endprob
tity type. Each support example gets an equal vote. | | using Eq. 5, 6
Also, for the span score we use the base token-level | | | add (startid , endid ,
similarity to measure the probability as shown in | | startprob , endprob ) to the output
Eq. 5, 6. Algorithm III shows the voting algorithm.
sort the spans based on the (startprob +
Algorithm III: Entity Type Recognition - endprob ) to select the top n high probable spans
Voting Algorithm
Suppose we have M entity types with ml sup-
P redSpansQ
eE
= [span21 , span22 , ..., span2n ]
port examples per entity type and a query Q. 2

for each entity type E in E1 , .., EM do: ...

| for each support example e in mE do: P redSpansQ = [spanmE 1 , ..., spanmE n ]
eE
| | get the predicted mE

| | spans [spani1 , spani2 , ..., spanin ] To finalize the span prediction for an entity type
| | using algorithm IV E, we take the vote on the top predicted span from
| get the final prediction per entity as top each support example of E, i.e.:
| voted spans of [spani1 , span21 , ..., spanmE n ]

aggregate all the predictions per entity type as P redQ

E = [T opV oted(span11 , span21 , ..., spanmE 1 )]
final output And finally, we aggregate the predictions of dif-
ferent entity types in the query Q to get the final
In this algorithm, we treat each entity type sepa- result as:
rately and use the support examples of that entity
Q
type to recognize the spans for that specific entity P redQ = ∪M
i=1 (P redEi )
type. In the end, we accumulate all of the recog-
It should be noted that in this method, the pre-
nized spans for different entity types and take all
dicted spans of different entity types could poten-
of them as final prediction. For example, let us
tially have overlaps. The voting algorithm is a sim-
assume that we have M entity types where each
ple heuristic approach with some gains and draw-
has ml (l in 1, 2, ..., M ) support examples. Note
backs that we will discuss.
that one example can have different entity types,
To compare the performance of these new scor-
from which we create multiple support examples
ing algorithms, we run a benchmark study as be-
from that example where each support example has
fore. Figure 5 shows the results of transferring
only one entity. Now, for a query Q, and entity
knowledge from OntoNotes 5.0 to other specific
type E with mE support examples, we get the top
domains.
n predicted spans (P S) for each support example
In general, the hard attention scoring algorithm
(e) of the E using algorithm IV.
achieves the best performance on a different num-
In short, we select the top start tokens and end ber of examples. Note that the voting algorithm
tokens based on the probabilities, build the valid performs almost the same regardless of the number
spans, sort them based on the highest summation of support examples. This is because in this algo-
of start probability and end probability, and select rithm, we don’t consider the scores of the entities
the top n spans (i.e., [span1 , span2 , ..., spann ]). : from different examples and treat them as equally
important. For instance if we have five examples,
the voting result could be ‘tag1’:3, ‘tag2’:2, result-
P redSpansQ
eE
= [span11 , span12 , ..., span1n ] ing in a prediction of ‘tag1’ as it has the most votes.
1

13
Figure 5: Impact of number of support examples, comparison of different scoring algorithms

Now, if we increase the number of examples, we

could have ‘tag1’:5, ‘tag2’:4, ‘tag3’:1 but still the
resulting prediction is ‘tag1’. As described in this
algorithm the actual prediction score for each of the
support examples has not been taken into account.
Similarly, the behavior of soft attention scoring
does not change much in relation to the number
of support examples. This is due to the fact that
the summation in eq. 17, 18 plays the role of an
averaging window, and as we have increased the
size of the window to all support examples, the
aggregated results remain unchanged. In contrast,
as we have a smaller window size of K = 5 in top-
K soft attention approach, we observe variations
when changing the number of examples. However,
the performance of this top-K soft attention is not
as good as neighbor tagging when confronted with
a higher number of examples. This gap is mitigated
when using hard attention.
For hard attention, we also tried different K val-
ues for top K, and although the results were com-
Figure 6: Precision and recall analysis
parable, all cases with K = 5 which is used during
training, gave the best performance.
To analyze the detail of each algorithm, we take when pre-training the BERT language model
a look at the precision and recall of different ap- (i.e., masking the entities in the training set
proaches. Figure 6 shows the precision and recall and pre-training the model to guess the en-
of different approaches from a model trained on the tities) and then applying the fine-tuning ap-
OntoNotes 5.0 dataset and evaluated on the ATIS proach. Although this approach helped a little
test set using a different number of ATIS support bit on the seen entity type recognition, it was
examples. Based on this figure, it is clear that the not helpful for train-free few-shot learning of
voting algorithm tends to have higher recall but low unseen entity types.
precision. In contrast, neighbor tagging tends to
achieve higher precision than recall. Meanwhile • Similar to neighbor tagging, we also applied
the example-based NER approach with hard at- the BIO tagging prediction as a scoring func-
tention achieves a balance between precision and tion on top of our fine-tuned model, but the
recall. results were not as good as our attention-based
approach especially with a low number of sup-
A.4 Negative Results port examples.
We briefly describe a few ideas that did not look
promising in our initial experiments: • Similar to neighbor tagging, we tried to run
the prediction at word-level rather than token-
• We initially attempted to use entity masking level (since the BERT wordpiece tokenizer

14
splits the words into multiple tokens and
neighbor tagging uses the first token as the
word representative and applies word-level
prediction). The results were better than our
heuristic voting algorithm but not as good as
the current token-level scoring structure.

• We also experimented with the prototypi-

cal network structure using BERT language
model but did not get comparable results.

• We also tried different attention mechanisms

as well as similarity measures in our training
and scoring approach but found that they had
lower performance.

Template-Based Named Entity Recognition Using BART: Person Location Location
No ratings yet
Template-Based Named Entity Recognition Using BART: Person Location Location
11 pages
2021 Acl-Long 216
No ratings yet
2021 Acl-Long 216
13 pages
DKhurana NERTask
No ratings yet
DKhurana NERTask
14 pages
Ensemble Learning For Named Entity Recognition
No ratings yet
Ensemble Learning For Named Entity Recognition
16 pages
Reinforced Iterative Knowledge Distillation For Cross-Lingual Named Entity Recognition
No ratings yet
Reinforced Iterative Knowledge Distillation For Cross-Lingual Named Entity Recognition
9 pages
ErlAusg ICONIP NER Shmuel
No ratings yet
ErlAusg ICONIP NER Shmuel
15 pages
Efficient NER with GLiNER Model
No ratings yet
Efficient NER with GLiNER Model
11 pages
Unit 4 DL
No ratings yet
Unit 4 DL
31 pages
Hand Written Recognition
No ratings yet
Hand Written Recognition
10 pages
A Unified MRC Framework For Named Entity Recognition
No ratings yet
A Unified MRC Framework For Named Entity Recognition
11 pages
A - Survey - On - Deep - Learn - For NER
No ratings yet
A - Survey - On - Deep - Learn - For NER
21 pages
Lecture 18. NER With: Conditional Random Fields (CRF)
No ratings yet
Lecture 18. NER With: Conditional Random Fields (CRF)
21 pages
10 1080@0194262X 2020 1759479
No ratings yet
10 1080@0194262X 2020 1759479
15 pages
Implementing Named Entity Recognition
No ratings yet
Implementing Named Entity Recognition
3 pages
r5.Unicoder-VL-A Universal Encoder For Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
No ratings yet
r5.Unicoder-VL-A Universal Encoder For Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
12 pages
NLP Exam: Named Entity Recognition
No ratings yet
NLP Exam: Named Entity Recognition
14 pages
Meta-Learning for Sequence Labeling
No ratings yet
Meta-Learning for Sequence Labeling
15 pages
Unsupervised Cross-Lingual Model Transfer For Name
No ratings yet
Unsupervised Cross-Lingual Model Transfer For Name
17 pages
Deep Learning in NER: A Survey
No ratings yet
Deep Learning in NER: A Survey
20 pages
4.1.5.named Entity Recognition
No ratings yet
4.1.5.named Entity Recognition
11 pages
2020 Acl-Main 574
No ratings yet
2020 Acl-Main 574
11 pages
A N E R: Survey On Recent Advances in Amed Ntity Ecognition
No ratings yet
A N E R: Survey On Recent Advances in Amed Ntity Ecognition
30 pages
A Survey On Recent Advances in Named Entity Recognition From Deep Learning Models
No ratings yet
A Survey On Recent Advances in Named Entity Recognition From Deep Learning Models
14 pages
A Hybrid Named Entity Recognition System For Aviat
No ratings yet
A Hybrid Named Entity Recognition System For Aviat
10 pages
Unified NER Framework for NLP Experts
No ratings yet
Unified NER Framework for NLP Experts
15 pages
NLP Prac 6
No ratings yet
NLP Prac 6
5 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
13 pages
01 Unit 4
No ratings yet
01 Unit 4
10 pages
A Survey On Named Entity Recognition
No ratings yet
A Survey On Named Entity Recognition
12 pages
Enhancing MNER with ChatGPT Insights
No ratings yet
Enhancing MNER with ChatGPT Insights
16 pages
DL Unit-V
No ratings yet
DL Unit-V
23 pages
Ner X LSTM
No ratings yet
Ner X LSTM
6 pages
Electronics 13 00261 v2
No ratings yet
Electronics 13 00261 v2
23 pages
Research Paper
No ratings yet
Research Paper
6 pages
1 s2.0 S073658452400187X Main
No ratings yet
1 s2.0 S073658452400187X Main
9 pages
Manuscript 9754
No ratings yet
Manuscript 9754
9 pages
Advances in Chinese Named Entity Recognition
No ratings yet
Advances in Chinese Named Entity Recognition
17 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
A Survey On Named Entity Recognition
No ratings yet
A Survey On Named Entity Recognition
8 pages
NLP Practicals
No ratings yet
NLP Practicals
54 pages
Named Entity Survey
No ratings yet
Named Entity Survey
27 pages
05 AIHC Exp05
No ratings yet
05 AIHC Exp05
6 pages
2020 Acl-Main 577
No ratings yet
2020 Acl-Main 577
7 pages
CNER-Improving NER With Attentive Ensemble of Syntactic Information
No ratings yet
CNER-Improving NER With Attentive Ensemble of Syntactic Information
15 pages
Entity Matching with Multi-Perspective Similarity
No ratings yet
Entity Matching with Multi-Perspective Similarity
16 pages
Appl Ner Social Media
No ratings yet
Appl Ner Social Media
11 pages
Drug Specification Named Entity Recognition Base On BiLSTM-CRF Model PDF
No ratings yet
Drug Specification Named Entity Recognition Base On BiLSTM-CRF Model PDF
5 pages
NER Model Comparisons: en_core_web Variants
No ratings yet
NER Model Comparisons: en_core_web Variants
47 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
Named Entity Recognition Project Report
No ratings yet
Named Entity Recognition Project Report
15 pages
Sequence-to-Set for Nested NER
No ratings yet
Sequence-to-Set for Nested NER
7 pages
PLN 65 06
No ratings yet
PLN 65 06
6 pages
Bidirectional LSTM-CRF For Named Entity Recognition
No ratings yet
Bidirectional LSTM-CRF For Named Entity Recognition
10 pages
Fall 2023 - CS619 - 8907
No ratings yet
Fall 2023 - CS619 - 8907
2 pages
Deepmatcher TR
No ratings yet
Deepmatcher TR
18 pages
Named Entity Recognition with NER Model
No ratings yet
Named Entity Recognition with NER Model
14 pages
Practical and Effective Neural NER
No ratings yet
Practical and Effective Neural NER
31 pages
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
No ratings yet
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
14 pages
Numeracy in NLP: Survey and Vision
No ratings yet
Numeracy in NLP: Survey and Vision
13 pages
NLP Models and Simple Math Challenges
No ratings yet
NLP Models and Simple Math Challenges
15 pages
Simple Topological Drawings of K-Planar Graphs
No ratings yet
Simple Topological Drawings of K-Planar Graphs
15 pages
Machine Learning For Weather and Climate Are Worlds Apart: Opinion
No ratings yet
Machine Learning For Weather and Climate Are Worlds Apart: Opinion
11 pages
EXPSPACE Complexity in One-Counter Nets
No ratings yet
EXPSPACE Complexity in One-Counter Nets
33 pages
Document-Editing Assistants and Model-Based Reinforcement Learning As A Path To Conversational AI
No ratings yet
Document-Editing Assistants and Model-Based Reinforcement Learning As A Path To Conversational AI
19 pages
Visual Question Answering On Image Sets
No ratings yet
Visual Question Answering On Image Sets
16 pages
Opinion-Aware Answer Generation For Review-Driven Question Answering in E-Commerce
No ratings yet
Opinion-Aware Answer Generation For Review-Driven Question Answering in E-Commerce
10 pages
Delphi Technique in Future Studies
No ratings yet
Delphi Technique in Future Studies
26 pages
Modern Pridictive Modelling (Regression)
No ratings yet
Modern Pridictive Modelling (Regression)
12 pages
Critical Thinking in Clinical Practice Improving The Quality of Judgments and Decisions Second Edition Eileen Gambrill Download
100% (1)
Critical Thinking in Clinical Practice Improving The Quality of Judgments and Decisions Second Edition Eileen Gambrill Download
48 pages
Maximum Entropy: Density Estimation
No ratings yet
Maximum Entropy: Density Estimation
18 pages
Neuro-Management Decision Making Insights
No ratings yet
Neuro-Management Decision Making Insights
14 pages
A Proactive Approach For Resource Provisioning in Cloud Computing
No ratings yet
A Proactive Approach For Resource Provisioning in Cloud Computing
10 pages
Predicting Dwell Time at WiFi Hotspots
No ratings yet
Predicting Dwell Time at WiFi Hotspots
9 pages
CSEIT2311254
No ratings yet
CSEIT2311254
18 pages
MLT Assignment 1
No ratings yet
MLT Assignment 1
13 pages
Excel Trend Analysis and Forecasting
No ratings yet
Excel Trend Analysis and Forecasting
3 pages
Lopez Cruz Et Al 2022
No ratings yet
Lopez Cruz Et Al 2022
15 pages
Regression PPT Final
100% (1)
Regression PPT Final
59 pages
PREDICTIONS
100% (1)
PREDICTIONS
14 pages
Prof. Hemant Kombrabail
100% (3)
Prof. Hemant Kombrabail
37 pages
Data Science in Python - Regression
100% (1)
Data Science in Python - Regression
234 pages
Secure Run-Time Hardware Trojan Detection Using Lightweight Analytical Models
No ratings yet
Secure Run-Time Hardware Trojan Detection Using Lightweight Analytical Models
11 pages
MachinesAndMechanisms - Activity Pack For Simple Machines
100% (1)
MachinesAndMechanisms - Activity Pack For Simple Machines
111 pages
Final
No ratings yet
Final
26 pages
FIFA WORLD Cup Kaushalkumar
No ratings yet
FIFA WORLD Cup Kaushalkumar
33 pages
NN and XGB Overtopping UserManual
No ratings yet
NN and XGB Overtopping UserManual
43 pages
Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation
100% (2)
Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation
280 pages
Telecom Infrastructure Case Study Analysis
No ratings yet
Telecom Infrastructure Case Study Analysis
22 pages
SYSS Forecasting Classification
No ratings yet
SYSS Forecasting Classification
11 pages
Project
No ratings yet
Project
14 pages
TRIPOD-Abstract AI Paper
No ratings yet
TRIPOD-Abstract AI Paper
14 pages
Error Analysis and Simulations of Complex Phenomen
No ratings yet
Error Analysis and Simulations of Complex Phenomen
21 pages
Math IA Sample
No ratings yet
Math IA Sample
23 pages
Final Report Lung Disease Prediction
No ratings yet
Final Report Lung Disease Prediction
40 pages

Example-Based Named Entity Recognition

Uploaded by

Example-Based Named Entity Recognition

Uploaded by

Example-Based Named Entity Recognition

Abstract datasets, they struggle with incorporating these cus-

recognition (NER) in the presence of scarce

nique. In this paper, we address these challenges the architecture.

entities. of an entity respectively in a support example j, and

Dataset Proprietary the experiments 10 times and taking the average of

Muhammad Asif Ali, Yifang Sun, Bing Li, and Wei

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

A.2 Production Benefit of Example-Based

for each entity type E in E1 , .., EM do: ...

aggregate all the predictions per entity type as P redQ

Now, if we increase the number of examples, we

• We also experimented with the prototypi-

• We also tried different attention mechanisms

You might also like