0% found this document useful (0 votes)
43 views11 pages

Exploring Pre-Trained Language Models For Event Extraction and Generation

This paper presents a framework for event extraction and generation using pre-trained language models, addressing challenges such as insufficient training data and roles overlap in event extraction. The proposed model includes a trigger extractor and an argument extractor, which are designed to improve extraction accuracy and handle overlapping roles effectively. Experiments demonstrate that the model achieves state-of-the-art results on the ACE2005 dataset, significantly enhancing event extraction performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views11 pages

Exploring Pre-Trained Language Models For Event Extraction and Generation

This paper presents a framework for event extraction and generation using pre-trained language models, addressing challenges such as insufficient training data and roles overlap in event extraction. The proposed model includes a trigger extractor and an argument extractor, which are designed to improve extraction accuracy and handle overlapping roles effectively. Experiments demonstrate that the model achieves state-of-the-art results on the ACE2005 dataset, significantly enhancing event extraction performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Exploring Pre-trained Language Models for Event Extraction and

Generation

Sen Yang† , Dawei Feng† , Linbo Qiao, Zhigang Kan, Dongsheng Li‡
National University of Defense Technology, Changsha, China
{sen yang,linbo.qiao,kanzhigang13}@nudt.edu.cn
[email protected], [email protected]

Event type: Meet


Abstract [Entity] [Trigger]
Sentence : President Bush is going to be meeting
[Entity]
Traditional approaches to the task of ACE with several Arab leaders
event extraction usually depend on manually
annotated data, which is often laborious to cre- Figure 1: An event of type Meet is highlighted in the
ate and limited in size. Therefore, in addi- sentence, including one trigger and two arguments.
tion to the difficulty of event extraction itself,
insufficient training data hinders the learning
process as well. To promote event extraction, even sharing the same argument (the roles over-
we first propose an event extraction model to lap problem). For example, in sentence ”The
overcome the roles overlap problem by sep- explosion killed the bomber and three shoppers”,
arating the argument prediction in terms of
”killed” triggers an Attack event, while argument
roles. Moreover, to address the problem of in-
sufficient training data, we propose a method ”the bomber” plays the role ”Attacker” as well
to automatically generate labeled data by edit- as the role ”Victim” at the same time. There are
ing prototypes and screen out generated sam- about 10% events in the ACE2005 dataset (Dod-
ples by ranking the quality. Experiments on dington et al., 2004) having the roles overlap prob-
the ACE2005 dataset demonstrate that our ex- lem. However, despite the evidence of the roles
traction model can surpass most existing ex- overlap problem, few attentions have been paid to
traction methods. Besides, incorporating our it. On the contrary, it is often simplified in evalu-
generation method exhibits further significant
ation settings of many approaches. For example,
improvement. It obtains new state-of-the-art
results on the event extraction task, including in most previous works, if an argument plays mul-
pushing the F1 score of trigger classification to tiple roles in an event simultaneously, the model
81.1%, and the F1 score of argument classifi- classifies correctly as long as the prediction hits
cation to 58.9%. any one of them, which is obviously far from ac-
curate to apply to the real world. Therefore, we
design an effective mechanism to solve this prob-
1 Introduction lem and adopt more rigorous evaluation criteria in
experiments.
Event extraction is a key and challenging task for On the other hand, so far most deep learn-
many NLP applications. It targets to detect event ing based methods for event extraction follow the
trigger and arguments. Figure 1 illustrates a sen- supervised-learning paradigm, which requires lots
tence containing an event of type Meet triggered of labeled data for training. However, annotating
by ”meeting”, with two arguments: ”President accurately large amounts of data is a very labo-
Bush” and ”several Arab leaders”, both of which rious task. To alleviate the suffering of existing
play the role ”Entity”. methods from the deficiency of predefined event
There are two interesting issues in event ex- data, event generation approaches are often used
traction that require more efforts. On the one to produce additional events for training (Yang
hand, roles in an event vary greatly in frequency et al., 2018; Zeng et al., 2018; Chen et al., 2017).
(Figure 2), and they can overlap on some words, And distant supervision (Mintz et al., 2009) is a

These two authors contributed equally. commonly used technique to this end for label-

Corresponding Author. ing external corpus. But the quality and quantity

5284
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5284–5294
Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics
0.7 et al., 2015; Nguyen and Grishman, 2015; Feng
0.6
0.5
et al., 2016).
Frequency
0.4 Event Generation External resources such as
0.3
Freebase, Frame-Net and WordNet are commonly
0.2
0.1 employed to generate event and enrich the train-
0 ing data. Several previous event generation ap-
Victim Place Agent Instrument Time
proaches (Chen et al., 2017; Zeng et al., 2018)
Figure 2: Frequency of roles that appear in events of base a strong assumption in distant supervision1
type Injure in the ACE2005 dataset. to label events in unsupervised corpus. But in fact,
co-occurring entities could have none expected re-
lationship. In addition, Huang et al. (2016) incor-
of events generated with distant supervision are
porates abstract meaning representation and distri-
highly dependent on the source data. In fact, ex-
bution semantics to extract events. While Liu et al.
ternal corpus can also be exploited by pre-trained
(2016, 2017) manages to mine additional events
language models to generate sentences. Therefore,
from the frames in FrameNet.
we turn to pre-trained language models, attempt-
Pre-trained Language Model Pre-trained lan-
ing to leverage their knowledge learned from the
guage models are capable of capturing the mean-
large-scale corpus for event generation.
ing of words dynamically in consideration of their
Specifically, this paper proposes a framework
context. McCann et al. (2017) exploits language
based on pre-trained language models, which in-
model pre-trained on supervised translation corpus
cludes an event extraction model as our baseline
in the target task. ELMO (Embeddings from Lan-
and a labeled event generation method. Our pro-
guage Models) (Peters et al., 2018) gets context
posed event extraction model is constituted of a
sensitive embeddings by encoding characters with
trigger extractor and an argument extractor which
stacked bidirectional LSTM (Long Short Term
refers result of the former for inference. In addi-
Memory) and residual structure (He et al., 2016).
tion, we improve the performance of the argument
Howard and Ruder (2018) obtains comparable re-
extractor by re-weighting the loss function based
sult on text classification. GPT (Generative Pre-
on the importance of roles.
Training) (Radford et al., 2018) improves the state
Pre-trained language models have also been ap-
of the art in 9 of 12 tasks. BERT (Bidirectional
plied to generating labeled data. Inspired by the
Encoder Representations from Transformers) (De-
work of Guu et al. (2018), we take the existing
vlin et al., 2018) breaks records of 11 NLP task
samples as prototypes for event generation, which
and received a lot of attention.
contains two key steps: argument replacement and
adjunct token rewriting. Through scoring the qual-
3 Extraction Model
ity of generated samples, we can pick out those
of high quality. Incorporating them with existing This section describes our approach to extract
data can further improve the performance of our events that occur in plain text. We consider event
event extractor. extraction as a two-stage task, which includes trig-
ger extraction and argument extraction, and pro-
2 Related work pose a Pre-trained Language Model based Event
Event Extraction In terms of analysis granularity, Extractor (PLMEE). Figure 3 illustrates the archi-
there are document-level event extraction (Yang tecture of PLMEE. It consists of a trigger extractor
et al., 2018) and sentence-level event extraction and an argument extractor, both of which rely on
(Zeng et al., 2018). We focus on the statistical the feature representation of BERT.
methods of the latter in this paper. These meth-
ods can be further divided into two detailed cat- 3.1 Trigger Extractor
egories: the feature based ones (Liao and Grish- Trigger extractor targets to predict whether a token
man, 2010; Liu et al., 2010; Miwa et al., 2009; Liu triggers an event. So we formulate trigger extrac-
et al., 2016; Hong et al., 2011; Li et al., 2013b) tion as a token-level classification task with labels
which track designed features for extraction, and 1
If two entities have a relationship in a knowledge base,
the neural based ones that take advantage of neu- then all sentences that mention these two entities will express
ral networks to learn features automatically (Chen that relationship.

5285
Conflict.Attack
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Classifier Cstart Cend Cstart Cend Cstart Cend Cstart Cend


For inference ...
Attacker Victim Place
Classifier Set

The The
explosion explosion Attacker
killed killed Victim

Embedding

Embedding
Place Role
the the ... Importance
bomber bomber
and and
three three
shoppers shoppers Loss

BERT BERT
Trigger Argument
WordPiece Segment Position WordPiece Segment Position

The explosion killed the bomber and three shoppers The explosion killed the bomber and three shoppers

Figure 3: Illustration of the PLMEE architecture, including a trigger extractor and an argument extractor. The
processing procedure of an event instance triggered by the word ”killed” is also shown.

being event types, and just add a multi-classifier To overcome the latter two issues in argument
on BERT to build the trigger extractor. extraction, we add multiple sets of binary classi-
The input of the trigger extractor follows the fiers on the BERT. Each set of classifiers sever for
BERT, i.e. the sum of three types of embed- a role to determine the spans (each span includes a
dings, including WordPiece embedding (Wu et al., start and an end) of all arguments that play it. This
2016), position embedding and segment embed- approach is similar to the question answering task
ding. Since the input contains only one sentence, on the SQuAD (Rajpurkar et al., 2016) in which
all its segment ids are set to zero. In addition, to- there is only one answer, while multiple arguments
ken [CLS] and [SEP]2 are placed at the start and playing the same role can appear simultaneously
end of the sentence. in an event. Since the prediction is separated with
In many cases, the trigger is a phrase. There- roles, an argument can play multiple roles, and a
fore, we treat consecutive tokens which share the token can belong to different arguments. Thus, the
same predicted label as a whole trigger. As gen- roles overlap problem can also be solved.
eral, we adopt cross entropy as the loss function
3.3 Argument Span Determination
for fine-tuning.
In PLMEE, a token t is predicted as the start of an
3.2 Argument Extractor argument that plays role r with probability:
Given the trigger, argument extractor aims to ex- Psr (t) = Sof tmax (Wsr · B (t)) ,
tract related arguments and all roles they play.
Compared with trigger extraction, argument ex- while as the end with probability:
traction is more complicated because of three is-
sues: the dependency of arguments on the trigger, Per (t) = Sof tmax (Wer · B (t)) ,
most arguments being long noun phrases, and the
roles overlap problem. We take exactly a series of in which we use subscript ”s” to represent ”start”
actions to deal with these obstacles. and subscript ”e” to represent ”end”. Wsr is the
In common with trigger extractor, argument ex- weight of binary classifier that aims to detect starts
tractor requires three kinds of embeddings as well. of arguments playing role r, while Wer is the
However, it needs to know which tokens comprise weight of another binary classifier that aims to de-
the trigger. Therefore, we feed argument extractor tect ends. B is the BERT embedding.
with the segment ids of trigger tokens being one. For each role r, we can get two lists Bsr and Ber
of 0 and 1 according to Psr and Per . They indicate
2
[CLS], [SEP] and [MASK] are special tokens of BERT. respectively whether a token in the sentence is the

5286
start or end of an argument that plays role r3 . Al- entropy between the output probabilities and the
gorithm 1 is used to detect each token sequentially golden label y:
to determine spans of all arguments that play the
1 X
role r. Ls = CE (Psr , ysr ) ,
|R| × |S|
r∈R
Algorithm 1 Argument span determination
In: Psr and Per , Bsr and Ber , sentence length l. in which CE is cross entropy, R is the set of roles,
Out: Span list L of the arguments that play role r S is the input sentence, and |S| is the number of
Initiate: as ←-1, ae ←-1 tokens in S. Similarly, we define Le as the loss
function of all binary classifiers that detect ends:
1: for i ← 0 to l do
2: if In State 1 & the ith token is a start then 1 X
Le = CE (Per , yer ) .
3: as ← i and change to State 2 |R| × |S|
r∈R
4: end if
5: if In State 2 then We finally average Ls and Le as the loss L of ar-
6: if the ith token is a new start then gument extractor.
7: as ← i if Psr [i] > Psr [as ] As Figure 2 shows, there exists a big gap in fre-
8: end if quency between roles. This implies that roles have
9: if the ith token is an end then different levels of ”importance” in an event. The
10: ae ← i and change to State 3 ”importance” here means the ability of a role to
11: end if indicate events of a specific type. For example,
12: end if the role ”Victim” is more likely to indicate a Die
13: if In State 3 then event than the role ”Time”. Inspired by this, we
14: if the ith token is a new end then re-weight Ls and Le according to the importance
15: ae ← i if Per [i] > Per [ae ] of roles, and propose to measure the importance
16: end if with the following definitions:
17: if the ith token is a new start then Role Frequency (RF) We define RF as the fre-
18: Append [as , ae ] to L quency of role r appearing in events of type v:
19: ae ← -1, as ← i and change to State 2
Nvr
20: end if RF(r, v) = P ,
k
21: end if k∈R Nv
22: end for where Nvr is the count of the role r that appear in
the events of type v.
Algorithm 1 contains a finite state machine, Inverse Event Frequency (IEF) As the mea-
which changes from one state to another in re- sure of the universal importance of a role, we de-
sponse to Bsr and Ber . There are three states to- fine IEF as the logarithmically scaled inverse frac-
tally: 1) Neither start nor end has been detected; tion of the event types that contain the role r:
2) Only a start has been detected; 3) A start as well
as an end have been detected. Specially, the state |V|
IEF(r) = log ,
changes according to the following rules: State 1 |{v ∈ V : r ∈ v}|
changes to State 2 when the current token is a start;
State 2 changes to State 3 when the current token where V is tht set of event types.
is an end; State 3 changes to State 2 when the cur- Finally we take RF-IEF as the product of RF
rent token is a new start. Notably, if there has been and IEF: RF-IEF(r, v) = RF(r, v) × IEF(r). With
a start and another start arises, we will choose the RF-IEF, we can measure the importance of a role
one with higher probability, and the same for end. r in events of type v:

3.4 Loss Re-weighting expRF-IEF(r,v)


I(r, v) = P RF-IEF(r0 ,v)
.
We initially define Ls as the loss function of all r0 ∈R exp
binary classifiers that are responsible for detect- We choose three event types and list the two
ing starts of arguments. It is the average of cross most important roles of each type in Table 1. It
3
The ith token is a start if Bsr [i]=1 or an end if Ber [i]=1. shows that although there could be multiple roles

5287
Event Type Top 2 Roles Sum fine-tune it on the ACE2005 dataset with the
Transport(15) Artifact, Origin 0.76 masked language model task (Devlin et al., 2018)
Attack(14) Attacker, Target 0.85 to bias its prediction towards the dataset distribu-
Die(12) Victim, Agent 0.90 tion. In common with the pre-training procedure
of BERT, each time we sample a batch of sen-
Table 1: Top two roles and their sum importance for tences and mask 15% of tokens. Its goal is still
each event type. The number in brackets behind event
type is the count of roles that have appeared in it.
to predict the correct token without supervision.

4.2 Event generation


in events of someone type, only a few of them is To generate events, we conduct two steps on a pro-
indispensable. totype. We first replace the arguments in the proto-
Give the event type v of input, we re-weight Ls type with those similar that have played the same
and Le based on each role’s importance in v: role. Next, we rewrite adjunct tokens with the fine-
X I(r, v) tuned BERT. Through these two steps, we can ob-
Ls = CE (Psr , ysr ) tain a new sentence with annotations.
|S|
r∈R Argument Replacement The first step is to re-
X I(r, v)
Le = CE (Per , yer ) . place arguments in the event. Both the argument
|S| to be replaced and the new one should have played
r∈R
ever the same role. While the roles are inherited
The loss of argument extractor L is still the aver- after replacement, so we can still use origin labels
age of Ls and Le . for the generated samples.
4 Training Data Generation In order not to change the meaning drastically,
we employ similarity as the criteria for selecting
In addition to PLMEE, we also propose a pre- new arguments. It is based on the following two
trained language model based method for event considerations: one is that two arguments that play
generation as illustrated in Figure 4. By edit- the same role may diverge significantly in seman-
ing prototypes, this method can generate a con- tics; another is that the role an argument plays
trollable number of labeled samples as the extra is largely dependent on its context. Therefore,
training corpus. It consists of three stages: pre- we should choose arguments that are semantically
processing, event generation and scoring. similar and coherent with the context.
To facilitate the generation method, we define We use cosine similarity between embeddings
adjunct tokens as the tokens in sentences except to measure the similarity of two arguments. And
triggers and arguments, including not only words due to ELMO’s ability to handle the OOV prob-
and numbers, but also punctuation. Taking sen- lem, we employ it to embed arguments:
tence in Figure 1 as an example, ”is” and ”going”
are adjunct tokens. It is evident that adjunct tokens 1 X
E(a) = E(t),
can adjust the smooth and diversity of expression. |a| t∈a
Therefore, we try to rewrite them to expand the di-
versity of the generation results, while keeping the where a is the argument, E is ELMO embedding.
trigger and arguments unchanged. We choose the top 10 percent most similar argu-
ments as candidates, and use softmax operation on
4.1 Pre-processing their similarity to allocate probability.
With the golden labels, we first collect arguments An argument is replaced with probability 80%
in the ACE2005 dataset as well as the roles they while keeping constant with probability 20% to
play. However, those arguments overlap with oth- bias the representation towards the actual event
ers are excluded. Because such arguments are of- (Devlin et al., 2018). Note that the triggers remain
ten long compound phrases that contain too much unchanged to avoid undesirable deviation of de-
unexpected information, and incorporating them pendency relation.
in argument replacement could bring more unnec- Adjunct Token Rewriting The results of argu-
essary errors. ment replacement can already be considered as the
We also adopt BERT as the target model to generated data, but the constant context may in-
rewrite adjunct tokens in the following stage, and crease the risk of overfitting. Therefore, to smooth

5288
Entity In: President Bush is going to be meeting
1. President with several Arab leaders
2. Prime minister Blair
3. the prime minister Argument
Argument 4. the Arab leaders
Dataset Quality: 0.5
Collection 5. an Arab counterpart Replacement
6. the Palestinians
7. the leaders Prime minister Blair is going to be meeting
8. ... with the leaders Scorer
Adjunct Token
BERT Fine-tuning BERT
Rewriting

Out: Prime minister Blair is reported to the


meeting with the leaders
Stage 1: Pre-processing Stage 2: Event generation Stage 3: Scoring

Figure 4: Flow chart of the generation approach.

the generated data and expand their diversity, we eration, and the latter reflects the differences be-
manage to rewrite adjunct tokens with the fine- tween the data.
tuned BERT. Perplexity (PPL) Different with the masked
The rewriting is to replace some adjunct tokens perplexity (Devlin et al., 2018) of logarithmic ver-
in the prototype with the new ones that are more sion, we take the average probability of those ad-
matchable with the current context. We take it as junct tokens that have been rewritten as the per-
a Cloze task (Taylor, 1953), where some adjunct plexity of generated sentence S 0 :
tokens are randomly masked and the BERT fine 1 X
tuned in the first stage is used to predict vocabulary PPL(S 0 ) = 0
P (t),
|A(S )| 0
ids of suitable tokens based on the context. We use t∈A(S )

a parameter m to denote the proportion of adjunct where A is the set of adjunct tokens in S 0 that have
tokens that need to be rewritten. been rewritten.
Adjunct token rewriting is a step-by-step pro- Distance (DIS) We measure the distance be-
cess. Each time we mask 15% of adjunct tokens tween S 0 and the dataset D with cosine similarity:
(with the token [MASK]). Then the sentence is fed 1 X B(S 0 ) · B(S)
into BERT to produce new adjunct tokens. The ad- DIS(S 0 , D) = 1 − .
|D| |B(S 0 )| × |B(S)|
junct tokens that have not yet been rewritten will S∈D
temporarily remain in the sentence. Different with embedding arguments by ELMO,
To further illustrate the above two steps, we give we utilize BERT to embed sentence and take the
an instance in Figure 4. In this instance, we set embedding of the first token [CLS] as the sentence
m to 1.0, which means all the adjunct tokens will embedding.
be rewritten. The final output is ”Prime minister Both the PPL and the DIS are limited in [0,1].
Blair is reported to the meeting with the leaders”, We consider that generated samples of high qual-
which shares the labels with the original event in ity should have both low PPL and DIS. Therefore,
the prototype. It is evident that some adjunct to- we define the quality function as:
kens are preserved despite m is 1.0.
Q(S 0 ) = 1 − λPPL S 0 + (1 − λ) DIS S 0 , D
 

4.3 Scoring , where λ ∈ [0, 1] is the balancing parameter. This


Theoretically, infinite number of events can be function is used to select generated samples of
generated with our generation method. However, high quality in experiments.
not all of them are valuable for the extractor and
5 Experiments
some may even degrade its performance. There-
fore, we add an extra stage to quantify the quality In this section, we first evaluate our event extractor
of each generated sample to pick out those valu- PLMEE on the ACE2005 dataset. Then we give a
able. Our key insight for evaluating the quality case study of generated samples and conduct au-
lies that it is tightly related to two factors, which tomatic evaluations by adding them into the train-
are the perplexity and the distance to the original ing set. Finally, we illustrate the limitations of the
dataset. The former reflects the rationality of gen- generation method.

5289
Phase Trigger Trigger Argument Argument
Identification(%) Calssfication(%) Identification(%) Calssfication(%)
Model P R F P R F P R F P R F
Cross Event N/A 68.7 68.9 68.8 50.9 49.7 50.3 45.1 44.1 44.6
Cross Entity N/A 72.9 64.3 68.3 53.4 52.9 53.1 51.6 45.5 48.3
Max Entropy 76.9 65.0 70.4 73.7 62.3 67.5 69.8 47.9 56.8 64.7 44.4 52.7
DMCNN 80.4 67.7 73.5 75.6 63.6 69.1 68.8 51.9 59.1 62.2 46.9 53.5
JRNN 68.5 75.7 71.9 66.0 73.0 69.3 61.4 64.2 62.8 54.2 56.7 55.4
DMCNN-DS 79.7 69.6 74.3 75.7 66.0 70.5 71.4 56.9 63.3 62.8 50.1 55.7
ANN-FN N/A 79.5 60.7 68.8 N/A N/A
ANN-AugATT N/A 78.0 66.3 71.7 N/A N/A
PLMEE(-) 71.5 59.2 64.7 61.7 53.9 57.5
84.8 83.7 84.2 81.0 80.4 80.7
PLMEE 71.4 60.1 65.3 62.3 54.2 58.0

Table 2: Performance of all methods. Bold denotes the best result.

As previous works (Li et al., 2013b; Chen et al., framework based on bidirectional RNN for event
2015; Hong et al., 2011), we take the test set extraction.
with 40 newswire documents, while 30 other doc- External resource based methods DMCNN-
uments as the validation set, and the remaining DS (Chen et al., 2017) uses FreeBase to label
529 documents to be the training set. However, potential events in unsupervised corpus by dis-
different with previous works, we take the follow- tance supervision. ANN-FN (Liu et al., 2016)
ing criteria to evaluate the correctness of each pre- improves extraction with additionally events au-
dicted event mention: tomatically detected from FrameNet, while ANN-
AugATT (Liu et al., 2017) exploits argument infor-
1. A trigger prediction is correct only if its span mation via the supervised attention mechanisms to
and type match with the golden labels. improve the performance further.
2. An argument prediction is correct only if its In order to verify the effectiveness of loss re-
span and all roles it plays match with the weighting, two groups of experiments are con-
golden labels. ducted for comparison. Namely, the group where
the loss function is simply averaged on all clas-
It is worth noting that all the predicted roles for sifiers’ output (indicated as PLMEE(-)) and the
an argument are required to match with the golden group where the loss is re-weighted based on role
labels, instead of just one of them. We adopt Pre- importance (indicated as PLMEE).
cision (P), Recall (R) and F measure (F1) as the Table 2 compares the results of the aforemen-
evaluation metrics. tioned models with PLMEE on the test set. As is
shown, in both the trigger extraction task and the
5.1 Results of Event Extraction argument extraction task, PLMEE(-) has achieved
We take several previous classic works for com- the best results among all the compared meth-
parison, and divide them into three categories: ods. The improvement on the trigger extraction
Feature based methods Document-level infor- is quite significant, seeing a sharp increase of near
mation is utilized in Cross event (Liao and Gr- 10% on the F1 score. While the improvement in
ishman, 2010) to assist event extraction. While argument extraction is not so obvious, achieving
Cross entity (Hong et al., 2011) uses cross-entity about 2%. This is probably due to the more rigor-
inference in extraction. Max Extropy (Li et al., ous evaluation metric we have taken and the diffi-
2013a) extracts triggers as well as arguments to- culty of argument extraction task as well. More-
gether based on structured prediction. over, compared with feature based methods, neu-
Neural based methods DMCNN (Chen et al., ral based methods can achieve better performance.
2015) adopts firstly dynamic multi-pooling CNN And the same observation appears when compar-
to extract sentence-level features automatically. ing external resource based methods with neural
JRNN (Nguyen et al., 2016) proposes a joint based methods. It demonstrates that external re-

5290
Prototype m Generated Event
0.2 Russian President Putin is going to the meeting with the Arab leaders
President Bush is 0.4 The president is reported to be meeting with an Arab counterpart
going to be meeting
with several Arab 0.6 Mr. Bush is summoned to a meeting with some Shiite Muslim groups
leaders 0.8 The president is attending to the meeting with the Palestinians
1.0 Prime minister Blair is reported to the meeting with the leaders

Table 3: Example samples generated with different proportion of rewritten adjunct tokens. Italic indicates argument
and bold indicates trigger.

sources are useful to improve event extraction. In different selection strategies can be used to screen
addition, the PLMEE model can achieve better re- out the generated samples.
sults on the argument extraction task - with im- We first tuned the former two parameters on the
provement of 0.6% on F1 score for identification development set through grid search. Specially,
and 0.5% for classification - than the PLMEE(-) we set m ranging from 0.2 to 1.0 with an interval
model, which means that re-weighting the loss can of 0.2, and set n to be 0.5, 1.0 and 2.0, while keep-
effectively improve the performance. ing other parameters unchanged in the generation
process. We conduct experiments with these pa-
5.2 Case Study rameters. By analyzing the results, we find that the
Table 3 illustrates a prototype and its generation best performance of PLMEE on both trigger ex-
with parameter m ranging from 0.2 to 1.0. We traction and argument extraction can be achieved
can observe that the arguments after replacement with m = 0.4 and n = 1.0. It suggests that nei-
can match the context in prototype relatively well, ther too few generated samples nor too much is a
which indicates that they are resembling with the better choice for extraction. Too few has limited
original ones in semantic. influence, while too much could bring more noise
On the other hand, rewriting the adjunct tokens that disturbs the distribution of the dataset. For the
can smooth the generated data and expand their di- better extraction performance, we use such param-
versity. However, since there is no explicit guide, eter settings in the following experiments.
this step can also introduce unpredictable noise, We also investigate the effectiveness of the
making the generation not fluent as expected. sample selection approach, a comparison is con-
ducted between three groups with different selec-
5.3 Automatic Evaluation of Generation tion strategies. We obtain a total of four times the
size of the ACE2005 dataset using our generation
So far, there are mainly three aspects of the gen- method with m = 0.4, and pick out one quarter of
eration method that could have significant impacts them (n = 1.0) with λ being 0, 0.5 and 1.0 respec-
on the performance of the extraction model, in- tively. When λ is 0 or 1.0, it is either perplexity
cluding the amount of generated samples (repre- or distance that determines the quality exclusively.
sented by n, which indicates times the generation We find that the selection method with λ = 0.5
size is the number of dataset size), the proportion in quality function is able to pick out samples that
of rewritten adjunct tokens m, and the quality of are more advantageous to promote the extraction
the generated samples. The former two factors performance.
are controllable in the generation process. Spe-
cially, we can reuse a prototype and get a variety of Model Trigger(%) Argument(%)
combinations of arguments via similarity based re- PLMEE 80.7 58.0
placement, which will bring different contexts for PLMEE(+) 81.1 58.9
rewriting adjunct tokens. Moreover, the propor-
tion of rewritten adjunct tokens can be adjusted, Table 4: F1 score of trigger classification and argument
making a further variation. Although the quality of classification on the test set.
generation cannot be controlled arbitrarily, it can
be quantified by the score function Q so that those Finally, we incorporate the above generated
samples of higher quality can be picked out and data with the ACE2005 dataset and investigate the
added into the training set. With λ in Q changing, effectiveness of our generation method on the test

5291
set. In Table 4, we use PLMEE(+) denotes the On the other hand, there are still limitations in
PLMEE model trained with extra generated sam- our work. Events of the same type often share sim-
ples. The results illustrate that with our event gen- ilarity. And co-occurring roles tend to hold a tight
eration method, the PLMEE model can achieve the relation. Such features are ignored in our model,
state of the art result of event extraction. but they deserve more investigation for improving
the extraction model. In addition, although our
5.4 Limitation generation method can control the number of gen-
By comparing the annotations in generated sam- erated samples and filter with quality, it still suf-
ples and manually labeled samples, we find that fers the deviation of roles alike with distant super-
one issue of our generation method is that the roles vision. Therefore, for the future work, we will in-
may deviate, because the semantics could change corporate relation between events and relation be-
a lot with only a few adjunct tokens been rewritten. tween arguments into pre-trained language mod-
Taking Figure 5 as an example. The roles played els, and take effective measures to overcome the
by argument ”Pittsburgh” and ”Boston” should be deviation problem of roles in the generation.
”Destination” and ”Origin”, rather not the oppo-
site as in the prototype. This is because the to- Acknowledgments
ken ”from” has been replaced with the token ”for”,
The work was sponsored by the National Key Re-
while token ”drive to” been replaced with ”return
search and Development Program of China under
from”.
Grant No.2018YFB0204300, and National Nat-
Prototype: Leave from Niagara Falls and drive to Toronto, on 85 miles ural Science Foundation of China under Grant
Trigger leave No.61872376 and No.61806216.
Event type Movement.Transport
Arguments Niagara Falls Toronto
Roles Origin Destination
Generation: Leave for Pittsburgh and return from Boston in 200 miles
leave ✓
Trigger
References
Event type Movement.Transport
Arguments Pittsburgh Boston
Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and
Roles Origin x Destination x Jun Zhao. 2017. Automatically labeled data gener-
ation for large scale event extraction. In Proceed-
Figure 5: One of the generated samples with wrong ings of the 55th Annual Meeting of the Association
annotations. for Computational Linguistics (Volume 1: Long Pa-
pers), volume 1, pages 409–419.

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,


6 Conclusion and Discussion and Jun Zhao. 2015. Event extraction via dy-
namic multi-pooling convolutional neural networks.
In this paper, we present a framework to promote In Proceedings of the 53rd Annual Meeting of the
event extraction by using a combination of an ex- Association for Computational Linguistics and the
traction model and a generation method, both of 7th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), vol-
which are based on pre-trained language models. ume 1, pages 167–176.
To solve the roles overlap problem, our extraction
approach tries to separate the argument predictions Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
in terms of roles. Then it exploits the importance Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
of roles to re-weight the loss function. To perform
ing. arXiv preprint arXiv:1810.04805.
event generation, we present a novel method that
takes the existing events as prototypes. This event George R Doddington, Alexis Mitchell, Mark A Przy-
generation method can produce controllably la- bocki, Lance A Ramshaw, Stephanie M Strassel, and
beled samples through argument replacement and Ralph M Weischedel. 2004. The automatic content
extraction (ace) program-tasks, data, and evaluation.
adjunct tokens rewriting. It also benefits from the In LREC, volume 2, page 1.
scoring mechanism which is able to quantify the
quality of generated samples. Experimental re- Xiaocheng Feng, Lifu Huang, Duyu Tang, Heng Ji,
sults show that the quality of generated data is Bing Qin, and Ting Liu. 2016. A language-
independent neural network for event detection. In
competitive and incorporating them with existing Proceedings of the 54th Annual Meeting of the As-
corpus can make our proposed event extractor to sociation for Computational Linguistics (Volume 2:
be superior to several state of the art approaches. Short Papers), volume 2, pages 66–71.

5292
Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017.
and Percy Liang. 2018. Generating sentences by Exploiting argument information to improve event
editing prototypes. Transactions of the Association detection via supervised attention mechanisms. In
of Computational Linguistics, 6:437–450. Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Long Papers), volume 1, pages 1789–1798.
Sun. 2016. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on Bryan McCann, James Bradbury, Caiming Xiong, and
computer vision and pattern recognition, pages 770– Richard Socher. 2017. Learned in translation: Con-
778. textualized word vectors. In Advances in Neural In-
formation Processing Systems, pages 6294–6305.
Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao,
Guodong Zhou, and Qiaoming Zhu. 2011. Us- Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
ing cross-entity inference to improve event extrac- sky. 2009. Distant supervision for relation extrac-
tion. In Proceedings of the 49th Annual Meeting of tion without labeled data. In Proceedings of the
the Association for Computational Linguistics: Hu- Joint Conference of the 47th Annual Meeting of the
man Language Technologies-Volume 1, pages 1127– ACL and the 4th International Joint Conference on
1136. Association for Computational Linguistics. Natural Language Processing of the AFNLP: Vol-
ume 2-Volume 2, pages 1003–1011. Association for
Jeremy Howard and Sebastian Ruder. 2018. Universal Computational Linguistics.
language model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the As- Makoto Miwa, Rune Sætre, Yusuke Miyao, and
sociation for Computational Linguistics (Volume 1: Jun’ichi Tsujii. 2009. A rich feature vector for
Long Papers), pages 328–339. protein-protein interaction extraction from multiple
corpora. In Proceedings of the 2009 Conference on
Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng
Empirical Methods in Natural Language Process-
Ji, Clare R Voss, Jiawei Han, and Avirup Sil. 2016.
ing: Volume 1-Volume 1, pages 121–130. Associa-
Liberal event extraction and event schema induction.
tion for Computational Linguistics.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-
Long Papers), volume 1, pages 258–268. ishman. 2016. Joint event extraction via recurrent
Peifeng Li, Qiaoming Zhu, and Guodong Zhou. neural networks. In Proceedings of the 2016 Con-
2013a. Joint modeling of argument identification ference of the North American Chapter of the Asso-
and role determination in chinese event extraction ciation for Computational Linguistics: Human Lan-
with discourse-level information. In Twenty-Third guage Technologies, pages 300–309.
International Joint Conference on Artificial Intelli-
Thien Huu Nguyen and Ralph Grishman. 2015. Event
gence.
detection and domain adaptation with convolutional
Qi Li, Heng Ji, and Liang Huang. 2013b. Joint event neural networks. In Proceedings of the 53rd Annual
extraction via structured prediction with global fea- Meeting of the Association for Computational Lin-
tures. In Proceedings of the 51st Annual Meeting of guistics and the 7th International Joint Conference
the Association for Computational Linguistics (Vol- on Natural Language Processing (Volume 2: Short
ume 1: Long Papers), volume 1, pages 73–82. Papers), volume 2, pages 365–371.

Shasha Liao and Ralph Grishman. 2010. Using doc- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
ument level cross-event inference to improve event Gardner, Christopher Clark, Kenton Lee, and Luke
extraction. In Proceedings of the 48th Annual Meet- Zettlemoyer. 2018. Deep contextualized word rep-
ing of the Association for Computational Linguis- resentations. In Proceedings of the 2018 Confer-
tics, pages 789–797. Association for Computational ence of the North American Chapter of the Associ-
Linguistics. ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
Bing Liu, Longhua Qian, Hongling Wang, and 2227–2237.
Guodong Zhou. 2010. Dependency-driven feature-
based learning for extracting protein-protein interac- Alec Radford, Karthik Narasimhan, Tim Salimans, and
tions from biomedical text. In Proceedings of the Ilya Sutskever. 2018. Improving language under-
23rd International Conference on Computational standing by generative pre-training. URL https://s3-
Linguistics: Posters, pages 757–765. Association us-west-2. amazonaws. com/openai-assets/research-
for Computational Linguistics. covers/languageunsupervised/language under-
standing paper. pdf.
Shulin Liu, Yubo Chen, Shizhu He, Kang Liu, and
Jun Zhao. 2016. Leveraging framenet to improve Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
automatic event detection. In Proceedings of the Percy Liang. 2016. Squad: 100,000+ questions for
54th Annual Meeting of the Association for Compu- machine comprehension of text. In Proceedings of
tational Linguistics (Volume 1: Long Papers), vol- the 2016 Conference on Empirical Methods in Nat-
ume 1, pages 2134–2143. ural Language Processing, pages 2383–2392.

5293
Wilson L Taylor. 1953. “cloze procedure”: A new
tool for measuring readability. Journalism Bulletin,
30(4):415–433.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Hang Yang, Yubo Chen, Kang Liu, Yang Xiao, and Jun
Zhao. 2018. Dcfee: A document-level chinese fi-
nancial event extraction system based on automat-
ically labeled training data. Proceedings of ACL
2018, System Demonstrations, pages 50–55.
Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui
Yan, Chongde Shi, and Dongyan Zhao. 2018. Scale
up event extraction learning via automatic training
data generation. In Thirty-Second AAAI Conference
on Artificial Intelligence.

5294

You might also like