Exploring Pre-Trained Language Models For Event Extraction and Generation
Exploring Pre-Trained Language Models For Event Extraction and Generation
Generation
Sen Yang† , Dawei Feng† , Linbo Qiao, Zhigang Kan, Dongsheng Li‡
National University of Defense Technology, Changsha, China
{sen yang,linbo.qiao,kanzhigang13}@nudt.edu.cn
[email protected], [email protected]
5284
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5284–5294
Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics
0.7 et al., 2015; Nguyen and Grishman, 2015; Feng
0.6
0.5
et al., 2016).
Frequency
0.4 Event Generation External resources such as
0.3
Freebase, Frame-Net and WordNet are commonly
0.2
0.1 employed to generate event and enrich the train-
0 ing data. Several previous event generation ap-
Victim Place Agent Instrument Time
proaches (Chen et al., 2017; Zeng et al., 2018)
Figure 2: Frequency of roles that appear in events of base a strong assumption in distant supervision1
type Injure in the ACE2005 dataset. to label events in unsupervised corpus. But in fact,
co-occurring entities could have none expected re-
lationship. In addition, Huang et al. (2016) incor-
of events generated with distant supervision are
porates abstract meaning representation and distri-
highly dependent on the source data. In fact, ex-
bution semantics to extract events. While Liu et al.
ternal corpus can also be exploited by pre-trained
(2016, 2017) manages to mine additional events
language models to generate sentences. Therefore,
from the frames in FrameNet.
we turn to pre-trained language models, attempt-
Pre-trained Language Model Pre-trained lan-
ing to leverage their knowledge learned from the
guage models are capable of capturing the mean-
large-scale corpus for event generation.
ing of words dynamically in consideration of their
Specifically, this paper proposes a framework
context. McCann et al. (2017) exploits language
based on pre-trained language models, which in-
model pre-trained on supervised translation corpus
cludes an event extraction model as our baseline
in the target task. ELMO (Embeddings from Lan-
and a labeled event generation method. Our pro-
guage Models) (Peters et al., 2018) gets context
posed event extraction model is constituted of a
sensitive embeddings by encoding characters with
trigger extractor and an argument extractor which
stacked bidirectional LSTM (Long Short Term
refers result of the former for inference. In addi-
Memory) and residual structure (He et al., 2016).
tion, we improve the performance of the argument
Howard and Ruder (2018) obtains comparable re-
extractor by re-weighting the loss function based
sult on text classification. GPT (Generative Pre-
on the importance of roles.
Training) (Radford et al., 2018) improves the state
Pre-trained language models have also been ap-
of the art in 9 of 12 tasks. BERT (Bidirectional
plied to generating labeled data. Inspired by the
Encoder Representations from Transformers) (De-
work of Guu et al. (2018), we take the existing
vlin et al., 2018) breaks records of 11 NLP task
samples as prototypes for event generation, which
and received a lot of attention.
contains two key steps: argument replacement and
adjunct token rewriting. Through scoring the qual-
3 Extraction Model
ity of generated samples, we can pick out those
of high quality. Incorporating them with existing This section describes our approach to extract
data can further improve the performance of our events that occur in plain text. We consider event
event extractor. extraction as a two-stage task, which includes trig-
ger extraction and argument extraction, and pro-
2 Related work pose a Pre-trained Language Model based Event
Event Extraction In terms of analysis granularity, Extractor (PLMEE). Figure 3 illustrates the archi-
there are document-level event extraction (Yang tecture of PLMEE. It consists of a trigger extractor
et al., 2018) and sentence-level event extraction and an argument extractor, both of which rely on
(Zeng et al., 2018). We focus on the statistical the feature representation of BERT.
methods of the latter in this paper. These meth-
ods can be further divided into two detailed cat- 3.1 Trigger Extractor
egories: the feature based ones (Liao and Grish- Trigger extractor targets to predict whether a token
man, 2010; Liu et al., 2010; Miwa et al., 2009; Liu triggers an event. So we formulate trigger extrac-
et al., 2016; Hong et al., 2011; Li et al., 2013b) tion as a token-level classification task with labels
which track designed features for extraction, and 1
If two entities have a relationship in a knowledge base,
the neural based ones that take advantage of neu- then all sentences that mention these two entities will express
ral networks to learn features automatically (Chen that relationship.
5285
Conflict.Attack
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
The The
explosion explosion Attacker
killed killed Victim
Embedding
Embedding
Place Role
the the ... Importance
bomber bomber
and and
three three
shoppers shoppers Loss
BERT BERT
Trigger Argument
WordPiece Segment Position WordPiece Segment Position
The explosion killed the bomber and three shoppers The explosion killed the bomber and three shoppers
Figure 3: Illustration of the PLMEE architecture, including a trigger extractor and an argument extractor. The
processing procedure of an event instance triggered by the word ”killed” is also shown.
being event types, and just add a multi-classifier To overcome the latter two issues in argument
on BERT to build the trigger extractor. extraction, we add multiple sets of binary classi-
The input of the trigger extractor follows the fiers on the BERT. Each set of classifiers sever for
BERT, i.e. the sum of three types of embed- a role to determine the spans (each span includes a
dings, including WordPiece embedding (Wu et al., start and an end) of all arguments that play it. This
2016), position embedding and segment embed- approach is similar to the question answering task
ding. Since the input contains only one sentence, on the SQuAD (Rajpurkar et al., 2016) in which
all its segment ids are set to zero. In addition, to- there is only one answer, while multiple arguments
ken [CLS] and [SEP]2 are placed at the start and playing the same role can appear simultaneously
end of the sentence. in an event. Since the prediction is separated with
In many cases, the trigger is a phrase. There- roles, an argument can play multiple roles, and a
fore, we treat consecutive tokens which share the token can belong to different arguments. Thus, the
same predicted label as a whole trigger. As gen- roles overlap problem can also be solved.
eral, we adopt cross entropy as the loss function
3.3 Argument Span Determination
for fine-tuning.
In PLMEE, a token t is predicted as the start of an
3.2 Argument Extractor argument that plays role r with probability:
Given the trigger, argument extractor aims to ex- Psr (t) = Sof tmax (Wsr · B (t)) ,
tract related arguments and all roles they play.
Compared with trigger extraction, argument ex- while as the end with probability:
traction is more complicated because of three is-
sues: the dependency of arguments on the trigger, Per (t) = Sof tmax (Wer · B (t)) ,
most arguments being long noun phrases, and the
roles overlap problem. We take exactly a series of in which we use subscript ”s” to represent ”start”
actions to deal with these obstacles. and subscript ”e” to represent ”end”. Wsr is the
In common with trigger extractor, argument ex- weight of binary classifier that aims to detect starts
tractor requires three kinds of embeddings as well. of arguments playing role r, while Wer is the
However, it needs to know which tokens comprise weight of another binary classifier that aims to de-
the trigger. Therefore, we feed argument extractor tect ends. B is the BERT embedding.
with the segment ids of trigger tokens being one. For each role r, we can get two lists Bsr and Ber
of 0 and 1 according to Psr and Per . They indicate
2
[CLS], [SEP] and [MASK] are special tokens of BERT. respectively whether a token in the sentence is the
5286
start or end of an argument that plays role r3 . Al- entropy between the output probabilities and the
gorithm 1 is used to detect each token sequentially golden label y:
to determine spans of all arguments that play the
1 X
role r. Ls = CE (Psr , ysr ) ,
|R| × |S|
r∈R
Algorithm 1 Argument span determination
In: Psr and Per , Bsr and Ber , sentence length l. in which CE is cross entropy, R is the set of roles,
Out: Span list L of the arguments that play role r S is the input sentence, and |S| is the number of
Initiate: as ←-1, ae ←-1 tokens in S. Similarly, we define Le as the loss
function of all binary classifiers that detect ends:
1: for i ← 0 to l do
2: if In State 1 & the ith token is a start then 1 X
Le = CE (Per , yer ) .
3: as ← i and change to State 2 |R| × |S|
r∈R
4: end if
5: if In State 2 then We finally average Ls and Le as the loss L of ar-
6: if the ith token is a new start then gument extractor.
7: as ← i if Psr [i] > Psr [as ] As Figure 2 shows, there exists a big gap in fre-
8: end if quency between roles. This implies that roles have
9: if the ith token is an end then different levels of ”importance” in an event. The
10: ae ← i and change to State 3 ”importance” here means the ability of a role to
11: end if indicate events of a specific type. For example,
12: end if the role ”Victim” is more likely to indicate a Die
13: if In State 3 then event than the role ”Time”. Inspired by this, we
14: if the ith token is a new end then re-weight Ls and Le according to the importance
15: ae ← i if Per [i] > Per [ae ] of roles, and propose to measure the importance
16: end if with the following definitions:
17: if the ith token is a new start then Role Frequency (RF) We define RF as the fre-
18: Append [as , ae ] to L quency of role r appearing in events of type v:
19: ae ← -1, as ← i and change to State 2
Nvr
20: end if RF(r, v) = P ,
k
21: end if k∈R Nv
22: end for where Nvr is the count of the role r that appear in
the events of type v.
Algorithm 1 contains a finite state machine, Inverse Event Frequency (IEF) As the mea-
which changes from one state to another in re- sure of the universal importance of a role, we de-
sponse to Bsr and Ber . There are three states to- fine IEF as the logarithmically scaled inverse frac-
tally: 1) Neither start nor end has been detected; tion of the event types that contain the role r:
2) Only a start has been detected; 3) A start as well
as an end have been detected. Specially, the state |V|
IEF(r) = log ,
changes according to the following rules: State 1 |{v ∈ V : r ∈ v}|
changes to State 2 when the current token is a start;
State 2 changes to State 3 when the current token where V is tht set of event types.
is an end; State 3 changes to State 2 when the cur- Finally we take RF-IEF as the product of RF
rent token is a new start. Notably, if there has been and IEF: RF-IEF(r, v) = RF(r, v) × IEF(r). With
a start and another start arises, we will choose the RF-IEF, we can measure the importance of a role
one with higher probability, and the same for end. r in events of type v:
5287
Event Type Top 2 Roles Sum fine-tune it on the ACE2005 dataset with the
Transport(15) Artifact, Origin 0.76 masked language model task (Devlin et al., 2018)
Attack(14) Attacker, Target 0.85 to bias its prediction towards the dataset distribu-
Die(12) Victim, Agent 0.90 tion. In common with the pre-training procedure
of BERT, each time we sample a batch of sen-
Table 1: Top two roles and their sum importance for tences and mask 15% of tokens. Its goal is still
each event type. The number in brackets behind event
type is the count of roles that have appeared in it.
to predict the correct token without supervision.
5288
Entity In: President Bush is going to be meeting
1. President with several Arab leaders
2. Prime minister Blair
3. the prime minister Argument
Argument 4. the Arab leaders
Dataset Quality: 0.5
Collection 5. an Arab counterpart Replacement
6. the Palestinians
7. the leaders Prime minister Blair is going to be meeting
8. ... with the leaders Scorer
Adjunct Token
BERT Fine-tuning BERT
Rewriting
the generated data and expand their diversity, we eration, and the latter reflects the differences be-
manage to rewrite adjunct tokens with the fine- tween the data.
tuned BERT. Perplexity (PPL) Different with the masked
The rewriting is to replace some adjunct tokens perplexity (Devlin et al., 2018) of logarithmic ver-
in the prototype with the new ones that are more sion, we take the average probability of those ad-
matchable with the current context. We take it as junct tokens that have been rewritten as the per-
a Cloze task (Taylor, 1953), where some adjunct plexity of generated sentence S 0 :
tokens are randomly masked and the BERT fine 1 X
tuned in the first stage is used to predict vocabulary PPL(S 0 ) = 0
P (t),
|A(S )| 0
ids of suitable tokens based on the context. We use t∈A(S )
a parameter m to denote the proportion of adjunct where A is the set of adjunct tokens in S 0 that have
tokens that need to be rewritten. been rewritten.
Adjunct token rewriting is a step-by-step pro- Distance (DIS) We measure the distance be-
cess. Each time we mask 15% of adjunct tokens tween S 0 and the dataset D with cosine similarity:
(with the token [MASK]). Then the sentence is fed 1 X B(S 0 ) · B(S)
into BERT to produce new adjunct tokens. The ad- DIS(S 0 , D) = 1 − .
|D| |B(S 0 )| × |B(S)|
junct tokens that have not yet been rewritten will S∈D
temporarily remain in the sentence. Different with embedding arguments by ELMO,
To further illustrate the above two steps, we give we utilize BERT to embed sentence and take the
an instance in Figure 4. In this instance, we set embedding of the first token [CLS] as the sentence
m to 1.0, which means all the adjunct tokens will embedding.
be rewritten. The final output is ”Prime minister Both the PPL and the DIS are limited in [0,1].
Blair is reported to the meeting with the leaders”, We consider that generated samples of high qual-
which shares the labels with the original event in ity should have both low PPL and DIS. Therefore,
the prototype. It is evident that some adjunct to- we define the quality function as:
kens are preserved despite m is 1.0.
Q(S 0 ) = 1 − λPPL S 0 + (1 − λ) DIS S 0 , D
5289
Phase Trigger Trigger Argument Argument
Identification(%) Calssfication(%) Identification(%) Calssfication(%)
Model P R F P R F P R F P R F
Cross Event N/A 68.7 68.9 68.8 50.9 49.7 50.3 45.1 44.1 44.6
Cross Entity N/A 72.9 64.3 68.3 53.4 52.9 53.1 51.6 45.5 48.3
Max Entropy 76.9 65.0 70.4 73.7 62.3 67.5 69.8 47.9 56.8 64.7 44.4 52.7
DMCNN 80.4 67.7 73.5 75.6 63.6 69.1 68.8 51.9 59.1 62.2 46.9 53.5
JRNN 68.5 75.7 71.9 66.0 73.0 69.3 61.4 64.2 62.8 54.2 56.7 55.4
DMCNN-DS 79.7 69.6 74.3 75.7 66.0 70.5 71.4 56.9 63.3 62.8 50.1 55.7
ANN-FN N/A 79.5 60.7 68.8 N/A N/A
ANN-AugATT N/A 78.0 66.3 71.7 N/A N/A
PLMEE(-) 71.5 59.2 64.7 61.7 53.9 57.5
84.8 83.7 84.2 81.0 80.4 80.7
PLMEE 71.4 60.1 65.3 62.3 54.2 58.0
As previous works (Li et al., 2013b; Chen et al., framework based on bidirectional RNN for event
2015; Hong et al., 2011), we take the test set extraction.
with 40 newswire documents, while 30 other doc- External resource based methods DMCNN-
uments as the validation set, and the remaining DS (Chen et al., 2017) uses FreeBase to label
529 documents to be the training set. However, potential events in unsupervised corpus by dis-
different with previous works, we take the follow- tance supervision. ANN-FN (Liu et al., 2016)
ing criteria to evaluate the correctness of each pre- improves extraction with additionally events au-
dicted event mention: tomatically detected from FrameNet, while ANN-
AugATT (Liu et al., 2017) exploits argument infor-
1. A trigger prediction is correct only if its span mation via the supervised attention mechanisms to
and type match with the golden labels. improve the performance further.
2. An argument prediction is correct only if its In order to verify the effectiveness of loss re-
span and all roles it plays match with the weighting, two groups of experiments are con-
golden labels. ducted for comparison. Namely, the group where
the loss function is simply averaged on all clas-
It is worth noting that all the predicted roles for sifiers’ output (indicated as PLMEE(-)) and the
an argument are required to match with the golden group where the loss is re-weighted based on role
labels, instead of just one of them. We adopt Pre- importance (indicated as PLMEE).
cision (P), Recall (R) and F measure (F1) as the Table 2 compares the results of the aforemen-
evaluation metrics. tioned models with PLMEE on the test set. As is
shown, in both the trigger extraction task and the
5.1 Results of Event Extraction argument extraction task, PLMEE(-) has achieved
We take several previous classic works for com- the best results among all the compared meth-
parison, and divide them into three categories: ods. The improvement on the trigger extraction
Feature based methods Document-level infor- is quite significant, seeing a sharp increase of near
mation is utilized in Cross event (Liao and Gr- 10% on the F1 score. While the improvement in
ishman, 2010) to assist event extraction. While argument extraction is not so obvious, achieving
Cross entity (Hong et al., 2011) uses cross-entity about 2%. This is probably due to the more rigor-
inference in extraction. Max Extropy (Li et al., ous evaluation metric we have taken and the diffi-
2013a) extracts triggers as well as arguments to- culty of argument extraction task as well. More-
gether based on structured prediction. over, compared with feature based methods, neu-
Neural based methods DMCNN (Chen et al., ral based methods can achieve better performance.
2015) adopts firstly dynamic multi-pooling CNN And the same observation appears when compar-
to extract sentence-level features automatically. ing external resource based methods with neural
JRNN (Nguyen et al., 2016) proposes a joint based methods. It demonstrates that external re-
5290
Prototype m Generated Event
0.2 Russian President Putin is going to the meeting with the Arab leaders
President Bush is 0.4 The president is reported to be meeting with an Arab counterpart
going to be meeting
with several Arab 0.6 Mr. Bush is summoned to a meeting with some Shiite Muslim groups
leaders 0.8 The president is attending to the meeting with the Palestinians
1.0 Prime minister Blair is reported to the meeting with the leaders
Table 3: Example samples generated with different proportion of rewritten adjunct tokens. Italic indicates argument
and bold indicates trigger.
sources are useful to improve event extraction. In different selection strategies can be used to screen
addition, the PLMEE model can achieve better re- out the generated samples.
sults on the argument extraction task - with im- We first tuned the former two parameters on the
provement of 0.6% on F1 score for identification development set through grid search. Specially,
and 0.5% for classification - than the PLMEE(-) we set m ranging from 0.2 to 1.0 with an interval
model, which means that re-weighting the loss can of 0.2, and set n to be 0.5, 1.0 and 2.0, while keep-
effectively improve the performance. ing other parameters unchanged in the generation
process. We conduct experiments with these pa-
5.2 Case Study rameters. By analyzing the results, we find that the
Table 3 illustrates a prototype and its generation best performance of PLMEE on both trigger ex-
with parameter m ranging from 0.2 to 1.0. We traction and argument extraction can be achieved
can observe that the arguments after replacement with m = 0.4 and n = 1.0. It suggests that nei-
can match the context in prototype relatively well, ther too few generated samples nor too much is a
which indicates that they are resembling with the better choice for extraction. Too few has limited
original ones in semantic. influence, while too much could bring more noise
On the other hand, rewriting the adjunct tokens that disturbs the distribution of the dataset. For the
can smooth the generated data and expand their di- better extraction performance, we use such param-
versity. However, since there is no explicit guide, eter settings in the following experiments.
this step can also introduce unpredictable noise, We also investigate the effectiveness of the
making the generation not fluent as expected. sample selection approach, a comparison is con-
ducted between three groups with different selec-
5.3 Automatic Evaluation of Generation tion strategies. We obtain a total of four times the
size of the ACE2005 dataset using our generation
So far, there are mainly three aspects of the gen- method with m = 0.4, and pick out one quarter of
eration method that could have significant impacts them (n = 1.0) with λ being 0, 0.5 and 1.0 respec-
on the performance of the extraction model, in- tively. When λ is 0 or 1.0, it is either perplexity
cluding the amount of generated samples (repre- or distance that determines the quality exclusively.
sented by n, which indicates times the generation We find that the selection method with λ = 0.5
size is the number of dataset size), the proportion in quality function is able to pick out samples that
of rewritten adjunct tokens m, and the quality of are more advantageous to promote the extraction
the generated samples. The former two factors performance.
are controllable in the generation process. Spe-
cially, we can reuse a prototype and get a variety of Model Trigger(%) Argument(%)
combinations of arguments via similarity based re- PLMEE 80.7 58.0
placement, which will bring different contexts for PLMEE(+) 81.1 58.9
rewriting adjunct tokens. Moreover, the propor-
tion of rewritten adjunct tokens can be adjusted, Table 4: F1 score of trigger classification and argument
making a further variation. Although the quality of classification on the test set.
generation cannot be controlled arbitrarily, it can
be quantified by the score function Q so that those Finally, we incorporate the above generated
samples of higher quality can be picked out and data with the ACE2005 dataset and investigate the
added into the training set. With λ in Q changing, effectiveness of our generation method on the test
5291
set. In Table 4, we use PLMEE(+) denotes the On the other hand, there are still limitations in
PLMEE model trained with extra generated sam- our work. Events of the same type often share sim-
ples. The results illustrate that with our event gen- ilarity. And co-occurring roles tend to hold a tight
eration method, the PLMEE model can achieve the relation. Such features are ignored in our model,
state of the art result of event extraction. but they deserve more investigation for improving
the extraction model. In addition, although our
5.4 Limitation generation method can control the number of gen-
By comparing the annotations in generated sam- erated samples and filter with quality, it still suf-
ples and manually labeled samples, we find that fers the deviation of roles alike with distant super-
one issue of our generation method is that the roles vision. Therefore, for the future work, we will in-
may deviate, because the semantics could change corporate relation between events and relation be-
a lot with only a few adjunct tokens been rewritten. tween arguments into pre-trained language mod-
Taking Figure 5 as an example. The roles played els, and take effective measures to overcome the
by argument ”Pittsburgh” and ”Boston” should be deviation problem of roles in the generation.
”Destination” and ”Origin”, rather not the oppo-
site as in the prototype. This is because the to- Acknowledgments
ken ”from” has been replaced with the token ”for”,
The work was sponsored by the National Key Re-
while token ”drive to” been replaced with ”return
search and Development Program of China under
from”.
Grant No.2018YFB0204300, and National Nat-
Prototype: Leave from Niagara Falls and drive to Toronto, on 85 miles ural Science Foundation of China under Grant
Trigger leave No.61872376 and No.61806216.
Event type Movement.Transport
Arguments Niagara Falls Toronto
Roles Origin Destination
Generation: Leave for Pittsburgh and return from Boston in 200 miles
leave ✓
Trigger
References
Event type Movement.Transport
Arguments Pittsburgh Boston
Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and
Roles Origin x Destination x Jun Zhao. 2017. Automatically labeled data gener-
ation for large scale event extraction. In Proceed-
Figure 5: One of the generated samples with wrong ings of the 55th Annual Meeting of the Association
annotations. for Computational Linguistics (Volume 1: Long Pa-
pers), volume 1, pages 409–419.
5292
Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017.
and Percy Liang. 2018. Generating sentences by Exploiting argument information to improve event
editing prototypes. Transactions of the Association detection via supervised attention mechanisms. In
of Computational Linguistics, 6:437–450. Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Long Papers), volume 1, pages 1789–1798.
Sun. 2016. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on Bryan McCann, James Bradbury, Caiming Xiong, and
computer vision and pattern recognition, pages 770– Richard Socher. 2017. Learned in translation: Con-
778. textualized word vectors. In Advances in Neural In-
formation Processing Systems, pages 6294–6305.
Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao,
Guodong Zhou, and Qiaoming Zhu. 2011. Us- Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
ing cross-entity inference to improve event extrac- sky. 2009. Distant supervision for relation extrac-
tion. In Proceedings of the 49th Annual Meeting of tion without labeled data. In Proceedings of the
the Association for Computational Linguistics: Hu- Joint Conference of the 47th Annual Meeting of the
man Language Technologies-Volume 1, pages 1127– ACL and the 4th International Joint Conference on
1136. Association for Computational Linguistics. Natural Language Processing of the AFNLP: Vol-
ume 2-Volume 2, pages 1003–1011. Association for
Jeremy Howard and Sebastian Ruder. 2018. Universal Computational Linguistics.
language model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the As- Makoto Miwa, Rune Sætre, Yusuke Miyao, and
sociation for Computational Linguistics (Volume 1: Jun’ichi Tsujii. 2009. A rich feature vector for
Long Papers), pages 328–339. protein-protein interaction extraction from multiple
corpora. In Proceedings of the 2009 Conference on
Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng
Empirical Methods in Natural Language Process-
Ji, Clare R Voss, Jiawei Han, and Avirup Sil. 2016.
ing: Volume 1-Volume 1, pages 121–130. Associa-
Liberal event extraction and event schema induction.
tion for Computational Linguistics.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-
Long Papers), volume 1, pages 258–268. ishman. 2016. Joint event extraction via recurrent
Peifeng Li, Qiaoming Zhu, and Guodong Zhou. neural networks. In Proceedings of the 2016 Con-
2013a. Joint modeling of argument identification ference of the North American Chapter of the Asso-
and role determination in chinese event extraction ciation for Computational Linguistics: Human Lan-
with discourse-level information. In Twenty-Third guage Technologies, pages 300–309.
International Joint Conference on Artificial Intelli-
Thien Huu Nguyen and Ralph Grishman. 2015. Event
gence.
detection and domain adaptation with convolutional
Qi Li, Heng Ji, and Liang Huang. 2013b. Joint event neural networks. In Proceedings of the 53rd Annual
extraction via structured prediction with global fea- Meeting of the Association for Computational Lin-
tures. In Proceedings of the 51st Annual Meeting of guistics and the 7th International Joint Conference
the Association for Computational Linguistics (Vol- on Natural Language Processing (Volume 2: Short
ume 1: Long Papers), volume 1, pages 73–82. Papers), volume 2, pages 365–371.
Shasha Liao and Ralph Grishman. 2010. Using doc- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
ument level cross-event inference to improve event Gardner, Christopher Clark, Kenton Lee, and Luke
extraction. In Proceedings of the 48th Annual Meet- Zettlemoyer. 2018. Deep contextualized word rep-
ing of the Association for Computational Linguis- resentations. In Proceedings of the 2018 Confer-
tics, pages 789–797. Association for Computational ence of the North American Chapter of the Associ-
Linguistics. ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
Bing Liu, Longhua Qian, Hongling Wang, and 2227–2237.
Guodong Zhou. 2010. Dependency-driven feature-
based learning for extracting protein-protein interac- Alec Radford, Karthik Narasimhan, Tim Salimans, and
tions from biomedical text. In Proceedings of the Ilya Sutskever. 2018. Improving language under-
23rd International Conference on Computational standing by generative pre-training. URL https://s3-
Linguistics: Posters, pages 757–765. Association us-west-2. amazonaws. com/openai-assets/research-
for Computational Linguistics. covers/languageunsupervised/language under-
standing paper. pdf.
Shulin Liu, Yubo Chen, Shizhu He, Kang Liu, and
Jun Zhao. 2016. Leveraging framenet to improve Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
automatic event detection. In Proceedings of the Percy Liang. 2016. Squad: 100,000+ questions for
54th Annual Meeting of the Association for Compu- machine comprehension of text. In Proceedings of
tational Linguistics (Volume 1: Long Papers), vol- the 2016 Conference on Empirical Methods in Nat-
ume 1, pages 2134–2143. ural Language Processing, pages 2383–2392.
5293
Wilson L Taylor. 1953. “cloze procedure”: A new
tool for measuring readability. Journalism Bulletin,
30(4):415–433.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Hang Yang, Yubo Chen, Kang Liu, Yang Xiao, and Jun
Zhao. 2018. Dcfee: A document-level chinese fi-
nancial event extraction system based on automat-
ically labeled training data. Proceedings of ACL
2018, System Demonstrations, pages 50–55.
Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui
Yan, Chongde Shi, and Dongyan Zhao. 2018. Scale
up event extraction learning via automatic training
data generation. In Thirty-Second AAAI Conference
on Artificial Intelligence.
5294