0% found this document useful (0 votes)
30 views31 pages

Speech and Language Processing

Information extraction

Uploaded by

Lululimon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
30 views31 pages

Speech and Language Processing

Information extraction

Uploaded by

Lululimon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 31
Speech and Language Processing. Daniel Jurafsky & Janes #. martin. rights reserved. Draft of August 23, 2019. CHAPTER aL 1 7 Information Extraction name emt ogni Lam the very model of a modern Major General, ve information vegetable, animal, and mineral now the kings of England, and I quote the fights historical From Marathon 19 Waterlo, in order categorical Gilbert and Sullivan, Pirates of Penzance Jmagine that you are an analyst with an investment firm that tracks airline stocks, You're given the task of determining the relationship (if any) between airline an- rouncements of fare increases and the behavior of their stocks the next day, His- torical data about stock prices is easy to come by, but what about the airline an- rouncements? You will need to know at least the name of the airline, the nature of the proposed fare hike, the dates of the announcement, and possibly the response of other airlines. Fortunately, these can be all found in news articles like this one: Citing high fuel prices, United Airlines said Friday it has increased fares, by $6 per round tip on flights to some cities also served by lower ccost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes Where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. ‘This chapter presents techniques for extracting limited kinds of semantic con- tent from text, This process of information extraction (IE), turns the unstructured information embedded in texts into structured data, for example for populating a relational database to enable further processing, ‘We begin with the first step in most IE tasks, finding the proper names or named ‘entities in a text, ‘The task of named entity recognition (NER) is to find each ‘mention of a named entity in the text and label its ype, What constitutes a named cenity type is task specific; people, places, and organizations are common, but gene or protein names (Cohen and Demner-Fushman, 2014) or financial asset classes right be relevant for some tasks. Once all the named entities in a text have been ‘extracted, they can be linked together in sets corresponding to real-world entities, inferring, for example, that mentions of United Airlines and United refer to the same ‘company. This isthe joint task of coreference resolution and entity linking which we defer til Chapter 31 Next, we turn to the ask of relation extraction: finding and classifying semantic relations among the text entities. These are often binary relations like child-of, em- ployment, part-whole, and geospatial relations. Relation extraction has close links to populating a relational database. Finally, we discuss three tasks related to events, Event extraction is finding ‘events in which these entities participate, like, in our sample text, the fare increases norman template ing CHAPTER 17. © INFORMATION EXTRACTION by United and American and the reporting events said and cite. Event coreference (Chapter 21) is needed to figure out which event mentions in a text refer to the same event; in our running example the two instances of increase and the plrase the move all refer to the same event. “To figure out when the events in a text happened we extract temporal expres- sions like days of the week (Friday and Thursday), relative expressions like 1wo days from now or next year and times such as 3:30 PM.. These expressions must be normalized onto specific calendar dates or times of day to situate events in time. In ‘our sample task, this will alow us to link Friday to the time of United's announce- ‘ment, and Thursday to the previous day's fare increase, and produce a timeline in which United’s announcement follows the fare increase and American’s announce- ‘ment follows both of those events Finally, many texts describe recurring stereotypical events or situations. ‘The task of template filing is to find such situations in documents and fll in the template slots. These sotfillers may consist of text segments extracted directly fromthe text, ‘or concepts like times, amounts, or ontology entities that have been inferred from. text elements through additional processing Our airline text is an example of this kind of stereotypical situation since airlines often raise fares and then wait to see if competitors follow along. In this situa- tion, we can identify United as a lead airline that initially raised its fares, $6 as the amount, Thursday as the increase date, and American as an aicline that followed along, leading to a filled template like the following. FARE-RAISEATTEMPE: [LEAD AIRLINE: UNITED AIRLINES AMOUNT: $6 [EFFECTIVE DATE: 2006-10-26 FOLLOWER: AMERICAN AIRLINES, 17.1. Named Entity Recognition ‘The fist step in information extraction sto detect the entities inthe text, A named entity is, roughly speaking, anything that can be referred to with a proper name: 4 person, a locaton, an organization. ‘The term is commonly extended to include things that aren't entities pe se, including dates, times, and other kinds of temporal expressions, and even numerical expressions like prices. Here's the sample text introduced calier with the named entities marked Citing high fuel prices, [og United Airlines] said fryqqe: Friday] it has increased fares by Iyontey $6] pee round trip on Rights to some cites also served by lower-cost cariers. (og American Airlines} a unit (ogg AMR Corp, iimeditely matched the move, spokesman lop Tim Wagner) said. (og United], a unit of [og UAL Corp.) said the increase took effet [rryqz Thursday] and applies to most routes where tcompetes against discount carers, such as (9c Chicago] {0 irc Dallas] and |, ¢ Denver] to [,o¢ San Francisco} ‘The text contains 13 mentions of named enttes including 5 organizations, 4 loca- tions, 2 times, 1 person, and 1 mention of money In addition to their use in extracting events and the relationship between par ticipants, named entities are useful for many other language processing tasks, In 17.1 ¢ NAMED ENTITY RECOGNITION 3 sentiment analysis we might want to know a consumer's sentiment toward a pattic ‘ular entity. Entities are @ useful first stage in question answering, ot for linking text to information in structured knowledge sources like Wikipedia, Figure 17.1 shows typical generic named entity types. Many applications will also need to use specific entity types like proteins, genes, commercial products, or works of at. Type ‘Tag_Sample Categories Example sentences People PER people, characters “Turing is a giant of computer science. Organization ORG companies, sports teams The IPCC warned about the cyclone. Location LOC regions, mountains, seas. The Mt. Sanitas loop is in Sunshine Canyon. Geo-Political GPE countries, states, provinces Palo Alto is r Entity ing the fees for parking. bridges, buildings, airports Consider the Golden Gate Bridge. planes, trains, automobiles _ It was a classic Ford Falcon, “list of generic named entity types withthe kinds of enities they refer to Named entity recognition means finding spans of text that constitute proper ‘names and then classifying the type of the entity. Recognition is difficult partly be- ‘cause of the ambiguity of segmentation; we need to decide what's an entity and what isn't, and where the boundaries are. Another dificuty is caused by type ambiguity ‘The mention JFK can refer to a person, the airport in New York, or any number of schools, bridges, and streets around the United States. Some examples ofthis kind of eross-type confusion are given in Figures 17.2 and 17.3 Name Possible Categories Washington Person, Location, Political Entity, Organization, Vehicle Downing St. Location, Organization TRA Person, Organization, Monetary Instrument Louis Vuitton Person, Organization, Commercial Product ‘Common categorical ambiguities associated with various proper names {eR Washington] was born into slavery on the farm of James Burroughs. {ong Washington] went up 2 games to 1 in the four-game series. Blair arrived in [Loc Washington] for what may well be his lst state visit, Ta June, [gp Washington] passed a primary seatbelt aw. “The [yp Washington] had proved tobe a leaky ship, every passage I made. Examples of type ambiguities inthe use ofthe name Washington. 17.1.1 NER as Sequence Labeling ‘The standard algorithm for named entity recognition is as a word-by-word sequence labeling task, in which the assigned tags capture both the boundary and the type. A sequence classifier like an MEMM/CRF or a bi-LSTM is trained to label the tokens in a text with tags that indicate the presence of particular kinds of named entities. ‘Consider the following simplified excerpt from our running example. [on American Airlines), aunit of [o3G AMR Corp.], immediately matched the move, spokesman [pr Tim Wagner] said, 4) CHAPTER 17. + INFORMATION EXTRACTION toa Figure 17.4 shows the same excerpt represented with IOB tagging. In IOB tag. ging we introduce a tag for the beginning (B) and inside (I) of each entity type, and one for tokens outside (0) any entity. The number of tags is thus 2n-+ I tags, where n is the number of entity types. IOB tagging can represent exactly the same information as the bracketed notation, Words TOBLabel TO Label American BORG. LORG Airlines T.ORG LORG 5 ° ° a ° ° unit ° ° of ° oO AMR B-ORG ORG Corp. LORG LORG , ° ° immediately 0 ° matched ° ° the ° ° move ° ° 5 ° ° spokesman O. ° Tim B-PER [PER Wagner PER [PER said ° ° ° o [Named entity tagging asa sequence model, showing IOB and 10 encodings. We've also shown IO tagging, which loses some information by eliminating the B tag. Without the B tag 10 tagging is unable to distinguish between two entities of the same type that are right next to each other. Since this situation doesn't arise very often (usually there is at least some punctuation or other deliminator), 10 tagging ‘may be sufficient, and has the advantage of using only n+ I tags. Tn the following three sections we introduce the three standard families of al- gorithms for NER tagging: feature based (MEMMJ/CRF), neural (bi-LSTM), and rule-based. 17.1.2 A feature-based algorithm for NER identity of wi, identity of neighboring words ‘embeddings for w:, embeddings for neighboring words part of speech of w, part of speech of neighboring words base-phrase syntactic chunk label of w, and neighboring words presence of w in a gazetteer Wj contains a particular prefix (from all prefixes of length < 4) ; contains a particular suffix (from all suffixes of length < 4) wi all upper ease word shape of w;, word shape of neighboring words short word shape of w;, short word shape of neighboring words presence of hyphen Fi “Typical features fora Tea ro Tased NER system, 17.1 © NAMED ENTITY RECOGNITION 5 ‘The first approach is to extract features and train an MEMM or CRF sequence model of the type we saw for part-of-speech tagging in Chapter 8, Figure 17.5 lists sandard features used in such feature-based systems, We've seen many of these features before in the context of part-of-speech tagging, particularly for tagging tn- known words, This is not surprising, as many unknown words are in fact named ‘entities. Word shape features are thus particularly important in the context of NER. Recall that word shape features are used to represent the abstract letter pattern of the word by mapping lower-case letters to"x’, upper-case to *X’, numbers to'd’, and retaining punctuation. Thus for example LM.F would map to X.X.X. and DC10-30, would map to XXdd-dd, A second class of shorter word shape features is also used In these features consecutive character types are removed, so DC10-30 would be mapped to Xd-d but LM.F would still map to X.X.X. This feature by itself accounts for a considerable part of the success of feature-based NER systems for English news text, Shape features are also particularly important in recognizing names of proteins and genes in biological texts For example the named entity token L'Occitane would generate the following non-zero valued feature values prefix(w) suffix(w) = tane prefix(w,) =L" suffix(vy) = ane prefixiw;) =L70 suffix(w) =ne refix(w;) = L"0c suffixGw) =e word-shape(w;) = X"Xxxxxxxx short-word-shape(w;) =X"Xx A gazetteer is alist of place names, often providing millions of entries for lo cations with detailed geographical and political information.' A related resource is nameclists; the United States Census Bureau also provides extensive lists of first, names and surnames derived from its decadal census in the U.S.? Similar lists of cor- porations, commercial products, and all manner of things biological and mineral are also available from a variety of sources. Gazetteer and name features are typically implemented as a binary feature for each name list. Unfortunately, such lists can be difficult to create and maintain, and their usefulness varies considerably. While gazettecrs can be quite effective, lists of persons and organizations are not always helpful (Mikheev et al., 1999). Feature effectiveness depends on the application, genre, media, and language. For example, shape features, critical for English newswire texts, are of litte use With automatic speech recognition transcripts, or other non-edited or informally- ‘edited sources, or for languages like Chinese that don’t use orthographic case. The features in Fig. 17.5 should therefore be thought of as only a stating point. Figure 176 illustrates the result of adding part-of-speech tags, syntactic base- phrase chunk tags, and some shape information to our earlier example, Given such a training sel, a sequence classifier ike an MEMM ean be trained to label new sentences. Figure 17.7 illustrates the operation of such a sequence labeler atthe point where the token Corp. is next to be labeled. If we assume a context win- dow that includes the two preceding and following words, then the features available to the classifier are those shown in the boxed area, > wwecenss gov 6 CHAPTER 17 + INFORMATION EXTRACTION Word POS Chunk Short shape Tabel ‘American NNP_B-NP___Xx B-ORG Airlines NNPS ENP Xx ORG , 1» Oo ° a Dr BNP x ° unit NN ENP x ° of IN| BPP x ° AMR NNP BNP X B-ORG Corp. NNP ENP Xx. LORG , 0 0 ° immediately RB B-ADVP x ° matched = VBD BVP x ° the DT BNP x ° move NN ENP x ° i . 0 ° spokesman NN BNP x ° Tm NNP ENP Xx B-PER Wagner = NNP_ ENP Xx -PER said VBD B-VP x ° oO o Word:by-word feature encoding for NER, 3 [aunt of immediately| matched |-f Named ently recogni as sequence labeling. The Teatures avalable to the clasiier duming traning and classtieation ate those inthe boxed area 17.1.3. A neural algorithm for NER ‘The standard neural algorithm for NER is based on the bi-LSTM introduced in Chap- ter 9, Recall that in that model, word and character embeddings are computed for input word w;.. These are passed through a left-to-right LSTM and a right-to-left LSTM, whose outputs are concatenated (or otherwise combined) to produce a sin- gle output layer at position i, In the simplest method, this layer can then be directly passed onto a softmax that creates a probability distribution over all NER tags, and the most likely tag is chosen asf, For named entity tagging this greedy approach to decoding is insufficient, sine it doesn't allow us to impose the strong constraints neighboring tokens have on each other (€-g., the tag I-PER must follow another IPER or B-PER). Instead a CRF layer is normally used on top of the bi-LSTM output, and the Viterbi decoding algorithm is used to decode, Fig. 17.8 shows a sketch of the algorithm 17.1 © NAMED ENTITY RECOGNITION 7 CRF Layer Concatenation Right-to-left LSTM Loftto-right STM [lsrwi} J--fisnaa}- | --fismaaf -} flr CharsGiove Embeddings Coerk Giatney Gvisies Caars Palting Wall together character embeddings and words together a BHLSTM Sequence model. After Lample etal. (2016), 17.1.4 Rule-based NER ‘While machine learned (neural or MEMM/CRF) sequence models are the norm in academic research, commercial approaches to NER are often based on pragmatic ‘combinations of lists and rules, with some smaller amount of supervised machine learning (Chiticari et al., 2013). For example IBM System T is a text understand- ing architecture in which a user specifies complex declarative constraints for ageing tasks in a formal query language that includes regular expressions, dictionaries, se mantic constraints, NLP operators, and table structures, all of which the system ‘compiles into an efficient extractor (Chiticariu ct al, 2018) ‘One common approach is to make repeated rule-based passes over a text, allow- ing the results of one pass to influence the next, The sages typically first involve the use of rules that have extremely high precision but low recall. Subsequent stages ‘employ more error-prone statistical methods that take the output ofthe firs pass into account. 1, First, use high-precision rules to tag unambiguous entity mentions 2. Then, search for substring matches of the previously detected names. 3. Consult application-specific name lists to identify likely name entity mentions from the given domain, 4, Finally, apply probabilistic sequence labeling techniques that make use of the tags from previous stages as additional features. ‘The intuition behind this staged approach is twofold, First, some of the entity ‘mentions in a text will be more clearly indicative of a given entity's class than others. Second, once an unambiguous entity mention is introduced into a text, itis likely that subsequent shortened versions will refer to the same entity (and thus the same type of entity). 17.1.5 Evaluation of Named Entity Recognition ‘The familiar metrics of recall, precision, and F, measure arc used to evaluate NER systems, Remember that recall is the ratio of the number of correctly labeled re- sponses to the total that should have been labeled; precision isthe ratio of the num- 8 CHAPTER 17 + INFORMATION EXTRACTION "The 17 relations used i he ACE relation extraction task, ber of correctly labeled responses to the total labeled; and F-measure isthe harmonic ‘mean of the two, For named entities, the entity rather than the word is the unit of response. Thus in the example in Fig. 17-6, the two entities Tim Wagner and AMR Corp. and the non-entity said would cach count as a single response. ‘The fact that named entity tagging has a segmentation component which is not present in tasks like text categorization or part-ol-speech tagging causes some prob- lems with evaluation, For example, a system that labeled American but not American Airlines as an organization would cause two errors, a false positive for O and a false negative for I-ORG. In addition, using entities as the unit of response but words as the unit of training means that there is a mismatch between the training and test conditions. 17.2 Relation Extraction [Next on our list of tasks isto discern the relationships that exist among the detected. cemities. Ler's return to our sample airline text: Citing high fuel prices, [ogG United Airlines) said (ryyqe Friday] it has increased fares by [yoNEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ogg American Airlines), a unitof (ong AMR Corp.], immediately matched the move, spokesman lpg Tim Wagner] said. (ogg United], a unit of (ogg UAL Corp.), said the increase took effect [rye Thursday] and applies to most routes where it competes against discount carters, such as (1 o¢ Chicago] to [Loc Dallas] and {o¢ Denver] to [9c San Francisco]. The text tells us, for example, that Tim Wagner is a spokesman for American Airlines, that United is a unit of UAL Corp., and that American is a unit of AMR. ‘These binary relations are instances of more generic relations such as part-of or ‘employs that are fairly frequent in news-style texts. Figure 17.9 lists the 17 relations used in the ACE relation extraction evaluations and Fig. 17.10 shows some sample relations. We might also extract more domain-specific relation such as the notion of an airline route, For example from this text we can conclude that United has routes {o Chicago, Dallas, Denver, and San Francisco, 17.2. = RELATION Relations Types Examples Physical-Located PER-GPE He was in Tennessee Part-Whole-Subsidiary ORG-ORG —_ XYZ, the parent company of ABC Person-Social-Family PER-PER Yoko's husband John (Org-AFF-Founder PER-ORG Steve Jobs, co-founder of Apple. Semantic relations with examples and the named eaity types they involve Domain Fa ahode hemi United, UAL, American Airlines, AMR abd ‘Tim Wagner e Chicago, Dallas, Denver, and San Francisco frgshul Classes United, UAL, American, and AMR are organizations Org = {a,b,c,d} ‘Tim Wagner is a person e} Chicago, Dallas, Denver, and San Francisco are places {f.ghyi} Relations United isa unit of UAL Patof = {(a,b),e,d)} American is a unit of AMR ‘Tim Wagner works for American Airlines OreAgf = {(c.e}} United serves Chicago, Dallas, Denver, and San Francisco Serves = {(a, f), (a,2), (a,h), (a,i)} [EEMEREATY 4 todet-based view ofthe eelations and entities ip our sample Text, ‘These relations correspond nicely to the model-theoretic notions we introduced in Chapter 15 to ground the meanings of the logical forms. That is, a relation consists of a set of ordered tuples over elements of a domain. In most standard information: ‘extraction applications, the domain clements correspond to the named entities that ‘occur in the text, to the underlying entities that result from co-reference resolution, or to entities selected from a domain ontology. Figure 17.11 shows a model-based view of the set of entities and relations that can be extracted from our running example Notice how this model-heoretic view subsumes the NER task as well; named entity recognition corresponds to the identification of a class of unary relations. Sets of relations have been defined for many other domains as well. For example UMLS, the Unified Medical Language System from the US National Library of ‘Medicine has a network that defines 134 broad subject categories, entity types, and 54 relations between the entities, such as the following: Entity Relation Entity Injury disrupts Physiological Function Bodily Location location-of Biologic Function Anatomical Structure part-of Organism Pharmacologic Substance causes Pathological Function, Pharmacologic Substance teats. Pathologic Function Given a medical sentence like this one: (17.1) Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes We could thus extract the UMLS relation: Echocardiography, Doppler Diagnoses Acquired stenosis Infoboxes Wikipedia also offers a large supply of relations, drawn from infoboxes, struc- tured tables associated with certain Wikipedia articles. For example, the Wikipedia 10 CHAPTER 17 * INFORMATION EXTRACTION, Freebase yperay infobox for Stanford includes structured facts like state - "California" or president = "Mark Tessier-Lavigne”. These facts can be tured into rela- tions like president-of or located-in. or into relations in & metalanguage called RDF (Resource Description Framework). An RDF triple is a tuple of entty-relation- cenity, called a subject-predicate-object expression. Here's a sample RDF triple: subject predicate object Golden Gate Park location San Francisco For example the crowdsourced DBpedia (Bizer et al., 2009) is an ontology derived from Wikipedia containing over 2 billion RDF triples. Another dataset from Wikipedia infoboxes, Freebase (Bollacker et al., 2008), now part of Wikidata (ww. wilkidata.org), has relations like people/person/nationality location/location/contains WordNet or other ontologies offer useful ontological relations that express hier- archical relations between words or concepts. For example WordNet has the is-a or hhypernym relation between classes, Giraffe is-a ruminant is-a ungulate i-a mammal is-a vertebrate ‘WordNet also has Instance-of relation between individuals and classes, so that for example San Francisco is in the Instance-of relation with city. Extracting these relations is an important step in extending or building ontologies. ‘There are five main classes of algorithms for relation extraction: hand-written patterns, supervised machine learning, semi-supervised (via bootstrapping and via distant supervision), and unsupervised, We'll introduce each of these in the next sections, 17.2.1. Using Patterns to Extract Relations ‘The earliest and still common algorithm for relation extraction is lexico-syntactic patterns, fist developed by Hearst (1992a). Consider the following sentence: Agar is a substance prepared from a mixture of red algae, such as Ge- lidium, for laboratory or industrial use. Hearst points out that most human readers will not know what Gelidium is, but that they can readily infer that itis a kind of (a hyponym of) red algae, whatever that is, ‘She suggests thatthe following lexico-syntactic pattern NPo such as NPs {,NP2...y(and oP)NP,},i> 1 a7) implies the follwing semanties ‘YNP,,i> 1, hyponym(NP, NP) a7) allowing us to infer byponym(Geliiur, ed algae) ara) Figure 17.12 shows five patterns Hearst (1992a, 1998) suggested for inferring the hyponym relation; we've shown NPyy as the parent/hyponym. Modern versions of the pattemn-based approach extend it by adding named entity constraints. For example if our goal is to answer questions about “Who holds what office in which corganization?”, we can use patterns like the following: 17.2. © RELATION EXTRACTION 11 NPG NP)", (andjon other NP Temples, Weasuies, and oiher important civic balldings NPiy such as {NP.}* {(orland)} NP red algae such as Gelidium such NP as {NP.}* {(orland)} NP such authors as Herrick, Goldsmith, ané Shakespeare NP {,} including {NP,}* {(orland)} NP common-law countries, including Canada and England NPy {,} especially {NP}* {(or and)} NP__ European countries, especially France, England, and Spain (Feats 1992, Hearst 1998, PER, POSITION of ORG: George Marshall, Secretary of State of the United States PER (named appointed|chose/etc.) PER Prep? POSITION Truman appointed Marshall Secretary of State PER [be]? (named|appointed|etc.) Prep? ORG POSITION George Marshall was named US Secretary of State ‘Hand-buit pattems have the advantage of high-precision and they can be tailored to specific domains. On the other hand, they are often low-recall, and it’s alot of work to create them for all possible patterns. 17.2.2. Relation Extraction via Supervised Learning Supervised machine learning approaches to relation extraction follow a scheme that should be familiar by now. A fixed set of relations and entities is chosen, a taining corpus is hand-annotated with the relations and entities, and the annotated texts are then used to train classifiers to annotate an unseen test set. ‘The most straightforward approach has three steps, illustrated in Fig. 17.13. Step ‘one isto find pairs of named entities (usually in the same sentence), In step two, a ‘tering classifier is trained to make a binary decision as to whether a given pair of named entities are related (by any relation). Positive examples are extracted directly from all relations in the annotated corpus, and negative examples are generated from ‘within-sentence entity pairs that are not annotated with a relation. In step 3, a classi fier is trained to assign a label to the relations that were found by step 2. The use of the filtering classifier can speed up the final classification and also allows the use of distinct feature-sets appropriate for each task. For each of the two classifiers, we ean use any ofthe standard classification techniques (logistic regression, neural network, SVM, ete.) function FINDRELATIONS(words) returns relarions relations nit entities FINDENTITIES(words) forall entity pairs (e/,¢2) in entities do WRELATEDe1,e2) relations relations+CLASSIEYRELATION(€1,€2) Finding and classifying the relations among enivies in & TeX For the feature-based classifiers like logistic regression or random forests the most important step is to identify useful features. Let's consider features for clas- 12 CHAPTER 17 «INFORMATION EXTRACTION, sifying the relationship between American Airlines (Mention 1, or M1) and Tim Wagner (Mention 2, M2) from this sentence: (17.5) American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said ‘Useful word features include ‘© The headwords of MI and M2 and their concatenation Airlines Wagner Ailines-Wagner ‘© Bag-of-words and bigrams in M1 and M2 American, Airlines, Tim, Wagner, American Airlines, Tim Wagner ‘© Words or bigrams in particular positions. M2; I spokesman M2; +1 said ‘¢ Bag of words or bigrams between M1 and M2: a, AMR, of, immediately, matched, move, spokesman, the, unit ‘Stemmed versions of the same mbeddings can be used to represent words in any of these features, Useful named entity features include ‘¢ Named-entity types and their concatenation (M1: ORG, M2: PER, MIM2: ORG-PER) ‘© Entity Level of MI and M2 (from the set NAMI ‘MI: NAME [it or he would be PRONOUN] ‘M2: NAME [the company would be NOMINAL] ‘« Number of entities between the arguments (in this case 1, for AMR) NOMINAL, PRONOUN) The syntactic structure of a sentence can also signal relationships among its cntities. Syntax is often featured by using strings representing syntactic paths: the (dependency or constituency) path traversed through the tree in getting from one entity to the other, ‘© Base syntactic chunk sequence from MI to M2 NP NP PP VP NP NP ‘© Constituent paths between MI and M2 NP*NPTStS{NP ‘* Dependency-tree paths Airlines «yupj matched comp Said —>npj Wagner Figure 17.14 summarizes many of the features we have discussed that could be used for classifying the relationship between American Airlines and Tim Wagner from our example text. Neural models for relation extraction similarly treat the task as supervised clas sification. One option is to use a similar architecture as we saw for named entity tagging: a bi-LSTM model with word embeddings as inputs and a single sofimax Classification of the sentence output as a 1-ofN relation label, Because relations often hold between entities that are far part in a sentence (or across sentences), it ‘may be possible to get higher performance from algorithms like convolutional nets (dos Santos et al, 2015) or chain or tree LSTMS (Miwa and Bansal 2016, Peng etal. 2017), In general, if the test set is similar enough to the training set, and if there is ‘enough hand-labeled data, supervised relation extraction systems can get high ac- curacies. But labeling a large training set is extremely expensive and supervised 17.2. © RELATION EXTRACTION 13 MI headword airlines (asa word WoKen oF an embedding) M2 headword Wagner Word(s) before M1_—_-NONE Word(s) after M2 said Bag of words between (a, uni, of, AMR, Inc, immediately, matched, the, move, spokesman } ML type onc M2 type PERS Concatenated types ORG-PERS Constituent path NP*NPTS+S4.NP Base phrase path NP-> NP PP NP VP NP NP -Typed-dependency path Airlines sty matched comp Said —rousj Wagner ___ imma. ‘ample of features extracted dung classification of the tuple; MI isthe first meation, M2 the second, seed pater see tuples boaterapring models ate brittle: they don’t generalize well to different text genres. For this rea- son, much research in relation extraction has focused on the semi-supervised and ‘unsupervised approaches we turn to next, 17.2.3 Semisupervised Relation Extraction via Bootstrapping Supervised machine learning assumes that we have lots of labeled data, Unfortu- nately, this is expensive. But suppose we just have a few high-precision seed pat- terms, like those in Section 17.2.1, or perhaps a few seed tuples. That's enough to bootstrap a classifier! Bootstrapping proceeds by taking the entities in the seed pair, and then finding sentences (on the web, or whatever dataset we are using) that ‘contain both entities. From all such sentences, we extract and generalize the context around the entities to learn new patterns. Fig. 17.15 sketches a basic algorithm. funetion Bootstwar(Relation R) returns new relation ples tuples Gather a set of seed tuples tha have relation R iterate Sentences find sentences that contain entities in tuples patterns «generalize the context between and around entities in sentences rewpairs use patter o grep for mare ples nnewpairsnewpairs with high confidence tuples tuples + newpairs return tuples Bootsirapping Irom seed enlly past lear relations ‘Suppose, for example, that we need to create a list of airline/hub pairs, and we know only that Ryanair has a hub at Charleroi, We ean use this seed fact to discover new pattems by finding other mentions of this relation in our corpus. We search for the terms Ryanair, Charleroi and hub in some proximity. Perhaps we find the following set of sentences: (17.6) Budget aitline Ryanair, which uses Charleroi as a hub, serapped all ‘weekend flights out of the azport. (17.7) Al Gights in and out of Ryanait's Belgian bub at Charleroi sirport were grounded on Friday. 14 CHAPTER 17 * INFORMATION EXTRACTION, oy-or (27.8) A spokesman at Charleroi, a main hub for Ryanair, estimated that 8000 passengers had already been affected. From these resulls, we can use the context of words between the entity mentions, the words before mention one, the Word after mention Wo, and the named entity types of the two mentions, and perhaps other features, to extract general patterns such as the following: / (ORG), which uses [LOC] as a hub / 7 (ORG)’s hub at [LOC] / 7 [LOC] a main hub for [ORG] / ‘These new patterns can then be used to search for additional tuples, Bootstrapping systems also assign confidence values to new tuples to avoid se- ‘mantic drift. In semantic drift, an erroneous patter leads to the introduction of erroneous tuples, which, in turn, lead tothe creation of problematic patterns and the ‘meaning of the extracted relations ‘drifts’. Consider the following example: (17.9) Sydney has a ferry hub at Circular Quay. If accepted as a positive example, this expression could Iead to the incorrect in- ‘roduction of the tuple (Sydney, ircularQuay). Patterns based on this tuple could propagate further errors into the database. Confidence values for patterns are based on balancing two factors: the pattern's performance with respect to the current set of tuples and the pattern’s productivity in terms of the number of matches it produces in the document collection. More formally, given a document collection %, a current set of tuples 7, and a proposed pattern p, we need to track two factors: + hits: the set of tuples in T that p matches while looking in «© finds: The total set of tuples that p finds in 2 The following equation balances these considerations (Riloff and Jones, 1999). hitsp nds, ‘This metric is generally normalized o produce a probability: ‘We can assess the confidence ina proposed new tuple by combining the evidence supporting it from all the pattems P’ that match that tuple in (Agichtcin and Gravano, 2000). One way to combine such evidence is the noisy-or technique. Assume that & given tuple is supported by a subset ofthe patterns in P, each with its own confidence assessed as above. Tn the noisy-ot model, we make two basic: assumptions. First, that fr a proposed tuple tobe false, allo its supporting patterns ‘must have been in error, and second, thatthe sources oftheir individual failures are all independent. If we loosely weat our confidence measures as probabilities, then the probability of any individual pater p failing is 1 ~ Conf(p); the probability of allof the supporting pattern for a tuple being wrong is the product of thei individual failure probabilities, leaving us with the following equation for our confidence in a new tuple, Conf gioge(P) = x log\finds,) (7.10) Conf(t) = 1— T] (1 Cong(p)) any Setting conservative confidence thresholds for the acceptance of new patterns and tuples during the bootstrapping process helps prevent the system from drifting away from the targeted relation, 17.2. © RELATION EXTRACTION 15 17.2.4 Distant Supervision for Relation Extraction Although text that has been hand-labeled with relation labels is extremely expensive to produce, there are ways to find indirect sources of taining data. The distant supervision method of Mintz ct al. (2009) combines the advantages of bootstrapping with supervised learning. Instead of just a handful of seeds, distant supervision uses a large database to acquire a huge number of seed examples, creates lots of noisy pattern features from all these examples and then combines them in a supervised classifier For example suppose we are trying to Tearn the place-of-birth relationship be- tween people and their birth cities, In the seed-based approach, we might have only 5 examples to start with, But Wikipedia-based databases like DBPedia or Freebase have tens of thousands of examples of many relations; including over 100,000 ex- amples of place-of-birth, (, , ctc.,). The next step is to run named entity taggers on large amounts of text Mintz etal. (2009) used 800,000 articles from Wikipedia—and extract all sentences that have two named entities that match the tuple, like the following: Hubble was bom in Marshfield Einstein, born (1879), Um. Hubble's birthplace in Marshfield ‘Training instances can now be extracted from this data, one training instance foreach identical tuple . Thus there will be one training instance for each of and s0 on, ‘We can then apply feature-based or neural classification. For feature-based clas- sification, standard supervised relation extraction features like the named entity la- bels ofthe two mentions, the words and dependency paths in between the mentions, and neighboring words. Each tuple will have features collected from many training instances; the feature vector for single training instance like ( The great advantage of unsupervised relation extraction is its ability to handle «huge number of relations without having to specify them in advance. The disad- vantage is the need to map these large sets of strings into some canonical form for adding to databases or other knowledge sources. Current methods focus heavily on relations expressed with verbs, and so will miss many relations that are expressed nominally 17.2.6 Evaluation of Relation Extraction. Supervised relation extraction systoms ae evaluated by using test sets with human- annotated, gold-tandard relations and computing precision, recall, and F-measure. Labeled precision and recall require the system to classify the relation correcty, whereas unlabeled methods simply measure a system's ability to detec entities that awe related Semi-supervised and unsupervised methods are much more difficult o evalu- ate, since they extract totally new relations from the web ora large text. Because these methods use very large amounts of text, itis generally not possible torn them solely on a small labeled test set, and as a result it's not possible to presannotate & sold set of corect instances of relations For these methods i's possible to approximate (onl) precision by drawing a random sample of relations from the output, and having a human check the accuracy ofeach ofthese relations. Usually this approach focuses onthe tuples tobe extracted from a body of text athe than onthe relation mentions; systems need not detect every mention ofa relation to be scored correctly, Insted, the evaluation is based on the set of tuples occupying the database when the system is finshed, That is wwe want to know ifthe system ean discover that Ryanair has a hub at Charleroi we don’t really care how many times it discovers it. The estimated precision P is then # of correctly extracted relation tuples in the sample ) ‘olal # of extracted relation tuples in the sample. cn) Another approach that gives us a litle bit of information about recall is to com- pute precision at different levels of recall. Assuming that our system is able to rank the relations it produces (by probability, or confidence) we ean separately com- pute precision for the top 1000 new relations, the top 10,000 new relations, the top 100,000, and so on. In each case we take a random sample of that set. This will show us how the precision curve behaves as we extract more and more tuples. But there is no way to directly evaluate recall 17.3. Extracting Times ‘Times and dates are a particularly important kind of named entity that play a role in question answering, in calendar and personal assistant applications. In order to reason about times and dates, after we extract these temporal expressions they must ‘be normalized —converted toa standard format so we can reason about them. In this section we consider both the extraction and normalization of temporal expressions. sole 17.3. © EXTRACTING TIMES 19 17.3.1 Temporal Expression Extraction Temporal expressions are those that refer to absolute points in time, relative times, durations, and sets of these. Absolute temporal expressions are those that can be mapped directly to calendar dates, times of day, or both. Relative temporal expres- sions map to particular times through some other reference point (as in a week from last Tuesday). Finally, durations denote spans of time at varying levels of granular- ity (seconds, minutes, days, weeks, centuries, et). Figure 17.18 lists some sample temporal expressions in each of these categories. ‘Absolute Relative Durations ‘April 24, 1916 yesterday Tour hours ‘The summer of 777, next semester three weeks 10:15 AM two weeks from yesterday six days ‘The 3rd quarter of 2006 last quarter the last three quarters ational and darational temporal expressions Examples of absolut, ‘Temporal expressions are grammatical constructions that have temporal lexical triggers as their heads. Lexical triggers might be nouns, proper nouns, adjectives, and adverbs; full temporal expressions consist of their phrasal projections: noun phrases, adjective phrases, and adverbial phrases. Figure 17.19 provides examples. Category Examples Noun morning, noon, night, winter, dusk, dawn Proper Noun January, Monday, Ides, Easter, Rosh Hashana, Ramadan, Tet Adjective recent, past, annual, former Adverb hourly, daily, monthiy, yearly Let's look at the TimeML annotation scheme, in which temporal expressions are annotated with an XML tag, TIMEX3, and various attributes to that tag (Pustejovsky «eal, 2005, Ferro eal 2005). The following example illustrates the basic use of this scheme (we defer discussion of the attributes until Section 17.3.2. AA fare increase initiated last week by UAL. Corp’s United Airlines was matched by competitors over the weekend, marking the second successful fare increase in two weeks. ‘The temporal expression recognition task consists of finding the start and end of all of the text spans that correspond to such temporal expressions, Rule-based ap- proaches to temporal expression recognition use cascades of automata to recognize patterns at increasing levels of complexity. Tokens ate frst parcof-speech tageed, and then larger and larger chunks are recognized from the results from previous stages, based on pattems containing trigger words (¢.g.. February) of classes (€.g., MONTH), Figure 17.20 gives a fragment from a rule-based system. ‘Sequence-labeling approaches follow the same IOB scheme used for named- entity tags, marking words that are either inside, outside or at the beginning of a ‘TIMEX3-delimited temporal expression with the I, 0, and B tags as follows: A fare increase initiated last week by UAL Corp's 000 0 BI 00 0 20. CHAPTER 17. ¢ INFORMATION EXTRACTION string =" /(C(SOT+ heSCr-\s+)280T -daySCrs\s+S0T+ (before after) $CT¥\s+)2607-STEREIDayExpESCT+ k\stS0Ts morning| afternoon evening| night) $Ch+)7)/S1 \PTIMEStever>/lo: string =" 9/(SOTH\wSCT¥\ 8H) ) : /TUMEXStever>/3194/980 > (BOT (Today Tonight) SCT) ‘his (aorming/afternoon/evening) string =" /((SOT+ early Tate) SCT+\s+)?S0Tthis$CI+\s*S0T+(aorning|afternoon eventing) $CI*)/ CTIMEXS tever: TYPE<\DATE\ "35 1e\/TUMEXStever>/g05: string =" 5/((SOT+ (early 1ate)SCr+\s+)?S0T+1astScrA\s*S0r ‘TyPE=\"DATE\">S1< Peet fragment from the GUTime temporal tagging system in Tarsqi(Veshagen eval, 2005). Jnescrs) /erIMExStever Features are extracted from the token and its context, and a statistical sequence labeler is trained (any sequence model can be used). Figure 17.21 lists standard features used in temporal tagging, Feature Explanation "Token “The target token to be labeled ‘Tokens in window Bag of tokens in the window around a target Shape Character shape features Pos Parts of speech of target and window words ‘Chunk tags Base-phrase chunk tag for target and words in a window Lexical riggers Presence in a list of temporal terms Typical features used to tran 1OR-style temporal expresion aggre Temporal expression recognizers are evaluated with the usual recall, precision, and F-measures. A major difficulty for all of these very lexicalized approaches is avoiding expressions that trigger false positives: (27.15) 1984 tells the story of Winston Smith (17.16) ...U2's classic Sunday Bloody Sunday 17.3.2 Temporal Normalization sorafZfil_ Temporal normalization isthe process of mapping a temporal expression to either 4 specific point in time ow a duration, Points in ime correspond to calendar dates, to times of day, or both. Durations primarily consist of lengths of time but may also include information about stat and end points. Normalized times are represented ‘ith the VALUE attribute from the ISO 8601 standard for encoding temporal values (808601, 2004). Fig. 17-22 reproduces ou earlier example with the value attributes added in STMERT d= 11 iype= DATE” value= 20070702" TunctionTaDocument=” CREATION TINE” > July 2, 2007 A fare increase initiated last_week by United Airlines was matched. by competitors. over the weekend the second sucessful fare increase in . “TimeML markup including nonmalized values or lemporal expressions, ies jorTimelD="<1"> two The dateline, or document date, for this text was July 2, 2007. The ISO repre- sentation for this kind of expression is YYYY-MM-DD, or inthis case, 2007-07-02, 17.3. © EXTRACTING TIMES 21 ‘The encodings for the temporal expressions in our sample text all follow from this, date, and are shown here as values for the VALUE attribute. ‘The fist temporal expression in the text proper refers to a particular week of the year, In the ISO standard, weeks are numbered from 01 to 53, with the frst week ‘of the year being the one that has the frst Thursday of the year. These weeks are represented with the template YYYY-Wan. The ISO week for our document date is week 27; thus the value for last week is represented as "2007-W26 The next temporal expression is the weekend. ISO weeks begin on Monday; thus, weekends occur at the end of a week and are fully contained within a single week. Weekends are treated as durations, so the value of the VALUE attribute has to be a length, Durations are represented according to the pattern Pax, where m is aan integer denoting the length and x represents the uni, as in P3Y for three years or P2D for two days. In this example, one weekend is captured as PIWE. In this ‘case, there is also sufficient information to anchor this particular weekend as part of a particular week. Such information is encoded in the ANCHORTIMEID attribute. Finally, the phrase two weeks also denotes a duration captured as P2W. There is a Tot more to the various temporal annotation standards—far too much to cover here Figure 17.23 describes some of the basic ways that other times and durations are represented, Consult ISO8601 (2004), Ferro et al. (2005), and Pustejovsky et al (2005) for more details. Unit SPatters SSS ample Valve Fully specified dates YYYY-MME-DD 1991-09-28 Weeks YYYY-Wnn 2007-W27 Weekends PaWE, PIWE 24-hour clock times. HH:MM:SS 11345 Dates and times YYYY-MM-DDTHH:MM:SS _ 1991.09-28T11:00:00 Financial quarters___ Qu. 1999.93 ‘Sample ISO pattems for representing various times and durations. ‘Most current approaches to temporal normalization are rule-based (Chang and Manning 2012, Suoigen and Gertz 2013). Patterns that match temporal expres sions are associated with semantic analysis procedures. As in the compositional rule-to-rule approach introduced in Chapter 16, the meaning of a constituent is com- pputed from the meaning of its parts using a method specific to the constituent, al though here the semantic composition rules involve temporal arithmetic rather than 2e-caleulus attachments. Fully qualified date expressions contain a year, month, and day in some con- ‘ventional form. The units in the expression must be detected and then placed in the ccortect place in the corresponding ISO pattern. ‘The following pattem normalizes ‘expressions like April 24, 1916. FOTE - Month Date, Year {Year-val ~ Month.val ~ Date.val} ‘The non-terminals Month, Date, and Year represent constituents that have already been recognized and assigned semantic values, accessed through the *.val notation, ‘The value of this FQE constituent can, in turn, be accessed as FQTE.val during futher processing Pally qualified temporal expressions are fairly rare in eal texts. Most temaporal ‘expressions in news articles are incomplete and are only implicitly anchored, of- ten with respect to the dateline of the article, which we refer to as the document's 22° CHAPTER 17. © INFORMATION EXTRACTION temporal anchor. ‘The values of temporal expressions such as today, yesterday, or tomorrow can all be computed with respect to this temporal anchor. The semantic procedure for today simply assigns the anchor, and the attachments for tomorrow and yesterday add a day and subtract a day from the anchor, respectively. Of course, given the cyelie nature of our representations for months, weeks, days, and times of day, our temporal arithmetic procedures must use modulo arithmetic appropriate to the time unit being used. Unfortunately, even simple expressions such as the weekend or Wednesday in {roduce a fair amount of complexity. In our current example, the weekend clearly refers to the weekend of the week that immediately precedes the document date, But this won’t always be the case, asi illustrated in the following example, (17.17) Random security checks that began yesterday at Sky Harbor will continue at least through the weekend, In this case, the expression the weekend refers to the weekend of the week that the anchoring date is part of (ie., the coming weekend). The information that signals {his meaning comes from the tense of continue, the verb governing the weekend. Relative temporal expressions are handled with temporal arithmetic similar to that used for today and yesterday. The document date indicates that our example article is ISO week 27, so the expression last week normalizes to the current week ‘minus 1, To resolve ambiguous next and last expressions we consider the distance from the anchoring date to the nearest unit. Next Friday can refer either to the immediately next Friday or to the Friday following that but the closer the document date isto a Friday, the more likely its thatthe phrase will skip the nearest one. Such ambiguities are handled by encoding language and domain-specific heuristics into the temporal attachments, 17.4 Extracting Events and their Times The task of event extraction i to identify mentions of events in texts, For the purposes of this task, an event mention is any expression denoting an event or state that ean be assigned toa particular point, or interval, in time. The following markup ofthe sample text on page 19 shows all the event in this text fever Citing) high fuel prices, United Airlines [gyenr ssid) Fri dlay it has [pyr increased) fares by $6 per round tip on fights to some cities also served by lower-cost carriers, American Airlines, aunt of AMR Corp, immediately [event matched] [pyer the movel, spokesman Tim Wagner [event sid]. United, a unit of UAL Corp., lever said] [even the increase] took effect Thursday and [even applies] to most routes where it (zyeavr competes) against discount cariers, such as Chicago to Dallas and Denver to San Francisco. In English, most event mentions correspond to verbs, and most verbs introduce events, However, 38 we can see from our example, this isnot always the case. Events can be introduced by noun phrases, a in the mave and the increase, and some verbs fail wo introduce evens, as inthe phrasal verb took effect, which refers to when the event began rather than to the event itself, Similarly, light verbs such as make, rake, and have often fail to denote events; fr light verbs the event is often expressed by the nominal dreet object ook a lgh), and these light verbs just provide a syntactic structure forthe noun’s arguments 17.4 © EXTRACTING /ENTS AND THEIR TIMES 23 Various versions of the event extraction task exist, depending on the goal, For ‘example in the TempEval shared tasks (Verhagen et al. 2009) the goal is to extract events and aspects like their aspectual and temporal properties. Events are to be ‘verting classified as actions, slates, reporting events (say; report, tell, explain), perception events, and so on. The aspect, tense, and modality of each event also needs to be ‘extracted. Thus for example the various said events in the sample text would be annotated as (class-REPORTING, tense=PAST, aspect=PERFECTIVE), Event extraction is generally modeled via supervised learning, detecting events via sequence models with IOB tagging, and assigning event classes and attributes ‘with multi-class classifiers. Common features include surface information like parts of specch, lexical items, and verb tense information; see Fig. 17.24 Feature Explanation SSS ‘Character affixes Character-level prefixes and suflixes of target word Nominalization suffix Character level sufixes for nominalizations (¢g.,-tion) Part of speech Part of speech of the target word Light verb Binary feature indicating that the target is governed by a light verb Subject syntactic category Syntactic category of the subject of the sentence Morphological stem —_Stemmed version of the target word. Verb root Root form of the verb basis for a nominalization WordNet hypernyms __Hyperym set for the target Features commonly used in both rule-based and machine leaming approaches to event detec 17.4.1 Temporal Ordering of Events ‘With both the events and the temporal expressions in a text having been detected, the next logical task isto use this information to fit the events into @ complete timeline Such a timeline would be useful for applications such as question answering and summarization, This ambitious task is the subject of considerable current research >but is beyond the capabilities of current systems. ‘A somewhat simpler, but still useful, task is to impose a partial ordering on the ‘events and temporal expressions mentioned in a text. Such an ordering can provide ‘many of the same benefits as a true timeline. An example of such a partial ordering is the determination thatthe fare increase by American Airlines came after the fare increase by United in our sample text, Determining such an ordering can be viewed as a binary relation detection and classification task similar to those described eatlicr in Section 17.2. The temporal relation between events is classified into one of the standard set of Allen relations shown in Fig. 17.25 (Allen, 1984), using feature based classifiers as in Section 17.2, trained on the TimeBank corpus with features like words/embeddings, parse paths, tense and aspect. ‘The TimeBank corpus consists of text annotated with much of the information we've been discussing throughout this section (Pustejovsky et al., 2003), Time- Bank 1.2 consists of 183 news articles selected from a variety of sources, including the Penn TreeBank and PropBank collections. Bach article in the TimeBank corpus has had the temporal expressions and event ‘mentions in them explicitly annotated in the TimeML annotation (Pustejovsky et al 2003a). In addition to temporal expressions and events, the TimeML annotation provides temporal links between events and temporal expressions that specify the nature of the relation between them, Consider the following sample sentence and 24 CHAPTER 17 © INFORMATION EXTRACTION a A A before B 8 overlaps B 8 Batter A B overaps'A A A Acauals B Ameots 8 5 ; (equals A) B meets’ A i Asians B Afinishos B a B starts’ A Bifnishes A, 5 5B during 8 A B curing A 8 Time ‘The 13 temporal lations from Allen (1985). TIMERS GIETTST™ type" DATEY value "1589-10-26" fanctToninbocament=" CREATION TIMES 16/26/49" «TIMEX Delta Air Lines camings soared « record in the fiscal first quarter , dee\iningc/ EVENTS Example fro he Timeliank compas its corresponding markup shown in Fig. 17.26, selected from one of the TimeBank documents, (17.18) Delta Air Lines earnings soared 33% to a record in the fiscal frst quarter, bucking the industry tend toward declining profits, As annotated, this text includes three events and two temporal expressions, The events are all in the occurrence class and are given unique identifiers for use in fur- ther annotations. The temporal expressions include the creation time of the article, ‘which serves as the document time, and a single temporal expression within the text. In addition to these annotations, TimeBank provides four links that capture the {temporal relations between the events and times in the text, using the Allen relations from Fig. 17.25. The following are the within-sentence temporal relations annotated for this example. 17.5 © TEMPLATE FILLING 25 Soaring.: is included in the fiscal first quarters # Soaringg; is before 1989-10-26,s7 © Soaring,; is simultaneous with the bucking. # Declininges ineludes soaring. 17.5 Template Filling templates template ing Many texts contain reports of events, and possibly sequences of events, that often correspond to fairly common, stereotypical situations in the world, These abstract, situations or stories, related to what have been called seripts (Schank and Abel- son, 1977), consist of prototypical sequences of sub-events, participants, and their roles. The strong expectations provided by these scripts can facilitate the proper classification of entities, the assignment of entities into roles and relations, and most critically, the drawing of inferences that fll in things that have been left unsaid. In their simplest form, such scripts can be represented as templates consisting of fixed sets of slots that take as values slot-fillers belonging to particular classes, The task of template filling isto ind documents that invoke particular scripts and then fill tke slots inthe associated templates with fillers extracted fromthe text. These slotfillers may consist of text segments extracted directly from the text, or they may consist of ‘concepts that have been inferred from text elements through some additional pro- cessing, ‘A filled template from our original airline story might look like the following FARE-RAISE ATTEMPT: [LEAD AIRLINE: UNITED AIRLINES AMOUNT: $6 EFFECTIVE DATE: 2006-10-26 FOLLOWER: AMERICAN AIRLINES, ‘This template has four slots (LEAD AIRLINE, AMOUNT, EFFECTIVE DATE, FOL LOWER). The next section describes a standard sequence-labeling approach to filling slots. Section 17.5.2 then describes an older system based on the use of cascades of finite-state transducers and designed to addzess a more complex template-filing task that current learning-based systems don’t yet address. 175.1 Machine Learning Approaches to Template Filling In the standard paradigm for template filling, we are trying to fil fixed known tem plates with known slots, and also assumes training documents labeled with examples ‘of each template, and the fillers of each slot marked in the text. The isto create one template for cach event in the input documents, with the slots filled with text from the document ‘The task is generally modeled by training two separate supervised systems, The first system decides whether the template is present in a particular sentence, This task is called template recognition or sometimes, in a perhaps confusing bit of terminology, event recognition. Template recognition can be treated as a text classi- fication task, with features extracted from every sequence of words that was labeled in training documents as filling any slot from the template being detected. The usual set of features can be used: tokens, embeddings, word shapes, part-of-specch tags, syntactic chunk tags, and named entity tags. 26 CHAPTER 17. * INFORMATION EXTRACTION roller crn The second system has the job of role-filler extraction. A separate classifier is trained to detect each role (LEAD-AIRLINE, AMOUNT, and so on). This can be & binary classifier that is run on every nown-phrase in the parsed input sentence, or & sequence model run over sequences of words, Each role classifier is trained on the labeled data in the training set. Again, the usual set of features can be used, but now ‘rained only on an individual noun phrase or the fillers of a single slot. Multiple non-identical text segments might be labeled with the same slot la bel. For example in our sample text, the strings United or United Airlines might be labeled as the LEAD AIRLINE. These are not incompatible choices and the corefer- cence resolution techniques introduced in Chapter 21 can provide a path to a solution, ‘A variety of annotated collections have been used (0 evaluate this style of ap- proach to template filling, including sets of job announcements, conference calls for papers, restaurant guides, and biological texts. Recent work focuses on extracting templates in cases where there is no taining data or even predefined templates. by inducing templates as sets of linked events (Chambers and Jurafsky, 2011), 17.5.2 Earlier Finite- tate Template-Filling Systems ‘The templates above are relatively simple. But consider the task of producing a template that contained all the information in a text like this one (Cirishman and Sundheim, 1995) Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan ‘with a local concer and a Japanese trading house to produce golf clubs to be shipped to Japan, The joint venture, Bridgestone Sports Taiwan Co,, capital- ized at 20 million new Taiwan dollars, will stat production in January 1990 ‘with production of 20,000 iron and “metal wood” clubs a month. The MUC-5 ‘joint venture’ task (the Message Understanding Conferences were 4 series of U.S. governmentorganized information-extraction evaluations) was to produce hierarchically linked templates describing joint ventures. Figure 17.27 shows a structure produced by the FASTUS system (Hobbs et al., 1997). Note how the filler of the ACTIVITY slot of the THE-UP template is itself a template with slots Aativity: RELATIONSHIP e-up ‘COMPANY Bridgestone Spors Taiwan Co. Enrinits, Bridgestone Sports Co. PRODUCT iron and “metal wood” clubs local concern START Dare. DURING: January 1990 a Japanese trading house JOINT VENTURE Bridgestone Sports Taiwan Co. AcTIvITY AMOUNT, Activity-1 1NT'$20000000 "The templates produced by FASTUS given the Inpu ext on page 26, Early systems for dealing with these complex templates were based on cascades of transducers based on hand-written rules, as sketched in Fig. 17.28, The first four stages use hand-written regular expression and grammar rules to do basic tokenization, chunking, and parsing. Stage 5 then recognizes entities and events with a FST-based recognizer and inserts the recognized objects into the ap- propriate slots in templates. This FST recognizer is based on hand-built regular expressions like the following (NG indicates Noun-Group and VG Verb-Group), which matches the first sentence of the news story above, 17.6» SUMMARY 27 No. Step Description T__ Tokens “Tokenize input stream of characters 2 Complex Words Multiword phrases, numbers, and proper names. 3 Basic phrases. Segment sentences into noun and verb groups 4 Complex phrases Identify complex noun groups and verb groups, 5 Semantic Patterns Identify entities and events, insert into templates. 6 Merging Merge references to the same entity or event GEERT Levels of processing in FasTUs (Hobbs etal, 1997), Each level exacts a specie type of information which is then pasted on tothe nex: higher level NG(Company/ies) VG(Set-up) NGCloint-Venture) with NG(Company/ies) YVGCProduce) NG(Product) ‘The result of processing these two sentences isthe five draft templates (Fig. 17.29) that must then be merged into the single hierarchical structure shown in Fig. 17.27 ‘The merging algorithm, after performing coreference resolution, merges two activi- ties that are likely to be describing the same events # TemplateStot___ Value T RELATIONSHIP: TIE-UP ENTITIES: Bridgestone Co., a local concer, a Japanese trading house 2 Activity: PRODUCTION Propuct: “golf clubs" 3 RELATIONSHIP: TIE-UP JOINT VENTURE: “Bridgestone Sports Taiwan Co.” AMOUNT: NTS20000000 4 activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” StaRTDATE: DURING: January 1990 § Acriviry: PRODUCTION Propucr: “iron and “metal wood” clubs’ “The five paral templates produced by stage 5 of FASTUS. These templates 1m stage 6 to produce the final template show in Fig. 17.27 on page 26. 17.6 Summary ‘This chapter has explored techniques for extracting limited forms of semantic con- tent from texts, + Named entities can be recognized and classified by featured-based of neural sequence labeling techniques. «Relations among entities can be extracted by pattern-based approaches, su- pervised learning methods when annotated training data is available, lightly supervised bootstrapping methods when small numbers of seed tuples or seed patterns arc available, distant supervision when a database of relations is available, and unsupervised or Open IE methods, # Reasoning about time can be facilitated by detection and normalization of 28 CHAPTER 17. * INFORMATION EXTRACTION temporal expressions through a combination of statistical learning and rule. based methods ‘* Events can be detected and ordered in time using sequence models and classi- fiers trained on temporally- and event-labeled data like the TimeBank corpus, ‘© Template-filling applications can recognize stereotypical situations in texts and assign elements from the text to roles represented as fixed sets of slots Bibliographical and Historical Notes “The earliest work on information extraction addressed the template-flling task inthe context of the Frump system (Delong, 1982). Later work was stimulated by the U.S. govemment-sponsored MUC conferences (Sundheim 1991, Sundheim 1992, Sund- hheim 1993, Sundheim 1995), Early MUC systems like CIRCUS system (Lehnert etal, 1991) and ScisoR (Jacobs and Rau, 1990) were quite influential and inspired later systems like FASTUS (Hobbs et al., 1997). Chinchor et al. (1993) describe the [MUC evaluation techniques. Due to the difficulty of porting systems from one domain to another, attention shifted to machine learning approaches. Early supervised learning approaches to IE (Cardie 1993, Cardie 1994, Riloff 1993, Soderland etal. 1995, Huffman 1996) focused on automating the knowledge acqui- sition process, mainly for finite-state rule-based systems. Their success, and the earlier success of HMM-based speech recognition, led to the use of sequence la- bling (HMMs: Bikel et al. 1997; MEMMs McCallum et al. 2000; CRFS: Laf- ferty et al. 2001), and a wide exploration of features (Zhou et al, 2005). Neural approaches to NER mainly follow from the pioneering results of Collobert et al (2011), who applied a CRF on top of a convolutional net, BiLSTMs with word and character-based embeddings as input followed shortly and became a standard neural algorithm for NER (Huang etal. 2015, Ma and Hovy 2016, Lample etal. 2016). Neural algorithms for relation extraction often explore architectures that can handle entities far apart in the sentence: recursive networks (Socher et al., 2012), convolutional nets (dos Santos et al., 2015), or chain or tree LSTMS (Miwa and Bansal 2016, Peng et al. 2017). Progress in this area continues to be stimulated by formal evaluations with shared ‘benchmark datases, including the Automatic Content Extraction (ACE) evaluations ‘of 2000-2007 on named entity recognition, relation extraction, and temporal ex- pressions’, the KBP (Knowledge Base Population) evaluations (Ji etal. 2010, Sur-

You might also like