Event Information Extraction Using Named Entity Recognition Catherine Dong and Mariano Sorgente Abstract: This project
aims to extract event information (event name, location, date, time) from bodies of text. The motivation behind this is to make email management more efficient. We used named-entity recognition for this problem and modeled each sentence as a modified linear-chain CRF that included tri-nary potentials. Potentials were given by the dot product between the extracted local feature vector and the parameter vector, and the Viterbi algorithm was used to determine the best optimal tag sequence. Using this approach, we achieved an F1 score of 59%. Times and dates were more accurately identified than were event names and locations. Our project suggests that with a much larger training data set our F1 score can be greatly improved.
INTRODUCTION Too often is information lost in the deep black hole of email inboxes. This can be especially problematic if emails contain information about important events, but the event information is not recorded elsewhere. Our project aims to remedy this problem by creating a program that, given a set of emails or other bodies of text, will output any event information contained within the emails. The event information we are looking for includes event name, location, date, and time. Named-Entity Recognition In order to perform this information extraction task, we will be using named-entity recognition. Named-entity recognition is the process of labeling words in sentences with tags that correspond to a property of the word. For example, it can be used in natural language processing to tag words with their parts of speech. Named-entity recognition is often implemented using conditional random
fields (CRFs) to model sentences. CRFs, derived from logistic regression factor graphs, are similar to Markov models, derived from nave Bayes models, but have additional benefits in named-entity recognition. One of the significant advantages that CRFs offer over hidden Markov models (HMM) is their conditional nature, relaxing the independence assumption required by HMMs.1
!"#$%&'()'*+!,'-,.'/01&%'2/3&4,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!
#$$%&''((()*+,-.-+/-)%#0)/12)1/)34'#2(56'/.,'!
The Stanford Named-Entity Recognizer (NER), also known as the CRFClassifier, is an extensive project that identifies features including location, person, and organization.2
APPROACH In this project, we will apply named-entity recognition by modeling each sentence in an email as a CRF and labeling each word with tags that indicate whether the word is part of an event description. Specifically, we will have the following tags: - DOW (day of week) - MONTH - DAY (numerical day of month) - YEAR - HOUR - MIN (minute) - AMPM (a.m. or p.m.) - EVENT (event name) - LOC (event location) - OTHER (word is not part of an event description) Our implementation of this project utilized several parts of the NER assignment. Data Our project required a large dataset of labeled emails. We were able to obtain emails from the Enron Email Dataset. These emails were preprocessed to be de-capitalized, raw text with all punctuation marks separated by spaces.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 5!#$$%&''+7%)8$1+,9.:)-:3'89,$(1.-';<=>?@<)8#$27!
Obtaining labeled data, however, was a challenge. Because we could not find any public datasets online that included all the different tags we needed (namely, the event name tag, since datasets with date, time, and location labels did exist), we resorted to manually labeling data. We wrote a program to allow us to quickly and easily do so, and we were eventually able to label approximately one thousand emails, which corresponded to about nine thousand sentences. Feature Extraction When approaching this problem, we first made observations about the general structures of email messages and event information. It is evident that in the English language, descriptions of dates tend to appear after the word on. Similarly, descriptions of time tend to appear after the word at. Locations tend to appear after words like at or in. Furthermore, all event information tends to appear within close proximity of each other. Thus, the classification of words seemed to depend the words near it as well as the labels of words near it. Observe the following sentence and corresponding tag sequence:
There will be OTHER OTHER OTHER a OTHER meeting on OTHER EVENT
Thursday March 4 at 2 p.m. DOW MONTH DAY OTHER HOUR AMPM in Room 100. OTHER LOACTION LOCATION
To take advantage of these traits, it was necessary to use more
than unigram observation functions. We suspected that that bigram features also seemed insufficient, as it often does not take into account the observation that event information tends to be near each other, as the event names, dates, times, and locations are usually separated by at least one word. Thus, we also tried incorporating trigram features that take into account the previous two words and corresponding labels. CRF Model We tested two different ways of modeling our problem as a CRF. First, we tried using a standard linear-chain CRF. As seen in Figure 2, this CRF has binary (transition) potentials, the calculation of which includes unary (observation) functions as well.
! !"#$%&'>)'?/3"@"&3'918"7'*+!:'0%"A78%<' =/0&70"84,' ! Training and Computing Potentials To obtain the optimal parameter vector, we used stochastic gradient decent on a training data set of approximately eight thousand sentences. The value of each potential Gi, given the words x in the sentence, the current index t, the tag sequence y, the current parameter vector , and the feature vector extraction function , is calculated using the following equations: Linear Chain CRF: Gi (yt-1, yt; x, ) = exp( (t, yt-1, yt, x))
!"#$%&'5)'6"7&8%'918"7'*+!:';"78%<' =/0&70"84, We also tested a modified chain CRF that has tri-nary potentials (Figure 3) with a scope of three consecutive labels in order to account for the previous two words and labels. Binary and unary functions can be incorporated into these tri-nary potentials.
Modified Chain CRF: Gi (yt-2, yt-1, yt; x, ) =
exp( (t, yt-2, yt-1, yt, x))
At each iteration, the parameter vector is updated, which, in turn, alters the values of the potentials of the CRF. The CRF and calculated potentials are then used to compute the optimal tag sequence for the sentence. To do so, we use the Viterbi algorithm. The Viterbi algorithm computes the max forward message at each index and
then uses backwards reconstruction to retrieve the optimal tag sequence. Finally, after up to 30 iterations of training, we saved the data produced from the training and used it to label the development data set, which contained approximately one thousand sentences.
constantly-running computers that we can run many iterations of training on, we have the potential to increase our score even further. A side-project we are currently working on is a program that takes the extracted event information and inputs it onto a Google Calendar. We have a preliminary version of this program that uses the Google Calendar API to do this.
RESULTS We found that the modified chain CRF was not scaleable. With the large sets of data required for minimal accuracy, a few iterations of training took several hours. Thus, we focused on testing the linear-chain CRF, and the results we got are as follows. We achieved an average F1 score of 59%. A sample confusion matrix is shown in Figure 4. As we labeled more data, we found that the F1 score increased dramatically at first, and then the growth slowed down, as shown in Figure 5. Furthermore, as we ran more iterations of training, our F1 score grew gradually.
DISCUSSION AND FUTURE WORK While an F1 score of 59% is probably not good enough to use in production, it is a good start. One of our major challenges was obtaining labeled data, so the more we increase the size of our dataset, the higher the score we can achieve. Furthermore, if we have
RESULTS FIGURES
!"#$%&'B)'*/7@$,"/7'280%"C
!"#$%&'D)'E808'F"G&'-,.'!('F9/%&
!"#$%&'D)'H0&%80"/7,'/@'I%8"7"7#'-,.'!(',9/%&