P Final

The document discusses a project that aims to extract event information like name, location, date, and time from bodies of text using named entity recognition and conditional random fields. The authors achieved an F1 score of 59% for event extraction, with times and dates identified more accurately than events and locations. Increasing the training data size could improve the F1 score.

Uploaded by

api-248783843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views5 pages

P Final

Uploaded by

api-248783843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Event Information Extraction Using Named Entity Recognition Catherine Dong and Mariano Sorgente Abstract: This project

aims to extract event information (event name, location, date, time) from bodies of text. The motivation behind this is to make email management more efficient. We used named-entity recognition for this problem and modeled each sentence as a modified linear-chain CRF that included tri-nary potentials. Potentials were given by the dot product between the extracted local feature vector and the parameter vector, and the Viterbi algorithm was used to determine the best optimal tag sequence. Using this approach, we achieved an F1 score of 59%. Times and dates were more accurately identified than were event names and locations. Our project suggests that with a much larger training data set our F1 score can be greatly improved.

INTRODUCTION Too often is information lost in the deep black hole of email inboxes. This can be especially problematic if emails contain information about important events, but the event information is not recorded elsewhere. Our project aims to remedy this problem by creating a program that, given a set of emails or other bodies of text, will output any event information contained within the emails. The event information we are looking for includes event name, location, date, and time. Named-Entity Recognition In order to perform this information extraction task, we will be using named-entity recognition. Named-entity recognition is the process of labeling words in sentences with tags that correspond to a property of the word. For example, it can be used in natural language processing to tag words with their parts of speech. Named-entity recognition is often implemented using conditional random

fields (CRFs) to model sentences. CRFs, derived from logistic regression factor graphs, are similar to Markov models, derived from nave Bayes models, but have additional benefits in named-entity recognition. One of the significant advantages that CRFs offer over hidden Markov models (HMM) is their conditional nature, relaxing the independence assumption required by HMMs.1

!"#$%&'()'*+!,'-,.'/01&%'2/3&4,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!

#$$%&''((()*+,-.-+/-)%#0)/12)1/)34'#2(56'/.,'!

The Stanford Named-Entity Recognizer (NER), also known as the CRFClassifier, is an extensive project that identifies features including location, person, and organization.2

APPROACH In this project, we will apply named-entity recognition by modeling each sentence in an email as a CRF and labeling each word with tags that indicate whether the word is part of an event description. Specifically, we will have the following tags: - DOW (day of week) - MONTH - DAY (numerical day of month) - YEAR - HOUR - MIN (minute) - AMPM (a.m. or p.m.) - EVENT (event name) - LOC (event location) - OTHER (word is not part of an event description) Our implementation of this project utilized several parts of the NER assignment. Data Our project required a large dataset of labeled emails. We were able to obtain emails from the Enron Email Dataset. These emails were preprocessed to be de-capitalized, raw text with all punctuation marks separated by spaces.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 5!#$$%&''+7%)8$1+,9.:)-:3'89,$(1.-';<=>?@<)8#$27!

Obtaining labeled data, however, was a challenge. Because we could not find any public datasets online that included all the different tags we needed (namely, the event name tag, since datasets with date, time, and location labels did exist), we resorted to manually labeling data. We wrote a program to allow us to quickly and easily do so, and we were eventually able to label approximately one thousand emails, which corresponded to about nine thousand sentences. Feature Extraction When approaching this problem, we first made observations about the general structures of email messages and event information. It is evident that in the English language, descriptions of dates tend to appear after the word on. Similarly, descriptions of time tend to appear after the word at. Locations tend to appear after words like at or in. Furthermore, all event information tends to appear within close proximity of each other. Thus, the classification of words seemed to depend the words near it as well as the labels of words near it. Observe the following sentence and corresponding tag sequence:
There will be OTHER OTHER OTHER a OTHER meeting on OTHER EVENT

Thursday March 4 at 2 p.m. DOW MONTH DAY OTHER HOUR AMPM in Room 100. OTHER LOACTION LOCATION

To take advantage of these traits, it was necessary to use more

than unigram observation functions. We suspected that that bigram features also seemed insufficient, as it often does not take into account the observation that event information tends to be near each other, as the event names, dates, times, and locations are usually separated by at least one word. Thus, we also tried incorporating trigram features that take into account the previous two words and corresponding labels. CRF Model We tested two different ways of modeling our problem as a CRF. First, we tried using a standard linear-chain CRF. As seen in Figure 2, this CRF has binary (transition) potentials, the calculation of which includes unary (observation) functions as well.

! !"#$%&'>)'?/3"@"&3'918"7'*+!:'0%"A78%<' =/0&70"84,' ! Training and Computing Potentials To obtain the optimal parameter vector, we used stochastic gradient decent on a training data set of approximately eight thousand sentences. The value of each potential Gi, given the words x in the sentence, the current index t, the tag sequence y, the current parameter vector , and the feature vector extraction function , is calculated using the following equations: Linear Chain CRF: Gi (yt-1, yt; x, ) = exp( (t, yt-1, yt, x))

!"#$%&'5)'6"7&8%'918"7'*+!:';"78%<' =/0&70"84, We also tested a modified chain CRF that has tri-nary potentials (Figure 3) with a scope of three consecutive labels in order to account for the previous two words and labels. Binary and unary functions can be incorporated into these tri-nary potentials.

Modified Chain CRF: Gi (yt-2, yt-1, yt; x, ) =

exp( (t, yt-2, yt-1, yt, x))

At each iteration, the parameter vector is updated, which, in turn, alters the values of the potentials of the CRF. The CRF and calculated potentials are then used to compute the optimal tag sequence for the sentence. To do so, we use the Viterbi algorithm. The Viterbi algorithm computes the max forward message at each index and

then uses backwards reconstruction to retrieve the optimal tag sequence. Finally, after up to 30 iterations of training, we saved the data produced from the training and used it to label the development data set, which contained approximately one thousand sentences.

constantly-running computers that we can run many iterations of training on, we have the potential to increase our score even further. A side-project we are currently working on is a program that takes the extracted event information and inputs it onto a Google Calendar. We have a preliminary version of this program that uses the Google Calendar API to do this.

RESULTS We found that the modified chain CRF was not scaleable. With the large sets of data required for minimal accuracy, a few iterations of training took several hours. Thus, we focused on testing the linear-chain CRF, and the results we got are as follows. We achieved an average F1 score of 59%. A sample confusion matrix is shown in Figure 4. As we labeled more data, we found that the F1 score increased dramatically at first, and then the growth slowed down, as shown in Figure 5. Furthermore, as we ran more iterations of training, our F1 score grew gradually.

DISCUSSION AND FUTURE WORK While an F1 score of 59% is probably not good enough to use in production, it is a good start. One of our major challenges was obtaining labeled data, so the more we increase the size of our dataset, the higher the score we can achieve. Furthermore, if we have

RESULTS FIGURES

!"#$%&'B)'*/7@$,"/7'280%"C

!"#$%&'D)'E808'F"G&'-,.'!('F9/%&

!"#$%&'D)'H0&%80"/7,'/@'I%8"7"7#'-,.'!(',9/%&

What Is CRF?
No ratings yet
What Is CRF?
3 pages
Conditional Random Fields (CRFS)
No ratings yet
Conditional Random Fields (CRFS)
13 pages
DKhurana NERTask
No ratings yet
DKhurana NERTask
14 pages
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
No ratings yet
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
9 pages
HLT 2004
No ratings yet
HLT 2004
8 pages
G22.2591 - Advanced Natural Language Processing - Spring 2004 Name Recognition Why Name Recognition?
No ratings yet
G22.2591 - Advanced Natural Language Processing - Spring 2004 Name Recognition Why Name Recognition?
5 pages
DipanshuKhurana NERTask
No ratings yet
DipanshuKhurana NERTask
8 pages
CRF Klinger Tomanek
No ratings yet
CRF Klinger Tomanek
32 pages
A Survey On Named Entity Recognition
No ratings yet
A Survey On Named Entity Recognition
8 pages
Discriminative Approach For Sequence Labelling Through The Use of CRFs and RNNs
No ratings yet
Discriminative Approach For Sequence Labelling Through The Use of CRFs and RNNs
5 pages
Rich Set of Features For Proper Name Recognition in Polish Texts - Extended Astract
No ratings yet
Rich Set of Features For Proper Name Recognition in Polish Texts - Extended Astract
4 pages
POS Tagging and NER Methods
No ratings yet
POS Tagging and NER Methods
51 pages
English7 Q3 W1 D4
No ratings yet
English7 Q3 W1 D4
44 pages
Leseprobe 01
No ratings yet
Leseprobe 01
11 pages
Week 7 1 02 20 2025
No ratings yet
Week 7 1 02 20 2025
35 pages
Joint Recognition of Handwritten Text and Named Entities With A Neural End-To-End Model
No ratings yet
Joint Recognition of Handwritten Text and Named Entities With A Neural End-To-End Model
6 pages
Symbian OS
No ratings yet
Symbian OS
3 pages
Extracting Meaningful Entities From Police Narrative Reports
No ratings yet
Extracting Meaningful Entities From Police Narrative Reports
5 pages
Survey of Named Entity Recognition
No ratings yet
Survey of Named Entity Recognition
7 pages
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
No ratings yet
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
9 pages
Chap 3
No ratings yet
Chap 3
52 pages
POS Tagging Techniques Explained
No ratings yet
POS Tagging Techniques Explained
10 pages
Humanitarian Applications of Big Data: Prof. (MRS.) Sindhu Nair, Mr. Neel Shah, Mr. Pinank Shah
No ratings yet
Humanitarian Applications of Big Data: Prof. (MRS.) Sindhu Nair, Mr. Neel Shah, Mr. Pinank Shah
3 pages
Natural Language Question Answering
No ratings yet
Natural Language Question Answering
5 pages
Conditional Random Field Model (CRF)
No ratings yet
Conditional Random Field Model (CRF)
31 pages
Research Paper 3
No ratings yet
Research Paper 3
13 pages
Learning Categories and Their Instances by Contextual Features
No ratings yet
Learning Categories and Their Instances by Contextual Features
5 pages
A Hybrid Named Entity Recognition System For Aviat
No ratings yet
A Hybrid Named Entity Recognition System For Aviat
10 pages
Department of Computer Science and Engineering Spring 2012
No ratings yet
Department of Computer Science and Engineering Spring 2012
18 pages
SLU - Deep Belief Network Based Semantic Taggers For Spoken Language Understanding
No ratings yet
SLU - Deep Belief Network Based Semantic Taggers For Spoken Language Understanding
5 pages
CRF Model Training and Evaluation Guide
No ratings yet
CRF Model Training and Evaluation Guide
10 pages
Genetic Approach For Arabic Part of Speech Tagging
No ratings yet
Genetic Approach For Arabic Part of Speech Tagging
12 pages
Genetic Algorithm for Arabic POS Tagging
No ratings yet
Genetic Algorithm for Arabic POS Tagging
12 pages
Arabic POS Tagging via Genetic Algorithm
No ratings yet
Arabic POS Tagging via Genetic Algorithm
12 pages
Luận Văn Building a Semantic Role Labeling System for Vietnamese Sentences
No ratings yet
Luận Văn Building a Semantic Role Labeling System for Vietnamese Sentences
16 pages
Named Entity Recognition and Transliteration in Bengali 2007
No ratings yet
Named Entity Recognition and Transliteration in Bengali 2007
20 pages
Association For Computing Machinery ACM SIGPLAN Proceedings Template 1
No ratings yet
Association For Computing Machinery ACM SIGPLAN Proceedings Template 1
4 pages
Multi-Tagging For Transition-Based Dependency Parsing
No ratings yet
Multi-Tagging For Transition-Based Dependency Parsing
10 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Asynchronous Text Mining Method
No ratings yet
Asynchronous Text Mining Method
5 pages
Conditional Random Fields in Sequence Labeling
No ratings yet
Conditional Random Fields in Sequence Labeling
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Analysis of Keywords in The Field of Crisis Management Using Semantic Network Graphs - Annals of Disaster Risk Sciences
No ratings yet
Analysis of Keywords in The Field of Crisis Management Using Semantic Network Graphs - Annals of Disaster Risk Sciences
7 pages
Theme-Related Keyword Extraction From Free Text Descriptions of Image Contents For Tagging
No ratings yet
Theme-Related Keyword Extraction From Free Text Descriptions of Image Contents For Tagging
5 pages
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
No ratings yet
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
6 pages
Precursor-Induced Conditional Random
No ratings yet
Precursor-Induced Conditional Random
13 pages
William James Thesis FINAL
No ratings yet
William James Thesis FINAL
47 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
Recognizing Named Entities in Turkish Tweets
No ratings yet
Recognizing Named Entities in Turkish Tweets
8 pages
Tagging Based Efficient Web Video Event Categorization
No ratings yet
Tagging Based Efficient Web Video Event Categorization
5 pages
Semantic Syntactic Doc Classifiers
No ratings yet
Semantic Syntactic Doc Classifiers
2 pages
Named Entity Recognition Using Machine Learning Techniques
No ratings yet
Named Entity Recognition Using Machine Learning Techniques
14 pages
AI Report Ver1
No ratings yet
AI Report Ver1
20 pages
POS Tagging-II
No ratings yet
POS Tagging-II
11 pages
Online News Analysis and Fake Detection
No ratings yet
Online News Analysis and Fake Detection
19 pages
Analysis of Human and Organizational Factors Related Accident Reports Based On Natural Language ProcessingChemical Engineering Transactions
No ratings yet
Analysis of Human and Organizational Factors Related Accident Reports Based On Natural Language ProcessingChemical Engineering Transactions
6 pages
Nonlinear Integral Equations of The Hammerstein Type (0
No ratings yet
Nonlinear Integral Equations of The Hammerstein Type (0
19 pages
Algebraic Expressions: Simplification & Factorization
100% (3)
Algebraic Expressions: Simplification & Factorization
15 pages
Multiplication of Rational Numbers
No ratings yet
Multiplication of Rational Numbers
2 pages
05.numerical Differentiation
No ratings yet
05.numerical Differentiation
23 pages
Weak Evidence for Goldbach's Conjecture
No ratings yet
Weak Evidence for Goldbach's Conjecture
3 pages
Integration Practice
No ratings yet
Integration Practice
10 pages
Class Ix Half Yearly Set 1 New
No ratings yet
Class Ix Half Yearly Set 1 New
8 pages
Unique Element Algorithm Analysis
No ratings yet
Unique Element Algorithm Analysis
6 pages
Rayleigh-Ritz Method Finite Element
No ratings yet
Rayleigh-Ritz Method Finite Element
10 pages
Bow Math 10
No ratings yet
Bow Math 10
4 pages
Summary and Bound Reference
No ratings yet
Summary and Bound Reference
33 pages
Discrete-Time Signals & Systems Guide
100% (1)
Discrete-Time Signals & Systems Guide
32 pages
Higher Order Differential Equations Final
No ratings yet
Higher Order Differential Equations Final
49 pages
Mathongo Jee Mains Crash Course
No ratings yet
Mathongo Jee Mains Crash Course
9 pages
ASSESSMENT2 - Question - MAT183 - MAC2023
No ratings yet
ASSESSMENT2 - Question - MAT183 - MAC2023
5 pages
Facility Layout Optimization
No ratings yet
Facility Layout Optimization
43 pages
Data Structures and Algorithms MCQ Questions Set 02
100% (1)
Data Structures and Algorithms MCQ Questions Set 02
41 pages
Dirac Delta Function Applications
No ratings yet
Dirac Delta Function Applications
13 pages
Example:: (I) SWISS MISS 10 Symbols
100% (1)
Example:: (I) SWISS MISS 10 Symbols
12 pages
Introduction To Number Theory: Do You Know
No ratings yet
Introduction To Number Theory: Do You Know
2 pages
MC7104-Data Structures and Algorithms
No ratings yet
MC7104-Data Structures and Algorithms
7 pages
5-1 Polynomial Functions WS
No ratings yet
5-1 Polynomial Functions WS
3 pages
FeedCon (Unit 3)
No ratings yet
FeedCon (Unit 3)
39 pages
Hypergeometric Equations
No ratings yet
Hypergeometric Equations
60 pages
CBSE XI Maths Sample Paper by O.P. Gupta
No ratings yet
CBSE XI Maths Sample Paper by O.P. Gupta
5 pages
Mathematics Model 2017EC
No ratings yet
Mathematics Model 2017EC
4 pages
Advanced Math Problems
100% (1)
Advanced Math Problems
3 pages
Function
No ratings yet
Function
15 pages
P.4 End of Term One
No ratings yet
P.4 End of Term One
11 pages
Class 9 Maths Chapter 4 Solutions
No ratings yet
Class 9 Maths Chapter 4 Solutions
13 pages