Information extraction
What is information extraction?
• It is the task of automatically extracting structured information from
unstructured and/or semi-structured machine-readable documents.
• In most of the cases this activity concerns processing human language
texts by means of natural language processing (NLP).
What is Information Extraction
The NLP task of information extraction (IE), turns the unstructured information
embedded in texts into structured data, for example for populating a relational
database to enable further processing.
Three IE sub-tasks:
1. Named Entity Recognition (NER)
2. Relation Extraction
3. Event Extraction
Named entities
• Part of speech tagging can tell us that words like Janet, Stanford University,
and Colorado are all proper nouns;
• being a proper noun is a grammatical property of these words.
• But viewed from a semantic perspective, these proper nouns refer to
different kinds of entities:
• Janet is a person, Stanford University is an organization,.. and Colorado is a
location.
Named Entity
• A named entity is, roughly speaking, anything that can be referred to with a proper
name: a person, a location, an organization.
• The task of named entity recognition (NER) is to find spans of text that constitute
proper names and tag the type of named entity recognition NER the entity.
• Four entity tags are most common: PER (person), LOC (location), ORG (organization), or
GPE (geo-political entity).
• However, the term named entity is commonly extended to include things that aren’t
entities per se, including dates, times, and other kinds of temporal expressions, and
even numerical expressions like prices.
• Here’s an example of the output of an NER tagger:
Example
The text contains 13 mentions of named entities including 5 organizations, 4 locations, 2 times, 1
person, and 1 mention of money
A list of generic named entity types with the
kinds of entities they refer to
Ambiguities in NER
• Unlike part-of-speech tagging, where there is no segmentation problem since each word
gets one tag,
• the task of named entity recognition is to find and label spans of text, and is difficult partly
because of the ambiguity of segmentation;
• we need to decide what’s an entity and what isn’t, and where the boundaries are.
• most words in a text will not be named entities.
• Another difficulty is caused by type ambiguity.
• The mention JFK can refer to a person, the airport in New York, or any number of schools,
bridges, and streets around the United States.
• Some examples of this kind of cross-type confusion are given in Figure
Ambiguities in NER
Ambiguities in NER
• The standard approach to sequence labeling for a span-recognition
problem like NER is BIO tagging (Ramshaw and Marcus, 1995).
• This is a method that allows us to treat NER like a word-by-word
sequence labeling task, via tags that capture both the boundary and
the named entity type.
• Consider the following sentence:
BIO Tagging
• Figure below shows the same excerpt represented with BIO tagging, as well
as variants called IO tagging and BIOES tagging.
• In BIO tagging we label any token that begins a span of interest with the label
B, tokens that occur inside a span are tagged with an I, and any tokens
outside of any span of interest are labeled O.
A sequence labeler (HMM,
CRF, RNN, Transformer, etc.)
is trained to label each token
in a text with tags that
indicate the presence (or
absence) of particular kinds
of named entities
Relation Extraction : relationships that exist
among the detected entities
Relationship Example
• Spokesman relationship: The text tells us, for example, that Tim
Wagner is a spokesman for American Airlines,
• unit of relationship: that United is a unit of UAL Corp., and that
American is a unit of AMR.
The 17 relations used in the ACE relation
extraction task.
Semantic relations with examples and the
named entity types they involve.
Relation Extraction Algorithms
• There are five main classes of algorithms for relation extraction:
handwritten patterns,
• supervised machine learning,
• semi-supervised (via bootstrapping and via distant supervision),
• and unsupervised.
Using Patterns to Extract Relation
• Consider the following sentence:
• Agar is a substance prepared from a mixture of red algae, such as Gelidium, for
laboratory or industrial use.
• Hearst points out that most human readers will not know what Gelidium is, but that they
can readily infer that it is a kind of (a hyponym of) red algae, whatever that is.
• She suggests that the following lexico-syntactic pattern
• Figure shows five patterns Hearst (1992a, 1998) suggested for
inferring the hyponym relation;
• we’ve shown NPH as the parent/hyponym.
• Modern versions of the pattern-based approach extend it by adding
named entity constraints.
• For example if our goal is to answer questions about “Who holds
what office in which organization?”,
• we can use patterns like the following:
Extracting Time
➢ Times and dates are a particularly important kind of named entity that play a
role in question answering, in calendar and personal assistant applications.
In order to reason about times and dates, after we extract these temporal
expressions they must be normalized— converted to a standard format so we
can reason about them.
Temporal Expression Extraction
❑ Temporal expressions are those that refer to:
▪ absolute points in time,
▪ relative times,
▪ absolute durations,
▪ and sets of these.
➢ Absolute temporal expressions are those that can be mapped directly to
calendar dates, times of day, or both.
➢ Relative temporal expressions map to particular times through some other
reference point (as in a week from last Tuesday).
➢ Durations denote spans of time at varying levels of granularity (seconds,
minutes, days, weeks, centuries, etc.).
Examples of absolute, relational and durational
temporal expressions.
➢ Important Observation: Temporal expressions are grammatical constructions that have
temporal lexical triggers as their heads.
Lexical triggers might be nouns, proper nouns, adjectives, and adverbs;
Full temporal expressions consist of their (lexical triggers) phrasal projections:
noun phrases, adjective phrases, and adverbial phrases.
Examples of lexical triggers:
The TimeML annotation scheme
❑ The TimeML annotation scheme annotates temporal expressions with an XML
tag, TIMEX3, and various attributes to that tag (Pustejovsky et al. 2005, Ferro
et al. 2005).
The temporal expression recognition task
❑ The temporal expression recognition task consists of finding the start and
end of all of the text spans that correspond to such temporal expressions.
➢ Rule-based approaches
➢ Sequence-labeling approaches
references
• Different ways of doing Relation Extraction from text | by Andreas
Herman | Medium
• Intro to Automated Question Answering | NLP for Question
Answering
• GitHub - roomylee/awesome-relation-extraction: A curated list of
awesome resources dedicated to Relation Extraction, one of the most
important tasks in Natural Language Processing (NLP).