Unit 3 - Syntax Level Analysis
Unit 3 - Syntax Level Analysis
• Words are traditionally grouped into equivalence classes called parts of speech (POS; Latin parsorationis),
• Closed class categories are composed of a small, fixed set of grammatical function words for a
given language.
• Pronouns, Prepositions, Modals, Determiners, Particles, Conjunctions
• Open class categories have large number of words and new ones are easily invented.
• Nouns (Googler, textlish), Verbs (Google), Adjectives (geeky), Abverb (automagically)
Textlish is the use of text based acronyms such as LOL/"Laugh out Loud" and OMG/"Oh My God"! in spoken
conversation automatically and in a way that seems ingenious, inexplicable, or magical.
Choosing a tagset
• To do POS tagging, a standard set needs to be chosen
• Could pick very coarse tagsets
N, V, Adj, Adv
• Could pick a fine grained tagsets
NN, NNS, VBD, MD,
• Average POS tagging disagreement amongst expert human judges for the Penn treebank was 3.5%
• Based on correcting the output of an initial automated tagger, which was deemed to be more accurate than tagging from
scratch.
• Baseline: Picking the most frequent tag for each specific word type gives about 90% accuracy
• 93.7% if use model for unknown words for Penn Treebank tagset.
POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical and other linguistic knowledge.
• Stochastic-Based: stochastic tagger that use simple generalization approach to pick the most likely tag for
a word
• Transformation-Based: uses both rule based as well as stochastic based taggers.
Rule-base POS Tagging Approach
• Assign each word in the input a list of potential POS tags
• Then winnow down this list to a single tag using hand-written rules.
• Steps: e.g.
The can was rusted.
1. Input sentence
2. Refer dictionary / lexicon
• The/DT can/MD/NN was/VBD rusted/VBD.
3. Handwritten rules
4. Output word and tag
• Here can is also NN
• Example:
Generate the observed data from hidden stuff, i.e. put a probability over the
observations given the class: P(d, c) in terms of P(d|c)
e.g. Naïve Bayes’ classifiers, Hidden Markov Models etc.
Take the data as given, and put a probability over hidden structure given the
data: P(c|d)
e.g. Logistic regression, maximum entropy models, conditional random fields
Stochastic-based POS Tagging Approach
• Problem
• We have n words in a corpus and have to find corresponding tags.
• The problem is to assign the tag with maximum probability to the given word.
Transformation-based POS Tagging Approach
• uses both rule based as well as stochastic based taggers
• Like the rule-based taggers, TBL is based on rules that specify what tags should be assigned to what words.
• But like the stochastic taggers, TBL is a machine learning technique, in which rules are automatically induced
➢ A corpus is first tagged using the broadest rule, i.e. the one that applies to the most cases.
➢ Then a slightly more specific rule is chosen, which changes some of the original tags.
➢ Next an even narrower rule, which changes a smaller number of tags (some of which might be previously-
changed tags).
Transformation-based POS Tagging Approach
1. Secretariat is expected to race tomorrow
2. People continue to inquire the reason for the race for outer space
2. People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
• Brill’s tagger learned a rule that applies exactly to this mistagging of race
Change NN to VB when the previous tag is TO
Issues in POS Tagging
1. Multiple tags:
▪ Tag indeterminacy arises when a word is ambiguous between multiple tags and it is impossible or very difficult to disambiguate.
▪ In this case, some taggers allow the use of multiple tags. E.g. in the Penn Treebank and in the British National Corpus.
▪ Common tag indeterminacies include adjective versus preterite versus past participle (JJ/VBD/VBN), and adjective versus noun as prenominal modifier
(JJ/NN).
2. Multi-part words:
▪ Another tokenization issue concerns multi-part words. The Treebank tagset assumes that tokenization of words like New York is done at whitespace.
▪ The phrase a New York City firm is tagged in Treebank notation as five separate words:
stems:
would/MD n't/RB
children/NNS 's/POS
Issues in POS Tagging
3. Unknown words
▪ All the tagging algorithms require a dictionary that lists the possible parts of speech of every word.
▪ But the largest dictionary will still not contain every possible word.
▪ Proper names and acronyms are created very often, and even new common nouns and verbs enter the language at a surprising rate.
▪ Therefore, to build a complete tagger we need some method for guessing the tag of an unknown word
▪ The most powerful unknown-word algorithms make use of information about how the word is spelled.
▪ For example,
o words that end in the letter-s are likely to be plural nouns (NNS),
o 4 values of capitalization (capitalized initial, capitalized non initial, etc.), and hyphenation.
▪ They used the following equation to compute the likelihood of an unknown word:
Issues in POS Tagging
4. Part-of-Speech Tagging for Other Languages
▪ part-of-speech tagging algorithms have all been applied to many other languages like Chinese (Chinese TreeBank 5.0), German (NEGRA corpus
)
▪ But several augmentations and changes become necessary when dealing with highly inflected or agglutinative languages (Turkish or Hungarian)
which result in a large vocabulary for these languages.
▪ The large vocabulary size seems to cause a significant degradation in tagging performance when the HMM algorithm is applied directly to
agglutinative languages. However, the performance of these taggers is hugely improved by adding a dictionary which essentially gives a better
model of unknown words
▪ A second, related issue with such languages is the vast amount of information that is coded in the morphology of the word. In English, lots of
information about syntactic function of a word is represented by word order, or neighboring function words. A part-of-speech tagging output for
Turkish or Czech needs to include information about the case and gender of each word in order to be as useful as parts-of-speech without case
or gender are in English
Issues in POS Tagging
5. Combining Taggers
▪ The most common approach to tagger combination is to run multiple taggers in parallel on the same sentence, and then combine
their output, either by voting or by training another classifier to choose which tagger to trust in each context.
▪ In general, this kind of combination is only useful if the taggers have complementary errors, and so research on combination often
begins by checking to see if the errors are indeed different from different taggers.
▪ Another option is to combine taggers in series
Context-free grammar
Constituency
• The idea: Groups of words may behave as a single unit or phrase, called a constituent.
• E.g. Noun Phrase
• Kermit the frog
• they
• December twenty-sixth
• the reason he is running for president
• Sentences have parts, some of which appear to have subparts. These groupings of words that go together we will call constituents.
• Compare with:
• The man from Amherst grew beautiful russet potatoes.
• We say Joe counts as a noun phrase because it appears in a place that a larger noun phrase could have been.
Formal definition of context-free grammar
A context-free grammar G (“is a 4-tuple”) is defined by four parameters N, Σ, R, S
Example of CFG
form
• A context-free grammar is in Chomsky Normal Form (CNF) if it is ϵ-free and if in addition each production is
• Chomsky normal form grammars are binary branching, i.e. have binary trees
• Any grammar can be converted into a weakly-equivalent Chomsky normal form grammar.
A → BCD
• Can be converted into a CNF as follows:
A → BX
X → CD
Parsing with CFG
• Types of parsing
• Top down parsing (Goal Oriented)
• Bottom up parsing (Data Oriented)
A top-down parser searches for a parse tree by trying to build from the root node S down to the leaves.
Method:
Initial Trees: Identify all trees that can start with S by checking grammar rules with S on the left hand side.
Grow Trees: Expand trees downward using these rules until they reach the POS categories at the bottom.
Match Input: Reject trees whose leaves do not match the input words.
1st iteration
2nd iteration
Sentence - Book that flight
2nd iteration
4th iteration
3rd iteration
Top down parser – example 2
S → NP VP
NP → Det N | Det N PP | PNoun | N
VP → Verb NP | Verb NP PP
PP → Prep NP
Verb → saw | ate | walked
PNoun → Ramesh | Raj | Anita
Det → a | an | the
N → man | dog | cat | telescope | park
Prep → in | on | by | with
Bottom up parser (data directed)
• The parser starts with the input words and builds trees upward.
• Method:
• Start with words: Begin with the words of the input
• Apply Rules: Build trees by applying grammar rules one at a time.
• Fit Rules: Look for places in the current parse where the RHS of a rule can fit.
1st iteration
2nd iteration
Bottom up parser – example 1
Sentence - Book that flight
1st iteration
2nd iteration
4th iteration
3rd iteration
Bottom up parser – example 2
Sentence - “Rahul is eating an apple”
Use the following grammar rules to
create the parse tree:
S → NP VP
NP → Det N | N
VP → Aux VP
VP → V NP
Aux → is
V→ eating
N → Rahul | apple
Det → an
Top-down/Bottom-up Parsing
Top-down (recursive decent parser) Bottom-up (shift-reduce parser)
Termination Success: When the leaves of a tree match the Success: When “S” is reached
input Failure: No more rewrites possible
Failure: No more non-terminals to expand in
any of the trees
Pros/Cons Pro: Goal-driven, starts with “S” Pro: Constrained by the input string
Con: Constructs trees that may not match Con: Constructs constituents that may not
input lead to the goal “S”
• Current input word must serve as the first word in the derivation of unexpanded node that the parser
is currently processing
• The first word along left edge is called left corner of the tree
• Eg.
A→Bα
Here B is left-corner of A
• For the i/p statement “Does this flight include a meal”? And the following grammar
• A grammar is left recursive if it contains a non-terminal category that has a derivation that
includes itself anywhere along its leftmost branch.
• Left recursive rules
• A Aβ
• e.g.
• NP NP PP
• VP VP NP
• S NP VP
• local ambiguity.
• Even if a sentence isn’t ambiguous, it can be inefficient to parse due to Local ambiguity.
• It occurs when some part of a sentence is ambiguous, i.e. has more than one parse, even if the whole sentence is not ambiguous.
• For example the sentence Book that flight is unambiguous, but when the parser sees the first word Book, it cannot know if it is a verb or a noun until later.
• The first 0 indicates that the constituent predicted by this state should begin at the start of the input; the second 0
reflects the fact that the dot lies at the beginning as well.
• The second state, created at a later stage in the processing of this sentence, indicates that an NP begins at
position 1, that a Det has been successfully parsed and that a Nominal is expected next.
• The third state, with its dot to the right of all its two constituents, represents the successful discovery of a tree
corresponding to a VP that spans the entire input.
• These states can also be represented graphically, in which the states of the parse are
edges, or arcs, and the chart as a whole is a directed acyclic graph,
When does it Applied when Applied when non- Applied when dot
apply terminals are to the terminals are to reaches the end of a
right of a dot the right of a dot rule
VP • Verb NP, S • VP , [0,0] NP Det Nom •, [1,3]
[0,0]
What chart cell New states are New states are New states are added
is affected added to the next added to current to current cell
cell cell
What contents Move the dot over One new state for One state for each rule
in the chart the terminal each expansion of “waiting” for the
cell Verb VP • NP, the non-terminal in constituent such as
[0,1] the grammar VP V • NP, [0,1]
VP • V, [0,0] VP V NP •, [0,3]
VP • V NP, [0,0]
CFG for Fragment of English
For all the possible expansions of a nonterminal, the sum of their probabilities
must be 1
How to use the probabilities
• The probability of a particular parse T is defined as the product of the probabilities of all the rules r
used to expand each node n in the parse tree:
The resulting probability P(T|S) is both the joint probability of the parse and the sentence, and also the
probability of the parse P(T).
But since a parse tree includes all the words of the sentence, P(S|T) is 1. Thus:
The probability for all different parse trees is computed and the algorithm picks the best tree for a sentence
S out of the set of parse trees for S.
We want the parse tree T which is most likely given the sentence S
• By definition, the probability P(T|S) = P(T, S)/P(S), thus
• Furthermore, since we showed above that P(T, S) = P(T), the final equation for
choosing the most likely parse simplifies to choosing the parse with the highest
probability
• We can see that the right tree in Figure 12.2(b) has a higher probability.
• Thus this parse would correctly be chosen by a disambiguation
algorithm which selects the parse with the highest PCFG probability.
Cocke-Younger-Kasami Parser
• Bottom-up parser with top-down filtering
• Start State(s): (A, i, i+1) for each Awi+1
• End State: (S, 0,n) n is the input size
• Next State Rules
• (B, i, k) (C, k, j) (A, i, j) if ABC
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det
4
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det
1 N
4
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det
1 N
2 V
3 Det
4 N
Now, we have filled the single word entries and move ahead with the two word entries e.g "the
price"
We look for a rule that has Det N on the right hand and find it in rule 2. So we make a note of it
in cell (0,1)
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det NP
1 N
2 V
3 Det
4 N
Now, we have filled the single word entries and move ahead with the two word entries e.g "the
price"
We look for a rule that has Det N on the right hand and find it in rule 2. So we make a note of it
in cell (0,1)
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det NP
1 N X
2 V X
3 Det NP
4 N
0 Det NP
1 N X
2 V X
3 Det NP
4 N
0 Det NP
1 N X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X X
2 V X
3 Det NP
4 N
0 Det NP X
1 N X X
2 V X VP
3 Det NP
4 N
0 Det NP X
1 N X X
2 V X VP
3 Det NP
4 N
"the price includes a" all the highlighted options are impossible
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det NP X X
1 N X X
2 V X VP
3 Det NP
4 N
"the price includes a" all the highlighted options are impossible
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
0 Det NP X X
1 N X X
2 V X VP
3 Det NP
4 N
0 Det NP X X
1 N X X
2 V X VP
3 Det NP
4 N
0 Det NP X X
1 N X X X
2 V X VP
3 Det NP
4 N
0 Det NP X X
1 N X X X
2 V X VP
3 Det NP
4 N
0 Det NP X X S
1 N X X X
2 V X VP
3 Det NP
4 N
For the entire input , we get a match and calculate the probabilities
The price includes a facemask
1-the 2-price 3-includes 4-a 5-facemask
1 N X X X
(0.01)
2 V X VP (0.2 * 0.05 *
(0.05) 0.0024 =
0.000024)
4 N
(0.02)
For the entire input , we get a match and calculate the probabilities
a pilot likes flying planes
4 NNS