Unit 4
Syntax Level Analysis
POS Tagging
• Parts of Speech (PoS) tagging is a core task in NLP.
• It gives each word a grammatical category such as nouns, verbs,
adjectives and adverbs.
• Through better understanding of phrase structure and semantics, this
technique makes it possible for machines to study human language
more accurately.
Example
Implementation of Parts-of-Speech tagging using
NLTK
Workflow of POS Tagging in NLP
• Tokenization: The input text is divided into individual tokens, representing words or subwords.
Tokenization is the foundational step in most NLP tasks which enables further analysis at the word
level.
• Loading a Language Model: Tools like NLTK or SpaCy requires a pre-trained language model to
perform POS tagging. These models are trained on large datasets and provide insights into the
grammatical rules and structure of the language.
• Text Preprocessing: The text is then cleaned to improve accuracy. Common preprocessing steps
include converting text to lowercase, removing special characters, and eliminating irrelevant
content.
• Linguistic Analysis: This stage involves parsing the sentence to understand the grammatical role
of each token. It lays the groundwork for assigning the appropriate part of speech by interpreting
the sentence’s syntactic structure.
• POS Tagging: Each token is then assigned a specific part-of-speech label. This is based on its role
in the sentence and contextual clues provided by surrounding words.
• Result Evaluation: Finally, the POS-tagged output is reviewed to ensure accuracy. Any
misclassifications or anomalies are identified and corrected as needed.
Why Difficult ?
• Although it seems easy, Identifying the part of speech tags is much
more complicated than simply mapping words to their part of speech
tags.
If it is difficult, then what approaches do we have?
Word Classes
• In grammar, a part of speech or part-of-speech (POS) is known as
word class or grammatical category, which is a category of words that
have similar grammatical properties.
• The English language has four major word classes: Nouns, Verbs,
Adjectives, and Adverbs.
• Commonly listed English parts of speech are nouns, verbs, adjectives,
adverbs, pronouns, prepositions, conjunctions, interjections,
numerals, articles, and determiners.
• These can be further categorized into open and closed classes.
Closed Class
• Closed classes are those with a relatively fixed/number of words, and
we rarely add new words to these POS, such as prepositions. Closed
class words are generally functional words like of, it, and, or
you, which tend to be very short, occur frequently, and often have
structuring uses in grammar.
• Example of closed class-
• Determiners: a, an, the Pronouns: she, he, I, others Prepositions: on,
under, over, near, by, at, from, to, with
Open Class
• Open Classes are mostly content-bearing, i.e., they refer to objects,
actions, and features; it's called open classes since new words are
added all the time.
• By contrast, nouns and verbs, adjectives, and adverbs belong to open
classes; new nouns and verbs like iPhone or to fax are continually
being created or borrowed.
• Example of open class-
• Nouns: computer, board, peace, school Verbs: say, walk, run,
belong Adjectives: clean, quick, rapid, enormous Adverbs: quickly,
softly, enormously, cheerfully
Tag set
The problem is that many words belong to more than one word class.
And to do POS tagging, a standard set needs to be chosen.
We could pick very simple/coarse tag sets such as Noun (NN), Verb
(VB), Adjective (JJ), Adverb (RB), etc.
But to make tags more dis-ambiguous, the commonly used set is finer-grained,
University of Pennsylvania’s “UPenn TreeBank tagset”, having a total of 45 tags.
Parts of Speech Tagging
• Tagging is a disambiguation task; words are ambiguous i.e. have
more than one a possible part of speech, and the goal is to find the
correct tag for the situation.
• For example, a book can be a verb (book that flight) or a noun (hand
me that book).
• The goal of POS tagging is to resolve these ambiguities, choosing the
proper tag for the context.
Looking into the Operational Modalities Adopted in Some of the POS Tagging Tools in Identification of Contextual Part -of-Speech of Words in Texts
[Link]
Speech_of_Words_in_Texts
Rule-Based Tagging
• Rule-based tagging is the oldest tagging approach where we use contextual information to assign
tags to unknown or ambiguous words.
• The rule-based approach uses a dictionary to get possible tags for tagging each word. If the word
has more than one possible tag, then rule-based taggers use hand-written rules to identify the
correct tag.
• Since rules are usually built manually, therefore they are also called Knowledge-driven taggers.
We have a limited number of rules, approximately around 1000 for the English language.
• One of example of a rule is as follows:
• Sample Rule: If an ambiguous word “X” is preceded by a determiner and followed by a noun, tag
it as an adjective;
• A nice car: nice is an ADJECTIVE here.
• Limitations/Disadvantages of Rule-Based Approach:
• High development cost and high time complexity when applying to a large corpus of text
• Defining a set of rules manually is an extremely cumbersome process and is not scalable at all
Stochastic POS Tagging
• Stochastic POS Tagger uses probabilistic and statistical information
from the corpus of labeled text (where we know the actual tags of
words in the corpus) to assign a POS tag to each word in a sentence.
• This tagger can use techniques like Word frequency
measurements and Tag Sequence Probabilities. It can either use one
of these approaches or a combination of both.
Word Frequency Measurements
• The tag encountered most frequently in the corpus is the one assigned to
the ambiguous words(words having 2 or more possible POS tags).
• Let’s understand this approach using some example sentences :
• Ambiguous Word = “play”
• Sentence 1 : I play cricket every day. POS tag of play = VERB
• Sentence 2 : I want to perform a play. POS tag of play = NOUN
• The word frequency method will now check the most frequently used POS
tag for “play”. Let’s say this frequent POS tag happens to be VERB; then we
assign the POS tag of "play” = VERB
• The main drawback of this approach is that it can yield invalid sequences of
tags.
Tag Sequence Probabilities
• In this method, the best tag for a given word is determined by the probability that it occurs with “n”
previous tags.
• Simply put, assume we have a new sequence of 4 words, w1 , w2 , w3 , w4, and we need to identify the POS
tag of w4
• If n = 3, we will consider the POS tags of 3 words prior to w4 in the labeled corpus of text
• Let’s say the POS tags for
• w1 = NOUN, w2 = VERB , w3 = DETERMINER
• In short, N, V, D: NVD
• Then in the labeled corpus of text, we will search for this NVD sequence.
• Let’s say we found 100 such NVD sequences. Out of these -
• 10 sequences have the POS of the next word is NOUN 90 sequences have the POS of the next word is VERB
• Then the POS of the word w4 = VERB
• The main drawback of this technique is that sometimes the predicted sequence is not Grammatically
correct.
Transformation-Based Learning Tagger: TBL
• Transformation-based tagging is the combination of Rule-based &
stochastic tagging methodologies.
• Transformation based tagging is also called Brill tagging.
Probabilistic Approach
•Idea: Pick the most likely tag for the word
•Approach
Generate data from the class
Determine class for data
•Training Data: Available in the form (data, class)
•Two Types of Models
Generative
Discriminative
Generative vs. Discriminative Learning – A
Story
Zed and Zack are twin brothers.
They’re so alike that you can’t tell who’s who by looking at them.
The twins are child prodigies and jointly hold the topper’s position in their class.
Zed’s approach (Generative style):
Zed can learn everything about a given topic.
He goes in-depth and understands every little detail about a subject.
Once he’s grasped it, he never forgets it.
But this is cumbersome, especially if there’s a lot to learn under said topic.
What’s more, he has to prepare for his exams much sooner than his brother.
Zack’s approach (Discriminative style):
On the other hand, Zack studies by creating a mind map.
He gets the general idea of a topic and then learns the differences and patterns
between the subtopics.
This gives him a lot more flexibility in his thinking process.
You could say he learns by learning the differences.
Conclusion:
As we can see, the brothers have very different learning approaches but both seem to work,
as evident by the topper’s position they’ve held for so long.
Generative and Discriminative Machine
Learning Approaches – A Small Story
• Translating the analogy to our discussion:
Generative models work like Zed
Discriminative models work like Zack
[Link]
discriminative-models-for-deep-learning
Discriminative model
• The majority of discriminative models, aka conditional models, are
used for supervised machine learning.
• They do what they ‘literally’ say, separating the data points into
different classes and learning the boundaries using probability
estimates and maximum likelihood
Generative model
• As the name suggests, generative models can be used to generate
new data points.
• These models are usually used in unsupervised machine learning
problems.
Hidden Markov Model POS Tagging: HMM
• HMM is a probabilistic sequence model, i.e., for POS tagging a given
sequence of words, it computes a probability distribution over
possible sequences of POS labels and chooses the best label
sequence.
• This makes HMM model a good and reliable probabilistic approach to
finding POS tags for the sequence of words.
Markov Model (or Markov Chain)
• Assume we have three types of weather conditions: sunny, rainy, and
foggy.
• The problem at hand is to predict the next day’s weather using the
previous day's weather.
• Let qn = variable denoting the weather on the nthday
• We want to find the probability of qn given weather conditions of
previous {n-1} days. This can be mathematically written as :
• P(qn∣qn−1,qn−2,.............,q1)=?
• According to first-order Markov Assumption -
• The weather condition on the nth day is only dependent on the weather of (n-
1)th day.
• i.e. tomorrow’s weather is only dependent on today's weather conditions only.
Hidden Markov Model
• A Markov chain is useful when we need to compute a probability for a
sequence of observable events.
• In many cases, the events we are interested in are hidden, i.e., we
don’t observe them directly.
• For example, we don’t normally observe part-of-speech tags in a text.
Rather, we see words and must infer the tags from the word
sequence. We call the tags hidden because they are not observed.
• A hidden Markov model (HMM) allows us to talk about both observed
events (like words that we see in the input) and hidden events (like
part-of-speech tags).
Hidden Markov Model
• A Markov chain is useful when we need to compute a probability for a
sequence of observable events.
• In many cases, the events we are interested in are hidden, i.e., we
don’t observe them directly.
• For example, we don’t normally observe part-of-speech tags in a text.
Rather, we see words and must infer the tags from the word
sequence. We call the tags hidden because they are not observed.
• A hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and hidden
events (like part-of-speech tags).
Hidden Markov Model (HMM)
• Markov Model: Future depends only on the present, not on the past.
• Hidden Markov Model:
• States are hidden (not directly visible).
• We only see the observations.
• Goal: Predict hidden states using visible observations.
• Hidden Markov Model (HMM) =
Hidden states (not visible)
Observations (visible outcomes)
Probabilities (transition + emission)
Simple Analogy
•Example: Student’s Mood
Hidden State: Happy / Sad (not directly visible)
Observation: Smile, Cry, Study more, Study less
We guess the mood based on what we can observe.
Components of HMM
1. States (Q): Hidden variables (e.g., Noun, Verb / Happy, Sad)
2. Observations (O): What we see (e.g., Words / Smile, Cry)
3. Transition Probability (A): P(qᵢ → qⱼ) or P(tag|next tag)
4. Emission Probability (B): P(observation | state) or (Pword|tag)
5. Initial Probability (π): Probability of starting in a state
Example Analogy – Student’s Mood
•Hidden state = Student’s Mood (Happy, Sad)
•Observation = Actions (Smile, Cry, Study more, Study less)
•Transition = Probability of mood changing (Happy→Sad, Sad→Happy)
•Emission = Probability of action given mood (Happy→Smile, Sad→Cry)
Example in NLP (POS Tagging)
•Sentence: “John can see Will”
•Hidden States (POS Tags): Noun, Modal, Verb
•Observations: John, can, see, Will
•HMM helps to find the most likely sequence of POS tags for the sentence.
Hidden Markov Model
• A Hidden Markov Model (HMM) is a probabilistic graphical model
used for modeling systems that exhibit sequential or temporal
behavior, where understanding the underlying states and transitions is
essential.
•Hidden part = Temperature (because we don’t
directly observe the actual temperature).
•Observed part = Weather condition (Sun / Rain /
Snow), which we can see.
•Transitions = The tendency of temperature to
change (cold to moderate, moderate to hot, etc.).
•Emissions = The kind of weather we are likely
to see given a temperature state.
POS tagging with Hidden Markov Model
• Let us consider an example proposed by [Link] Serrano and find out
how HMM selects an appropriate tag sequence for a sentence.
[Link]
What are we trying to do?
• We want to assign the correct POS tags (Noun, Modal, Verb) to words
in sentences like “Ted will spot Will”.
• The model uses:
• Transition probability → likelihood that one tag follows another.
• Emission probability → likelihood that a word belongs to a tag.
Emission Probabilities (Word → Tag)
• Training sentences:
• Mary Jane can see Will
• Spot will see Mary
• Will Jane spot Mary?
• Mary will pat Spot
• We count how many times each word appears as Noun (N), Modal (M),
Verb (V).
• Example:
• "Mary" occurs 4 times as a Noun → Emission P(Mary|Noun) = 4/9
• "Will" occurs 1 time as Noun, 3 times as Model →
• P(Will|Noun) = 1/9
• P(Will|Modal) = 3/4
• The following table gives the likelihood of each word belonging to a tag.
Emission probabilities.
Transition Probabilities (Tag → Next Tag)
We add <S> (start) and <E> (end) to sentences.
Then count tag-to-tag transitions.
Example:
•<S> followed by Noun 3 times → P(N|<S>) = 3/4
•Model followed by Verb 3 times → P(V|M) = 3/4
Total number of co-occurrences of the tag in
consideration,
Evaluating a Tagged Sequence
• Test Sentence = Take a new sentence:
“Will can spot Mary”
• Suppose (wrong tagging):
• Will → Model (M)
• Can → Verb (V)
• Spot → Noun (N)
• Mary → Noun (N)
• Step 1: Transition probabilities <S> → M → V → N → N → <E>
• Multiply row probabilities from the transition table.
• Step 2: Emission probabilities
• P(Will|M) = ¾
• P(Can|V) = 0 (because "Can" never appeared as a verb in training → zero)
• So whole sequence probability = 0.
Correct Tagging
• Correct sequence is:<S> → N (Will) → M (can) → V (spot) → N (Mary)
→ <E>
• Transition product = (3/4 * 1/9 * 3/9 * 1/4 … )
• Emission product = (P(Will|N) * P(can|M) * P(spot|V) * P(Mary|N))
• Final probability (after multiplying all) = 0.00025720164 (non-zero).
The next step is to delete all the vertices and edges with probability zero, also the vertices which do not lead
to the endpoint are removed. Also, we will mention-
Now there are only two paths that lead to the end, let us calculate the
probability associated with each path.
• <S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
• <S>→N→M→N→V→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
• Clearly, the probability of the second sequence is much higher and
hence the HMM is going to tag each word in the sentence according
to this sequence.
Named Entity Recognition
• Named Entity Recognition (NER) in NLP focuses on identifying and
categorizing important information known as entities in text.
• These entities can be names of people, places, organizations, dates,
etc.
• It helps in transforming unstructured text into structured information
which helps in tasks like text summarization, knowledge graph
creation and question answering.
Working of Named Entity Recognition (NER)
• Analyzing the Text: It processes entire text to locate words or phrases that could represent
entities.
• Finding Sentence Boundaries: It identifies starting and ending of sentences using punctuation
and capitalization which helps in maintaining meaning and context of entities.
• Tokenizing and Part-of-Speech Tagging: Text is broken into tokens (words) and each token is
tagged with its grammatical role which provides important clues for identifying entities.
• Entity Detection and Classification: Tokens or groups of tokens that match patterns of known
entities are recognized and classified into predefined categories like Person, Organization,
Location etc.
• Model Training and Refinement: Machine learning models are trained using labeled datasets and
they improve over time by learning patterns and relationships between words.
• Adapting to New Contexts: A well-trained model can generalize to different languages, styles and
unseen types of entities by learning from context.
Semantic analysis
• Semantic Analysis → Meaning extraction from text
• Vector Space Model → Representing words & documents
mathematically
• Applications → Search engines, Chatbots, Machine Translation,
Question Answering
Semantic Relations among Lexemes
• Homonymy – Same form, different meaning (e.g., bank = river bank /
money bank)
• Polysemy – One word with related senses (e.g., mouth = of a river / of
a person)
• Synonymy – Same meaning, different words (big/large)
• Hyponymy – Hierarchical relation (rose is a type of flower)
WordNet and Hierarchy
• Definition: Identifying the correct sense of a word in context
• Example: “He deposited money in the bank” vs “The boat is near the
bank”
• Approaches to WSD:
• Knowledge-based (WordNet, dictionaries)
• Supervised ML (training data with senses)
• Unsupervised (clustering by context words)
Vector Space Models (VSM)
• Vector space models are to consider the relationship between data
that are represented by vectors.
• It is popular in information retrieval systems but also useful for other
purposes. Generally, this allows us to compare the similarity of two
vectors from a geometric perspective.
References
• [Link]
markov-models-5d1b548ece00