NLP Notes
MOD 1
Q. What is NLP? Applications of NLP?
Natural Language Processing (NLP) is a branch of Artificial Intelligence that
enables computers to understand, interpret, and generate human language in a
meaningful way, both in text and speech form.
The main purpose of NLP is to read and understand the human language and
deliver the output accordingly.
Applications :
1.Machine Translation – NLP enables automatic conversion of text or speech
from one language to another while maintaining the meaning and grammar. This
is useful in communication between people who speak different languages.
Example: Google Translate converting English sentences into Hindi, Marathi,
or any other language.
2.Sentiment Analysis – NLP techniques help identify the sentiment or opinion
expressed in text, such as positive, negative, or neutral. This is widely used in
businesses to understand customer satisfaction from reviews, social media
posts, or surveys.
Example: Analyzing Twitter posts to know public opinion about a new movie.
3.Chatbots & Virtual Assistants – NLP powers AI-based chat systems to
understand user queries and respond intelligently in human language. These
assistants can answer questions, perform tasks, and provide customer support
without human intervention.
Example: Amazon Alexa, Apple Siri, and WhatsApp business chatbots.
4.Text Summarization – It automatically produces a shorter and more concise
version of long documents while keeping the important information. This helps
in quickly reading large content.
Example: Summarizing a 10-page research paper into key bullet points for
quick review.
5.Speech Recognition – NLP with speech processing allows machines to
convert spoken words into written text. This is used in voice-controlled devices,
transcription services, and accessibility tools for differently-abled users.
Example: Voice typing in Google Docs or dictating messages on smartphones.
6.Information Retrieval – NLP helps search engines and databases to
understand the meaning of keywords and find the most relevant results. This
improves search accuracy and saves time for the user.
Example: Google search returning relevant articles when typing “best tourist
places in India.”
7.Spam Detection – NLP filters out unwanted or harmful messages by
analyzing text patterns, keywords, and structure. It protects users from phishing
attempts, scams, and irrelevant messages.
Example: Gmail automatically moving fraudulent emails to the spam folder.
Q. What is Ambiguity? Types?
Ambiguity occurs when a sentence or word has more than one possible
meaning, making it unclear which interpretation is correct without additional
context.
Types of Ambiguity in NLP :
1. Lexical Ambiguity
This occurs when a single word has more than one possible meaning. The
correct meaning can only be understood from context.
Example: “He went to the bank.” – Here, bank could mean a financial
institution or the side of a river.
NLP systems use techniques like Word Sense Disambiguation (WSD)
to resolve such cases.
2. Syntactic Ambiguity
This happens when the structure or grammar of a sentence allows more
than one interpretation. It is also called structural ambiguity.
Example: “I saw the man with the telescope.” – It’s unclear whether I had
the telescope or the man did.
Parsers and grammar rules are used in NLP to handle this type.
3. Semantic Ambiguity
This occurs when the meaning of a sentence is unclear, even though its
grammar is correct.
Example: “Visiting relatives can be boring.” – This could mean relatives
who visit you are boring, or that visiting them is boring.
Context understanding and meaning representation are required to solve
it.
4. Anaphoric Ambiguity
This happens when a pronoun or a reference word can refer to more than
one noun in the sentence.
Example: “John told Peter that he passed.” – It’s not clear whether John
or Peter passed.
NLP uses coreference resolution to find the correct reference.
5. Pragmatic Ambiguity
This occurs when the intended meaning depends on the situation,
background knowledge, or speaker’s intention.
Example: “Can you open the door?” – Literally asks about ability, but
usually it’s a polite request.
Understanding this requires context and real-world knowledge.
2ND MOD
Q. NGram Numerical & Theory
Definition
An N-gram is a sequence of N words that appear together in a sentence or text.
For example, a 2-gram (bigram) looks at pairs of words, and a 3-gram (trigram)
looks at three words in a row. It is used to understand patterns in language.
Working
The model counts how often word sequences occur in a given text or corpus.
Using these counts, it calculates the probability of the next word given the
previous words (like P(next word | previous word(s))). This helps in predicting
or generating the next likely word. Used in Text prediction (like mobile
keyboard suggestions) , Speech recognition , Machine translation , Spelling
correction.
Example
Corpus: "The dog runs fast"
• Unigrams (N=1): [The], [dog], [runs], [fast]
• Bigrams (N=2): [The dog], [dog runs], [runs fast]
• Trigrams (N=3): [The dog runs], [dog runs fast]
If we know "The dog", the model predicts the next word is "runs" with the
highest probability.
Numerical :
Porter Stemmer
• Definition (3–4 lines):
The Porter Stemmer is a rule-based algorithm in NLP used to reduce
words to their root form by stripping common suffixes. It uses linguistic
rules, based on consonant (C) and vowel (V) sequences, to decide how
much of a word can be safely removed.
• Working (3–4 lines):
The algorithm defines words as sequences of consonants (C) and vowels
(V). For example, tree → (C V), trouble → (C V C V C). It then applies
multiple steps of rules like removing “ing”, “ed”, “ly”, etc., but only
when a certain C–V pattern condition is satisfied. This ensures
meaningful stems are produced.
• Concept of Consonants & Vowels (3–4 lines):
o Vowels: a, e, i, o, u (and sometimes 'y' depending on position).
o Consonants: All other letters.
o The algorithm checks word structure using patterns of vowels and
consonants (e.g., m = number of VC sequences), and rules are
applied only if m is large enough (to avoid over-stemming).
• Example:
o Caresses → caress (rule: sses → ss)
o Ponies → poni (rule: ies → i)
o Troubling → troubl (rule: ing → if word has VC before)
Algorithm of Porter Stemmer
Step 1: Remove common plural and past participle forms
• If the word ends with sses → replace with ss
• If the word ends with ies → replace with i
• If the word ends with ss → keep as ss
• If the word ends with s → remove s
Example:
• caresses → caress
• ponies → poni
• cats → cat
Step 2: Remove suffixes like -ed, -ing
• If the word ends with ed or ing and the stem contains a vowel → remove
ed/ing
• If after removal, the word ends with at → add e (e.g., hopping → hope)
• If ends with double consonant (like tt, ss, ll) → remove last consonant
• If word is short (CVC form, e.g., hop) → add e
Example:
• hopping → hop
• hoped → hope
• tanned → tan
Step 3: Replace suffixes (-ational, -izer, -fulness, etc.)
• ization → ize
• ational → ate
• fulness → ful
• ousness → ous
Example:
• rationalization → rationalize
• hopefulness → hopeful
Step 4: Remove suffixes like -ic, -able, -ant
• icate → ic
• ative → (remove)
• alize → al
Example:
• communicate → communic
• relational → relat
Step 5: Remove final suffixes
• If word ends in e → remove it if the word is long enough
• If double consonant at end → remove one consonant
Example:
• probate → probat
• rate → rate (unchanged, because short word)
Final Result: The word is reduced to its stem/root.
Q. Difference between Stemming and Lemmatization.
Aspect Stemming Lemmatization
Process of chopping off Process of reducing a word
1. Definition prefixes/suffixes to reduce a to its base/dictionary form
word to its stem. (lemma).
Rule-based, works by cutting Dictionary + morphological
2. Approach
affixes. analysis.
Always returns a
3. Output Stemmed word may not have
meaningful word (e.g.,
Meaning a real meaning (e.g., comput).
compute).
Considers POS and
4. Grammar Ignores Part-of-Speech (POS)
grammar for accurate
Awareness and grammar.
results.
More accurate,
5. Accuracy Less accurate, more crude.
linguistically correct.
Slower due to dictionary
6. Speed Faster because it’s simple.
lookups and analysis.
7. Example Studies → Studi Studies → Study
Useful when speed > Useful where correctness
8. Use Cases accuracy (e.g., search matters (e.g., chatbots, NLP
engines). tasks).
9. Resource Requires linguistic
Requires minimal resources.
Requirement resources like WordNet.
Porter Stemmer, Snowball
WordNet Lemmatizer,
10. Algorithm Stemmer, Lancaster
spaCy Lemmatizer.
Stemmer.
Q. What is Edit Distance Algorithm?
It is a way to measure how different two words (or strings) are by counting the
minimum number of edits needed to transform one word into the other.
The allowed edits are:
1. Insertion → Add a character.
o Example: cat → cart (insert "r").
2. Deletion → Remove a character.
o Example: cart → cat (delete "r").
3. Substitution → Replace one character with another.
o Example: cat → cut (substitute "a" → "u").
Example : "kitten" → "sitting"
• kitten → sitten (substitute "k" → "s")
• sitten → sittin (substitute "e" → "i")
• sittin → sitting (insert "g")
Minimum edits = 3
Working with Autocorrect:
In autocorrect systems (like in your phone or search engines), when you
misspell a word, the system calculates the edit distance between your input and
valid dictionary words.
• The word(s) with the lowest edit distance are suggested as corrections.
• Example: You type “recieve” → Autocorrect checks the dictionary.
o recieve → receive (edit distance = 2: swap "i" and "e")
o Since the distance is small, it suggests “Did you mean: receive?”
Applications :
1. Spell Checking → Fixing typos by finding closest words.
2. Search Engines → "Did you mean …?" suggestions.
3. Text Prediction → Suggesting likely next words.
4. Natural Language Processing (NLP) → Matching noisy input with
correct forms.
Q. What are Collocations?Significance?
Definition:
Collocations are pairs or groups of words that frequently appear together in
natural language, more often than would be expected by chance. They sound
“natural” to native speakers.
Examples
• "fast food" (common) vs "quick food" (rarely used)
• "make a decision" vs "do a decision"
• "strong tea" vs "powerful tea"
Types of Collocations
1. Adjective + Noun → strong tea, heavy rain
2. Verb + Noun → make a decision, commit a crime
3. Verb + Adverb → whisper softly, argue strongly
4. Noun + Noun → data mining, credit card
5. Adverb + Adjective → deeply concerned, highly recommended
Significance of Collocations in NLP
1. Improves Naturalness in Language Generation
o Collocations make machine-generated text sound more natural and
fluent.
o Example: "Strong tea" sounds natural, but "powerful tea" does not.
2. Enhances Machine Translation
o Correct collocations help in choosing context-appropriate
translations.
o Example: "Heavy rain" in English should not be translated word-
for-word as "strong rain."
3. Better Information Retrieval (IR)
o Search engines can use collocations to improve relevance.
o Example: Searching "machine learning algorithm" gives better
results than searching each word separately.
4. Context Understanding in NLP Models
o Collocations provide semantic clues about meaning.
o Example: "Fast food" has a different meaning than just "fast" +
"food."
5. Speech Recognition & Text Prediction
o Predictive text keyboards rely on collocations to suggest the next
word.
o Example: After typing "Happy," it suggests "birthday" instead of
"elephant."
3RD MOD
Q. Discuss HMM (Hidden Markov Model ) with an example.
Hidden Markov Model (HMM)
1. Definition
A Hidden Markov Model (HMM) is a statistical model used to represent
systems that have hidden (unobservable) states but produce observable outputs.
It assumes that the system follows a Markov process, where the next state
depends only on the current state, not on past history.
The "hidden" part means that we cannot directly see the states, but we can
estimate them from the observed data.
Components of Hidden Markov Model (HMM)
An HMM consists of the following main components:
1. States (Hidden States)
• These represent the underlying system conditions that are not directly
observable.
• Example: In speech recognition, the hidden states may represent
phonemes (sounds); in weather prediction, they may represent sunny,
rainy, cloudy.
• At any point in time, the system is in one of these states.
2. Observations
• These are the visible outputs that we can measure or record.
• Each observation is generated from a corresponding hidden state.
• Example: In speech recognition, the sound waves or acoustic signals we
capture are the observations.
• Observations provide indirect evidence about which hidden state the
system is in.
3. Transition Probabilities (A matrix)
• These are the probabilities of moving from one hidden state to another.
• Represented as a matrix A, where each entry a_ij = probability of moving
from state i to state j.
• Example: If yesterday was "Rainy," the probability of today being
"Sunny" might be 30%, and "Rainy" again might be 70%.
• This captures the temporal dependency (sequence nature) of the states.
4. Emission Probabilities (B matrix)
• These represent the probability of an observation being generated from a
particular hidden state.
• Each state produces observations with certain likelihoods.
• Example: If the hidden state is "Rainy," the observation could be "people
carrying umbrellas" with probability 0.8.
• This models the relationship between hidden states and observed
outputs.
5. Initial State Distribution (π vector)
• This defines the probabilities of the system starting in each possible
hidden state.
• Example: At the beginning of the week, the probability of it starting as
"Sunny" may be 0.6, "Rainy" 0.3, and "Cloudy" 0.1.
• Important for initializing the model before observations begin.
Q. Explain Context Free Grammar(CFG) in detail.
Definition :
A Context Free Grammar (CFG) is a formal grammar used in computational
linguistics and computer science to describe the syntax of programming
languages and natural languages. It generates strings (sentences) from a
language by applying production rules. In CFG, every production rule replaces
a single non-terminal symbol with a sequence of non-terminals and/or
terminals.
Components of CFG
A CFG is formally represented as a 4-tuple G = (V, Σ, R, S) where:
1. V (Variables / Non-Terminals):
o These are symbols that can be replaced using production rules.
o They act as placeholders for patterns in the language.
o Example: S, NP, VP where S = Sentence, NP = Noun Phrase, VP =
Verb Phrase.
2. Σ (Sigma - Terminals):
o These are the actual alphabet of the language (words or tokens) that
appear in the final sentences.
o They cannot be replaced further.
o Example: dog, eats, food.
3. R (Production Rules):
o A set of transformation rules of the form A → α,
where A is a non-terminal and α is a string consisting of terminals
and/or non-terminals.
o These rules define how sentences can be constructed step by step.
o Example: S → NP VP.
4. S (Start Symbol):
o A special non-terminal from where the derivation begins.
o It represents the entire sentence or structure.
o Example: S.
Working / Example
Let’s define a simple CFG for a basic English-like grammar:
• Variables (V): {S, NP, VP}
• Terminals (Σ): {dog, cat, runs, eats}
• Rules (R):
1. S → NP VP
2. NP → dog | cat
3. VP → runs | eats
• Start symbol (S): S
Derivation Example:
• Start: S
• Apply Rule 1: S → NP VP
• Replace NP: dog VP
• Replace VP: dog runs
Final sentence: dog runs