NLP Module 1
NLP Module 1
MODULE-1
Introduction & Language Modelling
• Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian
Languages, NLP Applications.
• Language Modelling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.
1. INTRODUCTION
1. Rationalist approach
2. Empiricist approach
Rationalist Approach(Rule-Based):
The system understands language by following rules and grammar written by humans.
Example: A program that checks sentence structure using fixed grammar rules.
Empiricist approach: The system learns language patterns from large amounts of text data instead
of fixed rules.
Example: A translation tool trained on thousands of sentence examples.
1
Natural Language Processing
Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.
Computational linguistics: is similar to theoretical linguistics and psycho linguistics, but uses different
tools. While theoretical linguistics is more about the structural rules of language, psycho-linguistics
focuses on how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology(analyzing how sounds are organized and used), and
morphology(internal structure of words). Linguists create models to explain how languages are
structured and how meaning is encoded. Eg. Most languages have constructs like noun and verb
phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").
2
Natural Language Processing
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.
3
Natural Language Processing
• A word can have a number of possible meanings associated with it. But in a given context, only
one of these meanings participates.
Syntactic Semantic
• Finding out the correct meaning of a particular use of word is necessary to find meaning of
larger units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and
syntactic knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.
Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence.
(a)
The district administration refused to give the trade union permission
for the meeting because they oppose government. (b)
• For example, in the above sentences, resolving the anaphoric reference 'they' requires pragmatic
knowledge.
Example:
Pragmatic analysis in NLP (Natural Language Processing) is about understanding the intended meaning
of a sentence in context, not just the literal words.
Example:
Literal meaning (syntax/semantics): asking if the person has the ability to open the window.
Pragmatic meaning: it’s actually a request for the person to open the window.
So, pragmatic analysis helps computers (and humans!) figure out what people really mean in real-life
situations.
6
Natural Language Processing
• Challenges in Language Specification: Natural languages constantly evolve, and the
numerous exceptions make language specification challenging for computers.
• Different Grammar Frameworks: Various grammar frameworks have been developed,
including transformational grammar, lexical functional grammar, and dependency grammar,
each focusing on different aspects of language such as derivation or relationships.
• Chomsky’s Contribution: Noam Chomsky’s generative grammar framework, which uses rules
to specify grammatically correct sentences, has been fundamental in the development of formal
grammar hierarchies.
Chomsky argued that phrase structure grammars are insufficient for natural language and proposed
transformational grammar in Syntactic Structures (1957). He suggested that each sentence has two
levels: a deep structure and a surface structure (as shown in Fig 1), with transformations mapping one
to the
other.
7
Natural Language Processing
Both sentences have the same meaning, despite having different surface structures (roles of subject and
object are inverted).
Transformational grammar has three components:
1. Phrase structure grammar: Defines the basic syntactic structure of sentences.
2. Transformational rules: Describe how deep structures can be transformed into different surface
structures.
3. Morphophonemic rules: Govern the relationship structure of a sentence (its syntax) influences
the form of the words in terms of sound and pronunciation (phonology).
Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:
Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one
surface representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”
The application of phrase structure rules will assign the structure shown in Fig 2 (a)
8
Natural Language Processing
Fig. 2: (a) Phrase structure (b) Passive Transformation
Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object
can be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.
E.g.
"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.
Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.
9
Natural Language Processing
10
Natural Language Processing
• To capture regularities of a language, the training corpus needs to be sufficiently large.
• Fundamental tasks in many NLP applications, including speech recognition, spelling correction,
handwriting recognition, and machine translation.
• Information retrieval, text summarization, and question answering.
• Most popular - n-gram models.
Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.
Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:
11
Natural Language Processing
उिस खािेेेेा उिस खािेेेेा खा
खाया। Usne khanaa नेलया। Usne khaanaa
khaayaa He (Subj) kha liyaa He (Subj) food
food ate He ate eat taken
food He ate food (completed the action)
वह चला वह चल नेदया
He move given
He moved He moved (started the action)
The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,
र खा क निता उसक निता
Rekha ke pita Uske pita
Rekha of
father
Father of Rekha Her (His) father
All nouns are categorized as feminine or masculine, and the verb form must have a gender agreement
with the subject
ताला खो गया चाभी खो गयी
Taalaa kho gayaa Chaabhii kho gayeee
Lock lose (past) key lose (past)
The lock was lost The key was lost.
Layered Representation in PG
The GB theory represents three syntactic levels: deep structure, surface structure, and logical form (LF),
where the LF is nearer to semantics. This theory tries to resolve all language issues at syntactic levels
only.
• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to
word (noun, verb, or other) groups based either on case
endings, or post-positions, or compound verbs, or main and
auxiliary verbs, etc
• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.
Karaka Theory
12
Natural Language Processing
• Karaka relations are assigned based on the roles played by various participants in the main
activity.
• Various Karakas, such as Karta (subject), Karma (object), Karana (instrument), Sampradana
(beneficiary), Apadan (separation), and Adhikaran (locus).
Example:
• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).
It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.
Why n-gram?
13
Natural Language Processing
The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the
chain rule as follows:
So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.
An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.
P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)
Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.
A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.
A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.
Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:
Example: The Arabian knights are fairy tales of the east bi-gram
approximation - P(east/the), tri-gram approximation - P(east/of the) One
pseudo-word <s> is introduced to mark the beginning of the sentence in
bi- gram estimation.
14
Natural Language Processing
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams
that share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common
prefix Wi-n+1, ... , Wi-1.
Example tri-gram:
Bi-gram model:
15
Natural Language Processing
Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x
P(are/knights)
×
P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of) x
P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:
• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the
training data, leading to many zero entries in the bigram matrix. This results from the
assumption that a word's probability depends solely on the preceding word(s), which isn't
always true.
• Fails to capture long-distance dependencies in natural language sentences.
Solution:
• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.
2.3.2 Add-one Smoothing
• It adds a value of one to each n-gram frequency before normalizing them into probabilities.
Thus, the conditional probability becomes:
• Yet, not effective, since it assigns the same probability to all missing n-grams, even though
some of them could be more intuitively appealing than others.
16
Natural Language Processing
Example:
17
Natural Language Processing
Consider the following toy corpus:
We want to calculate the probability of the bigram "I love" using Add-one smoothing. Step
• Unigrams:
• Bigrams:
• Vocabulary size sV: There are 4 unique words: "I", "love", "programming", "coding".
Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):
18
Natural Language Processing
2.3.3 Good-Turing Smoothing
• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following
expression:
where n is the number of n-grams that occur exactly f times in the training corpus.
As an example, consider that the number of n-grams that occur 4 times is
25,108 and the number of n-grams that occur 5 times is 20,542. Then, the smoothed count for 5 will be:
19