0% found this document useful (0 votes)
48 views19 pages

NLP Module 1

The document provides an overview of Natural Language Processing (NLP), including its definition, origins, and challenges. It discusses various approaches to NLP, such as the rationalist and empiricist methods, and outlines key concepts like language modeling, computational linguistics, and different levels of language processing (lexical, syntactic, semantic, discourse, and pragmatic analysis). Additionally, it highlights the unique characteristics of Indian languages and the complexities involved in processing natural languages.

Uploaded by

ravenspar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views19 pages

NLP Module 1

The document provides an overview of Natural Language Processing (NLP), including its definition, origins, and challenges. It discusses various approaches to NLP, such as the rationalist and empiricist methods, and outlines key concepts like language modeling, computational linguistics, and different levels of language processing (lexical, syntactic, semantic, discourse, and pragmatic analysis). Additionally, it highlights the unique characteristics of Indian languages and the complexities involved in processing natural languages.

Uploaded by

ravenspar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Natural Language Processing

MODULE-1
Introduction & Language Modelling
• Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian
Languages, NLP Applications.
• Language Modelling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.

Textbook 1: Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and


Information Retrieval”, Oxford University Press. Ch. 1, Ch. 2.

1. INTRODUCTION

1.1 What is Natural Language Processing (NLP)


Language is the primary means of communication used by humans and tool to express
the greater part of our ideas and emotions. It shapes thought and has a structure, and carries
meaning. To express a thought, content helps represent the language in real-time.

NLP is concerned with development of computational models of aspects of human


language processing, there are two main reasons:

1. To develop automated tools for language processing


2. To gain a better understanding of human communication

Building computational models with human language-processing abilities requires a


knowledge of how humans acquire, store, and process language.

Historically, there have been two major approaches to NLP:

1. Rationalist approach
2. Empiricist approach

Rationalist Approach(Rule-Based):
The system understands language by following rules and grammar written by humans.
Example: A program that checks sentence structure using fixed grammar rules.

Empiricist approach: The system learns language patterns from large amounts of text data instead
of fixed rules.
Example: A translation tool trained on thousands of sentence examples.

1
Natural Language Processing
Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.

1.2 Origins of NLP


The NLP includes speech processing and sometimes mistakenly termed natural language
understanding-originated from machine translation research. Natural language processing includes both
understanding (interpretation) and generation (production). We are concerned with text processing only
- The area of computational linguistics and its application.

Computational linguistics: is similar to theoretical linguistics and psycho linguistics, but uses different
tools. While theoretical linguistics is more about the structural rules of language, psycho-linguistics
focuses on how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology(analyzing how sounds are organized and used), and
morphology(internal structure of words). Linguists create models to explain how languages are
structured and how meaning is encoded. Eg. Most languages have constructs like noun and verb
phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").

Computational Linguistics Models:


Computational linguistics is concerned with the study of language using computational models of
linguistic phenomena. It deals with the application of linguistic theories and computational techniques
for NLP. In computational linguistics, representing a language is a major problem; Most knowledge
representations tackle only a small part of knowledge. Representing the whole body of knowledge is
almost impossible.
Computational models may be broadly classified under knowledge-driven and data-driven categories.
Knowledge-driven systems rely on explicitly coded linguistic knowledge, often expressed as a set of
handcrafted grammar rules. Acquiring and encoding such knowledge is difficult and is the main
bottleneck in the development of such systems.
Data-driven approaches presume the existence of a large amount of data and usually employ some
machine learning technique to learn syntactic patterns. Performance of these systems is dependent on
the quantity of the data and usually adaptive to noisy data.
Main objective of the models is to achieve a balance between semantic (knowledge-driven) and
data-driven approaches on one hand, and between theory and practice on the other.
With the unprecedented amount of information now available on the web, NLP has become one

2
Natural Language Processing
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.

1.3 Language & Knowledge


Language is the medium of expression in which knowledge is deciphered (convert into normal
language). We are here considering the text form of the language and the content of it as knowledge.
Language, being a medium of expression, is the outer form of the content it expresses. The same
content can be expressed in different languages.
Hence, to process a language means to process the content of it. As computers are not able to
understand natural language, methods are developed to map its content in a formal language.
The language and speech community considers a language as a set of sounds that, through
combinations, conveys meaning to a listener. However, we are concerned with representing and
processing text only. Language (text) processing has different levels, each involving different types of
knowledge.
1.3.1 lexical analysis
• Analysis of words.
• Word-level processing requires morphological knowledge, i.e., knowledge about the structure
and formation of words from basic units (morphemes).
• The rules for forming words from morphemes are language specific.

1.3.2 Syntactic analysis


• Considers a sequence of words as a unit, usually a sentence, and finds its structure.
• Decomposes a sentence into its constituents (or words) and identifies how they relate to
each other.
• It captures grammaticality or non-grammaticality of sentences by looking at constraints like
word order, number, and case agreement.
• This level of processing requires syntactic knowledge (How words are combined to form
larger units such as phrases and sentences)
• For example:
o 'I went to the market' is a valid sentence whereas 'went the I market to' is not. o
'She is going to the market' is valid, but 'She are going to the market' is not.
1.3.3 Semantic analysis
• It is associated with the meaning of the language.
• Semantic analysis is concerned with creating meaningful representation of linguistic inputs.
• Eg. 'Colorless green ideas sleep furiously' - syntactically correct, but semantically anomalous.

3
Natural Language Processing
• A word can have a number of possible meanings associated with it. But in a given context, only
one of these meanings participates.

Syntactic Semantic

• Finding out the correct meaning of a particular use of word is necessary to find meaning of
larger units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and
syntactic knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.

1.3.4 Discourse Analysis


• Attempts to interpret the structure and meaning of even larger units, e.g., at the paragraph and
document level, in terms of words, phrases, clusters, and sentences.
• It requires the resolution of anaphoric references and identification of discourse structure.

Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence.
(a)
The district administration refused to give the trade union permission
for the meeting because they oppose government. (b)
• For example, in the above sentences, resolving the anaphoric reference 'they' requires pragmatic
knowledge.

Anaphoric reference means using a word (usually a pronoun) to refer back


to something already mentioned in a sentence or text.
4
Natural Language Processing

Example:

Ravi bought a book. He liked it a lot.


Here He refers back to Ravi, and it refers back to book.

1.3.5 Pragmatic analysis


• The highest level of processing, deals with the purposeful use of sentences in situations.
• It requires knowledge of the world, i.e., knowledge that extends beyond the contents of the text.

Pragmatic analysis in NLP (Natural Language Processing) is about understanding the intended meaning
of a sentence in context, not just the literal words.

Example:

Sentence: "Can you open the window?"

Literal meaning (syntax/semantics): asking if the person has the ability to open the window.

Pragmatic meaning: it’s actually a request for the person to open the window.

So, pragmatic analysis helps computers (and humans!) figure out what people really mean in real-life
situations.

1.4 The Challenges of NLP


• Natural languages are highly ambiguous and vague, achieving precise representation of content
can be difficult.
• The inability to capture all the required knowledge.
• Identifying its semantics.
• A language keeps on evolving. New words are added continually and existing words are
introduced in new context. (eg. 9/11 - terrorist act on WTC)
Solution: The only way machines can learn is by considering its context, context of a
word is defined by co-occurring words.
• The frequency of a word being used in a particular sense also affects its meaning.
• Idioms, metaphor, and ellipses add more complexity to identify the meaning of the written text.
o Example: “The old man finally kicked the bucket”  "kicked the bucket" is a well-known
Idiom, meaning is to "to die." o "Time is a thief."  Metaphor suggests “time robs
you of valuable moments or experiences in life”.
o "I’m going to the store, and you’re going to the
party, right?"
"Yes, I am…"
Ellipses refer to the omission of words or phrases in a sentence. (represented by "…")
• The ambiguity of natural languages is another difficulty (explicit as well as implicit sources of
5
Natural Language Processing
knowledge).
o Word Ambiguity: Example: 'Taj' - a monument, a brand of tea, or a hotel.
 “Can” – ambiguous in its part-of-speech. ('Part-of-speech tagging' algorithm)
 “Bank” is ambiguous in its meaning. ('word sense disambiguation' algorithm) o
Structural ambiguity - A sentence may be ambiguous
 'Stolen rifle found by tree.'
 Verb sub-categorization may help to resolve
 Probabilistic parsing - statistical models to predict the most likely syntactic structure.
• A number of grammars have been proposed to describe the structure of sentences. o It is
almost impossible for grammar to capture the structure of all and only meaningful text.
1.5 Language and Grammar
• Language Grammar: Grammar defines language and consists of rules that allow parsing and
generation of sentences, serving as a foundation for natural language processing.
• Syntax vs. Semantics: Although syntax and semantics are closely related, a separation is made
in processing due to the complexity of world knowledge influencing both language structure
and meaning.

6
Natural Language Processing
• Challenges in Language Specification: Natural languages constantly evolve, and the
numerous exceptions make language specification challenging for computers.
• Different Grammar Frameworks: Various grammar frameworks have been developed,
including transformational grammar, lexical functional grammar, and dependency grammar,
each focusing on different aspects of language such as derivation or relationships.
• Chomsky’s Contribution: Noam Chomsky’s generative grammar framework, which uses rules
to specify grammatically correct sentences, has been fundamental in the development of formal
grammar hierarchies.
Chomsky argued that phrase structure grammars are insufficient for natural language and proposed
transformational grammar in Syntactic Structures (1957). He suggested that each sentence has two
levels: a deep structure and a surface structure (as shown in Fig 1), with transformations mapping one
to the

other.

Fig 1. Surface and Deep Structures of sentence

• Chomsky argued that an utterance is the surface representation of a 'deeper structure'


representing its meaning.
• The deep structure can be transformed in a number of ways to yield many different surface-
level representations.
• Sentences with different surface-level representations having the same meaning, share a
common deep-level representation.
Pooja plays veena.
Veena is played by Pooja.

7
Natural Language Processing
Both sentences have the same meaning, despite having different surface structures (roles of subject and
object are inverted).
Transformational grammar has three components:
1. Phrase structure grammar: Defines the basic syntactic structure of sentences.
2. Transformational rules: Describe how deep structures can be transformed into different surface
structures.
3. Morphophonemic rules: Govern the relationship structure of a sentence (its syntax) influences
the form of the words in terms of sound and pronunciation (phonology).

Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:

Eg: Veena is played by Pooja.

S  NP + VP Det  the, a, an, ...


VP  V + NP Verb  catch, write, eat, ...
NP  Det + Noun Noun  police, snatcher, ...
V  Aux + Verb Aux  will, is, can, ...

Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one
surface representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”

Eg. [NP1 - Aux - V - NP2]  [NP2 - Aux + be + en - V - by + NP1]

The application of phrase structure rules will assign the structure shown in Fig 2 (a)

8
Natural Language Processing
Fig. 2: (a) Phrase structure (b) Passive Transformation

The passive transformation rules will convert the sentence into


The + culprit + will + be + en + catch + by + police
Morphophonemic Rule: Another transformational rule will then reorder 'en + catch' to 'catch + en' and
subsequently one of the morphophonemic rules will convert 'catch + en' to 'caught'.

Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object
can be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.

E.g.

"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.

1.6 Processing Indian Languages


There are a number of differences between Indian languages and English:
• Unlike English, Indic scripts have a non-linear structure.
• Unlike English, Indian languages have SOV (Subject-Object-Verb) as the default sentence
structure.
• Indian languages have a free word order, i.e., words can be moved freely within a sentence
without changing the meaning of the sentence.
• Spelling standardization is more subtle in Hindi than in English.
• Indian languages have a relatively rich set of morphological variants.
• Indian languages make extensive and productive use of complex predicates (CPs).
• Indian languages use post-position (Karakas) case markers instead of prepositions.
• Indian languages use verb complexes consisting of sequences of verbs,
o e.g., गा रहा है (ga raha hai-singing) and खेल रही है (khel rahi hai-playing).
o The auxiliary verbs in this sequence provide information about tense, aspect,
modality, etc

Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.

9
Natural Language Processing

1.7 NLP Applications


1.7.1 Machine Translation
This refers to automatic translation of text from one human language to another. In order to carry
out this translation, it is necessary to have an understanding of words and phrases, grammars of the two
languages involved, semantics of the languages, and word knowledge.

1.7.2 Speech Recognition


This is the process of mapping acoustic speech signals to a set of words. The difficulties arise due to
wide variations in the pronunciation of words, homonym (e.g. dear and deer) and acoustic ambiguities
(e.g., in the rest and interest).

1.7.3 Speech Synthesis


Speech synthesis refers to automatic production of speech (utterance of natural language sentences).
Such systems can read out your mails on telephone, or even read out a storybook for you.

1.7.4 Information Retrieval


This focuses on identifying relevant documents for a user's query using NLP techniques in
information retrieval. Methods like indexing, word sense disambiguation, query modification, and
knowledge bases improve IR performance, including query expansion. Lexical resources like WordNet,
LDOCE, and Roget's Thesaurus enhance these systems. These tools help refine search results and
improve accuracy.

1.7.5 Information Extraction


An information extraction system captures and outputs factual information contained within a
document. Query is specified as pre-defined templates. System identifies a subset of information within
a document that fits the pre-defined template.

1.7.6 Question Answering


Given a question and a set of documents, a question answering system attempts to find the precise
answer, or at least the precise portion of text in which the answer appears. A question answering system
requires more NLP than an information retrieval system or an information extraction system. It requires
not only precise analysis of questions and portions of texts but also semantic as well as background
knowledge to answer certain type of questions.

1.7.7 Text Summarization


This deals with the creation of summaries of documents and involves syntactic, semantic, and
discourse level processing of text.
2. LANGUAGE MODELLING
Statistical language modelling:

• Creates a language model by training it from a corpus.

10
Natural Language Processing
• To capture regularities of a language, the training corpus needs to be sufficiently large.
• Fundamental tasks in many NLP applications, including speech recognition, spelling correction,
handwriting recognition, and machine translation.
• Information retrieval, text summarization, and question answering.
• Most popular - n-gram models.

2.2.4 Paninian Framework


Paninian grammar (PG) was written by Panini in 500 BC in Sanskrit (the original text being titled
Asthadhyayi), the framework can be used for other Indian languages and possibly some Asian languages
as well.

Unlike English (Subject-Verb-Object ordered), Asian languages are SOV (Subject-Object-Verb)


ordered and inflectionally rich. The inflections provide important syntactic and semantic cues for
language analysis and understanding. The Paninian framework takes advantage of these features.

Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.

Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:

(a) माँ बच्च को खािेेेेा (b) बच्च को माँ खािेेेेा


द ती है। Maan Bachche ko
द ती है। Bachche ko Maan
khanaa detii hai Mother child to
khanaa detii hai Child to mother
food give-(s)
food give-(s)
Mother gives food to the child.
Mother gives food to the child.
The auxilary verbs follow the main verb. In Hindi, they remain as separate words:

खा रहा करता रहा है


है kartaa rahaa
khaa raha hai doing been
hai eat-ing has has been
eating doing
In Hindi, some verbs (main), e.g., give (द िेेेेा), take (ल िेेेेा), also combine with
other verbs (main) to change the aspect and modality of the verbs.

11
Natural Language Processing
उिस खािेेेेा उिस खािेेेेा खा
खाया। Usne khanaa नेलया। Usne khaanaa
khaayaa He (Subj) kha liyaa He (Subj) food
food ate He ate eat taken
food He ate food (completed the action)
वह चला वह चल नेदया
He move given
He moved He moved (started the action)

The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,
र खा क निता उसक निता
Rekha ke pita Uske pita
Rekha of
father
Father of Rekha Her (His) father
All nouns are categorized as feminine or masculine, and the verb form must have a gender agreement
with the subject
ताला खो गया चाभी खो गयी
Taalaa kho gayaa Chaabhii kho gayeee
Lock lose (past) key lose (past)
The lock was lost The key was lost.
Layered Representation in PG
The GB theory represents three syntactic levels: deep structure, surface structure, and logical form (LF),
where the LF is nearer to semantics. This theory tries to resolve all language issues at syntactic levels
only.

Paninian grammar framework is said to be syntactico-semantic, that


is, one can go from surface layer to deep semantics by passing
through intermediate layers.

• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to
word (noun, verb, or other) groups based either on case
endings, or post-positions, or compound verbs, or main and
auxiliary verbs, etc
• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.

Karaka Theory

• Karaka theory is the central theme of PG framework.

12
Natural Language Processing
• Karaka relations are assigned based on the roles played by various participants in the main
activity.
• Various Karakas, such as Karta (subject), Karma (object), Karana (instrument), Sampradana
(beneficiary), Apadan (separation), and Adhikaran (locus).

Example:

माँ बच्ची को आँिग में हाथ स रोटी खखलाती है।


Maan bachchi ko aangan mein haath se rotii khilaatii hei Mother
child-to courtyard-in hand-by bread feed (s).
The mother feeds bread to the child by hand in the courtyard.

• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).

Issues in Paninian Grammar


The two problems challenging linguists are:
(i) Computational implementation of PG, and
(ii) Adaptation of PG to Indian, and other similar languages.
However, many issues remain unresolved, specially in cases of shared Karak relations. Another
difficulty arises when mapping between the Vibhakti (case markers and post-positions) and the semantic
relation (with respect to verb) is not one to one. Two different Vibhakti can represent the same relation,
or the same Vibhakti can represent different relations in different contexts.

2.3 Statistical Language Model


A statistical language model is a probability distribution P(s) over all possible word sequences (or any
other linguistic unit like words, sentences, paragraphs, documents, or spoken utterances).

2.3.1 n-gram Model (https://www.youtube.com/watch?v=Vc2C1NZkH0E )

Applications: Suggestions in messages, spelling correction, Machine translation, Handwritten recognition…

It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.

Why n-gram?

13
Natural Language Processing
The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the
chain rule as follows:

where hi is history of word


wi, defined as w1 w2 . . . wi-1

So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.

An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.

P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)

Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.

A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.

A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.

Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:

Example: The Arabian knights are fairy tales of the east bi-gram
approximation - P(east/the), tri-gram approximation - P(east/of the) One
pseudo-word <s> is introduced to mark the beginning of the sentence in
bi- gram estimation.

Two pseudo-words <s1> and <s2> for tri-gram estimation.


How to estimate these probabilities?
1. Train n-gram model on training corpus.
2. Estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e.,
using relative frequencies.

14
Natural Language Processing
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams
that share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common
prefix Wi-n+1, ... , Wi-1.

Example tri-gram:

Predicted word for “The girl bought”


Example
Training set:

The Arabian Knights


These are the fairy tales of the east
The stories of the Arabian knights are translated in many languages

Bi-gram model:

P(the/<s>) =0.67 P(Arabian/the) = 0.4 P(knights /Arabian) =1.0


P(are/these) = 1.0 P(the/are) = 0.5 P(fairy/the) =0.2

15
Natural Language Processing

P(tales/fairy) =1.0 P(of/tales) =1.0 P(the/of) =1.0


P(east/the) = 0.2 P(stories/the) =0.2 P(of/stories) =1.0
P(are/knights) =1.0 P(translated/are) =0.5 P(in /translated) =1.0
P(many/in) =1.0
P(languages/many) =1.0

Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x
P(are/knights)

×
P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of) x
P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:

• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the
training data, leading to many zero entries in the bigram matrix. This results from the
assumption that a word's probability depends solely on the preceding word(s), which isn't
always true.
• Fails to capture long-distance dependencies in natural language sentences.

Solution:

• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.
2.3.2 Add-one Smoothing

• It adds a value of one to each n-gram frequency before normalizing them into probabilities.
Thus, the conditional probability becomes:

Where, V is the vocabulary size.

• Yet, not effective, since it assigns the same probability to all missing n-grams, even though
some of them could be more intuitively appealing than others.

16
Natural Language Processing
Example:

17
Natural Language Processing
Consider the following toy corpus:

• "I love programming"

• "I love coding"

We want to calculate the probability of the bigram "I love" using Add-one smoothing. Step

1: Count the occurrences

• Unigrams:

o "I" appears 2 times o "love"


appears 2 times o "programming"
appears 1 time o "coding"
appears 1 time

• Bigrams:

o "I love" appears 2 times o

"love programming" appears 1


time o "love coding" appears
1 time

• Vocabulary size sV: There are 4 unique words: "I", "love", "programming", "coding".

Step 2: Apply Add-one smoothing


For the bigram "I love":

Step 3: For an unseen bigram

Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):

18
Natural Language Processing
2.3.3 Good-Turing Smoothing

• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following
expression:

where n is the number of n-grams that occur exactly f times in the training corpus.
As an example, consider that the number of n-grams that occur 4 times is
25,108 and the number of n-grams that occur 5 times is 20,542. Then, the smoothed count for 5 will be:

2.3.4 Caching Technique


The caching model is an enhancement to the basic n-gram model that addresses the issue of frequency
variation across different segments of text or documents. In traditional n-gram models, the probability
of an n-gram is calculated based solely on its occurrence in the entire corpus, which does not take into
account the local context or recent patterns. The caching model improves this by incorporating the
recently discovered n-grams into the probability calculations.

19

You might also like