0% found this document useful (0 votes)
8 views109 pages

Module 1 (Chapter 1)

Uploaded by

gckarthikreddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views109 pages

Module 1 (Chapter 1)

Uploaded by

gckarthikreddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NATURAL LANGUAGE PROCESSING

Module-I

Gunti Spandan
Associate Professor
Department of CSE
GITAM School of Technology (GST)
Bangalore
Email: [email protected]
Department of CSE, GST 19ECS443: NLP 1
Contents
Introduction to natural language processing,

ambiguities in language,

Regular expression, words, morphology, morphology parsing,

word tokenization, lemmatization & stemming,

edit distance

N grams language models,

smoothing-Laplace smoothing,

Good-Turing discounting, Interpolation,Backoff, and perplexity

Department of CSE, GST 19ECS443: NLP 2


TEXT BOOKS and REFERENCES
TEXT BOOKS:

1. Daniel Jurafsky, James H Martin, “Speech and Language Processing: An introduction to Natural Language
Processing, Computational Linguistics and Speech Recognition”, 2/e, Prentice Hall, 2008.
2. C. Manning, H. Schutze, “Foundations of Statistical Natural Language Processing”, MIT Press. Cambridge, MA,
1999.
3. Jacob Eisenstein, Introduction to Natural Language Processing, MIT Press, 2019.

REFERENCE BOOK:

1. Jalaj Thanaki, Python Natural Language Processing: Explore NLP with machine Learning and deep learning
Techniques, Packt, 2017.

Department of CSE, GST 19ECS443: NLP 3


Introduction

Department of CSE, GST 19ECS443: NLP 4


Introduction

 Natural Language Processing (NLP) is convergence of Computer Science, Computational Linguistics, Machine
Learning and Artificial Intelligence.

 The main objective of NLP: To make the machines understand the natural human language and generate the
automated response for human language.

 Today most of the applications in the Smartphone are dependent on NLP techniques.

 Apps or applications like Apple’s Siri, Amazon’s Alexa, Google Assistant are used by
majority of the smart phone users in daily lives,
these applications which use NLP techniques have become a part of common man’s lives
in all over the World.

Department of CSE, GST 19ECS443: NLP 5


Introduction
(from Book-1)
Introduction:
• The idea of giving computers the ability to process human language is as old as the idea of computers themselves..
• This is a vibrant interdisciplinary field with many names like:
-- speech and language processing,
-- human language technology,
-- natural language processing,
-- computational linguistics, and
-- speech recognition and synthesis.

Goal of this new field: to perform


• useful tasks involving human language,
• tasks like enabling human-machine communication,
• improving human-human communication, or
• simply doing useful processing of text or speech.
Department of CSE, GST 19ECS443: NLP 6
Continued...
Ex: A conversational agent
The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey

https://www.youtube.com/watch?v=Wy4EfdnMZ5g
https://www.youtube.com/watch?v=HH37JTBpi2A
https://www.dazeddigital.com/film-tv/article/39557/1/space-odyssey-stanley-kubrick-hal-9000-ai-in-film

• HAL is an artificial agent capable of such advanced language behaviour as speaking and understanding English, and
at a crucial moment in the plot, even reading lips.

• It is now clear that HAL’s creator, Arthur C. Clarke, was a little optimistic in predicting when an artificial agent such
as HAL would be available.

• We call programs like HAL that converse with humans in natural language
conversational agents or dialogue systems.

Department of CSE, GST 19ECS443: NLP 7


Why NLP is required

8
Continued...
• Machine Translation: to automatically translate a document from one language to another.
-- making available to non-English-speaking readers the vast amount of scientific information on the Web in
English.
-- translating for English speakers the hundreds of millions of Web pages written in other languages like Chinese.

• Many other language processing tasks are also related to the Web.

• Another such Question task is Web-based question answering.


• This is a generalization of simple Web search, answering where instead of just typing keywords,
a user might ask complete questions, ranging from easy to hard, like the following:

 What does “divergent” mean?


 What year was Abraham Lincoln born?
 How many states were in the United States that year?
 How much Chinese silk was exported to England by the end of the 18th century?
 What do scientists think about the ethics of human cloning?
Department of CSE, GST 19ECS443: NLP 9
Continued...
• Some of these, such as

definition questions, or simple factoid questions like dates and locations,


can already be answered by search engines.

• But answering more complicated questions might require extracting information that is

-- embedded in other text on a Web page,

-- doing inference (drawing conclusions based on known facts), or

-- synthesizing and summarizing information from multiple sources or Web pages.

Department of CSE, GST 19ECS443: NLP 10


1.1 Knowledge in Speech and Language Processing
 Language processing applications differ from data processing systems in their use of knowledge of language.

Ex: Unix wc program (counts the total number of bytes, words, and lines in a text file)

 When used to count bytes and lines, wc is an ordinary data processing application.
 When it is used to count the words in a file, it requires knowledge about what it means to be a word and thus
becomes a language processing system.

HAL:

• HAL must be able to recognize words from an audio signal and to generate an audio signal from a sequence of
words.
• These tasks of speech recognition and speech synthesis require knowledge about phonetics and phonology:
-- how words are pronounced in terms of sequences of sounds and
-- how each of these sounds is realized acoustically.

Department of CSE, GST 19ECS443: NLP 11


Continued...
• Moving beyond individual words, HAL must use structural knowledge to properly string together the words that
constitute its response.
 (1.1) I’m I do, sorry that afraid Dave I’m can’t.
(HAL must know that the above sequence of words will not make sense to Dave)
The knowledge needed to order and group words comes under the heading of syntax.

 (1.2) How much Chinese silk was exported to Western Europe by the end of the 18th century?
• To answer this question, we need to know something about :
lexical semantics, the meaning of all the words (export or silk) as well as
compositional semantics (what exactly constitutes Western Europe as opposed to Eastern or Southern
Europe, what does end mean when combined with the 18th century)

We also need to know something about : the relationship of the words to the syntactic structure.
For example, we need to know that –
by the end of the 18th century is a temporal end-point and not a description of the agent.

Department of CSE, GST 19ECS443: NLP 12


Continued...
 (1.3) How much Chinese silk was exported to Western Europe by southern merchants?

• HAL must determine that Dave’s utterance is a request for action


REQUEST: HAL, open the pod bay door.
STATEMENT: HAL, the pod bay door is open.
INFORMATION QUESTION: HAL, is the pod bay door open?

HAL’s reply-1: I’m sorry and I’m afraid.


then I can’t, rather than the more direct (and truthful) I won’t.

HAL’s reply-2: No or No, I won’t open the door. (polite to Dave)

• This knowledge about the kind of actions that speakers intend by their use of sentences is pragmatic or dialogue
knowledge.

Department of CSE, GST 19ECS443: NLP 13


Continued...
 (1.4) How many states were in the United States that year?
Another kind of pragmatic or discourse knowledge is required to answer the question.

• What year is that year?


To interpret words like that year,
a question-answering system needs to examine the earlier questions that were asked;
in this case, the previous question talked about the year that Lincoln was born.
Thus, this task of coreference resolution makes use of knowledge about how words like
that or pronouns like it or she refer to previous parts of the discourse.
To summarize, engaging in complex language behaviour requires various kinds of knowledge of language:
 Phonetics and Phonology —knowledge about linguistic sounds
 Morphology —knowledge of the meaningful components of words
 Syntax —knowledge of the structural relationships between words
 Semantics —knowledge of meaning
 Pragmatics — knowledge of the relationship of meaning to the goals and intentions of the speaker
 Discourse —knowledge about linguistic units larger than a single utterance
Department of CSE, GST 19ECS443: NLP 14
AMBIGUITIES IN LANGUAGE

Ambiguity in NLP can be defined as the ability of being understood in more than one way. In other way we
can say that ambiguity is the capacity of being understood in more number of ways.

Example 1 : In the above figure, the speaker says, Call me a taxi, please. Is the speaker asking someone to
hail them a taxi or to be called a taxi is creating a ambiguity
15
1.2 Ambiguity
• Most tasks in speech and language processing can be viewed as resolving ambiguity at one of these levels.

• Some input is ambiguous if multiple, alternative linguistic structures can be built for it.

• Consider the spoken sentence: I made her duck.

 Here are five different meanings this sentence could have (see if you can think of some more),
each of which exemplifies an ambiguity at some level:

(1.5) I cooked waterfowl for her. (waterfowl : duck, goose or other aquatic bird)
(1.6) I cooked waterfowl belonging to her.
(1.7) I created the (plaster?) duck she owns.
(1.8) I caused her to quickly lower her head or body.
(1.9) I waved my magic wand and turned her into undifferentiated waterfowl.

Department of CSE, GST 19ECS443: NLP 16


Continued...
• These different meanings are caused by a number of ambiguities.

• First, the words duck and her are morphologically or syntactically ambiguous in their part-of-speech.
• Duck can be a verb or a noun, while her can be a dative pronoun or a possessive pronoun.

• Second, the word make is semantically ambiguous; it can mean create or cook.

• Finally, the verb make is syntactically ambiguous in a different way.


Make can be transitive -- taking a single direct object (1.6), or
it can be ditransitive -- taking two objects (1.9), meaning that the first object (her) was made into the
second object (duck).

• Finally, make can take a direct object and a verb (1.8), meaning that the object (her) was caused to perform the
verbal action (duck).
• Furthermore, in a spoken sentence, there is an even deeper kind of ambiguity;
the first word could have been eye or the second word maid.
Department of CSE, GST 19ECS443: NLP 17
2. Regular Expressions, Text Normalization and Edit Distance
User: I am unhappy.
ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
User: I need some help, that much seems certain.
ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
User: Perhaps I could learn to get along with my mother.
ELIZA: TELL ME MORE ABOUT YOUR FAMILY
User: My mother takes care of me.
ELIZA: WHO ELSE IN YOU FAMILY TAKES CARE OF YOU
User: My father.
ELIZA: YOUR FATHER
User: You are like my father in some ways.

Joseph Weizenbaum (creator of ELIZA)

Department of CSE, GST 19ECS443: NLP 18


Continued...
• The dialogue above is from ELIZA, an early NLP system –
that could carry on a limited conversation with a user by imitating the responses of a
Rogerian psychotherapist.

• It is a simple program that uses pattern matching to recognize phrases like “I need X” and
translate them into suitable outputs like “What would it mean to you if you got X?”.

• This simple technique succeeds in this domain because


ELIZA doesn’t actually need to know anything to mimic a Rogerian psychotherapist.

• Eliza’s mimicry of human conversation was remarkably successful:


-- many people who interacted with ELIZA came to believe that it really understood them and their problems,
-- many continued to believe in ELIZA’s abilities even after the program’s operation was explained to them,
and
-- even today such chatbots are a fun diversion.

Department of CSE, GST 19ECS443: NLP 19


2.1 Regular Expressions (REs)
• REs -- can be used to specify strings we might want to extract from a document.

• Regular expression -- an algebraic notation for characterizing a set of strings.

• Particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search
through.

• A regular expression search function will search through the corpus, returning all texts that match the pattern.

• The corpus can be a single document or a collection.

Ex: the Unix command-line tool grep takes a regular expression and returns
every line of the input document that matches the expression.

Department of CSE, GST 19ECS443: NLP 20


Continued...
2.1.1 Basic Regular Expression Patterns
• The simplest kind is a sequence of simple characters.
• To search for woodchuck: we type /woodchuck/.
• The expression /Buttercup/ : matches any string containing the substring Buttercup;
grep with that expression would return the line : I’m called little Buttercup.
• The search string can consist of a single character (like /!/) or a sequence of characters (like /urgl/).
• Regular expressions are case sensitive:
-- lower case /s/ is distinct from upper case /S/ (/s/ matches a lower case s but not an upper case S).
-- the pattern /woodchucks/ will not match the string Woodchucks.

Department of CSE, GST 19ECS443: NLP 21


Continued...
• The case sensitive problem (if it is) can be solved by using square brackets [ and ].

• The regular expression /[1234567890]/ specifies -- any single digit


• /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ specifies -- any capital letter
• A simple method for this is:

Department of CSE, GST 19ECS443: NLP 22


Continued...
• The square braces can also be used to specify what a single character cannot be, by use of the caret ^.
• If the caret ^ is the first symbol after the open square brace [, the resulting pattern is negated.
• This is only true when the caret is the first symbol after the open square brace.
• The pattern /[^a]/ matches any single character (including special characters) except a.

• For optional elements: /?/ -- means “the preceding character or nothing”


(“zero or one instances of the previous character”)

Department of CSE, GST 19ECS443: NLP 23


Continued...
Consider the Sheep Language:
• baa! baaa! baaaa! baaaaa! ...
• This language consists of strings with a b, followed by at least two a’s, followed by an exclamation point.
• Kleene * (or Cleany star): The set of operators that allows us to say things like “some number of as”
are based on the asterisk or *.

• The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.
• /a*/ -- means “any string of zero or more as”.
(matches with a or aaaaaa, but also with Off Minor since the string Off Minor has zero a’s)

• The regular expression for matching one or more a is -- /aa*/ (one a followed by zero or more a’s)
• So /[ab]*/ means -- “zero or more a’s or b’s” (not “zero or more right square braces”).
(This will match strings like aaaa or ababab or bbbb.)

• An integer (a string of digits) is thus /[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)

Department of CSE, GST 19ECS443: NLP 24


When you put them together, .* means "any character (zero or more times)".

Continued...
Kleene + : “one or more occurrences of the immediately preceding character or regular expression”.
/[0-9]+/ -- specifies “a sequence of digits”.
/baaa*!/ or /baa+!/ -- two ways to specify the sheep language

• Period (/./): a wildcard expression that matches any single character (except a carriage return)

Wildcard with Kleene *: When you put them together, .* means "any character (zero or more times)".
“any string of characters”
Ex: to find any line in which a particular word, aardvark, appears twice.

• We can specify this with the regular expression /aardvark.*aardvark/.

Department of CSE, GST 19ECS443: NLP 25


Continued...
Anchors in REs: (the caret ^ and the dollar sign $)
Anchors are special characters that anchor regular expressions to particular places in a string.

• The caret ^ matches the start of a line. /^The/ matches the word The only at the start of a line.
• Thus, the caret ^ has three uses: to match the start of a line,
to indicate a negation inside of square brackets, and
just to mean a caret.
• The dollar sign $ matches the end of a line.
• So the pattern $ is a useful pattern for matching a space at the end of a line, and
/^The dog\.$/ matches a line that contains only the phrase The dog.
(We have to use the backslash here since we want the . to mean “period” and not the wildcard.)

Department of CSE, GST 19ECS443: NLP 26


Continued...
Two other anchors: \b matches a word boundary,
\B matches a non-boundary.

Thus, /\bthe\b/ matches the word the but not the word other.

• “word” (for an RE) is defined as any sequence of digits, underscores, or letters.


(based on the definition of “words” in programming languages)

Ex: /\b99\b/ will match the string 99 in


There are 99 bottles of juice on the wall (because 99 follows a space)
but not 99 in
There are 299 bottles of juice on the wall (since 99 follows a number).

But it will match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter).

Department of CSE, GST 19ECS443: NLP 27


Continued...
2.1.2 Disjunction, Grouping and Precedence
Ex: Searching for texts about pets (interested in cats and dogs)
 We might want to search for either the string cat or the string dog.
 We use the disjunction operator (also called the pipe symbol |).
 The pattern /cat|dog/ matches either the string cat or the string dog.

Disjunction operator in the midst of a larger sequence:

Ex: Specifying both guppy and guppies

/guppy|ies/ -- won’t work (it matches only the strings guppy and ies)
(sequences like guppy take precedence over the disjunction operator |)

Parentheses : To make the disjunction operator apply only to a specific pattern


So the pattern is : /gupp (y|ies)/

Department of CSE, GST 19ECS443: NLP 28


Continued...
Ex: a line that has column labels of the form Column 1 Column 2 Column 3.
Regular Expression:

to match the word Column, followed by a number and optional spaces,


the whole pattern repeated zero or more times.

2.1.4 More Operators

Department of CSE, GST 19ECS443: NLP 29


Continued...

Department of CSE, GST 19ECS443: NLP 30


2.2 Words
What counts as a word:

• Look at one particular corpus (plural corpora) -- a computer-readable collection of text or speech.
Ex:1 Brown corpus -- a million-word collection of samples from 500 written English texts
from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963–64.

How many words are in the following Brown sentence?


He stepped out into the hall, was delighted to encounter a water brother.

• It has 13 words (if we don’t count punctuation marks as words)


15 (if we count punctuation).
 Punctuation is critical for finding boundaries of things (commas, periods, colons) and for identifying some aspects
of meaning (question marks, exclamation marks, quotation marks).

 Some tasks (like part-of-speech tagging or parsing or speech synthesis) treat them as if they were separate words.
Department of CSE, GST 19ECS443: NLP 31
Continued...
Ex:2 The Switchboard corpus --

American English telephone conversations between strangers


-- collected in the early 1990s
-- contains 2430 conversations averaging 6 minutes each, totaling 240 hours of speech and about 3 million
words

• Such corpora of spoken language don’t have punctuation but do introduce other complications with regard to
defining words.

Ex: I do uh main- mainly business data processing


This utterance has two kinds of disfluencies.
The broken-off word main- -- is called a fragment.
Words like uh and um -- are called fillers or filled pauses.

(Are they words? --- it depends on the application)


Department of CSE, GST 19ECS443: NLP 32
Continued...
• If we are building a speech transcription system -- we strip out the disfluencies.

• Sometimes we keep the disfluencies.


-- Disfluencies like uh or um are actually helpful in speech recognition in predicting the upcoming word
(they may signal that the speaker is restarting the clause or idea)

Capitalized and uncapitalized tokens: Are they same? (like They and they)

-- lumped together (in speech recognition)


-- useful feature (in part-of-speech or named-entity tagging)

Inflected forms (like cats versus cat):


(Inflected : change the form of (a word) to express a particular grammatical function or attribute,
typically tense, mood, person, number, and gender.
"Arabic verbs are inflected for person, number, and gender“)

Department of CSE, GST 19ECS443: NLP 33


Continued...
 Cats and cat : These two words have the same lemma cat but are different wordforms.

 Lemma -- a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.
 Wordform -- the full inflected or derived form of the word.

• For morphologically complex languages like Arabic, we often need to deal with lemmatization.
• For many tasks in English, wordforms are sufficient.

How many words are there in English?

To answer this question we need to distinguish two ways of talking about words.

 Types -- the number of distinct words in a corpus


(if the set of words in the vocabulary is V, the number of types is the vocabulary size |V|)
 Tokens -- the total number N of running words.

Department of CSE, GST 19ECS443: NLP 34


Continued...
• If we ignore punctuation, the following Brown sentence has 16 tokens and 14 types:

They picnicked by the pool, then lay back on the grass and looked at the stars.

• When we speak about the number of words in the language, we are generally referring to word types.

Department of CSE, GST 19ECS443: NLP 35


Continued...
• The larger the corpora we look at, the more word types we find.

Herdan’s Law or Heaps’ Law :

 It is the relationship between the number of types |V| and number of tokens N.
It is shown in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.

(2.1)
• The value of β depends on the corpus size and the genre.
(but at least for the large corpora in Fig. 2.11, β ranges from 0.67 to 0.75. )

• Another measure of the number of words in the language:


the number of lemmas instead of wordform types (Dictionaries can help in giving lemma counts)

 The 1989 edition of the Oxford English Dictionary had 6,15,000 entries.

Department of CSE, GST 19ECS443: NLP 36


2.3 Corpora
• Any particular piece of text that we study is produced by:
one or more specific speakers or writers,
in a specific dialect of a specific language,
at a specific time, in a specific place, for a specific function.

• NLP algorithms are most useful when they apply across many languages.
(The world has 7,097 languages at the time of this writing this book)

• It is important to test algorithms on more than one language, and particularly on languages with different
properties.
(current tendency for NLP algorithms to be developed or tested just on English)

• Even when algorithms are developed beyond English, they tend to be developed for
the official languages of large industrialized nations (Chinese, Spanish, Japanese, German etc.).

Department of CSE, GST 19ECS443: NLP 37


Continued...
• Most languages also have multiple varieties, often spoken in different regions or by different social groups.
• Ex: Processing text that uses features of African American Language (AAL)
— we must use NLP tools that function with features of those varieties.
(AAL -- the name for the many variations of language used by millions of people in African American
communities)

• Twitter posts might use features often used by speakers of African American Language.
Exs: iont (I don’t in Mainstream American English (MAE))
talmbout (talking about in MAE)

Code Switching: Usage of multiple languages in a single communicative act by Speakers or writers
Exs: Spanish and (transliterated) Hindi code switching with English
• (2.2) Por primera vez veo a @username actually being hateful! it was beautiful
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
• (2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Department of CSE, GST 19ECS443: NLP 38
Continued...
Genre (another dimension of variation) :

The text that our algorithms must process might come from:
 newswire, fiction or non-fiction books, scientific articles, Wikipedia, or religious texts.
 spoken genres like telephone conversations, business meetings, police body-worn cameras, medical interviews, or
transcripts of television shows or movies.
 work situations like doctors’ notes, legal text, or parliamentary or congressional proceedings.

Text also reflects the demographic characteristics of the writer (or speaker):
 their age, gender, race, socioeconomic class can all influence the linguistic properties of the text we are
processing.

Time matters too:


 Language changes over time, and for some languages we have good corpora of texts from different historical
periods.

Department of CSE, GST 19ECS443: NLP 39


Continued...
• When developing computational models for language processing from a corpus,
-- it’s important to consider who produced the language, in what context, for what purpose.
• How can a user of a dataset know all these details?
• The best way is for the corpus creator: to build a datasheet or data statement for each corpus.

A datasheet specifies properties of a dataset like:

• Motivation: Why was the corpus collected, by whom, and who funded it?

• Situation: When and in what situation was the text written/spoken?


(Ex: Was there a task? Was the language originally spoken conversation, edited text,
social media communication, monologue vs. dialogue?)

• Language variety: What language (including dialect/region) was the corpus in?

Department of CSE, GST 19ECS443: NLP 40


Continued...
• Speaker demographics: What was, e.g., age or gender of the authors of the text?

• Collection process: How big is the data? If it is a subsample how was it sampled? Was the data collected with
consent? How was the data pre-processed, and what metadata is available?

• Annotation process: What are the annotations, what are the demographics of the annotators, how were they
trained, how was the data annotated?

• Distribution: Are there copyright or other intellectual property restrictions?

Department of CSE, GST 19ECS443: NLP 41


2.4 Text Normalization
Before almost any natural language processing of a text, the text has to be normalized.

• At least three tasks are commonly applied as part of any normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

2.4.1 Unix Tools for Crude Tokenization and Normalization:

Tokenization -- splitting the raw text into words

Ex: sh.txt is a text file contains the ‘complete words’ of Shakespeare

Unix tr command -- stands for translate

Department of CSE, GST 19ECS443: NLP 42


Continued...
The command: tr -sc 'A-Za-z' '\n' < sh.txt
tokenizes the words by changing every sequence of non-alphabetic characters to a newline
(’A-Za-z’ means alphabetic, the -c option complements to non-alphabet, and
the -s option squeezes all sequences into a single character)

Output:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Department of CSE, GST 19ECS443: NLP 43
Continued...
-c option : Searching for a complement means searching for the inverse of the set specified.
In the following example, any character that is not ‘a’ is matched and translated to ‘z’.

Ex: echo abcdefghijklmnop | tr -c 'a' 'z'


azzzzzzzzzzzzzzzz

-s option : This removes repeated instances of a character.


In the following example, a string with too many spaces is squeezed to remove them.

Ex: echo 'too many spaces here' | tr -s '[:space:]'


too many spaces here

Department of CSE, GST 19ECS443: NLP 44


Continued...
2.4.2 Word Tokenization:

 Tokenization -- the task of segmenting running text into words

• Numbers and Punctuations are required for most NLP applications.

• Punctuations: are used as separate tokens


-- commas are a useful piece of information for parsers
-- periods help indicate sentence boundaries

• We keep punctuation that occurs word internally


-- m.p.h., Ph.D., AT&T, etc.
-- Special characters and numbers in prices ($45.55) and dates (01/02/06)
(we don’t want to segment that price into separate tokens of “45” and “55”)
-- URLs (http://www.stanford.edu), Twitter hashtags (#nlproc), or
email addresses ([email protected]).
Department of CSE, GST 19ECS443: NLP 45
Continued...
 Number expressions introduce other complications:
-- commas normally appear at word boundaries,
-- they are also used inside numbers in English, every three digits: 555,500.50.

• A tokenizer can also be used to expand clitic contractions that are marked by apostrophes.
Ex: Converting what're to the two tokens what are, and we're to we are.

• Clitic: a part of a word that can’t stand on its own, and can only occur when it is attached to another word.

• Some such contractions occur in other alphabetic languages, including articles and pronouns in French (j'ai,
l'homme).

• Depending on the application, tokenization algorithms may also tokenize multiword expressions
like New York or rock 'n' roll as a single token

 Tokenization is tied up with named entity recognition -- the task of detecting names, dates, and organizations
Department of CSE, GST 19ECS443: NLP 46
Continued...
Penn Treebank tokenization standard:
• It separates out clitics (doesn’t becomes does plus n’t),
keeps hyphenated words together, and
separates out all punctuation.

 Tokenization needs to be run before any other language processing.

• The standard method for tokenization:


-- to use deterministic algorithms based on regular expressions compiled into
very efficient finite state automata.

Department of CSE, GST 19ECS443: NLP 47


Continued...
Ex: A basic regular expression that can be used to tokenize with the nltk.regexp_tokenize function of the
Python-based Natural Language Toolkit (NLTK) (http://www.nltk.org).

Department of CSE, GST 19ECS443: NLP 48


Continued...
2.4.3 Byte-Pair Encoding for Tokenization:
Another option to tokenizing text:
 Instead of defining tokens as words (whether delimited by spaces), or as characters (as in Chinese),
we can use our data to automatically tell us what the tokens should be.

 This is especially useful in dealing with unknown words (an important problem in language processing).

• NLP algorithms often learn some facts about language from one corpus (a training corpus) and then
use these facts to make decisions about a separate test corpus and its language.

• Thus if our training corpus contains, say the words low, new, newer, but not lower,
then if the word lower appears in our test corpus, our system will not know what to do with it.

• To deal with this unknown word problem, modern tokenizers often automatically induce
sets of tokens that include tokens smaller than words, called subwords.

Department of CSE, GST 19ECS443: NLP 49


Continued...
• Subwords can be arbitrary substrings, or
they can be meaning-bearing units like the morphemes -est or -er.

(A morpheme is the smallest meaning-bearing unit of a language.


the word unlikeliest has the morphemes un-, likely, and -est.)

• In modern tokenization schemes, most tokens are words, but


some tokens are frequently occurring morphemes or other subwords like -er.

• Unseen words like lower can thus be represented by some sequence of known subword units,
such as low and er, or even as a sequence of individual letters if necessary.

 Most tokenization schemes have two parts:


a token learner, and
a token segmenter.

Department of CSE, GST 19ECS443: NLP 50


Continued...
Token learner:
takes a raw training corpus (sometimes roughly pre-separated into words, for example by whitespace) and
induces a vocabulary, a set of tokens.

Token segmenter:
takes a raw test sentence and segments it into the tokens in the vocabulary.

Three algorithms are widely used:

1. byte-pair encoding
2. unigram language modeling
3. WordPiece

There is also a SentencePiece library that includes implementations of the first two of the three.

Department of CSE, GST 19ECS443: NLP 51


Continued...
Byte-Pair Encoding (BPE) algorithm:
1. Begins with a vocabulary that is just the set of all individual characters.
2. It then examines the training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’, ‘B’),
adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent ’A’ ’B’ in the corpus with the new
‘AB’.
3. It continues to count and merge, creating new longer and longer character strings, until k merges have been done
creating k novel tokens; k is thus is a parameter of the algorithm.
4. The resulting vocabulary consists of the original set of characters plus k new symbols.

Running the Algorithm:

• It usually runs inside words (not merging across word boundaries),


so the input corpus is first white-space-separated to give a set of strings,
each corresponding to the characters of a word, plus a special end-of-word symbol , and its counts.

Department of CSE, GST 19ECS443: NLP 52


Continued...
Ex: Input corpus of 18 word tokens with counts for each word (the word low appears 5 times, the word newer 6
times, and so on), which would have a starting vocabulary of 11 letters.

• The BPE algorithm first counts all pairs of adjacent symbols:


the most frequent is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of 3)
for a total of 9 occurrences.

• We then merge these symbols, treating er as one symbol, and count again:

Department of CSE, GST 19ECS443: NLP 53


Continued...

• Now the most frequent pair is er_ , which we merge; our system has learned that there should be a token for
word-final er, represented as er_ :

Department of CSE, GST 19ECS443: NLP 54


Continued...
Next n e (total count of 8) get merged to ne:

If we continue, next merges are:

Department of CSE, GST 19ECS443: NLP 55


Continued...
• Once we’ve learned our vocabulary, the token parser is used to tokenize a test sentence.

• The token parser just runs on the test data the merges we have learned from the training data, greedily, in the
order we learned them.
(Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).

• So first we segment each test sentence word into characters.

• Then we apply the first rule:


replace every instance of e r in the test corpus with r.
• Then the second rule:
replace every instance of er _ in the test corpus with er_, and so on.

• By the end, if the test corpus contained the word n e w e r , it would be tokenized as a full word.
But a new (unknown) word like l o w e r would be merged into the two tokens low er .

Department of CSE, GST 19ECS443: NLP 56


Continued...
• In real algorithms, BPE is run with many thousands of merges on a very large input corpus.
• The result is that most words will be represented as full symbols, and
only the very rare words (and unknown words) will have to be represented by their parts.

Algorithm:

Department of CSE, GST 19ECS443: NLP 57


Continued...
2.4.4 Word Normalization, Lemmatization and Stemming
Word normalization:

 The task of putting words/tokens in a standard format, choosing a single normal form
for words with multiple forms like USA and US or uh-huh and uhhuh.

• This standardization may be valuable, despite the spelling information that is lost in the normalization process.

• For information retrieval or information extraction about the US,


we might want to see information from documents whether they mention the US or the USA.

Case folding:
This is another kind of normalization.

Mapping everything to lower case means that Woodchuck and woodchuck are represented identically.
(very helpful for generalization in many tasks, such as information retrieval or speech recognition)
Department of CSE, GST 19ECS443: NLP 58
Continued...
 For sentiment analysis and other text classification tasks, information extraction, and machine translation --
by contrast, case can be quite helpful and case folding is generally not done.

 This is because maintaining the difference between, for example, US the country and us the pronoun
can outweigh the advantage in generalization that case folding would have provided for other words.

• For many natural language processing situations we also want two morphologically different forms of a word to
behave similarly.

Ex: web search string: woodchucks

o A useful system may also return pages that mention woodchuck with no s.
o This is especially common in morphologically complex languages like Russian,
where for example the word Moscow has different endings in the phrases Moscow, of Moscow, to Moscow,
and so on.

Department of CSE, GST 19ECS443: NLP 59


Continued...
Lemmatization:
The task of determining that two words have the same root, despite their surface differences.
Words Lemma
am, are, and is be
dinner and dinners dinner

 Lemmatizing each of these forms to the same lemma will let us find all mentions of words in Russian like Moscow.
 The lemmatized form of a sentence like He is reading detective stories would thus be He be read detective story.

How is lemmatization done?

• The most sophisticated methods for lemmatization involve complete morphological parsing of the word.

Morphology:
the study of the way words are built up from smaller meaning-bearing units called morphemes.

Department of CSE, GST 19ECS443: NLP 60


Continued...
Two broad classes of morphemes:
stems—the central morpheme of the word, supplying the main meaning
affixes—adding “additional” meanings of various kinds

Ex: Word Morpheme(s)


fox fox
cats cat, -s

 A morphological parser takes a word like cats and parses it into the two morphemes cat and s.

The Porter Stemmer:


• Lemmatization algorithms can be complex.
• We sometimes make use of a simpler but cruder method, which mainly
consists of chopping off word-final stemming affixes.

• This naive (immature) version of morphological analysis is called stemming.


Department of CSE, GST 19ECS443: NLP 61
Continued...
• One of Porter stemmer the most widely used stemming algorithms is the Porter.
• The Porter stemmer applied to the following paragraph:

This was not the map we found in Billy Bones's chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses
and the written notes.

produces the following stemmed output:

Thi wa not the map we found in Billi Bone s chest but an


accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note

Department of CSE, GST 19ECS443: NLP 62


Continued...
• The algorithm is based on series of rewrite rules run in series, as a cascade, in which
the output of each pass is fed as input to the next pass; here is a sampling of the rules:

• Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.) can be found on Martin Porter’s
homepage.

• Simple stemmers can be useful in cases where we need to collapse across different variants of the same lemma.
• Nonetheless, they do tend to commit errors of both over- and under-generalizing, as shown in the table below:

Department of CSE, GST 19ECS443: NLP 63


Continued...
2.4.5 Sentence Segmentation:
 Another important step in text processing.

• The most useful components for segmenting a text into sentences:


punctuation (like periods, question marks) and exclamation points.

• Question marks and exclamation points: relatively unambiguous markers of sentence boundaries
• Periods: more ambiguous

• The period character “.” is ambiguous between a sentence boundary marker and a marker of abbreviations like Mr.
or Inc.
• The previous sentence that you just read showed an even more complex case of this ambiguity,
in which the final period of Inc. marked both an abbreviation and the sentence boundary marker.

• For this reason, sentence tokenization and word tokenization may be addressed jointly.

Department of CSE, GST 19ECS443: NLP 64


Continued...
• In general, sentence tokenization methods work by first deciding
whether a period is part of the word or is a sentence-boundary marker.

• An abbreviation dictionary can help determine whether the period is part of a commonly used abbreviation.

• The dictionaries can be hand-built or machine learned.

• In the Stanford CoreNLP toolkit (Manning et al., 2014), for example sentence splitting is rule-based, a deterministic
consequence of tokenization.

• That is, sentence ends when a sentence-ending punctuation (., !, or ?) is not already grouped with other
characters into a token (such as for an abbreviation or number),
optionally followed by additional final quotes or brackets.

Ex: The cost is Rs.30.50.

Department of CSE, GST 19ECS443: NLP 65


Continued...
2.5 Minimum Edit Distance:
• Much of natural language processing is concerned with measuring how similar two strings are.

Ex:1: The user typed some erroneous string—let’s say graffe–and we want to know what the user meant.
The user probably intended a word that is similar to graffe.

• Among candidate similar words, the word giraffe, which differs by only one letter from graffe, seems intuitively
(naturally) to be more similar than, say grail or graf, which differ in more letters.

Ex:2: from coreference, the task of deciding whether two strings such as the following refer to the same entity:
Stanford President Marc Tessier-Lavigne
Stanford University President Marc Tessier-Lavigne

 The fact that these two strings are very similar (differing by only one word) seems like useful evidence for deciding
that they might be coreferent.

Department of CSE, GST 19ECS443: NLP 66


Continued...
• Edit distance gives us a way to quantify both of these intuitions about string similarity.
The minimum edit distance between two strings:
It is the minimum number of editing operations (operations like insertion, deletion, substitution) needed to
transform one string into another.

Ex: The gap between intention and execution is 5.


-- delete an i, substitute e for n, substitute x for t, insert c, substitute u for n.
It’s much easier to see this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.14.

Department of CSE, GST 19ECS443: NLP 67


Continued...
• Given two sequences, an alignment is a correspondence between substrings of the two sequences.
• Thus, we say I aligns with the empty string, N with E, and so on.

• Beneath the aligned strings: d for deletion, s for substitution, i for insertion.

Levenshtein distance between two sequences:


 It assigns a particular cost or weight to each of these operations.
 Each of the three operations has a cost of 1.
(we assume that the substitution of a letter for itself, for example, t for t, has zero cost. )
 The Levenshtein distance between intention and execution is 5.

 His alternative version of metric: each insertion or deletion has a cost of 1 and substitutions are not allowed.
(This is equivalent to allowing substitution, but giving each substitution a cost of 2 since any substitution can
be represented by one insertion and one deletion).

 Using this version, the Levenshtein distance between intention and execution is 8.
Department of CSE, GST 19ECS443: NLP 68
Continued...
2.5.1 The Minimum Edit Distance Algorithm:
• Finding the minimum edit distance can be viewed as a dynamic programming problem, i.e.,
large problem can be solved by properly combining the solutions to various sub-problems.
• The minimum edit distance algorithm was named by Wagner and Fischer.

Defining the minimum edit distance between two strings:


Given two strings: source string X of length n and target string Y of length m

Define D[i, j] as the edit distance between X[1…i] and Y[1…j]


(i.e., the first i characters of X and the first j characters of Y)

The edit distance between X and Y is thus D[n, m].

• We’ll use dynamic programming to compute D[n, m] bottom up, combining solutions to subproblems.

Department of CSE, GST 19ECS443: NLP 69


Continued...
• In the base case:
source substring target substring requires (target to source)
length i empty i deletes
empty length j j inserts

• Having computed D[i, j] for small i, j we then compute larger D[i, j] based on previously computed smaller values.

• The value of D[i, j] is computed by taking the minimum of the three possible paths through the matrix which arrive
there:

Department of CSE, GST 19ECS443: NLP 70


Continued...
• Considering Levenshtein distance:

-- insertions and deletions each have a cost of 1


-- substitutions have a cost of 2 (except substitution of identical letters have zero cost)

the computation for D[i, j] becomes:

Department of CSE, GST 19ECS443: NLP 71


Continued...
Algorithm:

Department of CSE, GST 19ECS443: NLP 72


Continued...
Execution:

Department of CSE, GST 19ECS443: NLP 73


3. N-gram Language Models
• We require models that assign a probability to each possible next word.
• The same models will also serve to assign a probability to an entire sentence.

• Such a model, for example, could predict that the following sequence has a much higher probability of appearing
in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
on guys all I of notice sidewalk three a sudden standing the

Why would you want to predict upcoming words, or assign probabilities to sentences?
(1) Speech Recognition: Probabilities are essential to identify words in noisy, ambiguous input
• For a speech recognizer, consider:
I will be back soonish.
I will be bassoon dish.
back soonish is a much more probable sequence than bassoon dish.

Department of CSE, GST 19ECS443: NLP 74


Continued...
(2) For writing tools like spelling correction or grammatical error correction:

Their are two midterms. (There was mistyped as Their), or


Everything has improve . (improve should have been improved).

• There are (is more probable than Their are), and


• has improved (is more probable than has improve) , allowing us to help users by detecting and correcting these
errors.

 Language models (LMs): Models that assign probabilities to sequences of words.

 N-gram model: The simplest language model n-gram is a sequence of n words:


a 2-gram (called bigram) is a two-word sequence of words like
“please turn”, “turn your”, or ”your homework”, and
a 3-gram (a trigram) is a three-word sequence of words like
“please turn your”, or “turn your homework”.
Department of CSE, GST 19ECS443: NLP 75
Continued...
3.1 N-grams:
• P(w|h) -- the probability of a word w given some history h
• Suppose h is -- “its water is so transparent that” and
We want to know the probability that the next word is the:
P(the | its water is so transparent that). (3.1)

• One way to estimate this probability is from relative frequency counts:


take a very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the.

 This would be answering the question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:

Department of CSE, GST 19ECS443: NLP 76


Continued...
• With a large enough corpus, such as the web, we can compute these counts and estimate the probability from Eq.
3.2.
• This method works fine in many cases, but in most cases, we may not get good estimates.
• This is because language is creative; new sentences are created all the time, and we won’t always be able to count
entire sentences.

• “Walden Pond’s water is so transparent that the”.


(a simple extension of the above sentence may have counts of zero).

• To know the joint probability of “its water is so transparent”,


we can ask “out of all possible sequences of five words, how many of them are its water is so transparent?”

• It is the count of its water is so transparent and divide by


the sum of the counts of all possible five word sequences

Department of CSE, GST 19ECS443: NLP 77


Continued...
Represent
P(Xi = “the”) as -- P(the)
a sequence of N words -- either as w1…wn or w1:n

• For the joint probability of each word in a sequence having a particular value
P(X=w1, Y=w2, Z=w3,….., W=wn) we will use P(w1, w2,…., wn).
To compute P(w1, w2,…., wn):
We decompose this probability using the chain rule of probability:

Department of CSE, GST 19ECS443: NLP 78


Continued...
 The chain rule doesn’t really seem to help us!

 We don’t know any way to compute the exact probability of a word given a long sequence of preceding words,
P( wn | w1n-1).

 We can’t just estimate by counting the number of times every word occurs following every long string,
because language is creative and any particular context might have never occurred before!

n-gram model:

The idea is that instead of computing the probability of a word given its entire history,
we can approximate the history by just the last few words.

• The bigram model, for example, approximates the probability of a word given all the previous words P( wn | w1:n-1)
by using only the conditional probability of the preceding word P( wn | wn-1).

Department of CSE, GST 19ECS443: NLP 79


Continued...
• In other words, instead of computing the probability
P(the | Walden Pond’s water is so transparent that) (3.5)
we approximate it with the probability
P(the | that) (3.6)

• When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:

Markov assumption:
 The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

 Markov models are the class of probabilistic models that assume :


“we can predict the probability of some future unit without looking too far into the past”.

Department of CSE, GST 19ECS443: NLP 80


Continued...
• We can generalize the bigram (which looks one word into the past)
to the trigram (which looks two words into the past) and thus
to the n-gram (which looks n-1 words into the past).

• Thus, the general equation for this n-gram approximation to the conditional probability of the next word in a
sequence is:

• Given the bigram assumption for the probability of an individual word,


we can compute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:

Department of CSE, GST 19ECS443: NLP 81


Continued...
Maximum Likelihood Estimation (MLE):
 It is a method to estimate the probabilities of bigram or n-gram.
 This is done by getting counts from a normalize corpus, and normalizing the counts
so that they lie between 0 and 1.

Ex: To compute a particular bigram probability of a word y given a previous word x,


we’ll compute the count of the bigram C(xy) and normalize by the sum of all the bigrams
that share the same first word x:

 Since the sum of all bigram counts that start with a given word wn-1 must be equal to
the unigram count for that word wn-1 , equation 3.10 becomes:

Department of CSE, GST 19ECS443: NLP 82


Continued...
Ex: A mini-corpus of three sentences
• Augment each sentence with a special symbol <s> at the beginning of the sentence
(to give us the bigram context of the first word).

• Use a special end-symbol. </s>

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

 Calculations for some of the bigram probabilities from this corpus:

Department of CSE, GST 19ECS443: NLP 83


Continued...
For the general case of MLE n-gram parameter estimation:

• Equation 3.12 (like Eq. 3.11) estimates the n-gram probability


by dividing the observed frequency of a particular sequence by the observed frequency of a prefix.
• This ratio is called a relative frequency.
An Example Corpus: Berkeley Restaurant Project
-- a dialogue system from the last century that answered questions about a database of restaurants in Berkeley,
California.
Some text-normalized sample user queries:
can you tell me about any good cantonese restaurants close by i’m looking for a good place to eat breakfast
mid priced thai food is what i’m looking for when is caffe venezia open during the day
tell me about chez panisse
can you give me a listing of the kinds of food that are available

Department of CSE, GST 19ECS443: NLP 84


Continued...
• Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the Berkeley Restaurant Project.
• Note that the majority of the values are zero.
• In fact, we have chosen the sample words to cohere with each other;
a matrix selected from a random set of seven words would be even more sparse.

Department of CSE, GST 19ECS443: NLP 85


Continued...
• Figure 3.2 shows the bigram probabilities after normalization
(dividing each cell in Fig. 3.1 by the appropriate unigram for its row,
taken from the following set of unigram probabilities):

Department of CSE, GST 19ECS443: NLP 86


Continued...
Here are a few other useful probabilities:

• P(I|<s>) = 0.25 P(English|want) = 0.0011


• P(food|English) = 0.5 P(</s>|food) = 0.68

Ex: Computing the probability of “I want English food”


Simply multiply the appropriate bigram probabilities together as follows:

P(<s> I want English food </s>)


= P(I|<s>) P(want|I) P(English|want) P(food|English) P(</s>|food)
= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68
= 0.000031

Exercise: Compute the probability of “I want chinese food”.

Department of CSE, GST 19ECS443: NLP 87


Continued...
3.2 Evaluating Language Models:
• Extrinsic Evaluation:
It is a way to evaluate the performance of a language model by embedding it in an application and measure
how much the application improves.

For speech recognition:


Compare the performance of two language models by running the speech recognizer twice,
once with each language model, and seeing which gives the more accurate transcription.

• Since running big NLP systems end-to-end is often very expensive, we have the following method:

• Intrinsic Evaluation:
It measures the quality of a model independent of any application.
We need a test set here.

Department of CSE, GST 19ECS443: NLP 88


Continued...
• The probabilities of an n-gram model come from the corpus it is trained on, the training set or training corpus.

• We can then measure the quality of an n-gram model by its performance on some unseen data called the test set
or test corpus.

• Suppose given a corpus of text and we want to compare two different n-gram models.
We divide the data into training and test sets, train the parameters of both models on the training set, and
then compare how well the two trained models fit the test set.

• Suppose we are trying to compute the probability of a particular “test” sentence.


• If our test sentence is part of the training corpus, we will mistakenly assign it an artificially high probability when it
occurs in the test set.

• We call this situation training on the test set.


• Training on the test set introduces a bias that makes the probabilities all look too high, and causes huge
inaccuracies in perplexity, the probability-based metric.

Department of CSE, GST 19ECS443: NLP 89


Continued...
3.2.1 Perplexity:
• In practice we don’t use raw probability as our metric for evaluating language models but a variant called
perplexity.
• The perplexity (sometimes called PP for short) of a language model on a test set is
the inverse probability of the test set, normalized by the number of words.
• For a test set w1, w2,…., wN :

We can use the chain rule to expand the probability of W:

Department of CSE, GST 19ECS443: NLP 90


Continued...
• Thus, if we are computing the perplexity of W with a bigram language model, we get:

 Because of the inverse in Eq. 3.15,


the higher the conditional probability of the word sequence, the lower the perplexity.
 Thus, minimizing perplexity is equivalent to maximizing the test set probability
according to the language model.

• For the word sequence in Eq. 3.15 or Eq. 3.16, we use the entire sequence of words in some test set.
• Since this sequence will cross many sentence boundaries,
we need to include the begin- and end-sentence markers <s> and </s> in the probability computation.
• We also need to include the end-of-sentence marker </s> (but not the beginning-of-sentence marker <s>)
in the total count of word tokens N.
Department of CSE, GST 19ECS443: NLP 91
Continued...
Another way to think about perplexity: Weighted average branching factor of a language.
• The branching factor of a language is the number of possible next words that can follow any word.
• Consider the task: recognizing the digits in English (zero, one, two,..., nine), given that (both in some training set
and in some test set) each of the 10 digits occurs with equal probability P= 1/10 .

• The perplexity of this mini-language is in fact 10.


• Ex: a test string of digits of length N, and in the training set all the digits occurred with equal probability.

• By Eq. 3.15, the perplexity will be

Department of CSE, GST 19ECS443: NLP 92


Continued...
Zero probability n-grams:
• For any n-gram that occurred for a sufficient number of times, we might have a good estimate of its probability.
• But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing
from it.
• These “zero probability n-grams” should really have some non-zero probability.
• From the WSJ Treebank3 corpus:
Consider the words that follow the bigram denied the, together with their counts:
denied the allegations: 5
denied the speculation: 2
denied the rumors: 1
denied the report: 1
 Suppose our test set has phrases like:
denied the offer
denied the loan

 Our model will incorrectly estimate that the P(offer | denied the) is 0.
Department of CSE, GST 19ECS443: NLP 93
Continued...
• These zeros— things that don’t ever occur in the training set but do occur in the test set—
are a problem for two reasons.

• First,

their presence means we are underestimating the probability of all sorts of words that might occur,
which will hurt the performance of any application we want to run on this data.

• Second,

if the probability of any word in the test set is 0, the entire probability of the test set is 0.

 By definition, perplexity is based on the inverse probability of the test set.


 Thus if some words have zero probability, we can’t compute perplexity at all, since we can’t divide by 0!

Department of CSE, GST 19ECS443: NLP 94


Continued...
3.3.1 Unknown Words:
Closed Vocabulary System:
 Test set can only contain words from the lexicon, and there will be no unknown words.
(we know all the words that can occur)

 This is a reasonable assumption in some domains, such as speech recognition or machine translation.
(we have a pronunciation dictionary or a phrase table that are fixed in advance, and so
the language model can only use the words in that dictionary or phrase table.)

Open Vocabulary System:


 Here, we have to deal with words we haven’t seen before called unknown words, or out of vocabulary (OOV)
words.
• The percentage of OOV words that appear in the test set is called the OOV rate.
• An open vocabulary system is one in which we model these potential unknown words in the test set
by adding a pseudo-word called <UNK>.

Department of CSE, GST 19ECS443: NLP 95


Continued...
Two ways to train the probabilities of the unknown word model <UNK>:
First approach:
Turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK>
in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set.

Second approach:
• When we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly,
replacing words in the training data by <UNK> based on their frequency.
Ex: We can replace by <UNK> all words that occur fewer than n times in the training set,
where n is some small number, or equivalently select a vocabulary size V in advance (say 50,000)
and choose the top V words by frequency and replace the rest by UNK.

• In either case we then proceed to train the language model as before, treating <UNK> like a regular word.
Department of CSE, GST 19ECS443: NLP 96
Continued...
3.4 Smoothing:
 Smoothing algorithms provide a more sophisticated way to estimate the probability of n-grams.

3.4.1 Laplace Smoothing (or Add-one Smoothing):

• The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into
probabilities.
• All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on.
• This algorithm is called Laplace Smoothing.

Application to unigram probabilities:


• The unsmoothed maximum likelihood estimate of the unigram probability of the word wi
is its count ci normalized by the total number of word tokens N:

Department of CSE, GST 19ECS443: NLP 97


Continued...
• Laplace smoothing merely adds one to each count (hence its alternate name add-one smoothing).

• Since there are V words in the vocabulary and each one was incremented,
we also need to adjust the denominator to take into account the extra V observations.

• Instead of changing both the numerator and denominator, it is convenient to describe


how a smoothing algorithm affects the numerator, by defining an adjusted count c*.

• This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability
like an MLE count by normalizing by N.

Department of CSE, GST 19ECS443: NLP 98


Continued...
• To define this count, since we are only changing the numerator in addition to adding 1,
we’ll also need to multiply by a normalization factor N / (N+V) :

Discounting (lowering):

 A related way to view smoothing is as


discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the
zero counts.
 Thus, instead of referring to the discounted counts c,
we might describe a smoothing algorithm in terms of a relative discount dc ,
the ratio of the discounted counts to the original counts:

Department of CSE, GST 19ECS443: NLP 99


Continued...
Smoothing Berkeley Restaurant Project bigrams:
Figure 3.5 shows the add-one smoothed counts for the bigrams in Fig. 3.1.

Department of CSE, GST 19ECS443: NLP 100


Continued...
• Figure 3.6 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2.
• The normal bigram probabilities are computed by normalizing each row of counts by the unigram count:

• For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types
in the vocabulary V:

Department of CSE, GST 19ECS443: NLP 101


Continued...
• Thus, each of the unigram counts given in the previous section will need to be augmented by V =1446.

• The result is the smoothed bigram probabilities in Fig. 3.6.

Department of CSE, GST 19ECS443: NLP 102


Continued...
• It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has
changed the original counts.
• These adjusted counts can be computed by Eq. 3.24. Figure 3.7 shows the reconstructed counts.

Department of CSE, GST 19ECS443: NLP 103


Continued...
Add-one smoothing has made a very big change to the counts.
C(want to) changed from 609 to 238.
P(to | want) decreases from 0.66 in the unsmoothed case to 0.26 in the smoothed case.

 The discount d (the ratio between new and old counts) shows
how strikingly the counts for each prefix word have been reduced;

-- the discount for the bigram want to is 0.39, while


-- the discount for Chinese food is 0.10, a factor of 10!

 The sharp change in counts and probabilities occurs because


too much probability mass is moved to all the zeros.

Department of CSE, GST 19ECS443: NLP 104


Continued...
3.4.2 Add-k Smoothing:
• This is an alternative to add-one smoothing.
• It moves a bit less of the probability mass from the seen to the unseen events.

• Instead of adding 1 to each count, we add a fractional count k (0.5? 0.05? 0.01?).
• This algorithm is therefore called add-k smoothing.

• Add-k smoothing requires that we have a method for choosing k.


• Add-k is useful for some tasks (including text classification).
• It still doesn’t work well for language modeling, generating counts with poor variances and
often inappropriate discounts.

Department of CSE, GST 19ECS443: NLP 105


Continued...
3.4.3 Backoff and Interpolation:

• The discounting discussed earlier can help solve the problem of zero frequency n-grams.

• But there is an additional source of knowledge we can draw on.

• If we are trying to compute P(wn | wn-2 wn-1) but we have no examples of a particular trigram wn-2 wn-1wn ,
we can instead estimate its probability by using the bigram probability P(wn | wn-1).

• Similarly, if we don’t have counts to compute P(wn | wn-1), we can look to the unigram P(wn).

• Sometimes using less context is a good thing, helping to generalize more for contexts that the model hasn’t
learned much about.

Department of CSE, GST 19ECS443: NLP 106


Continued...
• There are two ways to use this n-gram “hierarchy”.
1. Backoff: we use the trigram if the evidence is sufficient, otherwise we use the bigram,
otherwise the unigram.
(we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram. )
2. Interpolation: we always mix the probability estimates from all the n-gram estimators,
weighing and combining the trigram, bigram, and unigram counts.

• In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models.
• Thus, we estimate the trigram probability P(wn | wn-2 wn-1) by mixing together
the unigram, bigram, and trigram probabilities, each weighted by a λ :

Department of CSE, GST 19ECS443: NLP 107


Continued...
3. Katz Backoff:
• In order for a backoff model to give a correct probability distribution, we have to discount the higher-order
n-grams to save some probability mass for the lower order n-grams.

• This kind of backoff with discounting is also called Katz backoff.

• In Katz backoff, we rely on a discounted probability P* if we’ve seen this n-gram before
(i.e., if we have non-zero counts).

• Otherwise, we recursively back off to the Katz probability for the shorter-history (N-1)-gram.

• In addition to this explicit discount factor, we’ll need a function α


to distribute this probability mass to the lower order n-grams.

Department of CSE, GST 19ECS443: NLP 108


Continued...
The probability for a backoff n-gram PBO is:

Good-Turing Backoff (Discounting): Katz backoff is often combined with a smoothing method called Good-Turing.

Department of CSE, GST 19ECS443: NLP 109

You might also like