Natural Language
Processing
NLP
is among the hottest topic in the field of data science.
Companies are putting tons of money into research in this field.
Everyone is trying to understand NLP and its applications to make a career around
it.
Every business out there wants to integrate it into their business somehow.
Are you using NLP these days?
Search Autocorrect and
Autocomplete – Language Translator
Social media monitoring
More people these days have started using social media for posting their thoughts about a
particular product, policy, or matter.
These could contain some useful information about an individual’s likes and dislikes.
Analyzing this unstructured data can help in generating valuable insights. NLP comes to rescue
here too.
various NLP techniques are used by companies to analyze social media posts and know what
customers think about their products.
Companies are also using social media monitoring to understand the issues and problems that
their customers are facing by using their products.
Chatbots
Modern Conversational
Agents can
• Answer questions
• Book flights
• Find Restaurants
• functions for which
they rely on a much
more sophisticated
understanding of
the user’s intent
Survey Analysis
Surveys are an important way of evaluating a
company’s performance.
to get customer’s feedback on various products.
useful in understanding the flaws and help
companies improve their products.
NLP is used to analyze the surveys and
generate insights from them, like knowing the
sentiments of users analyzing product
reviews to understand the pros and cons
Targeted Advertising – Hiring and
Recruitment
Targeted advertising is a type of online
advertising where ads are shown to the user
based on their online activity.
it saves companies a lot of money
relevant ads are shown only to the potential
customers.
Voice Assistants
Conventional vs. NLP-based search
What is NLP?
Natural language processing is a sub-field of linguistics, computer
science and AI concerned with the interactions between computers
and human language
NLP makes computers understand complex language structure and retrieve
meaningful pieces of information from it
Modern challenges in NLP involve
speech recognition,
natural language understanding and
natural language generation
Why study NLP?
Text is the largest repository of human knowledge –
news articles, web pages, scientific articles, patents, emails, government
documents…
Tweets, facebook posts, comments, quora… etc.
What are the top ten languages in the internet in terms of millions of user?
Goals of NLP
Fundamental and Scientific Goal – Deep understanding of broad language
Engineering Goal – Design, implement and test subject that process natural
languages for practical applications.
Applications of NLP
Text Classification
Language Modelling
Information Extraction
Information Retrieval
Conversational Agents
Text Summarization
Question Answering
Machine Translation
Topic Modelling
Speech Recognition
Origins of NLP
Alan Turing’s Turing Test (1950)
1950s – 1960s : Early Developments
Georgetown – IBM Experiment (1954)
Chomsky’s Transformational Generative Grammar (1957)
1960s – 1970s : Rule-based approaches
1970s – 1980s : Rise of statistical methods
1980s – 1990s : Corpus Linguistics and Machine Learning
2000s – present : Deep Learning and Neural networks.
Challenges of NLP
Why NLP is Hard?
Lexical Ambiguity
Why NLP is Hard?
Lexical Ambiguity
Ambiguity is pervasive
Activity
Find at least 5 meanings of this sentence
I made her duck
Syntactic category
Duck can be a noun or verb
Her can be possessive or dative pronoun
Word meaning
Make can mean create or cook
Why NLP is Hard?
Ambiguities
Ambiguity is Pervasive
Lexical Ambiguity
Ambiguity is Explosive
Lexical Ambiguity
Why is language ambiguous?
Natural Language Vs. Computer Languages
The goal in the production and
Ambiguity is the primary difference
comprehension of natural language is
efficient communication Programming languages are designed to
be unambiguous
Allowing resolvable ambiguity
PLs are defined by grammar that
Permits shorter linguistic expressions
produces a unique parse for each
Avoids language being overly complex sentence in the language.
Language relies on people’s ability to use
that their knowledge and inference
abilities to properly resolve ambiguities
NLP is Hard? .. Why else NLP is hard?
See you, I will text you later.
Neologisms
Non standard use of English in Social
media Unfriend
Segmentation issues Retweet
The New York-New Heaven Road Google / skype
Idioms New senses of the word
Dark horse That’s sick due
Ball in the court Giants – multinationals, manufacturers
Burn the midnight oil Tricky Entity Names
Where is A Bug’s life playing…
Let It Be was recorded
Empirical Laws
Function words Vs. Content Words
Function words have little lexical
meaning but serve as important
elements to the structure of the
sentences
Function words are closed class
words
Prepositions, pronouns, auxiliary
verbs, conjunctions, grammatical
articles, particles etc. • Most of the words here are function words
• The list is dominated by the little words of
Eg: a, an, the etc. English having important grammatical role
Empirical Laws
Type Vs. Token
Type
Type-Token Ratio (TTR) :
Concept
It is the ratio of the [Link] different words(types)
Unique words
to the [Link] running words (tokens) in a given
Tokens text or corpus.
Instances of concepts The index indicates how often, on average, a
new ‘word form’ appears in the text or corpus.
The number of words Mark Twain’s Complete
Type-Token distinction is a distinction Tom Sawyer Shakespeare work
that separates a concept from the Word Tokens 71,370 884,647
objects which are particular instances Word types 8018 29,066
of the concept.
TTR 0.112 0.032
Empirical Laws
Observation on various texts
Consider various texts from conversation, academic prose, news, fiction. Which one will
have high TTR and which one will have lowest TTR?
High TTR – tendency
to use new words
Low TTR – same word
repeatedly
Word distribution from Tom Sawyer
Empirical Laws
Zipf’s Law
Count the frequency of each word type in a large corpus
List the word types in decreasing order of their frequency
i.e., the 50th most common word should occur with 3 times the frequency
of the 150th most common word
Empirical evaluation from Tom Sawyer
Empirical Laws
Zipf’s Other laws
Empirical Laws
Heap’s Law
Words – What counts as a word?
corpus (plural corpora): a computer-readable corpora collection of text or speech
For example the Brown corpus is a million-word collection of samples from 500 written English texts from
different genres (newspaper, fiction, non-fiction, academic, etc.)
How many words are in the following Brown sentence?
Sentence : He stepped out into the hall, was delighted to encounter a water
brother.
This sentence has 13 words if we don’t count punctuation marks as words,
15 if we count punctuation.
Are capitalized tokens like They and uncapitalized tokens like they the same word?
How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.
A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the
same word sense.
The wordform is the full inflected or derived form of the word.
Notion of Corpus:
Words – Types and Tokens
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the
number of types is the word token vocabulary size |V|.
Tokens are the total number N of running words.
ignore punctuation and find the number of tokens and types in the following sentence
They picnicked by the pool, then lay back on the grass and looked
at the stars
16
tokens
14 types
Notion of Corpus:
Corpora
Any particular piece of text that we study is produced by
one or more specific speakers or writers,
in a specific dialect of a specific language,
at a specific time,
in a specific place,
for a specific function.
The most important dimension of variation is the language.
NLP algorithms are most useful when they apply across many languages. The world has 7097
languages.
It is important to test algorithms on more than one language, and particularly on languages with
different properties; by contrast there is an unfortunate current tendency for NLP algorithms to
be developed or tested just on English
Code Switching : A phenomenon which uses multiple languages in a single communicative act
Another variations are Genre, demographic characteristics of the writer, time.
Text-processing Basics
Tokenization
Tokenization is the process of segmenting a string of characters into
words.
What is sentence segmentation? –
The problem of deciding where the sentences begin and end.
Depending on the application in hand, you might have to perform
sentence segmentation as well.
What are the challenges in sentence segmentation?
!, ? Are quite unambiguous. Period (.) is quite ambiguous.
What are the strategies to build a sentence segmenter?
Hand-written rules, regular expressions, machine learning
Text-processing Basics
Word Normalization
Is the process of segmenting a string of characters into words. Issues in Tokenization
Finland’s
I have a can opener; but I can’t open these cans
What’re, I’m, Should n’t
Word Token San Francisco
An occurrence of a word m.p.h.
For the above sentence, 11 word tokens. Handling Hyphenation
Word Type End-of-line hyphen
A different realization of a word Lexical hyphen
Sentential determined
For the above sentence, 10 word types
Language specific issues
Practice
French and German
NLTK toolkit, Stanford CoreNLP, Unix Commands
Chinese and Japanese
Sanskrit
Using Python’s split() function
Tokenization using Regular Expressions
Tokenization using NLTK
Word Normalization, Stemming and
Lemmatization
used to prepare text, words, and documents for further
processing
Reduce inflections or invariant forms to base form:
am, are, is – be
car, car’s, cars, cars’ – car
Finds correct dictionary handword form
Morphemes are divided into two categories
Stems – The core meaning bearing units
Affixes – prefix (un-, anti- etc.,), suffix – (-ity, -ation etc.)
Stemming and Lemmatization helps us to achieve the
root forms of inflected words
Stemming
• helps us to achieve the root forms of inflected words.
• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
• Crude chopping of affixes
• stemming a word or sentence may result in words that are not actual words.
Stems are created by removing the suffixes or prefixes used with a word.
• A computer program that stems word is called a stemming program, or
stemmer
• PorterStemmer is stemming algorithm present in NLTK which uses Suffix
Stripping
• It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatization
• Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
• For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
• As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
• Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
Standardization of Data
The common operations performed to standardize the data are
Removal of duplicate whitespaces and Acronym normalization (e.g.: ‘US’→‘United
punctuation. States’/‘U.S.A’) and abbreviation normalization
(e.g.: ‘btw’→‘by the way’).
Accent removal
Normalize date formats, social security numbers
Capital letter removal
Spell correction — this is very important if you’re
Removal or substitution of special
dealing with open user inputs, such as tweets, IMs
characters/emojis (e.g.: remove
and emails.
hashtags).
Removal of gender/time/grade variation with
Substitution of contractions (very common
Stemming or Lemmatization.
in English; e.g.: ‘I’m’→‘I am’).
Substitution of rare words for more common
Transform word numerals into numbers
synonyms.
(eg.: ‘twenty three’→‘23’).
Stop word removal (more a dimensionality
Substitution of values for their type (e.g.:
reduction technique than a normalization
‘$50’→‘MONEY’).
technique).
Spelling Correction – Edit Distance
Isolated word error correction
Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Need a distance metric
The simplest metric is – Edit Distance
Edit Distance
The minimum edit distance between two strings – is defined as the minimum number
of editing operations
Insertion
Deletion
Substittution
Levenshtein distance - substitution has cost -1
Alternate version – substitution cost - 2
Defining minimum edit distance matrix
Edit Distance calculation
Algorithm using Dynamic Programming
Tracing
Edit Distance
Computing Alignments
Computing edit distance may not be sufficient for some applications – we
often need to align characters of the two strings to each other
We do this by keeping a backtrace
Everytime we enter a cell, remember where we came from
When we reach the end, tracke back the path from upper right corner to
read off the algorithms.
Performance
Time – O(nm)
Space – O(nm)
Backtrace – O(n+m)
Language models
is a computational model or algorithm designed to understand, generate, and predict
human language.
fundamental part of natural language processing (NLP) and machine learning applications
that involve dealing with textual data.
The primary goals of a language model include:
Understanding Language
Generating Text
Predicting Sequences
There are different types of language models, and they can be broadly categorized into
Statistical Language Models (SLM)
Grammar-based Language Models
Neural Language Models
Grammar based Language models
Grammar-based language models rely on predefined rules and structures
to generate sentences. These rules are often based on formal
grammatical frameworks, such as context-free grammars.
The model uses syntactic rules to define the permissible arrangements of
words in a sentence.
Example: In a grammar-based LM, you might have rules specifying that a
sentence must start with a noun phrase followed by a verb phrase.
Challenge - These models may struggle with handling natural language
variations and may not capture the full complexity of language.
Statistical Language Model
SLMs are based on statistical patterns observed in a given dataset. They
estimate the probability of a sequence of words occurring based on the
frequencies of these sequences in the training data.
N-gram Models: SLMs often use n-gram models, where the probability of a
word is conditioned on the previous n-1 words. Commonly used n-grams
include bigrams (n=2) and trigrams (n=3).
Example: In an SLM, the probability of the word "rain" might be higher if the
preceding words are "the" and "it" compared to other combinations.
Challenge – data sparsity issues