Natural Language Processing
3/10/2022
Text is everywhere!
Social media
3/10/2022
Text is everywhere!
Research papers, news, etc.
3/10/2022
Diversity of Languages (Worldwide)
How many Languages are spoken today?
1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)
How many Languages are spoken today?
7,1111
1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)
How many Languages are spoken today?
7,1111
Can we understand the majority of the world’s data? What percentage?
1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)
How many Languages are spoken today?
7,1111
Can we understand the majority of the world’s data? What percentage?
1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (India)
Can we understand the majority of the India’s data?
Total Languages: > 1650 (2011 consensus)
Hindi: 57.1%
English: 10.6%
Bengali: 8.9%
Marathi: 8.2%
Telugu: 7.8 %
...
...
3/10/2022
The goals of NLP
French Sentence: Tu Bois un Coca-cola
3/10/2022
The goals of NLP
French Sentence: Tu Bois un Coca-cola
English Translation: You drink a Coca-cola
3/10/2022
The goals of NLP
French Sentence: Tu Bois un Coca-cola
English Translation: You drink a Coca-cola
We try to understand a foreign language using some known keywords
3/10/2022
The goals of NLP
French Sentence: Tu Bois un Coca-cola
English Translation: You drink a Coca-cola
We try to understand a foreign language using some known keywords
Goals of NLP
Deep understanding of broad language constructs.
Achieve human-like comprehension of texts/languages.
Make computer systems to understand, draw inferences from,
summarize, translate and generate accurate and natural human text and
language.
3/10/2022
Some Applications: Language Translation
3/10/2022
Language Translation
3/10/2022
Language Translation is not easy even for humans
Pepsi Chinese blunder
“Come alive with the Pepsi Generation”, when translated into Chinese meant,
“Pepsi brings your relatives back from the dead.”
3/10/2022
Language Translation is not easy even for humans
Pepsi Chinese blunder
“Come alive with the Pepsi Generation”, when translated into Chinese meant,
“Pepsi brings your relatives back from the dead.”
KFC’s Chinese blunder
KFC’s slogan, “Finger lickin’ good”, when translated into Chinese meant “We’ll
eat your fingers off.”
3/10/2022
Some more examples...
3/10/2022
Query Recommendation in Search Engines
3/10/2022
Spelling Correction
3/10/2022
Information Extraction
3/10/2022
Sentiment Analysis
3/10/2022
Recent Trends: Fake news detection
3/10/2022
Recent Trends: Chatbots
3/10/2022
Other Goals
Text Summarization
Opinion dynamics
Spam detection
.. .
3/10/2022
Other Goals
Text Summarization
Opinion dynamics
Spam detection
.. .
Natural Language Technology not yet perfect
But still good enough for several useful applications
3/10/2022
Why is NLP hard?
Compounding
195 characters (with 428 characters when transliterated into the roman writing
system).
3/10/2022
Why is NLP hard?
Ambiguity
3/10/2022
Why is NLP hard?
Ambiguity
3/10/2022
Why else is NLP hard?
Shorthand text
3/10/2022
Why else is NLP hard?
Non-standard English
3/10/2022
Why else is NLP hard?
Segmentation Issues
the New York New Haven Railroad
3/10/2022
Why else is NLP hard?
Segmentation Issues
the New York New Haven Railroad
the [New] [York New] [Haven] [Railroad]
the [New York] [New Haven] [Railroad]
3/10/2022
Why else is NLP hard?
Idioms
Dark horse
Ball in your court
Burn the midnight oil
3/10/2022
Why else is NLP hard?
Idioms : An expression whose meaning is different from the meanings of
the individual words in it.
Idioms Example
Dark horse
Ball in your court Burn the midnight oil
Neologisms: A new word, phrase or expression, or a new meaning of a familiar
word
Unfriend
Retweet
Google/Skype/photoshop
3/10/2022
Why is NLP hard?
New Senses of a word
That’s sick dude!
Giants
3/10/2022
Why is NLP hard?
New Senses of a word
That’s sick dude!
Giants ... multinationals, conglomerates, manufacturers
3/10/2022
Why is NLP hard?
New Senses of a word
That’s sick dude!
Giants ... multinationals, conglomerates, manufacturers
Tricky Entity Names
Where is A Bug’s Life playing ...
Let It Be was recorded ...
3/10/2022
Why is NLP hard?
Code Mixing/switching Romanization
3/10/2022
What we do in NLP?
Create annotated corpora
Brown Corpus, Google Books Ngram Corpus, Reuters Newswire Topic
Classification, IMDB Movie Review Sentiment Classification, Project
Gutenberg, etc.
Create Models/Algorithms
LDA, BERT, CKY, Edit Distance, CRF++, etc.
Create Tools
CoreNLP, NLTK, Gensim, SpaCy, etc.
3/10/2022
Stages in NLP (traditional view)
Phonetics and phonology
Morphology
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Pragmatics
Discourse
Source: IITB NLP Course by Pushpak Bhattacharyya
3/10/2022
Phonetics
• Processing of speech
• Challenges
• Homophones: bank (finance) vs. bank (river bank)
• Near Homophones: maatraa vs. maatra (hin)
• Word Boundary
• aajaayenge (aa jaayenge (will come) or aaj aayenge (will
come today)
• I got [ua]plate
• Phrase boundary
• mtech1 students are especially exhorted to attend as such
seminars are integral to one's post-graduate education
• Disfluency: ah, um, ahem etc.
3/10/2022
Morphology
• Word formation rules from root words
• Nouns: Plural (boy-boys); Gender marking (czar-czarina)
• Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat);
Modality (e.g. request khaanaa khaaiie)
• First crucial first step in NLP
• Languages rich in morphology: e.g., Dravidian, Hungarian, Turkish
• Languages poor in morphology: Chinese, English
• Languages with rich morphology have the advantage of easier processing
at higher stages of processing
• A task of interest to computer science: Finite State Machines for Word
Morphology
3/10/2022
Lexical Analysis
• Essentially refers to dictionary access and obtaining the
properties of the word
e.g. dog
noun (lexical property)
take-’s’-in-plural (morph property)
animate (semantic property)
4-legged (-do-)
carnivore (-do)
• Challenge: Lexical or word sense disambiguation
3/10/2022
Lexical Disambiguation
First step: part of Speech Disambiguation
Dog as a noun (animal)
Dog as a verb (to pursue)
Sense Disambiguation
Dog (as animal)
Dog (as a very detestable person)
Needs word relationships in a context
The chair emphasized the need for adult education
Very common in day to day communications
Satellite Channel Ad: Watch what you want, when you want
(two senses of watch)
e.g., Ground breaking ceremony/research
3/10/2022
Syntax Processing Stage
Structure Detection
VP
NP
V NP
I
like
mangoes
3/10/2022
Parsing Strategy
Driven by grammar
S-> NP VP
NP-> N | PRON
VP-> V NP | V PP
N-> Mangoes
PRON-> I
V-> like
3/10/2022
Challenges in Syntactic Processing: Structural Ambiguity
• Scope
1.The old men and women were taken to safe locations
(old men and women) vs. ((old men) and women)
2. No smoking areas will allow Hookas inside
• Preposition Phrase Attachment
• I saw the boy with a telescope
(who has the telescope?)
• I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of
seeing)
• I saw the boy with the pony-tail
(world knowledge: pony-tail cannot be an instrument of
seeing)
• Very ubiquitous: newspaper headline “20 years later,
BMC pays father 20 lakhs for causing son’s death”
3/10/2022
Structural Ambiguity…
Overheard
I did not know my PDA had a phone for 3 months
An actual sentence in the newspaper
The camera man shot the man with the gun when
he was near Tendulkar
3/10/2022
Semantic Analysis
• Representation in terms of
Predicate calculus/Semantic
Nets/Frames/Conceptual Dependencies and
Scripts
• John gave a book to Mary
Give action: Agent: John, Object: Book,
Recipient: Mary
• Challenge: ambiguity in semantic role labeling
(Eng) Visiting aunts can be a nuisance
(Hin) aapko mujhe mithaai khilaanii padegii (ambiguous
in Marathi and Bengali too; not in Dravidian languages)
3/10/2022
Pragmatics
• Very hard problem
• Model user intention
• Tourist (in a hurry, checking out of the hotel,
motioning to the service boy): Boy, go upstairs and
see if my sandals are under the divan. Do not be late.
I just have 15 minutes to catch the train.
• Boy (running upstairs and coming back panting): yes
sir, they are there.
• World knowledge
• WHY INDIA NEEDS A SECOND OCTOBER (ToI,
2/10/07)
3/10/2022
Discourse
• Processing of sequence of sentences
Mother to John:
John go to school. It is open today. Should you bunk?
Father will be very angry.
• Ambiguity of open
• bunk what?
• Why will the father be angry?
Complex chain of reasoning and application of world
knowledge
Ambiguity of father
father as parent
or
father as headmaster
3/10/2022
Reference Books
Daniel Jurafsky and James H. Martin. Speech and Language Processing:
An Introduction to Natural Language Processing, Speech Recognition,
and Computational Linguistics. 2nd edition. Prentice-Hall. 2009.
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical
Natural Language Processing. MIT Press. 1999.
Steven Bird, Ewan Klein and Edward Loper. Natural Language
Processing with Python. O’Reilly Media. 2009.
Ian Goodfellow,Yoshua Bengio and Aaron Courville. Deep Learning. MIT
Press. 2016.
3/10/2022