0% found this document useful (0 votes)

22 views52 pages

Unit I Inroduction

Uploaded by

GAJULA SATYA NAGA SAI SRI RAM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views52 pages

Unit I Inroduction

Uploaded by

GAJULA SATYA NAGA SAI SRI RAM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language

Processing
NLP

 is among the hottest topic in the field of data science.

 Companies are putting tons of money into research in this field.
 Everyone is trying to understand NLP and its applications to make a career around
it.
 Every business out there wants to integrate it into their business somehow.
Are you using NLP these days?
Search Autocorrect and
Autocomplete – Language Translator
Social media monitoring
 More people these days have started using social media for posting their thoughts about a
particular product, policy, or matter.
 These could contain some useful information about an individual’s likes and dislikes.
 Analyzing this unstructured data can help in generating valuable insights. NLP comes to rescue
here too.
 various NLP techniques are used by companies to analyze social media posts and know what
customers think about their products.
 Companies are also using social media monitoring to understand the issues and problems that
their customers are facing by using their products.
Chatbots

Modern Conversational
Agents can
• Answer questions
• Book flights
• Find Restaurants
• functions for which
they rely on a much
more sophisticated
understanding of
the user’s intent
Survey Analysis

 Surveys are an important way of evaluating a

company’s performance.
 to get customer’s feedback on various products.
 useful in understanding the flaws and help
companies improve their products.

 NLP is used to analyze the surveys and

generate insights from them, like knowing the
sentiments of users analyzing product
reviews to understand the pros and cons
Targeted Advertising – Hiring and
Recruitment

Targeted advertising is a type of online

advertising where ads are shown to the user
based on their online activity.

it saves companies a lot of money

relevant ads are shown only to the potential
customers.
Voice Assistants
Conventional vs. NLP-based search
What is NLP?

 Natural language processing is a sub-field of linguistics, computer

science and AI concerned with the interactions between computers
and human language
 NLP makes computers understand complex language structure and retrieve
meaningful pieces of information from it
 Modern challenges in NLP involve
 speech recognition,
 natural language understanding and
 natural language generation
Why study NLP?

 Text is the largest repository of human knowledge –

 news articles, web pages, scientific articles, patents, emails, government
documents…
 Tweets, facebook posts, comments, quora… etc.
 What are the top ten languages in the internet in terms of millions of user?
 Goals of NLP
 Fundamental and Scientific Goal – Deep understanding of broad language
 Engineering Goal – Design, implement and test subject that process natural
languages for practical applications.
Applications of NLP

 Text Classification
 Language Modelling
 Information Extraction
 Information Retrieval
 Conversational Agents
 Text Summarization
 Question Answering
 Machine Translation
 Topic Modelling
 Speech Recognition
Origins of NLP

 Alan Turing’s Turing Test (1950)

 1950s – 1960s : Early Developments
 Georgetown – IBM Experiment (1954)
 Chomsky’s Transformational Generative Grammar (1957)
 1960s – 1970s : Rule-based approaches
 1970s – 1980s : Rise of statistical methods
 1980s – 1990s : Corpus Linguistics and Machine Learning
 2000s – present : Deep Learning and Neural networks.
Challenges of NLP
Why NLP is Hard?
Lexical Ambiguity
Why NLP is Hard?
Lexical Ambiguity
Ambiguity is pervasive
Activity
 Find at least 5 meanings of this sentence

I made her duck

 Syntactic category
 Duck can be a noun or verb
 Her can be possessive or dative pronoun
 Word meaning
 Make can mean create or cook
Why NLP is Hard?
Ambiguities

Ambiguity is Pervasive
Lexical Ambiguity
Ambiguity is Explosive
Lexical Ambiguity

Why is language ambiguous?

Natural Language Vs. Computer Languages
 The goal in the production and
 Ambiguity is the primary difference
comprehension of natural language is
efficient communication  Programming languages are designed to
be unambiguous
 Allowing resolvable ambiguity
 PLs are defined by grammar that
 Permits shorter linguistic expressions
produces a unique parse for each
 Avoids language being overly complex sentence in the language.

 Language relies on people’s ability to use

that their knowledge and inference
abilities to properly resolve ambiguities
NLP is Hard? .. Why else NLP is hard?
See you, I will text you later.

 Neologisms
 Non standard use of English in Social
media  Unfriend

 Segmentation issues  Retweet

 The New York-New Heaven Road  Google / skype

 Idioms  New senses of the word

 Dark horse  That’s sick due

 Ball in the court  Giants – multinationals, manufacturers

 Burn the midnight oil  Tricky Entity Names

 Where is A Bug’s life playing…
 Let It Be was recorded
Empirical Laws

 Function words Vs. Content Words

 Function words have little lexical
meaning but serve as important
elements to the structure of the
sentences
 Function words are closed class
words
 Prepositions, pronouns, auxiliary
verbs, conjunctions, grammatical
articles, particles etc. • Most of the words here are function words
• The list is dominated by the little words of
 Eg: a, an, the etc. English having important grammatical role
Empirical Laws
Type Vs. Token
 Type
 Type-Token Ratio (TTR) :
 Concept
 It is the ratio of the [Link] different words(types)
 Unique words
to the [Link] running words (tokens) in a given
 Tokens text or corpus.
 Instances of concepts  The index indicates how often, on average, a
new ‘word form’ appears in the text or corpus.
 The number of words Mark Twain’s Complete
 Type-Token distinction is a distinction Tom Sawyer Shakespeare work
that separates a concept from the Word Tokens 71,370 884,647
objects which are particular instances Word types 8018 29,066
of the concept.
TTR 0.112 0.032
Empirical Laws
Observation on various texts
 Consider various texts from conversation, academic prose, news, fiction. Which one will
have high TTR and which one will have lowest TTR?

High TTR – tendency

to use new words
Low TTR – same word
repeatedly
Word distribution from Tom Sawyer
Empirical Laws
Zipf’s Law
 Count the frequency of each word type in a large corpus
 List the word types in decreasing order of their frequency

 i.e., the 50th most common word should occur with 3 times the frequency
of the 150th most common word
Empirical evaluation from Tom Sawyer
Empirical Laws
Zipf’s Other laws
Empirical Laws
Heap’s Law
Words – What counts as a word?

 corpus (plural corpora): a computer-readable corpora collection of text or speech

 For example the Brown corpus is a million-word collection of samples from 500 written English texts from
different genres (newspaper, fiction, non-fiction, academic, etc.)

How many words are in the following Brown sentence?

Sentence : He stepped out into the hall, was delighted to encounter a water
brother.
 This sentence has 13 words if we don’t count punctuation marks as words,
 15 if we count punctuation.
 Are capitalized tokens like They and uncapitalized tokens like they the same word?
 How about inflected forms like cats versus cat?
 These two words have the same lemma cat but are different wordforms.
 A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the
same word sense.
 The wordform is the full inflected or derived form of the word.
Notion of Corpus:
Words – Types and Tokens

 Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the
number of types is the word token vocabulary size |V|.
 Tokens are the total number N of running words.
 ignore punctuation and find the number of tokens and types in the following sentence

They picnicked by the pool, then lay back on the grass and looked
at the stars
16
tokens
14 types
Notion of Corpus:
Corpora

 Any particular piece of text that we study is produced by

 one or more specific speakers or writers,
 in a specific dialect of a specific language,
 at a specific time,
 in a specific place,
 for a specific function.
 The most important dimension of variation is the language.
 NLP algorithms are most useful when they apply across many languages. The world has 7097
languages.
 It is important to test algorithms on more than one language, and particularly on languages with
different properties; by contrast there is an unfortunate current tendency for NLP algorithms to
be developed or tested just on English
 Code Switching : A phenomenon which uses multiple languages in a single communicative act
 Another variations are Genre, demographic characteristics of the writer, time.
Text-processing Basics
Tokenization
 Tokenization is the process of segmenting a string of characters into
words.
 What is sentence segmentation? –
 The problem of deciding where the sentences begin and end.
 Depending on the application in hand, you might have to perform
sentence segmentation as well.
 What are the challenges in sentence segmentation?
 !, ? Are quite unambiguous. Period (.) is quite ambiguous.
 What are the strategies to build a sentence segmenter?
 Hand-written rules, regular expressions, machine learning
Text-processing Basics
Word Normalization

 Is the process of segmenting a string of characters into words.  Issues in Tokenization

 Finland’s
I have a can opener; but I can’t open these cans
 What’re, I’m, Should n’t
 Word Token  San Francisco
 An occurrence of a word  m.p.h.
 For the above sentence, 11 word tokens.  Handling Hyphenation
 Word Type  End-of-line hyphen

 A different realization of a word  Lexical hyphen

 Sentential determined
 For the above sentence, 10 word types
 Language specific issues
 Practice
 French and German
 NLTK toolkit, Stanford CoreNLP, Unix Commands
 Chinese and Japanese
 Sanskrit
Using Python’s split() function
Tokenization using Regular Expressions
Tokenization using NLTK
Word Normalization, Stemming and
Lemmatization
 used to prepare text, words, and documents for further
processing
 Reduce inflections or invariant forms to base form:
 am, are, is – be
 car, car’s, cars, cars’ – car
 Finds correct dictionary handword form
 Morphemes are divided into two categories
 Stems – The core meaning bearing units
 Affixes – prefix (un-, anti- etc.,), suffix – (-ity, -ation etc.)
 Stemming and Lemmatization helps us to achieve the
root forms of inflected words
Stemming
• helps us to achieve the root forms of inflected words.
• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
• Crude chopping of affixes
• stemming a word or sentence may result in words that are not actual words.
Stems are created by removing the suffixes or prefixes used with a word.
• A computer program that stems word is called a stemming program, or
stemmer
• PorterStemmer is stemming algorithm present in NLTK which uses Suffix
Stripping

• It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatization

• Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
• For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
• As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
• Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
Standardization of Data
The common operations performed to standardize the data are

 Removal of duplicate whitespaces and  Acronym normalization (e.g.: ‘US’→‘United

punctuation. States’/‘U.S.A’) and abbreviation normalization
(e.g.: ‘btw’→‘by the way’).
 Accent removal
 Normalize date formats, social security numbers
 Capital letter removal
 Spell correction — this is very important if you’re
 Removal or substitution of special
dealing with open user inputs, such as tweets, IMs
characters/emojis (e.g.: remove
and emails.
hashtags).
 Removal of gender/time/grade variation with
 Substitution of contractions (very common
Stemming or Lemmatization.
in English; e.g.: ‘I’m’→‘I am’).
 Substitution of rare words for more common
 Transform word numerals into numbers
synonyms.
(eg.: ‘twenty three’→‘23’).
 Stop word removal (more a dimensionality
 Substitution of values for their type (e.g.:
reduction technique than a normalization
‘$50’→‘MONEY’).
technique).
Spelling Correction – Edit Distance

 Isolated word error correction

 Pick the one that is closest to ‘behaf’
 How to define ‘closest’?
 Need a distance metric
 The simplest metric is – Edit Distance
 Edit Distance
 The minimum edit distance between two strings – is defined as the minimum number
of editing operations
 Insertion
 Deletion
 Substittution
 Levenshtein distance - substitution has cost -1
 Alternate version – substitution cost - 2
Defining minimum edit distance matrix
Edit Distance calculation
Algorithm using Dynamic Programming
Tracing
Edit Distance
Computing Alignments

 Computing edit distance may not be sufficient for some applications – we

often need to align characters of the two strings to each other
 We do this by keeping a backtrace
 Everytime we enter a cell, remember where we came from
 When we reach the end, tracke back the path from upper right corner to
read off the algorithms.
 Performance
 Time – O(nm)
 Space – O(nm)
 Backtrace – O(n+m)
Language models

 is a computational model or algorithm designed to understand, generate, and predict

human language.
 fundamental part of natural language processing (NLP) and machine learning applications
that involve dealing with textual data.
 The primary goals of a language model include:
 Understanding Language

 Generating Text

 Predicting Sequences

 There are different types of language models, and they can be broadly categorized into
 Statistical Language Models (SLM)

 Grammar-based Language Models

 Neural Language Models

Grammar based Language models

 Grammar-based language models rely on predefined rules and structures

to generate sentences. These rules are often based on formal
grammatical frameworks, such as context-free grammars.
 The model uses syntactic rules to define the permissible arrangements of
words in a sentence.
 Example: In a grammar-based LM, you might have rules specifying that a
sentence must start with a noun phrase followed by a verb phrase.
 Challenge - These models may struggle with handling natural language
variations and may not capture the full complexity of language.
Statistical Language Model

 SLMs are based on statistical patterns observed in a given dataset. They

estimate the probability of a sequence of words occurring based on the
frequencies of these sequences in the training data.
 N-gram Models: SLMs often use n-gram models, where the probability of a
word is conditioned on the previous n-1 words. Commonly used n-grams
include bigrams (n=2) and trigrams (n=3).
 Example: In an SLM, the probability of the word "rain" might be higher if the
preceding words are "the" and "it" compared to other combinations.
 Challenge – data sparsity issues

Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
36 pages
Natural Language Processing
No ratings yet
Natural Language Processing
30 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
Notes
No ratings yet
Notes
9 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Introduction To Natural Language Processing: Unit 1
No ratings yet
Introduction To Natural Language Processing: Unit 1
60 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
Unit 1 Extra
No ratings yet
Unit 1 Extra
6 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
44 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
NLP Unit 1 1
No ratings yet
NLP Unit 1 1
67 pages
NLP for AI and Tech Enthusiasts
No ratings yet
NLP for AI and Tech Enthusiasts
30 pages
NLP Challenges: Sparsity Explained
No ratings yet
NLP Challenges: Sparsity Explained
55 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
30 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
1 Introduction
No ratings yet
1 Introduction
13 pages
Natural Language Processing (NPL) : Group Name: Goal Diggers
No ratings yet
Natural Language Processing (NPL) : Group Name: Goal Diggers
22 pages
Introduction
No ratings yet
Introduction
24 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
45 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP MODULE-1 New Updated
No ratings yet
NLP MODULE-1 New Updated
57 pages
10-Unit 6 NLP-Notes and Exercise
No ratings yet
10-Unit 6 NLP-Notes and Exercise
13 pages
Nayie Bayes Classifier 21 Page
No ratings yet
Nayie Bayes Classifier 21 Page
28 pages
2 Introduction
No ratings yet
2 Introduction
15 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
7 pages
Natural Language Processing: By-Himani (ROLL NO. 43)
No ratings yet
Natural Language Processing: By-Himani (ROLL NO. 43)
19 pages
Asm1 Artificial Intelligence 314384
No ratings yet
Asm1 Artificial Intelligence 314384
9 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
Chapter 6-NLPs
No ratings yet
Chapter 6-NLPs
31 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
7 pages
NLP Unit I Notes-1
75% (4)
NLP Unit I Notes-1
22 pages
Natural Language Processing State of The Art Curre
No ratings yet
Natural Language Processing State of The Art Curre
33 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
NLP & Linguistics for Researchers
No ratings yet
NLP & Linguistics for Researchers
35 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Module-1 - Introduction To Natural Language Processing
No ratings yet
Module-1 - Introduction To Natural Language Processing
70 pages
1.1 NLP Introduction
No ratings yet
1.1 NLP Introduction
64 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
19 pages
Introduction to Natural Language Processing
100% (1)
Introduction to Natural Language Processing
37 pages
NLP - Natural Language Processing and APPLICATION
No ratings yet
NLP - Natural Language Processing and APPLICATION
31 pages
Lesson 1 Introduction To Natural Language Processing
No ratings yet
Lesson 1 Introduction To Natural Language Processing
93 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
NLP Introduction
No ratings yet
NLP Introduction
36 pages
1 Natural Language Processing-Intro
No ratings yet
1 Natural Language Processing-Intro
16 pages
NLP
No ratings yet
NLP
21 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
NLP - Introduction
No ratings yet
NLP - Introduction
7 pages
NLP1 Lecture1
No ratings yet
NLP1 Lecture1
22 pages
0 Unit-1 Introducntion To NLP
No ratings yet
0 Unit-1 Introducntion To NLP
41 pages
NLP Basics for Computer Science Students
No ratings yet
NLP Basics for Computer Science Students
87 pages
1 Introduction
No ratings yet
1 Introduction
45 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
67c7d4a944082 Designathon
No ratings yet
67c7d4a944082 Designathon
8 pages
Best Quantitative Aptitude Formula Cheat Sheet For Exam Prep - Unstop
No ratings yet
Best Quantitative Aptitude Formula Cheat Sheet For Exam Prep - Unstop
26 pages
FfundamentalsOofEentraprenurship Merged
No ratings yet
FfundamentalsOofEentraprenurship Merged
49 pages
Infosys HR Interview Insights
No ratings yet
Infosys HR Interview Insights
7 pages
Hakin9 - Hacker's Toolset For 2022 Hide01.Ir
No ratings yet
Hakin9 - Hacker's Toolset For 2022 Hide01.Ir
109 pages
Test Paper
No ratings yet
Test Paper
2 pages
RHEL Patch Rollback Guide
No ratings yet
RHEL Patch Rollback Guide
7 pages
01AdvancedThinkAhead3 ExtraPractice Module5A
No ratings yet
01AdvancedThinkAhead3 ExtraPractice Module5A
2 pages
The New Way of The Cross
No ratings yet
The New Way of The Cross
9 pages
Awards
No ratings yet
Awards
32 pages
RLA Summary Record Sheet
No ratings yet
RLA Summary Record Sheet
3 pages
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
No ratings yet
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
8 pages
Grammar Unit 6
No ratings yet
Grammar Unit 6
3 pages
Unit8 Reading PDF 2
No ratings yet
Unit8 Reading PDF 2
9 pages
Ennoble Question Paper Computer Class 6
No ratings yet
Ennoble Question Paper Computer Class 6
3 pages
Summary and Poetic Devices
No ratings yet
Summary and Poetic Devices
2 pages
Gauss-Seidel Iteration Method
No ratings yet
Gauss-Seidel Iteration Method
10 pages
Correlation Between Reading Comprehension Skills and Students Performance in Mathematics
No ratings yet
Correlation Between Reading Comprehension Skills and Students Performance in Mathematics
9 pages
Kalyani Rachabattula 6
No ratings yet
Kalyani Rachabattula 6
15 pages
Immigration Trends and Challenges
No ratings yet
Immigration Trends and Challenges
4 pages
Java Script Part 2 PPT-Unit2 MSD
No ratings yet
Java Script Part 2 PPT-Unit2 MSD
171 pages
This Document Contains "Project Management Plan" of Project School Management System
No ratings yet
This Document Contains "Project Management Plan" of Project School Management System
8 pages
Aegis
No ratings yet
Aegis
1 page
Input Data Sheet For E-Class Record: Region Division District School Name School Id School Year
No ratings yet
Input Data Sheet For E-Class Record: Region Division District School Name School Id School Year
22 pages
Roman Catholic Mission
No ratings yet
Roman Catholic Mission
20 pages
Language's Role in Power Dynamics
No ratings yet
Language's Role in Power Dynamics
20 pages
Venezuelan Waltz: History & Elements
No ratings yet
Venezuelan Waltz: History & Elements
56 pages
Lesson2 Numbers Week Month
No ratings yet
Lesson2 Numbers Week Month
3 pages
RC 2
No ratings yet
RC 2
13 pages
Identify The Transition Words
No ratings yet
Identify The Transition Words
2 pages
Fast Majority Vote Algorithm
No ratings yet
Fast Majority Vote Algorithm
14 pages
3 in Love With My Bike: (1.5 Grammar)
No ratings yet
3 in Love With My Bike: (1.5 Grammar)
2 pages
Canvas Past Tense - Verb Forms, Conjugate CANVAS
No ratings yet
Canvas Past Tense - Verb Forms, Conjugate CANVAS
4 pages
Introduction to Morphology Concepts
No ratings yet
Introduction to Morphology Concepts
4 pages

Unit I Inroduction

Uploaded by

Unit I Inroduction

Uploaded by

Natural Language

 is among the hottest topic in the field of data science.

 Surveys are an important way of evaluating a

 NLP is used to analyze the surveys and

Targeted advertising is a type of online

it saves companies a lot of money

 Natural language processing is a sub-field of linguistics, computer

 Text is the largest repository of human knowledge –

 Alan Turing’s Turing Test (1950)

I made her duck

Why is language ambiguous?

 Language relies on people’s ability to use

 Segmentation issues  Retweet

 The New York-New Heaven Road  Google / skype

 Idioms  New senses of the word

 Ball in the court  Giants – multinationals, manufacturers

 Burn the midnight oil  Tricky Entity Names

 Function words Vs. Content Words

High TTR – tendency

 corpus (plural corpora): a computer-readable corpora collection of text or speech

How many words are in the following Brown sentence?

 Any particular piece of text that we study is produced by

 Is the process of segmenting a string of characters into words.  Issues in Tokenization

 A different realization of a word  Lexical hyphen

 Removal of duplicate whitespaces and  Acronym normalization (e.g.: ‘US’→‘United

 Isolated word error correction

 Computing edit distance may not be sufficient for some applications – we

 is a computational model or algorithm designed to understand, generate, and predict

 Grammar-based Language Models

 Neural Language Models

 Grammar-based language models rely on predefined rules and structures

 SLMs are based on statistical patterns observed in a given dataset. They

You might also like