Intro to NLP for IT Students
Intro to NLP for IT Students
(4351601)
Foundation of
- H. P. Jagad
Lecturer(IT)
Sir BPTI Bhavnagar
http://hpjagad.blogspot.com
Unit-4 1
Language is a method of communication with the help of
which we can speak, read and write.
“Can human beings communicate with computers in their
natural language? ”
Introduction to It is Challenge for us to develop such applications because
computers need structured data, but human speech is
NLP unstructured & ambiguous in nature.
Branch of AI that gives ability to machine to Interpret,
Analyze, Manipulate & Understand human's languages.
Human languages can be in form of text or audio format.
Helps developers to organize knowledge for performing tasks
such as translation, automatic summarization, speech
recognition & topic segmentation.
To enable computers for analyzing and processing huge
amount of natural language data.
Unit-4 I AM U 2
1940- Began experimenting with Machine Translation during
World War-II.
1948 - First recognizable NLP application was introduced in
Birkbeck College, London.
1950 - Alan Turing published an article titled "Computing
Machinery and Intelligence“. The proposed test includes a
task that involves the automated interpretation and
History of generation of natural language. (Turing Test)
NLP 1950 - Chomsky introduced the idea of Generative
Grammar(Rule based descriptions of syntactic structures.)
1960 to 1970, NLP focused on Rule based System. Used set of
predefined rules & dictionaries to process language.
1970, SHRDLU – It is a program that understand & respond
to natural language queries. It shows syntax, semantics, and
reasoning about the world.
Unit-4 3
1980 to 1990, Hidden Markov Model(HMM) became a
popular tool for speech recognition. Converts speech to text.
2000, IBM’s model-1 & model-2 used statistical patterns to
improve translation quality.
LUNAR- one of the largest and most successful question-
answering system using AI techniques. It had a separate
syntax analyzer and a semantic interpreter.
History to 2010 onwards, NLP reformed using Deep learning & Neural
network. Models like Word2Vec, GloVe, Google’s BERT
NLP (Bidirectional Encoder Representations from Transformers)
developed for NLP tasks.
GPT-3 (Generative Pre-trained Transformer 3) developed.
Now, modern NLP consists of various applications, like
speech recognition, machine translation, sentiment analysis
and machine text reading. Ex- AMAZON ALEXA
Unit-4
In 2028, NLP is expected to grow 20 billion $ to 127 billion $. 4
Enhanced User Experience- chatbot & virtual assistant
make interaction with users in a natural way.
Efficient Information Retrieval- Analyze large volume of
text data quickly & accurately.
Automation of repetitive task- text summarization, data
extraction & document classification.
Multi language Capabilities
Advantages Insight Extraction- Customer feedback, social media, online
of review used in decision making.
NLP Content generation
Accessibility- helps to disable person using text to speech or
speech to text application.
Fraud detection- Identifying phishing emails or fraud
financial transactions.
Market Research- Analyze social media conversations,
Unit-4
customer review & survey & understand market trends. 5
Requires vast amount of high quality data to train a model.
That is time consuming & expensive.
Training can take time. It’s necessary to develop a model
with a new set of data without using a pre-trained model, it
can take weeks to achieve a good performance depending on
the amount of data.
May require vast number of computation resources.
Disadvantages Difficult to understand how they arrive at decisions.
of Biases present in training data may be inherited by NLP
NLP models.
NLP model with multi language can be challenging task.
Unpredictable. It’s not 100% reliable. There’s the possibility
of errors in its prediction and results.
May require more keystrokes.
NLP model can work well on specific task but may not work
Unit-4 effectively for unseen task. 6
It helps the machine to understand and analyze human
language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations etc.
NLU mainly used in Business applications to understand the
Components customer's problem in both spoken and written language.
of NLP Semantic Understanding– Meaning of word, phrase, sentence
Contextual Analysis & Understanding- Considering
surrounding words to interpret exact meaning of word.
1.
Named Entity Recognition(NER)- Categorize named entity like
Natural name of people, organization, location, date etc.
Language Sentiment Analysis
Understanding Relationship Extraction
(NLU) Question Answering
Topic Modeling- Identify main topic within collection of
documents and Categorize them.
Unit-4
Construct Knowledge Graph for organizing structured info. 7
Generation of human like text or speech from structured
data, information or other non-linguistic input.
Converts machine readable language into text and can also
convert text into audible speech using text-to-speech
Components technology.
of NLP Data to text Generation
Text Summarization
2. Content/Dialog/Narrative(Story) Generation
Natural Data Reporting- Report & Insights from Visualization tools.
Language Automated Translation- Translation between languages.
Generation Techniques used in NLG
(NLG) Rule based NLG- Predefine rules & templates
Statistical NLG- Based on probabilities of word sequences. Ex-
Hidden Markov Model(HMM), Conditional Random Fields(CRFs)
Unit-4 Neural NLG- Generative Pre-trained Transformer(GPT) 8
NL Understanding NL Generation
Understand & interpret human language. Generating human like text or speech from
structured data.
Natural language as i/p Structured data, templates as i/p
Extract information or meaning from it as o/p Generate natural language text/speech as o/p
Techniques: Tokenization, Part of Speech Techniques: Rule based approach, Stastical
tagging, NER(Named Entity Recognition), Modeling and deep learning
Sentiment Analysis
Challenge- Handling ambiguity & understand Challenge- Ensuring generated content is
context contextually relevant
Applications: Chatbot, Voice assistant, Sentiment analysis, Report generation, Content
creation, dialog system, story telling
Unit-4 9
Phonology − study of organizing sound
NLP Terminology* systematically.
Morphemes - Smallest meaningful unit
Syntax − Arranging words to make a sentence. of language. If it is altered, the entire
meaning of the word can be changed.
Semantics − meaning of words and how to
combine words into meaningful phrases and Ex.- Word “Eating” has 2 morphemes
sentences. “Eat” &“ing”, Redo= Re+do
Pragmatics − Understanding sentences in Morphology- Study of Construction of
different situations and how the interpretation words from primitive meaningful units.
of the sentence is affected. Lexemes - set of inflected forms taken
Discourse- how the immediately preceding by a single word. Ex.- {run, running, ran}
sentence can affect the interpretation of the Context - how everything within
next sentence. Ex.- That is an elephant. It is language works together to convey a
running. particular meaning.
Phoneme – basic unit of phonology. Smallest
unit of sound that may cause a change of
meaning within a language, but that doesn’t Corpus->Document->Paragraph->
have meaning by itself. Unit-4 Sentence->Word/Token 10
Input Sentence
Lexical
Analysis
Lexicon
Syntax
Analysis
Grammer
Semantic Semantic
Phases of NLP Rules Analysis
Contextual Discourse
Information Integration
Pragmatic
LSS Analysis
DP
Output
Unit-4 11
It scans the source code as a stream of characters
Phases of NLP: and converts it into meaningful lexemes. It breaks
1. Lexical Analysis/ down the whole text into paragraphs, sentences,
and words called token list.
Morphological Splitting text at every space & remove punctuation
Processing marks.
Study of trying to understand the meaning of words,
their relation with other words, and the context.
Starting point of an NLP pipeline.
Token refers to a sequence of characters that can be
considered as one unit in the grammar.
Approaches used in Lexical Analysis
➢Part of speech (PoS) tagger- Assign PoS tag to
each word to understand meaning of text.
➢Stemming- Cut each word to its base form.
➢Lemmatization- Reduce/cut each word to its
Unit-4 meaningful base form. Ex.- Cats-Cat, Running-Run12
It is used to check grammar, logical meaning, correctness
of sentences, word arrangements, & shows the
relationship among the words.
‘Parsing’ is originated from the Latin word ‘pars’ which
Phases of NLP: means ‘part’. It means to break down a given sentence into
its ‘grammatical constituents’.
2. Apply grammatical rules only to group of words not on
individual words.
Syntactic Ex: College go a boy
Analysis Grammatical structure is not correct. It does not convey its
(Parsing) logical meaning. It is rejected.
Process of analyzing the string of symbols in natural language
& confirming rule of formal Grammer.
Unit-4 13
It is the process of finding the meaning of the text.
It tries to understand & interpret sentences, paragraphs or
whole document by analyzing grammatical structure &
identifying relationship between individual words in
particular context.
Only finds out the dictionary meaning or the actual meaning of
Phases of NLP: the given text.
Reading every word in the content to capture the actual meaning
of any text. It identifies the text elements and assigns them to
3. their logical and grammatical role.
Semantic Every sentence has a predicate that conveys the main logic of
that sentence.
Analysis
Unit-4 14
Phases of NLP:
4.
Discourse
Integration
It is the process of understanding relationship between words, sentences,
paragraphs & entire documents. Specify the relations between sentences or clauses.
It is essential because meaning of word depends on context in which it is used & also on
previous sentences.
Ex- “Bank” -> Financial institute/Riverbank/place to store data.
It is important for information retrieval, text summarization & information extraction.
Various ways to integrate discourse information in NLP models:
➢Co-reference resolution algorithm is process of identifying words/phrases that refer
to the same entity in a text.
Unit-4
Ex- She refers to Aalya. 15
➢Discourse Markers are words that represents
relationship between sentences or paragraphs. It
provides clue about overall structure of text & main
points of document. Ex- “However”, “therefore” and “in
Phases of NLP: conclusion”.
Applications of Discourse Integration in NLP
➢Sentiment Analysis
4. Ex- “The product is good but the customer service is
Discourse terrible.”
Here, “but” denotes change in sentiment.
Integration ➢Question answering system
By integrating discourse information, question answer
system can better understand context of question &
provide accurate answers.
Unit-4 16
NLP is used to perform tasks like sentiment analysis, text
classification, information extraction. However, accuracy of
result depends on quality of data & analysis techniques used.
Pragmatic analysis can help to improve accuracy.
“What was said” is re-interpreted on what it actually meant. It
involves deriving those aspects which require real world
Phases of NLP: knowledge.
It is the process of extracting information from text and
focuses on figuring out actual meaning of structured text.
5. Lot of the text’s meaning does have to do with the context in
Pragmatic which it was said/written.
Ex.- "Open the door" is interpreted as a request not an order.
Analysis
Ex. - “I am so excited to attend the meeting tomorrow.”
Helpful in sentiment analysis- Identify +ve /-ve /neutral
sentiments.
Challenging process as context can vary depending upon factors.
Unit-4 17
Computationally Expensive & time consuming process
NLP Libraries Ex- Scikit-learn, NLTK, Pattern, TextBlob, SpaCy
NLTK is widely used Python Library for working with
human language data NLP and text analysis task.
It provides set of tools, recourses and libraries for tasks such as
tokenization, Part of speech tagging, Parsing, stemming,
Lemmatization, Character Count, Word count, sentiment
NLTK-Natural analysis etc. Free & Open source.
Language It supports multiple languages like English, German, Hindi etc
ToolKit Provides a set of algorithms for NLP.
Library Installation for Jupyter Notebook or Google Colab
PIP is a package manager for Python packages. (Preferred
Installer Program)
!pip install nltk
import nltk
nltk.download('all’)
Unit-4 Crl+Enter to run Program 18
The process of cleaning unstructured text data. So that it can
be used to predict, analyze & extract information.
Real-world text data is unstructured & inconsistent. So, Data
preprocessing becomes a necessary step.
The various Data Preprocessing methods are:
Data 1. Tokenization
Preprocessing 2. Frequency Distribution of Words
3. Filtering Stop Words
Using NLTK 4. Stemming
5. Lemmatization
6. Parts Of Speech (POS)Tagging
7. Name Entity Recognition
8. WordNet
Unit-4 19
The process of breaking down the text data into
individual tokens (words, sentences, characters) is
known as Tokenization.
First step in Text Analytics & implemented using a class
tokenize.
punkt Sentence Tokenizer: - It divides a text into a list of
sentences by using an unsupervised algorithm to build a
1. model for abbreviation words, collocations and words that
start sentences. It identifies sentence boundary.
Tokenization
a) Sentence Tokenization
Text data is split into sentences. It is implemented using
sent_tokenize() function from nltk.tokenize module.
b) Word Tokenization
Text data is split into individual words. It is implemented
using word_tokenize() function from nltk.tokenize module.
Unit-4 20
Find out number of times each word repeated in a given text.
Generate the frequency distribution of words in a text by
using the FreqDist() function in from probability submodule.
from nltk.probability import FreqDist
fd = FreqDist(tokenized_word)
2.
most_common() is used to print the most frequent words.
Frequency
Matplotlib is a graph plotting library in python which has
Distribution of pyplot submodule. It has plot() function.
words plot() function draws a line from point to point.
Parameter 1 is an array containing the points on the x-axis.
Parameter 2 is an array containing the points on the y-axis.
import matplotlib.pyplot as plt
fd.plot()
Unit-4 21
Useless words are referred to as Stop words (Noise).
It is necessary to filter some words which are repetitive and
don’t hold any information. For example, words like – {that,
these, below, is, are, a, an etc.} don’t provide any information, so
they need to be removed from the text.
NLTK provides a huge list of stop words.
stopwords module is a part of NLTK library and provides a
3. collection of commonly used stop words for various languages.
Need to download stopwords module. It is available in corpus.
Filtering A corpus is a collection of authentic text documents or audio
Stop Words organized into datasets. 'Authentic' means text written or audio
spoken by a native of the language.
nltk.download('stopwords')
from nltk.corpus import stopwords
format() method: - Concatenate elements within an output
through positional formatting. Users use {} to mark where a
variable will be substituted.
Unit-4 Ex-print('{1} and {0}'.format(“Tom”, “Jerry”)) – Jerry and Tom 22
Stemming is a text normalization technique which reduced
prefix and suffix of the word to their root word or stem.
Used by chatbots and search engines to analyze the meaning
behind the search queries.
3 stemming algorithms
1. Porter Stemmer: - It is one of the oldest and most widely
used. It removes common suffix from the words.
2. Snowball Stemmer: - It is also called Porter2 stemmer. It is
4. advanced version of the Porter stemmer where a few of the
Stemming stemming issues have been resolved. It supports several
languages.
3. Lancaster stemmer: - It is an aggressive approach because it
implements over-stemming for a lot of terms. It reduces the
word to the shortest stem possible.
Faster process compared to lemmatization as it does not
nltk.stem consider the context of the words.
Due to its aggressive nature, there always remains a possibility
Unit-4 of invalid outcomes in a set of data. 23
It is also text normalization technique used to reduce the
word to their root word and gives the complete meaning of
the word which makes sense.
It uses vocabulary and morphological analysis to transform a
5. word into a root word called Lemma.
Lemmatization Stemming is used to reduce the given word to their root form by
just discarding last few characters while Lemmatization
considers context and converts the word to its meaningful base
form which is called lemma.
from nltk.stem import WordNetLemmatizer
POS Char Description
These process requires a lot of data to analyze the structure of
v verb the language. This approach always considers the context first
n noun and then converts the word to its meaningful root form.
a adjective The WordNet lemmatizer is a lexical database that is used by all
the major search engines. It provides lemmatization features.
r adverb
By default it lemmatizes words as nouns. To provide pos (part
Unit-4 of speech) input externally, use character given in table: 24
POS tagging assigns a grammatical category (like noun,
verb, adjective, adverb etc.) to each word in a sentence.
Process of identifying parts of speech of a sentence.
Parts of speech are also known as word classes or lexical
categories. The collection of tags used for a particular task is
known as a tagset.
2 main types of POS tagging in NLP:
6.
Rule-based POS tagging: - It relies on a predefined set of
Parts of grammatical rules, a dictionary of words, and their POS tags.
Speech (POS) Simple to implement and understand but less accurate.
Statistical POS tagging: - It uses ML algorithms to predict POS
Tagging tags based on the context of the words in a sentence. It require a
large amount of training data and computational resources. So,
More accurate.
Averaged Perceptron Tagger is a statistical POS tagger that
uses a ML algorithm called Averaged Perceptron. It uses the
universal POS tagset. It is trained on a large corpus of text.
Unit-4 25
The averaged_perceptron_tagger.zip contains the pre-trained
English POS tagger in NLTK. Some of them are given below:
Abbreviation Meaning
NNP proper noun, singular
6. PRP personal pronoun (hers, herself, him, himself)
Parts of TO infinite marker (to)
VB verb
Speech (POS) VBG verb gerund (ing form)
Tagging VBP verb, present tense not 3rd person singular