0% found this document useful (0 votes)
15 views22 pages

Notes UNIT-I Introduction NLP

Uploaded by

dharabhilai77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Notes UNIT-I Introduction NLP

Uploaded by

dharabhilai77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Lecture-1

What is Natural Language Processing? (NLP)


Natural Language is a powerful tool of Artificial Intelligence that enables computers to
understand, interpret and generate human readable text that is meaningful. NLP is a method used
for processing and analyzing the text data. In Natural Language Processing the text is tokenized
means the text is break into tokens, it could be words, phrases or character. It is the first step in
NLP task. The text is cleaned and preprocessed before applying Natural Language Processing
technique.
Natural Language Processing technique is used in machine translation, healthcare, finance,
customer service, sentiment analysis and extracting valuable information from the text data. NLP
is also used in text generation and language modeling. Natural Processing technique can also be
used in answering the questions. Many companies uses Natural Language Processing technique
to solve their text related problems.

10 Major Challenges of Natural Language Processing(NLP)


Natural Language Processing (NLP) faces various challenges due to the complexity and diversity
of human language. Let's discuss 10 major challenges in NLP:
1. Language differences
The human language and understanding is rich and intricated and there many languages spoken
by humans. Human language is diverse and thousand of human languages spoken around the
world with having its own grammar, vocabular and cultural nuances. Human cannot understand
all the languages and the productivity of human language is high. There is ambiguity in natural
language since same words and phrases can have different meanings and different context. This
is the major challenges in understating of natural language.
There is a complex syntactic structures and grammatical rules of natural languages. The rules
are such as word order, verb, conjugation, tense, aspect and agreement. There is rich semantic
content in human language that allows speaker to convey a wide range of meaning through
words and sentences. Natural Language is pragmatics which means that how language can be
used in context to approach communication goals. The human language evolves time to time
with the processes such as lexical change.
[Link] Data
Training data is a curated collection of input-output pairs, where the input represents the features
or attributes of the data, and the output is the corresponding label or target. Training data is
composed of both the features (inputs) and their corresponding labels (outputs). For NLP,
features might include text data, and labels could be categories, sentiments, or any other relevant
annotations.
It helps the model generalize patterns from the training set to make predictions or classifications
on new, previously unseen data.
3. Development Time and Resource Requirements
Development Time and Resource Requirements for Natural Language Processing
(NLP) projects depends on various factors consisting the task complexity, size and quality of the
data, availability of existing tools and libraries, and the team of expert involved. Here are some
key points:
 Complexity of the task: Task such as classification of text or analyzing the sentiment of the
text may require less time compared to more complex tasks such as machine translation or
answering the questions.
 Availability and Quality Data: For Natural Language Processing models requires high-
quality of annotated data. It can be time consuming to collect, annotate, and preprocess the
large text datasets and can be resource-intensive specially for tasks that requires specialized
domain knowledge or fine-tuned annotations.
 Selection of algorithm and development of model: It is difficult to choose the right
algorithms machine learning algorithms that is best for Natural Language Processing task.
 Evaluation and Training: It requires powerful computation resources that consists of
powerful hardware (GPUs or TPUs) and time for training the algorithms iteration. It is also
important to evaluate the performance of the model with the help of suitable metrics and
validation techniques for conforming the quality of the results.
4. Navigating Phrasing Ambiguities in NLP
It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of
human languages. The cause of phrasing ambiguities is when a phrase can be evaluated in
multiple ways that leads to uncertainty in understanding the meaning. Here are some key points
for navigating phrasing ambiguities in NLP:
 Contextual Understanding: Contextual information like previous sentences, topic focus, or
conversational cues can give valuable clues for solving ambiguities.
 Semantic Analysis: The content of the semantic text is analyzed to find meaning based on
word, lexical relationships and semantic roles. Tools such as word sense disambiguation,
semantics role labeling can be helpful in solving phrasing ambiguities.
 Syntactic Analysis: The syntactic structure of the sentence is analyzed to find the possible
evaluation based on grammatical relationships and syntactic patterns.
 Pragmatic Analysis: Pragmatic factors such as intentions of speaker, implicatures to infer
meaning of a phrase. This analysis consists of understanding the pragmatic context.
 Statistical methods: Statistical methods and machine learning models are used to learn
patterns from data and make predictions about the input phrase.
5. Misspellings and Grammatical Errors
Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis. Here
are some key points for solving misspelling and grammatical error in NLP:
 Spell Checking: Implement spell-check algorithms and dictionaries to find and correct
misspelled words.
 Text Normalization: The is normalized by converting into a standard format which may
contains tasks such as conversion of text to lowercase, removal of punctuation and special
characters, and expanding contractions.
 Tokenization: The text is split into individual tokens with the help of tokenization
techniques. This technique allows to identify and isolate misspelled words and grammatical
error that makes it easy to correct the phrase.
 Language Models: With the help of language models that is trained on large corpus of data
to predict the likelihood of word or phrase that is correct or not based on its context.

6. Mitigating Innate Biases in NLP Algorithms


It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness, equity,
and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.
 Collection of data and annotation: It is very important to confirm that the training data
used to develop NLP algorithms is diverse, representative and free from biases.
 Analysis and Detection of bias: Apply bias detection and analysis method on training data
to find biases that is based on demographic factors such as race, gender, age.
 Data Preprocessing: Data Preprocessing the most important process to train data to mitigate
biases like debiasing word embeddings, balance class distributions and augmenting
underrepresented samples.
 Fair representation learning: Natural Language Processing models are trained to learn fair
representations that are invariant to protect attributes like race or gender.
 Auditing and Evaluation of Models: Natural Language models are evaluated for fairness
and bias with the help of metrics and audits. NLP models are evaluated on diverse datasets
and perform post-hoc analyses to find and mitigate innate biases in NLP algorithms.
7. Words with Multiple Meanings
Words with multiple meaning plays a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meaning are known as polysemous or
homonymous have different meaning based on the context in which they are used. Here are some
key points for representing the lexical challenge plays by words with multiple meanings in NLP:
 Semantic analysis: Implement semantic analysis techniques to find the underlying meaning
of the word in various contexts. Word embedding or semantic networks are the semantic
representation can find the semantic similarity and relatedness between different word sense.
 Domain specific knowledge: It is very important to have a specific domain-knowledge in
Natural Processing tasks that can be helpful in providing valuable context and constraints for
determining the correct context of the word.
 Multi-word Expression (MWEs): The meaning of the entire sentence or phrase is analyzed
to disambiguate the word with multiple meanings.
 Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to find the
semantic relationships between different words context.
8. Addressing Multilingualism
It is very important to address language diversity and multilingualism in Natural Language
Processing to confirm that NLP systems can handle the text data in multiple languages
effectively. Here are some key points to address language diversity and multilingualism:
 Multilingual Corpora: Multilingual corpus consists of text data in various languages and
serve as valuable resources for training NLP models and systems.
 Cross-Lingual Transfer Learning: This is a type of techniques that is used to transfer
knowledge learned from one language to another.
 Language Identification: Design language identification models to automatically detect the
language of a given text.
 Machine Translation: Machine Translation provides the facility to communicate and inform
access across language barriers and can be used as preprocessing step for multilingual NLP
tasks.
9. Reducing Uncertainty and False Positives in NLP
It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models. Here are some key points to
approach the solution:
 Probabilistic Models: Use probabilistic models to figure out the uncertainty in predictions.
Probabilistic models such as Bayesian networks gives probabilistic estimates of outputs that
allow uncertainty quantification and better decision making.
 Confidence Scores: The confidence scores or probability estimates is calculated for NLP
predictions to assess the certainty of the output of the model. Confidence scores helps us to
identify cases where the model is uncertain or likely to produce false positives.
 Threshold Tuning: For the classification tasks the decision thresholds is adjusted to make
the balance between sensitivity (recall) and specificity. False Positives in NLP can be
reduced by setting the appropriate thresholds.
 Ensemble Methods: Apply ensemble learning techniques to join multiple model to reduce
uncertainty.
10. Facilitating Continuous Conversations with NLP
Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines gives
to capability to analyze and interpret user input as it is received involving algorithms are
optimized and systems for low latency processing to confirm quick responses to user queries and
inputs.
Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history tracking,
and generating relevant responses based on the ongoing dialogue. Apply intent recognition
algorithm to find the underlying goals and intentions expressed by users in their messages.
How to overcome NLP Challenges
It requires a combination of innovative technologies, experts of domain, and methodological
approached to over the challenges in NLP. Here are some key points to overcome the challenge
of NLP tasks:
 Quantity and Quality of data: High quality of data and diverse data is used to train the NLP
algorithms effectively. Data augmentation, data synthesis, crowdsourcing are the techniques
to address data scarcity issues.
 Ambiguity: The NLP algorithm should be trained to disambiguate the words and phrases.
 Out-of-vocabulary Words: The techniques are implemented to handle out-of-vocabulary
words such as tokenization, character-level modeling, and vocabulary expansion.
 Lack of Annotated Data: Techniques such transfer learning and pre-training can be used to
transfer knowledge from large dataset to specific tasks with limited labeled data.

Lecture - 2
Natural Language Processing(NLP) VS Programming Language




In the world of computers, there are mainly two kinds of languages: Natural Language
Processing (NLP) and Programming Languages. NLP is all about understanding human
language while programming languages help us to tell computers what to do. But as
technology grows, these two areas are starting to overlap in cool ways, changing how we
interact with machines.
This article will explain the basics of both NLP and programming languages, their
differences, and how they're beginning to work together. Let's dive into the topic to understand
the basic difference between them.
What is Natural language processing?
Natural language processing is an area of research in computer science and artificial
intelligence (AI) concerned with processing natural languages such as English or Mandarin.
This processing generally involves translating natural language into data (numbers) that a
computer can use to learn about the world. This understanding of the world is sometimes used
to generate natural language text that reflects that understanding. A natural language
processing system is often referred to as a pipeline because it usually involves several stages of
processing where natural language flows in one end and the processed output flows out the
other.
How does Natural language processing(NLP) Works?
1. Understanding Human Language
2. Convert it into computer language
3. Translated the language and learn
4. Give human-readable text
5. Natural language processing works in multiple stages just like a production line.
What is a Programming Language?
Programming languages are more formal, and structurally instructing a computer to perform
certain tasks or manipulate data. These Programming languages provide us with a way for
humans to communicate with machines. Each programming language has its own set of rules,
and syntax that give direction on how instructions are written for execution. There are several
programming languages available, each designed for specific purposes, such as web
development ( JavaScript, Python), system programming ( C, C++), data analysis ( R, Python),
etc.
How does Programming Language Works?
1. Give instructions to the computer
2. Then we write code
3. Communication
4. Multiple languages to give instructions to the computer
5. Developers has multiple tools like code editor
Differences Between Natural Language Processing and Programming Language
Parameter Natural language processing Programming Language

Connected in processing the human


Way of writing instructions to the
Purpose natural language, one of the sub-
computer
categories of AI
Parameter Natural language processing Programming Language

Syntax Generates human language syntax Strict syntax for every language

enable computers to interact with Solve the task and computational


Aim
human language problems and do all manipulation

Works with unstructured and Works with structured data,


Works on
speech data variables, and program logic

data scientists, computational Programmers, software


Used by
linguists and NLP experts developers

Focuses on processing and


Used for specifying algorithms
Communication understanding human language text
and manipulation
data

Chatbots, language translation, develop software, applications


Application
speech recognition, etc and algorithms

machine translation, sentiment


Examples C,C++,java,python etc.
analysis

Error
uses probabilistic models through try-catch blocks
management

IDEs (Integrated Development


Tools NLTK, TensorFlow
Environments), compilers

Similarities between Natural Language Processing and Programming Language


After being so many differences there are multiple similarities in between Natural Language
Processing(NLP) and Programming Language.
1. NLP and programming languages have their own sets of syntax and rules.
2. Both have provided multiple levels of abstraction
3. Both have tools and libraries.
4. For pattern recognition, both have pattern recognition algorithms
5. Both NLP and programming languages is abstraction of Complexity
Similarities in Programming language and Natural language
There are multiple similarities in learning a programming language and natural language in
term of structure syntax, rules, and the main part is message transferring. However, there are
multiple basic differences in aim and actual purpose:
 The actual aim of Natural Language is to conveying our message to others using thoughts
and emotions via multiple languages like Hindi, English, etc.. while on the other hand
Programming languages are formal made for instructing computers and use to make
software, app, algorithms.
 As far as language structure or syntax is concerned Natural languages have flexible syntax
because of wide range of expressions that follow particular language grammar while on the
hand Programming languages are strict, rules to ensure precise instructions for computers.

Lecture -3
Ambiguity in NLP and how to address them



Ambiguity in Natural Language Processing (NLP) happens because human language can
have multiple meanings. Computers sometimes confuse to understand exactly what we mean
unlike humans, who can use intuition and background knowledge to infer meaning, computers
rely on precise algorithms and statistical patterns.
The sentence "The chicken is ready to eat" is ambiguous because it can be interpreted in two
different ways:
1. The chicken is cooked and ready to be eaten.
2. The chicken is hungry and ready to eat food.
This dual meaning arises from the structure of the sentence, which does not clarify the subject's
role (the eater or the one being eaten). Resolving such ambiguities is essential for accurate
NLP applications like chatbots, translation, and sentiment analysis.
This article explores types of ambiguity in NLP and methods to address them effectively.
Types of Ambiguity in NLP
The meaning of an ambiguous expression often depends on the situation, prior knowledge, or
surrounding words. For example: He is cool. This could mean he is calm under
pressure or he is fashionable depending on the context.
1. Lexical Ambiguity
Lexical ambiguity occurs when a single word has multiple meanings, making it unclear which
meaning is intended in a particular context. This is a common challenge in language.
For example, the word "bat" can have two different meanings. It could refer to a flying
mammal, like the kind you might see at night. Alternatively, "bat" could also refer to a piece of
sports equipment used in games like baseball or cricket.
For computers, determining the correct meaning of such a word requires looking at the
surrounding context to decide which interpretation makes sense.
2. Syntactic Ambiguity
Syntactic ambiguity occurs when the structure or grammar of a sentence allows for more than
one interpretation. This happens because the sentence can be understood in different ways
depending on how it is put together.
For example, take the sentence, “The boy kicked the ball in his jeans.” This sentence can be
interpreted in two different ways: one possibility is that the boy was wearing jeans and he
kicked the ball while he was wearing them. Another possibility is that the ball was inside the
boy’s jeans, and he kicked the ball out of his jeans.
A computer or NLP system must carefully analyze the structure to figure out which
interpretation is correct, based on the context.
3. Semantic Ambiguity
Semantic ambiguity occurs when a sentence has more than one possible meaning because of
how the words are combined. This type of ambiguity makes it unclear what the sentence is
truly trying to say.
For example, take the sentence, “Visiting relatives can be annoying.” This sentence could be
understood in two different ways. One meaning could be that relatives who are visiting you
are annoying, implying that the relatives themselves cause annoyance. Another meaning could
be that the act of visiting relatives is what is annoying, suggesting that the experience of going
to see relatives is unpleasant.
The confusion comes from how the words "visiting relatives" can be interpreted: is it about the
relatives who are visiting, or is it about the action of visiting? In cases like this, semantic
ambiguity makes it hard to immediately understand the exact meaning of the sentence, and the
context is needed to clarify it.
4. Pragmatic Ambiguity
Pragmatic ambiguity occurs when the meaning of a sentence depends on the speaker’s
intent, tone, or the situation in which it is said. This type of ambiguity is common in everyday
conversations, and it can be tricky for computers to understand because it often requires
knowing the broader context.
For example, consider the sentence, “Can you open the window?” In one situation, it could be
understood as a literal question asking if the person is physically able to open the window.
However, in another context, it could be a polite request, where the speaker is asking the
listener to open the window, even though they’re not directly giving an order.
The meaning changes based on the tone of voice or social context, which is something that is
difficult for NLP systems to capture without understanding the surrounding situation
5. Referential Ambiguity
Referential ambiguity occurs when a pronoun (like "he," "she," "it," or "they") or a phrase is
unclear about what or who it is referring to. This type of ambiguity happens when the sentence
doesn’t provide enough information to determine which person, object, or idea the pronoun is
referring to.
For example, consider the sentence, “Alice told Jane that she would win the prize.” In this
case, it’s unclear whether the pronoun "she" refers to Alice or Jane. Both could be possible
interpretations, and without further context, we can’t be sure. If the sentence was about a
competition, "she" could be referring to Alice, meaning Alice is telling Jane that she would
win the prize. However, it could also mean that Alice is telling Jane that Jane would win the
prize.
6. Ellipsis Ambiguity
Ellipsis ambiguity happens when part of a sentence is left out, making it unclear what the
missing information is. This often occurs in everyday conversation or writing when people try
to be brief and omit words that are understood from the context.
For example, consider the sentence, "John likes apples, and Mary does too." The word
"does" is a shortened form of "likes apples," but it’s not explicitly stated. This creates two
possible interpretations:
1. Mary likes apples just like John, meaning both John and Mary enjoy apples.
2. Mary likes something else (not apples), and the sentence is leaving out the specific thing
she likes.
The ambiguity arises because it's unclear from the sentence whether "does" refers to liking
apples or something else.
Addressing Ambiguity in Natural Language Processing
To address ambiguity in NLP, several methods are used to accurately interpret language.
 Contextual analysis is one of the key approaches, where surrounding words and context
help determine the correct meaning of a word or phrase.
 Word sense disambiguation (WSD) resolves lexical ambiguity by using context to
identify which meaning of a word is being used.
 Parsing and syntactic analysis help resolve syntactic ambiguity by breaking down
sentence structures to understand different grammatical interpretations.
 Coreference resolution is used to clarify what pronouns or phrases refer to, solving
referential ambiguity.
 Discourse and pragmatic modeling help capture speaker intent and the social context,
which resolves pragmatic ambiguity.
 Machine learning and deep learning techniques, like BERT and GPT, leverage large
datasets to learn language patterns, aiding in resolving ambiguity.

Lecture – 4

Rule Based Approach in NLP



Natural Language Processing serves as an interrelationship between human language and


computers. It is a subfield of Artificial Intelligence that helps machines process, understand and
generate natural language intuitively. Common tasks done by NLP are text and speech
processing, language translation, sentiment analysis, etc. The use cases include spam detection,
chatbots, text summarization, etc.
There are three types of NLP approaches:
1. Rule-based Approach - Based on linguistic rules and patterns
2. Machine Learning Approach - Based on statistical analysis
3. Neural Network Approach - Based on various artificial, recurrent, and convolutional neural
network algorithms
Rule-based approach in NLP
Rule-based approach is one of the oldest NLP methods in which predefined linguistic rules are
used to analyze and process textual data. Rule-based approach involves applying a particular set
of rules or patterns to capture specific structures, extract information, or perform tasks such as
text classification and so on. Some common rule-based techniques include regular expressions
and pattern matches.
Steps in Rule-based approach in NLP:
1. Rule Creation: Based on the desired tasks, domain-specific linguistic rules are created such
as grammar rules, syntax patterns, semantic rules or regular expressions.
2. Rule Application: The predefined rules are applied to the inputted data to capture matched
patterns.
3. Rule Processing: The text data is processed in accordance with the results of the matched
rules to extract information, make decisions or other tasks.
4. Rule refinement: The created rules are iteratively refined by repetitive processing to
improve accuracy and performance. Based on previous feedback, the rules are modified and
updated when needed.

Steps in Rule-Based Approach


Libraries that can be used for a rule-based approach are: Spacy(Best suited for production),
[Link], NLTK(Not preferred nowadays)
In this article, we'll work with the Spacy library to demonstrate the Rule-based Approach. Spacy
is an open-source software library designed for advanced Natural Language Processing (NLP)
tasks. It is built in Python and provides a wide range of functionalities for processing and
analyzing large volumes of text data
A rule-matching engine in Spacy called the Matcher can work over tokens, entities, and phrases
in a manner similar to regular expressions.
Spacy Installation:
# Spacy Installation
!pip install - U spacy
!pip install - U spacy-lookups-data
!python - m spacy download en_core_web_sm # For English language

Example 1: Matching Token with Rule-based Approach


Step 1: The necessary modules are imported
#import modules
import spacy
#import the Matcher
from [Link] import Matcher
#import the Span class
from [Link] import Span
Step 2: The English Language Spacy model is loaded
#The English model 'en_core_web_sm' is loaded
spacy = [Link]("en_core_web_sm")
Step 3: The input text is added and all the tokens are separated.
#The input text as a Document object
txt ="Natural Language Processing serves as an interrelationship between human language and
computers. Natural Language Processing is a subfield of Artificial Intelligence that helps
machines process, understand and generate natural language intuitively."
doc = spacy(txt)
Tokens = []
for token in doc:
[Link](token)

print('Tokens:',Tokens)
print('Number of token :',len(Tokens))
Output:
Tokens: [Natural, Language, Processing, serves, as, an, interrelationship, between, human,
language, and, computers, ., Natural, Language, Processing, is, a, subfield, of, Artificial,
Intelligence, that, helps, machines, process, ,, understand, and, generate, natural,
language, intuitively, .]
Number of token : 34
Step 4: The rule-based matching Engine 'Matcher' is loaded.
#Matcher class object instantiation
matcher = Matcher([Link])
Step 5: The rule or the pattern to be searched in the text is added. Here the words 'language'
and 'human' are set as patterns.
#pattern to be searched
pattern = [[{'LOWER': 'language'}],[{'LOWER':'human'}]]
Step 6: The pattern is added to the matcher object using the 'add' method with the first
parameter as ID and the second parameter as the pattern.
#adding the pattern/rule to the matcher object
[Link]("TokenMatch",pattern)
Step 7: The matcher object is called with the 'doc' object input text to match the pattern. The
result is stored in 'matches' variable
#Matcher object called
#returns match_id, start and stop indexes of the matched words
matches = matcher(doc)
Step 8: The matched results are extracted and printed.
#Extracting matched results
for m_id, start, end in matches:
string_id = [Link][m_id]
span = doc[start:end]
print('match_id:{}, string_id:{}, Start:{}, End:{}, Text:{}'.format(
m_id, string_id, start, end, [Link])
)
Output:
match_id:9580390278045680890, string_id:TokenMatch, Start:1, End:2, Text:Language
match_id:9580390278045680890, string_id:TokenMatch, Start:8, End:9, Text:human
match_id:9580390278045680890, string_id:TokenMatch, Start:9, End:10, Text:language
match_id:9580390278045680890, string_id:TokenMatch, Start:14, End:15, Text:Language
match_id:9580390278045680890, string_id:TokenMatch, Start:31, End:32, Text:language

Example 2: Matching Phrases with the Rule-based Approach

Step 1: The PhraseMatcher module is imported from Spacy


# import necessary modules
import spacy
from [Link] import PhraseMatcher
Step 2: The English Language Spacy model is loaded
#The English model 'en_core_web_sm' is loaded
spacy = [Link]('en_core_web_sm')
Step 3: The input text is added as 'doc' object
#The input text as a Document object
txt ="Natural Language Processing serves as an interrelationship between human language and
computers. Natural Language Processing is a subfield of Artificial Intelligence that helps
machines process, understand and generate natural language intuitively."
doc = spacy(txt)
print(doc)
Output:
Natural Language Processing serves as an interrelationship between human language and
computers.
Natural Language Processing is a subfield of Artificial Intelligence that helps machines process,
understand and generate natural language intuitively.
Step 4: The PhraseMatcher object is instantiated.
# PhraseMatcher object creation
matcher = PhraseMatcher([Link], attr='LOWER')
Step 5: The list of phrases is added in term_list which is converted to a patterns object using
'make_doc' method to speed up the process.
# list of phrases
term_list = ["Language Processing", "human language"]
# phrases into document object
patterns = [spacy.make_doc(t) for t in term_list]
Step 6: The created rule is added to the matcher object
# patterns added to the matcher object
[Link]("Phrase Match", None, *patterns)
Step 7: The matcher object is called on the input text 'doc' with parameter 'is_spans=True'
that returns span objects directly. The extracted results are printed.
# Matcher object called. It returns Span objects directly
matches = matcher(doc, as_spans=True)
#Extracting matched results
for span in matches:
print([Link],":-", span.label_)
Output:
Language Processing :- Phrase Match
human language :- Phrase Match
Language Processing :- Phrase Match

Example 3: Named Entity Recognization with Spacy

Step 1: Import spacy and Load the English Language Spacy model
# import spacy
import spacy
#Load the English Language Spacy model
nlp = [Link]("en_core_web_sm")
Step 2: Named Entity Recognization with Spacy
#The input text as a Document object
txt = """
My name is Pawan Kumar Gunjan. I live in India
India, officially the Republic of India, is a country in South Asia.
It is the seventh-largest country by area and the second-most populous country.
Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest,
and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;
China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.
"""
doc = nlp(txt)
Tokens = []
for entity in [Link]:
print('Text:{}, Label:{}'.format([Link], entity.label_))
Output:
Text:Pawan Kumar Gunjan, Label:PERSON
Text:India, Label:GPE
Text:India, Label:GPE
Text:the Republic of India, Label:GPE
Text:South Asia, Label:LOC
Text:seventh, Label:ORDINAL
Text:second, Label:ORDINAL
Text:the Indian Ocean, Label:LOC
Text:the Arabian Sea, Label:LOC
Text:the Bay of Bengal, Label:LOC
Text:Pakistan, Label:GPE
Text:China, Label:GPE
Text:Nepal, Label:GPE
Text:Bhutan, Label:GPE
Text:Bangladesh, Label:GPE
Text:Myanmar, Label:GPE
Advantages of the Rule-based approach:
 Easily interpretable as rules are explicitly defined
 Rule-based techniques can help semi-automatically annotate some data in domains where
you don't have annotated data (for example, NER(Named Entity Recognization) tasks in a
particular domain).
 Functions even with scant or poor training data
 Computation time is fast and it offers high precision
 Many times, deterministic solutions to various issues, such as tokenization, sentence
breaking, or morphology, can be achieved through rules (at least in some languages).
Disadvantages of the Rule-based approach:
 Labor-intensive as more rules are needed to generalize
 Generating rules for complex tasks is time-consuming
 Needs regular maintenance
 May not perform well in handling variations and exceptions in language usage
 May not have a high recall metric
Why Rule-based Approach with Machine Learning and Neural Network Approaches?
1. Rule-based NLP usually deals with edge cases when included with other approaches.
2. It helps to speed up the data annotation. For instance, a rule-based technique is used for URL
formats, date formats, etc., and a machine learning approach can be used to determine the
position of text in a pdf file (including numerical data).
3. Also, in languages other than English annotated data is really scarce even for common tasks
which are carried out by Rule-based NLP.
4. By using a rule-based approach, the computation performance of the pipeline is also
improved.

Lecture - 5
Phases of Natural Language Processing (NLP)



Natural Language Processing (NLP) helps computers to understand, analyze and interact with
human language. It involves a series of phases that work together to process language and each
phase helps in understanding structure and meaning of human language. In this article, we will
understand these phases.
Phases of NLP

1. Lexical and Morphological Analysis


Lexical Analysis
It focuses on identifying and processing words (or lexemes) in a text. It breaks down the input
text into individual tokens that are meaningful units of language such as words or phrases.
Key tasks in Lexical analysis:
1. Tokenization: Process of dividing a text into smaller chunks called tokens. For example the
sentence "I love programming" would be tokenized into ["I", "love", "programming"].
2. Part-of-Speech Tagging: Assigning parts of speech such as noun, verb, adjective to each
token in the sentence. This helps us to understand grammatical roles of words in the context.
Example: Consider the sentence: "I am reading a book."
 Tokenization: Sentence is broken down into individual tokens or words: ["I", "am",
"reading", "a", "book"]
 Part-of-Speech Tagging: Each token is assigned a part of speech: ["I" → Pronoun (PRP),
"am" → Verb (VBP), "reading" → Verb (VBG), "a" → Article (DT), "book" → Noun (NN)]
Importance of Lexical Analysis
 Word Identification: It breaks text into tokens which helps the system to understand
individual words for further processing.
 Text Simplification: It simplifies text through tokenization and stemming which improves
accuracy in NLP tasks.
Morphological Analysis
It deals with morphemes which are the smallest units of meaning in a word. It is important for
understanding structure of words and their parts by identifying free morphemes (independent
words like "cat") and bound morphemes (like prefixes or suffixes e.g. "un-" or "-ing").
Key tasks in morphological analysis:
1. Stemming: Reducing words to their root form like "running" to "run".
2. Lemmatization: Converting words to their base or dictionary form considering the context
like "better" becomes "good".
Importance of Morphological Analysis
1. Understanding Word Structure: It helps in breaking the composition of complex words.
2. Improving Accuracy: It enhances accuracy of tasks such as part-of-speech tagging,
syntactic parsing and machine translation.
By identifying and analyzing morphemes system can identify text correctly at the most basic
level which helps in more advanced NLP applications.
2. Syntactic Analysis (Parsing)
Syntactic Analysis helps in understanding how words in a sentence are arranged according to
grammar rules. It ensures that the sentence follows correct grammar which makes the meaning
clearer. The goal is to create a parse tree which is a diagram showing the structure of sentence. It
breaks the sentence into parts like the subject, verb and object and shows how these parts are
connected. This helps machines understand the relationships between words in the sentence.
Key components of syntactic analysis include:
 POS Tagging: Assigning parts of speech (noun, verb, adjective) to words in a sentence as
discussed earlier.
 Ambiguity Resolution: Handling words that have multiple meanings (e.g "book" can be a
noun or a verb).
Examples
Consider the following sentences:
 Correct Syntax: "John eats an apple."
 Incorrect Syntax: "Apple eats John an."
Despite using same words only the first sentence is grammatically correct and makes sense. The
correct arrangement of words according to grammatical rules is what makes the sentence
meaningful. By analyzing sentence structure NLP systems can better understand and generate
human language. This helps in tasks like machine translation, sentiment analysis and information
retrieval by making the text clearer and reducing confusion.
3. Semantic Analysis
Semantic Analysis focuses on understanding meaning behind words and sentences. It ensures
that the text is not only grammatically correct but also logically coherent and contextually
relevant. It aims to understand dictionary definitions of words and their usage in context and also
find whether the arrangement of words in a sentence makes logical sense.
Key Tasks in Semantic Analysis
1. Named Entity Recognition (NER): It identifies and classifies entities such as names of
people, locations, organizations, dates and more. These entities provide important meaning in
the text and help in understanding the context. For example in the sentence "Tesla announced
its new electric vehicle in California," NER would identify "Tesla" as an organization and
"California" as a location.
2. Word Sense Disambiguation (WSD): Many words have multiple meanings depending on
the context in which they are used. It identifies the correct meaning of a word based on its
surrounding text. For example word "bank" can refer to a financial institution or the side of a
river. It uses context to identify which meaning applies in a given sentence which ensures
that interpretation is accurate.
Example of Semantic Analysis
"Apple eats a John." while grammatically correct this sentence doesn’t make sense semantically
because an apple cannot "eat" a person. Semantic analysis ensures that the meaning is logically
sound and contextually appropriate. It is important for various NLP applications including
machine translation, information retrieval and question answering.
4. Discourse Integration
It is the process of understanding how individual sentences or segments of text connect and
relate to each other within a broader context. This phase ensures that the meaning of a text is
consistent and coherent across multiple sentences or paragraphs. It is important for
understanding long or complex texts where meaning focuses on previous statements.
Key aspects of discourse integration:
 Anaphora Resolution: Anaphora refers to the use of pronouns or other references that
depend on earlier parts of the text. For example in the sentence "Taylor went to the store. She
bought groceries" pronoun "She" refers back to "Taylor." It ensures that references like these
are correctly understood by linking them to their antecedents.
 Contextual References: Many words or phrases can only be fully understood when
considered in the context of following sentences. It helps in interpreting how certain words or
phrases focuses on context. For example "It was a great day" is clearer when you know what
event or situation is being discussed.
Example of Discourse Integration
1. "Taylor went to the store to buy some groceries. She realized she forgot her wallet."
Understanding that "Taylor" is the antecedent of "she" is important for understanding
sentence's meaning.
2. "This is unfair!" helps in understand what "this" refers to we need to identify following
sentences. Without context statement's meaning remains unclear.
It is important for NLP applications like machine translation, chatbots and text summarization. It
ensures that meaning remains same across sentences which helps machines to understand
context. This enables accurate and natural responses in applications like conversational AI and
document translation.
5. Pragmatic Analysis
Pragmatic analysis helps in understanding the deeper meaning behind words and sentences by
looking beyond their literal meanings. While semantic analysis looks at the direct meaning it
considers the speaker's or writer's intentions, tone and context of the communication.
Key tasks in pragmatic analysis:
 Understanding Intentions: Sometimes language doesn’t mean what it says literally. For
example when someone asks "Can you pass the salt?" it's not about ability but a polite
request. It helps to understand true intention behind such expressions.
 Figurative Meaning: Language often uses idioms or metaphors that can’t be taken literally.
Examples of Pragmatic Analysis
 "Hello! What time is it?" here it might be a straightforward request for the current time but it
could also imply concern about being late.
 For example "I’m falling for you" means "I love you" not literally falling. It helps to interpret
these non-literal meanings.
It is important for NLP tasks like sentiment analysis, chatbots and conversation-based AI. It
helps machines to understand the speaker's intentions, tone and context which go beyond the
literal meaning of words. By identifying sarcasm and emotions this help systems to respond
naturally which improves human-computer interaction. By combining these phases NLP systems
can effectively interpret, analyze and generate human language making more intelligent and
natural interactions between humans and machines.

Applications of NLP
Machine Translation (MT)
Translating text/speech from one language to another.
Example: Google Translate converting 'Good morning'
Use case: Cross-language communication, multilingual websites.

Information Extraction (IE)


Extracting entities, relations, events from text.
Example: From 'Apple acquired Beats in 2014 for $3 billion' → Entities: Apple, Beats, 2014, $3
billion.
Use case: Resume parsing, financial news analysis, knowledge graph building.

Question Answering (QA)


Answering user queries in natural language.
Example: Q: 'Who is the current president of India?' → A: Droupadi Murmu.
Use case: Search engines, customer service bots, digital assistants.

Sentiment Analysis / Opinion Mining


Identifying emotions or polarity (positive, negative, neutral) in text.
Example: 'The movie was amazing!' → Positive sentiment.
Use case: Brand monitoring, customer feedback analysis.

Text Summarization
Generating concise summaries from documents.
Example: Summarizing a 5-page article into key bullet points.
Use case: News apps, research paper summarizers.

Speech Recognition & Synthesis


Converting speech ↔ text.
Example: ASR: Google Voice Typing, TTS: Screen readers.
Use case: Voice assistants, transcription services.

Chatbots & Conversational Agents


Interacting with humans in natural language.
Example: Banking chatbots, ChatGPT.
Use case: Customer service, healthcare advice, education.

Document Classification
Automatically categorizing documents.
Example: Spam email detection, news article classification.
Use case: Email filtering, digital libraries.

Named Entity Recognition (NER)


Identifying proper names of people, organizations, places.
Example: 'Elon Musk founded SpaceX in 2002' → Entities: Elon Musk (Person), SpaceX
(Organization), 2002 (Date).
Use case: Search indexing, medical records analysis.

Text Generation
Producing human-like text.
Example: ChatGPT generating essays, code, or stories.
Use case: Creative writing, report automation, coding assistants.

Challenges in NLP
Ambiguity in Language
• Lexical ambiguity: Word has multiple meanings (e.g., 'bank' → river bank vs. financial bank).
• Syntactic ambiguity: Multiple parse structures (e.g., 'I saw the man with a telescope').
• Semantic ambiguity: Meaning confusion (e.g., 'Visiting relatives can be annoying').
• Pragmatic ambiguity: Context-dependent (e.g., 'Can you open the window?' → request, not
ability check).

Variability of Natural Language


Many ways to say the same thing (e.g., 'I’m fine', 'Doing good', 'All okay').
This makes text understanding harder.

Data Sparsity
Rare words, idioms, and domain-specific terms are not seen often in training corpora.
Example: Technical jargon, medical or legal terms.

Lack of World Knowledge (Commonsense Reasoning)


Machines often fail at commonsense reasoning.
Example: 'The trophy doesn’t fit into the suitcase because it is too small.' → 'it' refers to
suitcase, not trophy.

Context Understanding
Words change meaning depending on context.
Example: 'Apple' → fruit vs. 'Apple' → company.

Multilinguality & Code-Mixing


Handling multiple languages and mixed sentences.
Example: 'Kal I went to market and bought apples' (Hindi + English mix).

Low-Resource Languages
Most NLP progress is for English; many languages lack large datasets.
Challenge in building fair, global NLP systems.

Ethical Issues & Bias


Models trained on biased data may produce biased or offensive outputs.
Example: Gender or racial bias in text generation.

Computational Complexity
Modern NLP models (Transformers, GPT, BERT) need huge memory, data, and compute power.
Difficult for small organizations or low-resource environments.

You might also like