0% found this document useful (0 votes)
106 views34 pages

Unit I - Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and various language modeling techniques such as grammar-based and statistical models. It discusses the evolution of NLP from rule-based systems to modern deep learning approaches, highlighting applications like translation and sentiment analysis. Key challenges in NLP include ambiguity, context understanding, and data sparsity, which complicate machine interpretation of human language.

Uploaded by

subhashree6124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views34 pages

Unit I - Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and various language modeling techniques such as grammar-based and statistical models. It discusses the evolution of NLP from rule-based systems to modern deep learning approaches, highlighting applications like translation and sentiment analysis. Key challenges in NLP include ambiguity, context understanding, and data sparsity, which complicate machine interpretation of human language.

Uploaded by

subhashree6124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit I

INTRODUCTION

Origins and challenges of NLP – Language Modeling: Grammar-based LM, Statistical LM -Regular
Expressions, Finite-State Automata – English Morphology, Transducers for lexicon and rules,
Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance

Introduction to NLP
Natural Language Processing (NLP) is a branch of artificial intelligence that deals with the interaction
between computers and human languages. The goal is to enable machines to read, understand, generate, and
interpret natural language
The ability of machines to interpret human language is now at the core of many applications that we
use every day - chatbots, Email classification and spam filters, search engines, grammar checkers, voice
assistants, and social language translators. The input and output of an NLP system can be Speech or Written
Text.
Applications and Purpose of NLP

● To automate tasks like translation (e.g., Google Translate), speech recognition (e.g., Siri, Alexa),
sentiment analysis, question answering, etc.
● Bridge the gap between human communication (natural languages) and computer understanding
(machine languages).

Origins of NLP
Understand the historical evolution of Natural Language Processing (NLP) – how it moved from rule-based
symbolic systems to modern deep learning models like GPT and BERT.
1.1950s–1960s: The Rule-Based Era & Symbolic AI
NLP started as part of Artificial Intelligence (AI). Turing Test (1950) – Alan Turing proposed a test to
determine if a machine could imitate human language. In the early days of NLP, systems were built entirely
using explicit grammar rules written by human experts. These systems could analyse sentence structure but
had no learning capability or understanding of meaning.
Used In:

● Machine translation experiments (e.g., Georgetown-IBM project)

● Syntax-based parsers

● Formal language models using Context-Free Grammars (CFG)

● Development of Turing Test concept

Limitations:

● Could not deal with ambiguity, context, or exceptions in natural language


● No learning: All knowledge had to be manually encoded

● Lacked understanding of meaning or semantics

2.1970s–1980s: Knowledge-Based NLP & Semantic Parsing


Rise of AI planning and expert systems began to include world knowledge to understand context
better. The idea was that for a machine to understand language, it must also understand the meaning and the
real-world context. Systems used frames, ontologies, and logic programming (like Prolog) to represent
knowledge and infer meaning.
Used In:

● SHRDLU (robot in blocks world)

● Expert systems using semantic networks

● Prolog-based NLP systems

● Dialogue systems in limited domains

● Morphological analysis tools

Limitations:

● Hard to scale

● Required domain-specific knowledge

● Could not generalize well across languages or tasks

3. 1990s: The Statistical Revolution in NLP


A major shift occurred when researchers began using statistical and machine learning techniques to
model language. Instead of relying on hand-coded rules, systems were trained on large text corpora. This
marked the beginning of data-driven NLP, especially with the use of probability and frequency to model
words and phrases.
Used In:

● N-gram language models (Unigram, Bigram, Trigram)

● POS tagging using Hidden Markov Models (HMMs)

● Spoken language processing

● Statistical machine translation

● Corpus-based learning (e.g., Brown Corpus, Penn Treebank)

Limitations:
● Required large labelled data

● Struggled with long-distance dependencies


● Data sparsity issues (never-seen word combinations)

4. 2000s–Present: Machine Learning to Deep Learning to Large Language Models (LLMs)

NLP transformed by machine learning and later deep learning. Introduction of neural networks,
especially Recurrent Neural Networks (RNNs) and Transformers. With the introduction of the
Transformer architecture (2017), NLP systems became capable of understanding long-range context,
semantics, and language generation. These models are trained on massive datasets and often fine-tuned for
specific tasks.

Used In:

● BERT (2018) – Bidirectional understanding of text

● GPT-1, 2, 3, 4 – Text generation, conversation

● Machine translation (e.g., Google Translate)

● Speech-to-text systems (e.g., Siri, Alexa)

● Sentiment analysis, summarization, chatbots

● Large Language Models (LLMs) like Claude, LLaMA, Gemini

Limitations:

● Requires large compute and data

● Biased, opaque, sometimes hallucinate

● Ethical concerns in deployment

Challenges of Natural Language Processing (NLP)


Natural Language Processing aims to enable machines to understand and generate human language.
However, achieving this is extremely complex due to several inherent challenges in language itself.
Below are the key issues faced in NLP:

1. Ambiguity

Ambiguity occurs when a word, phrase, or sentence can have multiple interpretations.

For example, the sentence “He saw the man with the telescope” can mean:

● He used a telescope to see the man.

● He saw a man who was holding a telescope.


NLP systems must analyze grammar, context, and word usage to resolve such ambiguities, which is
not easy for machines.

2. Context Understanding

Human language is highly context-dependent.

For example, the word "bank" can mean:

● A financial institution (as in “He went to the bank”), or

● The side of a river (as in “He sat on the river bank”).

Without understanding the surrounding context, machines cannot accurately interpret the meaning.

3. Morphology

Morphology is the study of word formation and structure. Words can exist in multiple forms (e.g., run,
runs, ran, running), and NLP must recognise that these all relate to the same root word
(lemma).Handling these variations accurately is crucial for tasks like searching, translation, or
sentiment analysis.

4. Syntax vs. Semantics

Syntax refers to grammatical structure, while semantics relates to meaning. A sentence can be
grammatically correct but semantically nonsensical.

● Example: “Colourless green ideas sleep furiously.” (Correct syntax, but meaningless)

NLP systems must handle both aspects to understand human language truly.

5. World Knowledge (Common Sense)

Humans use background knowledge and common sense to understand language, but machines do not
have real-world experience.

Example: “John dropped the glass. It broke.” – We understand "it" refers to "glass" due to common
sense, but a computer needs logical reasoning or trained data to infer this.

6. Multilingualism

There are thousands of languages worldwide, each with unique grammar rules, writing systems,
idioms, and cultural nuances.

Designing NLP systems that support multilingual communication, code-mixing (e.g., "Tanglish"), and
translation is extremely challenging.

7. Data Sparsity

Language has infinite possible word combinations, but training data is always finite.
Some phrases or sentences might never appear in the training data, making it hard for statistical or
machine learning models to assign probabilities or make predictions.

This is especially problematic for rare words or low-resource languages.

8. Spelling, Slang, and Noisy Text

Real-world data (like social media, chats, or forums) often contains typos, informal abbreviations,
emojis, and slang.

● Example: “u r gr8” instead of “You are great”

NLP models must be trained to normalize and interpret noisy language effectively.

Language Model (LM)

A Language Model (LM) is a statistical or probabilistic tool that assigns a probability to a sequence of
words. In simple terms, it helps predict what word is likely to come next in a sentence. Language models
are fundamental to many NLP applications, such as speech recognition, machine translation, text
prediction, spelling correction, and chatbots.

For example,

If we type “I want to eat…” a good language model might predict “pizza” or “food” as the most
probable next word.
1.Grammar-Based Language Models
Grammar-based LMs use predefined rules and syntax to define how sentences can be structured in a
language. These are based on formal grammar systems like Context-Free Grammar (CFG). This
method is rule-based and is often used for parsing tasks, where the structure of the sentence (syntax tree)
is important.

Example:

NP Det N

VP V NP

Det the | a

N cat | dog

V eats | drinks

This allows the system to generate or validate syntactically correct sentences. However, grammar-based
models struggle with real-world language data due to their strictness and inability to handle ambiguity and
variations naturally found in human language.
Advantages:

● Precise and explainable.

● Good for rule-based applications (e.g., programming languages, compilers).

Disadvantages:
● Hard to scale to real human language.

● Cannot model probabilities (i.e., what is common or rare).

2.Statistical Language Models (SLM) – Detailed Explanation


A Statistical Language Model (SLM) is a type of language model that uses mathematics and probability
to predict how likely a sequence of words is to occur in a given language. Instead of relying on hand-written
grammar rules, SLMs learn patterns from large text corpora (datasets) and estimate the probability of
word sequences based on how often they appear in the training data.
At the heart of statistical language modeling is the chain rule of probability, which is used to calculate the
probability of a sequence of words. Suppose we want to compute the probability of a sentence like:
P("I want to eat pizza")
We can break it into conditional probabilities:
P(I) × P(want|I) × P(to|I want) × P(eat|I want to) × P(pizza|I want to eat)
However, calculating this full sequence is computationally expensive and often impractical due to data
sparsity (not all word combinations appear in the corpus). To simplify, we use N-gram models, which
assume that the probability of a word depends only on the previous N–1 words. For example:

● A unigram model assumes each word is independent:

P(w1, w2, w3) P(w1) × P(w2) × P(w3)

● A bigram model uses the previous word:

P(w3 | w2)

● A trigram model uses the previous two words:

P(w3 | w1, w2)


In a bigram model, the probability of the sentence "I want pizza" would be computed as:
P(I) × P(want | I) × P(pizza | want)
These probabilities are calculated using frequency counts from the training data:

P(Wn∣Wn 1)=Count(W n 1,W n )

Count(Wn 1)

Example:
Given a corpus:

● Count("want pizza") = 30

● Count("want") = 100
Then:

P(pizza∣want)=30100=0.3P(pizza | want) = \frac{30}{100} = 0.3P(pizza∣want)=10030=0.3


So, the model assigns a 30% chance that “pizza” follows “want”.
Limitations:
● Data Sparsity: Some combinations of words may never appear, even if they are valid.

● Fixed Context Window: N-gram models only look at a limited number of previous words.

● Poor long-term dependencies: They don’t remember what was said earlier in the sentence if it’s
more than N words away.
Input Sentence: I want to eat _____

Bigram Model: P(want|I) × P(to|want) × P(eat|to) × P(pizza|eat)

Most probable next word: pizza

Regular Expressions (Regex)


A Regular Expression (Regex) is a sequence of characters that defines a search pattern used for
matching strings. It is widely used in text processing, search engines, syntax checking, NLP tasks, and
more. In NLP, regex is used to match patterns in natural language texts, such as identifying dates,
phone numbers, email IDs, or even specific word structures.
Components of a Regex
(a) Literals

● Exact characters to match.


Example: cat matches "cat" only.

(b) Metacharacters (Special meaning)


Symbol Meaning Example

. Any single character a.c matches abc, axc

^ Start of string ^Hello matches "Hello world"

$ End of string world$ matches "Hello world"

` ` OR
(c) Quantifiers
Symbol Meaning

* 0 or more times

+ 1 or more times

? 0 or 1 time

{n} Exactly n times

{n,} At least n times

{n,m} Between n and m times


(d) Character Classes
Pattern Matches

[abc] a, b, or c

[^abc] NOT a, b, or c

[a-z] Lowercase letters

\d Digits

\D Non-digits

\w Word characters (letters, digits, _)

\W Non-word characters

\s Whitespace

\S Non-whitespace
Basic Components of Regular Expressions:
Symb
Meaning Example
ol
. Matches any single character (except newline) a.c matches "abc", "a9c", "a_c", but not "ac"
Matches zero or more repetitions of the
* ab* matches "a", "ab", "abb", "abbb", etc.
preceding character or group
Matches one or more repetitions of the
+ ab+ matches "ab", "abb", but not "a"
preceding character or group
Matches zero or one occurrence of the preceding
? ab? matches "a" or "ab"
character or group
` ` Acts as a logical OR between expressions
[aeiou] matches any vowel: "a", "e", "i", "o",
[] Matches any one character inside the brackets
"u"
Matches any character except those inside the
[^ ] [^aeiou] matches any consonant
brackets
Groups expressions together for applying
() (ab)* matches " ", "ab", "abab", "ababab", etc.
operators
^Hello matches "Hello world" but not "Say
^ Matches the beginning of a string
Hello"
end$ matches "This is the end" but not "end of
$ Matches the end of a string
line"
Regex in NLP
● Tokenization: Split sentences into words using patterns like \w+.
● Removing Punctuation: Replace [^a-zA-Z0-9\s] with an empty string.
● Detecting Email/URLs: Extract structured information from raw text.

Python Regex
Python provides the re module for working with regular expressions:
● re.compile() Compiles the regex pattern into a regex object (like Pattern in Java).
● finditer() Returns an iterator of match objects (like Matcher in Java).
Example code:
import re # Import Python's regular expression module
text = "Roll numbers: 101, 102, 103" # Sample text containing roll numbers
pattern = r"\d{3}" # Regex pattern to match exactly 3 digits
matches = re.finditer(pattern, text) # Find all non-overlapping matches
for match in matches: # Loop through each match object
print("Found:", match.group()) # Print the matched string
Output
Found: 101
Found: 102
Found: 103

Finite Automata

Finite automata are abstract machines used to recognize patterns in input sequences, forming the basis for
understanding regular languages in computer science. They consist of states, transitions, and input symbols,
processing each symbol step-by-step. If the machine ends in an accepting state after processing the input, it
is accepted; otherwise, it is rejected. Finite automata come in deterministic (DFA) and non-deterministic
(NFA), both of which can recognize the same set of regular languages. They are widely used in text
processing, compilers, and network protocols.

Features of Finite Automata


● Input: A Set of symbols or characters provided to the machine.
● Output: Accept or reject based on the input pattern.

● States of Automata: The conditions or configurations of the machine.

● State Relation: The transitions between states.

● Output Relation: Based on the final state, the output decision is made.

Formal Definition of Finite Automata


A finite automaton can be defined as a tuple:
{ Q, Σ, q, F, δ }, where:

● Q: Finite set of states

● Σ: Set of input symbols

● q: Initial state

● F: Set of final states

● δ: Transition function

Types of Finite Automata


There are two types of finite automata:

● Deterministic Finite Automata (DFA)

● Non-Deterministic Finite Automata (NFA)

1. Deterministic Finite Automata (DFA)

A DFA is represented as {Q, Σ, q, F, δ}. In DFA, for each input symbol, the machine transitions to one and
only one state. DFA does not allow any null transitions, meaning every state must have a transition defined
for every input symbol.
DFA consists of 5 tuples {Q, Σ, q, F, δ}.
Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X Σ --> Q.

Example:
Construct a DFA that accepts all strings ending with 'a'.

Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}

State\Symbol a b

q0 q1 q0

q1 q1 q0

In this example, if the string ends in 'a', the machine reaches state q1, which is an accepting state.

2) Non-Deterministic Finite Automata (NFA)

NFA is similar to DFA but includes the following features:

● It can transition to multiple states for the same input.

● It allows null (ϵ) moves, where the machine can change states without consuming any input.

Example:
Construct an NFA that accepts strings ending in 'a'.
Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}
Fig 2. State Transition Diagram for NFA with Σ = {a, b}
State Transition Table for above Automaton,

State\Symbol a b

q0 {q0,q1} q0

q1 φ φ

In an NFA, if any transition leads to an accepting state, the string is accepted.


Comparison of DFA and NFA
Although NFAs appear more flexible, they do not have more computational power than DFAs. Every NFA
can be converted to an equivalent DFA, although the resulting DFA may have more states.

● DFA: Single transition for each input symbol, no null moves.

● NFA: Multiple transitions and null moves allowed.

● Power: Both DFA and NFA recognize the same set of regular languages.

Introduction to Morphological Analysis


Morphology is the branch of linguistics concerned with the structure and form of words in a language.
Morphological analysis, in the context of NLP, refers to the computational processing of word structures. It
aims to break down words into their constituent parts, such as roots, prefixes, and suffixes, and understand
their roles and meanings. This process is essential for various NLP tasks, including language modeling, text
analysis, and machine translation.

Importance of Morphological Analysis


Morphological analysis is a critical step in NLP for several reasons:

1. Understanding Word Formation: It helps in identifying the basic building blocks of words,
which is crucial for language comprehension.
2. Improving Text Analysis: By breaking down words into their roots and affixes, it enhances the
accuracy of text analysis tasks like sentiment analysis and topic modeling.
3. Enhancing Language Models: Morphological analysis provides detailed insights into word
formation, improving the performance of language models used in tasks like speech recognition
and text generation.
4. Facilitating Multilingual Processing: It aids in handling the morphological diversity of different
languages, making NLP systems more robust and versatile.
Key Techniques used in Morphological Analysis for NLP Tasks
Morphological analysis involves breaking down words into their constituent morphemes (the smallest units
of meaning) and understanding their structure and formation. Various techniques can be employed to
perform morphological analysis, each with its own strengths and applications.
Here are some of the key techniques used in morphological analysis:
1. Stemming
Stemming reduces words to their base or root form, usually by removing suffixes. The resulting stems are
not necessarily valid words but are useful for text normalization.
Common ways to implement stemming in python:

● Porter Stemmer: One of the most popular stemming algorithms, known for its simplicity and
efficiency.

● Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple languages.

● Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in shorter stems.
2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). It considers the context and part of
speech, producing valid words. To implement lemmatization in python, WordNet Lemmatizer is used,
which leverages the WordNet lexical database to find the base form of words.
3. Morphological Parsing
Morphological parsing involves analyzing the structure of words to identify their morphemes (roots,
prefixes, suffixes). It requires knowledge of morphological rules and patterns. Finite-State Transducers
(FSTs) is uses as a tool for morphological parsing.
Finite-State Transducers (FSTs)
FSTs are computational models used to represent and analyze the morphological structure of words. They
consist of states and transitions, capturing the rules of word formation.
Applications:

● Morphological Analysis: Parsing words into their morphemes.

● Morphological Generation: Generating word forms from morphemes.


4. Neural Network Models
Neural network models, especially deep learning models, can be trained to perform morphological analysis
by learning patterns from large datasets.
Types of Neural Network

● Recurrent Neural Networks (RNNs): Useful for sequential data like text.
● Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
● Transformers: Advanced models like BERT and GPT that understand context and semantics.

5. Rule-Based Methods

Rule-based methods rely on manually defined linguistic rules for morphological analysis. These rules can
handle specific language patterns and exceptions.
Applications:

● Affix Stripping: Removing known prefixes and suffixes to find the root form.

● Inflectional Analysis: Identifying grammatical variations like tense, number, and case.

6. Hidden Markov Models (HMMs)


Hidden Markov Models (HMMs) are probabilistic models that can be used to analyze sequences of data,
such as morphemes in words. HMMs consist of a set of hidden states, each representing a possible state of
the system, and observable outputs generated from these states. In the context of morphological analysis,
HMMs can be used to model the probabilistic relationships between sequences of morphemes, helping to
predict the most likely sequence of morphemes for a given word.
Components of Hidden Markov Models (HMMs):

● States: Represent different parts of words (e.g., prefixes, roots, suffixes).

● Observations: The actual characters or morphemes in the words.

● Transition Probabilities: Probabilities of moving from one state to another.

● Emission Probabilities: Probabilities of an observable output being generated from a state.

Applications:

● Morphological Segmentation: Breaking words into morphemes.

● Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.

● Sequence Prediction: Predicting the most likely sequence of morphemes for a given word.

Applications of Morphological Analysis


Morphological analysis has numerous applications in NLP, contributing to the advancement of various
technologies and systems:

1. Information Retrieval: Enhances search engines by improving the matching of query terms with
relevant documents, even if they are in different morphological forms.
2. Machine Translation: Facilitates accurate translation by understanding and generating correct
word forms in different languages.
3. Text-to-Speech Systems: Improves pronunciation and intonation by accurately identifying word
structures and their stress patterns.
4. Spell Checkers and Grammar Checkers: Detects and suggests corrections for misspelled
words and grammatical errors by analyzing word forms and their usage.
5. Named Entity Recognition (NER): Helps in identifying and classifying named entities in text
by understanding their morphological variations.

Tokenization in NLP
Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual
input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words,
or sentences. It helps in improving interpretability of text by different models. Let's understand How
Tokenization Works.

What is Tokenization in NLP?


Natural Language Processing (NLP) is a subfield of Artificial Intelligence, information engineering, and
human-computer interaction. It focuses on how to process and analyze large amounts of natural language
data efficiently. It is difficult to perform as the process of reading and understanding languages is far more
complex than it seems at first glance.

● Tokenization is a foundation step in NLP pipeline that shapes the entire workflow.

● Involves dividing a string or text into a list of smaller units known as tokens.

● Uses a tokenizer to segment unstructured data and natural language text into distinct chunks of
information, treating them as different elements.

● Tokens: Words or Sub-words in the context of natural language processing. Example: A word is
a token in a sentence, A character is a token in a word, etc.

● Application: Multiple NLP tasks, text processing, language modelling, and machine translation.
Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some types of
tokenization:
1. Word Tokenization

Word tokenization is the most commonly used method where text is divided into individual words. It works
well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating"
becomes:
Input before tokenization: ["Machine Learning is fascinating"]
Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]

2. Character Tokenization

In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This
is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear
boundaries. It can also be useful for modelling character-level language.
Example
Input before tokenization: ["You are helpful"]
Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]

3. Sub-word Tokenization

This strikes a balance between word and character tokenization by breaking down text into units that are
larger than a single character but smaller than a full word. This is useful when dealing with morphologically
rich languages or rare words.
Example
["Time", "table"]
["Rain", "coat"]
["Grace", "fully"]
["Run", "way"]
Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form
words by combining smaller units.

4. Sentence Tokenization

Sentence tokenization is also a common technique used to make a division of paragraphs or large set of
sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or
processing.
Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is
fascinating. Computer Vision handles images. "]
Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine
learning is fascinating.", "Computer Vision handles images."]

5. N-gram Tokenization

N-gram tokenization splits words into fixed-sized chunks (size = n) of data.


Input before tokenization: ["Machine learning is powerful"]
Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]
Need of Tokenization
Tokenization is an essential step in text processing and natural language processing (NLP) for several
reasons. Some of these are listed below:

● Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical
and computational analysis.

● Feature extraction: Text data can be represented numerically for algorithmic comprehension by
using tokens as features in ML models.

● Information Retrieval: Tokenization is essential for indexing and searching in systems that
store and retrieve information efficiently based on words or phrases.

● Text Analysis: Used in sentiment analysis and named entity recognition, to determine the
function and context of individual words in a sentence.

● Vocabulary Management: Generates a list of distinct tokens, Helps manage a corpus's


vocabulary.

● Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and
machine translation.

Implementation for Tokenization

Sentence Tokenization using sent_tokenize

The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize function is used to
segment a given text into a list of sentences.
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

How sent_tokenize works: The sent_tokenize function uses an instance of PunktSentenceTokenizer from
the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and
beginning of sentence at what characters and punctuation.

Sentence Tokenization using PunktSentenceTokenizer


It is efficient to use 'PunktSentenceTokenizer' to from the NLTK library. The Punkt tokenizer is a data-
driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence
boundaries.
import nltk.data
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

Tokenize sentence of different language

Sentences from different languages can also be tokenized using different pickle file other than English.

● In the following code snippet, we have used NLTK library to tokenize a Spanish text into
sentences using pre-trained Punkt tokenizer for Spanish.

● The Punkt tokenizer: Data-driven ML-based tokenizer to identify sentence boundaries.

import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)
Output:
['Hola amigo.',
'Estoy bien.']

Word Tokenization using work_tokenize

The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into
individual words.

● The word_tokenize function is helpful for breaking down a sentence or text into its constituent
words.

● Eases analysis or processing at the word level in natural language processing tasks.

from nltk.tokenize import word_tokenize


text = "Hello everyone. Welcome to GeeksforGeeks."
word_tokenize(text)
Output:
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']
How word_tokenize works: The word_tokenize() function is a wrapper function that calls tokenize() on an
instance of the TreebankWordTokenizer class.

Word Tokenization Using TreebankWordTokenizer

The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize
a given text into individual words.
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)
Output:
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']
These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code
outputs above, it doesn't discard the punctuation, allowing a user to decide what to do with the punctuations
at the time of pre-processing.

Word Tokenization using WordPunctTokenizer

The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries.
Each punctuation mark is treated as a separate token.
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")
Output:
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

Word Tokenization using Regular Expression

The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given
text based on a regular expression pattern.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)
Output:
['Let', 's', 'see', 'how', 'it', 's', 'working']
Using regular expressions allows for more fine-grained control over tokenization, and you can customize the
pattern based on your specific requirements.
More Techniques for Tokenization

We have discussed the ways to implement how we can perform tokenization using the NLTK library. We
can also implement tokenization using the following methods and libraries:

● Spacy: Spacy is an NLP library that provides robust tokenization capabilities.

● BERT tokenizer: BERT uses Word Piece tokenizer, which is a type of sub-word tokenizer for
tokenizing input text. Using regular expressions allows for more fine-grained control over
tokenization, and you can customise the pattern based on your specific requirements.

● Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also
found applications in the field of natural language processing, specifically for tokenization. It is a
Sub-word Tokenization technique that works by iteratively merging the most frequent pairs of
consecutive bytes (or characters) in a given corpus.

● Sentence Piece: Sentence Piece is another sub-word tokenization algorithm commonly used for
natural language processing tasks. It is designed to be language-agnostic and works by iteratively
merging frequent sequences of characters or subwords in a given corpus.

Limitations of Tokenization
● Unable to capture the meaning of the sentence results in ambiguity.

● Chinese, Japanese, and Arabic lack distinct spaces between words. Hence, the absence of clear
boundaries complicates the process of tokenization.

● Tough to decide how to tokenize text that may include more than one word, for example, email
addresses, URLs and special symbols

Detecting and Correcting Spelling Errors

Spelling correction is the process of substituting an erroneously spelt word with the most likely intended one
in a document produced in any Natural Language. It has been a focus of ongoing study in the field of Natural
Language Processing (NLP). The NLP group's mission is to create and construct software that will analyse,
comprehend, and generate natural human languages.Various applications are used in NLP like Text
Summarisation, Question Answering, Machine Translation, Parsing, Information Retrieval and Optical
Recognition. The basic aim of spell checking is to discover problems in written text. Dictionary lookup and
N-gram analysis are two ways for error detection. The technique for detecting errors often entails
determining whether or not an input string is a legitimate dictionary word. For detecting such errors,
efficient approaches have been developed. Dictionary lookup and n-gram approaches are used by most
spellcheckers. When a word in a written document is discovered as a mistake, spelling correction procedures
are used to correct the word or provide right options. For text error correction, many techniques, such as
rule-based techniques, edit distance techniques, N-gram techniques, and deep learning technique,s are
available.

Spelling Error Detection and Correction

Introduction

There are many commercial as well as non-commercial spelling error detection and correction tools
available in the market for almost all popular languages.
Every tool generally works at the word level, using an integral dictionary or WordNet as the backend
database for correction and detection.

When processing text:

● Every word is looked up in the speller lexicon (dictionary).

● If a word is not found in the dictionary, it is detected as an error.

● To correct the error, a spell checker searches the dictionary or WordNet for the word that most
closely resembles the erroneous word.

● These suggested words are presented to the user to choose the intended correct word.

Spelling checking is widely used in applications such as:

● Machine Translation

● Search Engines

● Information Retrieval

● Text Editors

The spell checking process consists of two stages:

1. Error Detection

2. Error Correction

Types of Spelling Errors

According to Damerau’s studies, spelling errors can generally be divided into two main types based on error
patterns:

Typographic Errors (Non-Word Errors)

● Occur when the correct spelling of a word is known but mistyped by mistake.

● Mostly related to keyboard errors and do not follow linguistic rules.


● Example:

○ Intended: "because"

○ Typed: "becuase”

Cognitive Errors (Real-Word Errors)

● Occur when the correct spelling of a word is not known.

● Often result in words that are spelled correctly but used incorrectly.

● Pronunciation of the misspelled word is the same or similar to the correct word.

● Example:

○ Intended: "their" (possessive)

○ Typed: "there" (adverb)

Error Detection Process


ERROR DETECTION For error detection, each word in a sentence or paragraph is tokenised by using a
tokeniser and checked for its validity. The candidate word is a valid if it has a meaning; otherwise, it is a
nonword. Two commonly used techniques for error detection are

○ N-gram Analysis
○ Dictionary / WordNet Lookup

1. N-gram Analysis

N-gram analysis is a method to find incorrectly spelt words in a large body of text.
Instead of comparing each entire word to a dictionary, this method checks n-grams (substrings of length n).

Process
1. An n-gram is a set of consecutive characters taken from a string, where n is a chosen length.
2. Real n-gram frequencies are stored in an n-dimensional matrix.
3. When processing text:

○ If a non-existent or rare n-gram is found, the word is flagged as a misspelling.

○ If all n-grams are frequent enough, the word is assumed to be correct.

4. Each string is split into sets of adjacent n-grams.


5. The similarity between two strings is calculated by:
Similarity Coefficient=Number of common n-grams (intersection)Total number of n-grams (union)\
text{Similarity Coefficient} = \frac{\text{Number of common n-grams (intersection)}}{\text{Total
number of n-grams (union)}}Similarity Coefficient=Total number of n-grams (union)Number of
common n-grams (intersection)

Advantages

● Language independent — no need for linguistic knowledge.

● Can detect words not in dictionaries.


2. Dictionary / WordNet Lookup

A dictionary or WordNet is a lexical resource containing a list of correct words in a particular language.

Process

● Each word in the text is checked against the dictionary.

● If the word does not exist in the dictionary, it is flagged as an error.

Drawbacks

1. Keeping the dictionary up-to-date is difficult.

2. May not cover all possible words (especially technical terms, slang, or new words)..

WordNet

● A large-scale lexical database with a hierarchical structure based on relationships between concepts.

● Covers nouns, verbs, adjectives, and adverbs.

● Widely used in NLP applications.

Error Correction

Error correction is the process of replacing detected misspellings with the correct words.

Steps
1. Generation of Candidate Corrections
○ Uses a precompiled table of legal n-grams to find potential correction terms.
○ Can also use dictionary-based lookups and edit distance algorithms.

2. Ranking of Candidate Corrections


○ Uses lexical similarity measures (e.g., Levenshtein distance, Jaro-Winkler similarity).
○ Uses probabilistic models to estimate the likelihood of each correction.
○ Candidates are ranked and presented in order of confidence.

In some cases, ranking is skipped, and the user selects the correct word from the list of
candidates.
Error Correction Techniques

Error correction in spell checking involves selecting the most likely correct word from a set of candidates
generated after error detection. Several approaches are used to determine the best correction.

1. Dictionary Lookup
● Idea:
Compare each word in the text against a list of valid words (dictionary).
● Process:

○ Tokenize the text into words.


○ Check if each token exists in the dictionary.

If not found mark as a possible error.
● Advantages:

○ Simple to implement.
○ Very fast for small dictionaries.
● Drawbacks:

○ Keeping dictionary up to date is difficult.


○ May fail for:
■ Proper nouns
■ Slang words
■ New words not yet in dictionary

Example:

Dictionary: {cat, dog, apple, run}

Word: "appl" Not found flagged as misspelling.

2. Edit Distance (Levenshtein Distance)


● Idea:
Measures the minimum number of operations needed to transform one word into another.
● Operations:

1. Insertion (add a letter)


2. Deletion (remove a letter)
3. Substitution (replace a letter)
● Formula:
Levenshtein distance = minimum edits required.

Example:

"kitten" "sitting"

Substitution: k s

Insertion: e i (after t)

Insertion: g end

Distance = 3

Usage in correction:
Choose the dictionary word with the smallest edit distance to the misspelled word.

3. Soundex / Phonetic Matching

● Idea:
Correct words based on how they sound, not just spelling.

Why:

Example:
Soundex("Robert") = R163
Soundex("Rupert") = R163 Likely match.

4. Minimum Edit Distance with Weights

● Improvement over basic edit distance:

○ Assign different costs for operations (e.g., substituting 'a' with 'e' costs less than substituting
'a' with 'z').

○ Based on likelihood of typing errors.


Example:

5.N-gram Based Similarity

This approach measures how many n-grams the misspelled word and a dictionary word have in common,
weighted by the length of the words.

● Process:

1. Split both the misspelled word and each candidate word into n-grams.
2. Count the common n-grams between them.
3. Apply a similarity score (e.g., Jaccard coefficient or cosine similarity) that is normalized by
word length

Example:

Misspelled word: wirld

Candidate word: world

Common bigrams: "wo", "or", "rl", "ld"

Similarity score: Common / Total

6. Probabilistic Techniques

Uses probability to determine which candidate is most likely the correct spelling.

● Notation:
○ w = original (misspelled) word
○ c = candidate correction word

Goal:

○ P(c)P(c)P(c): probability of candidate word occurring in the language


P(wc)P(w|c)P(wc): probability of making the specific spelling error from c to w

Example:
w = "wird"

c candidates = "word", "weird", "wired"


Choose c with the highest P(c|w).

Drawback:

○ Can never be 100% certain; it’s always about most likely choice.

7. Neural Networks

A machine learning approach using backpropagation networks.

● Method:
○ One output node per word in the dictionary.
○ Input nodes for every possible n-gram in every word position (n usually 1 or 2).
○ Network predicts which output (word) is most likely correct.

● Limitations:
Works well only for small dictionaries (< 1000 words).

○ Very high time and computational cost for large vocabularies.

○ Learning phase is resource-intensive on traditional hardware.

8. Noisy Channel Model

An older but foundational statistical model for error correction.

● Intuition:
○ Treat the misspelled word as if it was a correct word distorted by noise during transmission
(like in a communication channel).
○ Noise could be substitutions, insertions, deletions, or transpositions of letters.

● Goal:
Find the true word by:

○ Building a model of the noisy channel.


○ Simulating how words might get distorted.
○ Choosing the candidate that is most likely to have produced the misspelled word.


Bayesian Formulation:

○ Where c^\hat{c}c^ is the chosen correction.


● History:
○ Applied to spelling correction by AT&T Bell Laboratories (1990–1991) and IBM Watson
Research (1991).

Spell Checking – Error Detection & Correction Workflow

flowchart TD

A[Input Text] --> B[Tokenization]

B --> C[Error Detection]

C --> D[Candidate Generation]

D --> E[Error Correction Methods]


E --> F1[N-gram Similarity]

E --> F2[Probabilistic Model]

E --> F3[Neural Network]

E --> F4[Noisy Channel Model]

F1 --> G[Best Correction]

F2 --> G

F3 --> G

F4 --> G

G --> H[Output Corrected Text]

Minimum Edit Distance


Given two strings s1 and s2 and below operations that can be performed on s1. The task is to find the
minimum number of edits (operations) to convert 's1' into 's2'.
● Insert: Insert any character before or after any index of s1
● Remove: Remove a character of s1
● Replace: Replace a character at any index of s1 with some other character.

Note: All of the above operations are of equal cost.


Examples:
Input: s1 = "geek", s2 = "gesek"
Output: 1
Explanation: We can convert s1 into s2 by inserting an 's' between two consecutive 'e' in s1.

Input: s1 = "gfg", s2 = "gfg"


Output: 0
Explanation: Both strings are same.

Input: s1 = "abcd", s2 = "bcfe"


Output: 3
Explanation: We can convert s1 into s2 by removing 'a', replacing 'd' with 'f' and inserting 'e' at the end.
Illustration of Edit Distance:
Let's suppose we have s1="GEEXSFRGEEKKS" and s2="GEEKSFORGEEKS"
Now to convert s1 into s2 we would require 3 minimum operations:
Operation 1: Replace 'X' to 'K'
Operation 2: Insert 'O' between 'F' and 'R'
Operation 3: Remove second last character i.e. 'K'

[Naive Approach] Using Recursion-O(3^(m+n)) time and space O(m*n)

The idea is to process all characters one by one starting from either from left or right sides of both strings.

Let us process from the right end of the strings, there are two possibilities for every pair of characters being
traversed, either they match or they don't match. If last characters of both string matches then we simply
recursively calculate the answer for rest of part of the strings. When last characters do not match, we perform
all three operations to match the last characters, i.e. insert, replace, and remove. And then recursively
calculate the result for the remaining part of the string. Upon completion of these operations, we will select
the minimum answer and add 1 to it.

Below is the recursive tree for this problem considering the case when the last characters never match.
When the last characters of strings matches. Make a recursive call editDistance(m-1, n-1) to calculate the
answer for remaining part of the strings.

When the last characters of strings don't matches. Make three recursive calls as show below:

● Insert s2[n-1] at last of s1 : editDistance(m, n-1)


● Replace s1[m-1] with s2[n-1] : editDistance(m-1, n-1)
● Remove s1[m-1] : editDistance(m-1, n)

Recurrence Relations for Edit Distance:

● Case 1: When the last character of both the strings are same. editDistance(s1, s2, m, n) =
editDistance(s1, s2, m-1, n-1)
● Case 2: When the last characters are different
○ editDistance(s1, s2, m, n) = 1 + Minimum{ editDistance(s1, s2 ,m,n-1),
editDistance(s1, s2 ,m-1, n), editDistance(s1, s2 , m-1, n-1)}

Base Case for Edit Distance:

● Case 1: When s1 becomes empty i.e. m=0


○ return n, as it require n insertions to convert an empty string to s2 of size n
● Case 2: When s2 becomes empty i.e. n=0
○ return m, as it require m removals to convert s1 of size m to an empty string.
# A Naive recursive Python program to find minimum number

# of operations to convert s1 to s2.

# Recursive function to find number of operations

# needed to convert s1 into s2.

def editDistRec(s1, s2, m, n):

# If first string is empty, the only option is to

# insert all characters of second string into first

if m == 0:

return n

# If second string is empty, the only option is to

# remove all characters of first string

if n == 0:

return m

# If last characters of two strings are same, nothing

# much to do. Get the count for

# remaining strings.

if s1[m - 1] == s2[n - 1]:

return editDistRec(s1, s2, m - 1, n - 1)

# If last characters are not same, consider all three

# operations on last character of first string,

# recursively compute minimum cost for all three

# operations and take minimum of three values.

return 1 + min(editDistRec(s1, s2, m, n - 1),

editDistRec(s1, s2, m - 1, n),

editDistRec(s1, s2, m - 1, n - 1))

# Wrapper function to initiate

# the recursive calculation

def editDistance(s1, s2):

return editDistRec(s1, s2, len(s1), len(s2))

if __name__ == "__main__":

s1 = "abcd"
s2 = "bcfe"

print(editDistance(s1, s2))

Output

Using Top-Down DP (Memoization) - O(m*n) time and O(m*n) space

In the above recursive approach, there are several overlapping subproblems:


editDist(m-1, n-1) is called Three times
editDist(m-1, n-2) is called Two times
editDist(m-2, n-1) is called Two times. And so on...

So, we can use Memoization to store the result of each subproblems to avoid recalculating the result again
and again.

Below is the illustration of overlapping subproblems during the recursion.

# Python program to find minimum number

# of operations to convert s1 to s2

# Recursive function to find number of operations

# needed to convert s1 into s2.

def editDistRec(s1, s2, m, n, memo):


# If first string is empty, the only option is to

# insert all characters of second string into first

if m == 0:

return n

# If second string is empty, the only option is to

# remove all characters of first string

if n == 0:

return m

# If value is memoized

if memo[m][n] != -1:

return memo[m][n]

# If last characters of two strings are same, nothing

# much to do. Get the count for

# remaining strings.

if s1[m - 1] == s2[n - 1]:

memo[m][n] = editDistRec(s1, s2, m - 1, n - 1, memo)

return memo[m][n]

# If last characters are not same, consider all three

# operations on last character of first string,

# recursively compute minimum cost for all three

# operations and take minimum of three values.

memo[m][n] = 1 + min(

editDistRec(s1, s2, m, n - 1, memo),

editDistRec(s1, s2, m - 1, n, memo),

editDistRec(s1, s2, m - 1, n - 1, memo)

return memo[m][n]
# Wrapper function to initiate the recursive calculation

def editDistance(s1, s2):

m, n = len(s1), len(s2)

memo = [[-1 for _ in range(n + 1)] for _ in range(m + 1)]

return editDistRec(s1, s2, m, n, memo)

if __name__ == "__main__":

s1 = "abcd"

s2 = "bcfe"

print(editDistance(s1, s2))

OUTPUT:

[Better Approach 2] Using Bottom-Up DP (Tabulation)-O(m*n) time and O(m*n) space

Use a table to store solutions of subproblems to avoiding recalculate the same subproblems multiple times.
By doing this, if same subproblems repeated during, we retrieve the solutions from the table itself.

Below are the steps to convert the recursive approach to Bottom up approach:
1. Choosing Dimensions of Table: The state of smaller sub-problems depends on the input parameters m and
n because at least one of them will decrease after each recursive call. So we need to construct a 2D table dp[]
[] to store the solution of the sub-problems.
2. Choosing Correct size of Table: The range of parameters goes from 0 to m and 0 to n. So we choose
dimensions as (m + 1)*(n + 1)
3. Filling the table: It consist of two stages, table initialisation and building the solution from the smaller
subproblems:
● Table initialisation: Before building the solution, we need to initialise the table with the known
values i.e. base case. Here m = 0 and n = 0 is the situation of the base case, so we initialise first-
column dp[i][0] with i and first-row dp[0][j] with j.
● Building the solution of larger problems from the smaller subproblems: We can easily define the
iterative structure by using the recursive structure of the above recursive solution.
● if (s1[i - 1] == s2[j - 1]) dp[i][j] = dp[i - 1][j - 1];
● if (s1[i - 1] != s2[j - 1]) dp[i][j] = 1 + min(dp[i][j - 1], dp[i - 1][j], dp[i - 1][j - 1]);

4. Returning final solution: After filling the table iteratively, our final solution gets stored at the bottom right
corner of the 2-D table i.e. we return dp[m][n] as an output.
Illustration:

Real-World Applications of Edit Distance

● Spell Checking and Auto-Correction

● DNA Sequence Alignment

● Plagiarism Detection

● Natural Language Processing

● Version Control Systems

● String Matching

Edit Distance and Longest Common Subsequence

If we do not consider the replace operation, then edit distance problem is same as the Longest Common
Subsequence (LCS) problem. With only insert and delete operations allowed, the edit distance between two
strings is ( M + N - 2* LCS)

You might also like