Unit I - Natural Language Processing
Unit I - Natural Language Processing
INTRODUCTION
Origins and challenges of NLP – Language Modeling: Grammar-based LM, Statistical LM -Regular
Expressions, Finite-State Automata – English Morphology, Transducers for lexicon and rules,
Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance
Introduction to NLP
Natural Language Processing (NLP) is a branch of artificial intelligence that deals with the interaction
between computers and human languages. The goal is to enable machines to read, understand, generate, and
interpret natural language
The ability of machines to interpret human language is now at the core of many applications that we
use every day - chatbots, Email classification and spam filters, search engines, grammar checkers, voice
assistants, and social language translators. The input and output of an NLP system can be Speech or Written
Text.
Applications and Purpose of NLP
● To automate tasks like translation (e.g., Google Translate), speech recognition (e.g., Siri, Alexa),
sentiment analysis, question answering, etc.
● Bridge the gap between human communication (natural languages) and computer understanding
(machine languages).
Origins of NLP
Understand the historical evolution of Natural Language Processing (NLP) – how it moved from rule-based
symbolic systems to modern deep learning models like GPT and BERT.
1.1950s–1960s: The Rule-Based Era & Symbolic AI
NLP started as part of Artificial Intelligence (AI). Turing Test (1950) – Alan Turing proposed a test to
determine if a machine could imitate human language. In the early days of NLP, systems were built entirely
using explicit grammar rules written by human experts. These systems could analyse sentence structure but
had no learning capability or understanding of meaning.
Used In:
● Syntax-based parsers
Limitations:
Limitations:
● Hard to scale
Limitations:
● Required large labelled data
NLP transformed by machine learning and later deep learning. Introduction of neural networks,
especially Recurrent Neural Networks (RNNs) and Transformers. With the introduction of the
Transformer architecture (2017), NLP systems became capable of understanding long-range context,
semantics, and language generation. These models are trained on massive datasets and often fine-tuned for
specific tasks.
Used In:
Limitations:
1. Ambiguity
Ambiguity occurs when a word, phrase, or sentence can have multiple interpretations.
For example, the sentence “He saw the man with the telescope” can mean:
2. Context Understanding
Without understanding the surrounding context, machines cannot accurately interpret the meaning.
3. Morphology
Morphology is the study of word formation and structure. Words can exist in multiple forms (e.g., run,
runs, ran, running), and NLP must recognise that these all relate to the same root word
(lemma).Handling these variations accurately is crucial for tasks like searching, translation, or
sentiment analysis.
Syntax refers to grammatical structure, while semantics relates to meaning. A sentence can be
grammatically correct but semantically nonsensical.
● Example: “Colourless green ideas sleep furiously.” (Correct syntax, but meaningless)
NLP systems must handle both aspects to understand human language truly.
Humans use background knowledge and common sense to understand language, but machines do not
have real-world experience.
Example: “John dropped the glass. It broke.” – We understand "it" refers to "glass" due to common
sense, but a computer needs logical reasoning or trained data to infer this.
6. Multilingualism
There are thousands of languages worldwide, each with unique grammar rules, writing systems,
idioms, and cultural nuances.
Designing NLP systems that support multilingual communication, code-mixing (e.g., "Tanglish"), and
translation is extremely challenging.
7. Data Sparsity
Language has infinite possible word combinations, but training data is always finite.
Some phrases or sentences might never appear in the training data, making it hard for statistical or
machine learning models to assign probabilities or make predictions.
Real-world data (like social media, chats, or forums) often contains typos, informal abbreviations,
emojis, and slang.
NLP models must be trained to normalize and interpret noisy language effectively.
A Language Model (LM) is a statistical or probabilistic tool that assigns a probability to a sequence of
words. In simple terms, it helps predict what word is likely to come next in a sentence. Language models
are fundamental to many NLP applications, such as speech recognition, machine translation, text
prediction, spelling correction, and chatbots.
For example,
If we type “I want to eat…” a good language model might predict “pizza” or “food” as the most
probable next word.
1.Grammar-Based Language Models
Grammar-based LMs use predefined rules and syntax to define how sentences can be structured in a
language. These are based on formal grammar systems like Context-Free Grammar (CFG). This
method is rule-based and is often used for parsing tasks, where the structure of the sentence (syntax tree)
is important.
Example:
NP Det N
VP V NP
Det the | a
N cat | dog
V eats | drinks
This allows the system to generate or validate syntactically correct sentences. However, grammar-based
models struggle with real-world language data due to their strictness and inability to handle ambiguity and
variations naturally found in human language.
Advantages:
Disadvantages:
● Hard to scale to real human language.
P(w3 | w2)
Count(Wn 1)
Example:
Given a corpus:
● Count("want pizza") = 30
● Count("want") = 100
Then:
● Fixed Context Window: N-gram models only look at a limited number of previous words.
● Poor long-term dependencies: They don’t remember what was said earlier in the sentence if it’s
more than N words away.
Input Sentence: I want to eat _____
` ` OR
(c) Quantifiers
Symbol Meaning
* 0 or more times
+ 1 or more times
? 0 or 1 time
[abc] a, b, or c
[^abc] NOT a, b, or c
\d Digits
\D Non-digits
\W Non-word characters
\s Whitespace
\S Non-whitespace
Basic Components of Regular Expressions:
Symb
Meaning Example
ol
. Matches any single character (except newline) a.c matches "abc", "a9c", "a_c", but not "ac"
Matches zero or more repetitions of the
* ab* matches "a", "ab", "abb", "abbb", etc.
preceding character or group
Matches one or more repetitions of the
+ ab+ matches "ab", "abb", but not "a"
preceding character or group
Matches zero or one occurrence of the preceding
? ab? matches "a" or "ab"
character or group
` ` Acts as a logical OR between expressions
[aeiou] matches any vowel: "a", "e", "i", "o",
[] Matches any one character inside the brackets
"u"
Matches any character except those inside the
[^ ] [^aeiou] matches any consonant
brackets
Groups expressions together for applying
() (ab)* matches " ", "ab", "abab", "ababab", etc.
operators
^Hello matches "Hello world" but not "Say
^ Matches the beginning of a string
Hello"
end$ matches "This is the end" but not "end of
$ Matches the end of a string
line"
Regex in NLP
● Tokenization: Split sentences into words using patterns like \w+.
● Removing Punctuation: Replace [^a-zA-Z0-9\s] with an empty string.
● Detecting Email/URLs: Extract structured information from raw text.
Python Regex
Python provides the re module for working with regular expressions:
● re.compile() Compiles the regex pattern into a regex object (like Pattern in Java).
● finditer() Returns an iterator of match objects (like Matcher in Java).
Example code:
import re # Import Python's regular expression module
text = "Roll numbers: 101, 102, 103" # Sample text containing roll numbers
pattern = r"\d{3}" # Regex pattern to match exactly 3 digits
matches = re.finditer(pattern, text) # Find all non-overlapping matches
for match in matches: # Loop through each match object
print("Found:", match.group()) # Print the matched string
Output
Found: 101
Found: 102
Found: 103
Finite Automata
Finite automata are abstract machines used to recognize patterns in input sequences, forming the basis for
understanding regular languages in computer science. They consist of states, transitions, and input symbols,
processing each symbol step-by-step. If the machine ends in an accepting state after processing the input, it
is accepted; otherwise, it is rejected. Finite automata come in deterministic (DFA) and non-deterministic
(NFA), both of which can recognize the same set of regular languages. They are widely used in text
processing, compilers, and network protocols.
● Output Relation: Based on the final state, the output decision is made.
● q: Initial state
● δ: Transition function
A DFA is represented as {Q, Σ, q, F, δ}. In DFA, for each input symbol, the machine transitions to one and
only one state. DFA does not allow any null transitions, meaning every state must have a transition defined
for every input symbol.
DFA consists of 5 tuples {Q, Σ, q, F, δ}.
Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X Σ --> Q.
Example:
Construct a DFA that accepts all strings ending with 'a'.
Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}
State\Symbol a b
q0 q1 q0
q1 q1 q0
In this example, if the string ends in 'a', the machine reaches state q1, which is an accepting state.
● It allows null (ϵ) moves, where the machine can change states without consuming any input.
Example:
Construct an NFA that accepts strings ending in 'a'.
Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}
Fig 2. State Transition Diagram for NFA with Σ = {a, b}
State Transition Table for above Automaton,
State\Symbol a b
q0 {q0,q1} q0
q1 φ φ
● Power: Both DFA and NFA recognize the same set of regular languages.
1. Understanding Word Formation: It helps in identifying the basic building blocks of words,
which is crucial for language comprehension.
2. Improving Text Analysis: By breaking down words into their roots and affixes, it enhances the
accuracy of text analysis tasks like sentiment analysis and topic modeling.
3. Enhancing Language Models: Morphological analysis provides detailed insights into word
formation, improving the performance of language models used in tasks like speech recognition
and text generation.
4. Facilitating Multilingual Processing: It aids in handling the morphological diversity of different
languages, making NLP systems more robust and versatile.
Key Techniques used in Morphological Analysis for NLP Tasks
Morphological analysis involves breaking down words into their constituent morphemes (the smallest units
of meaning) and understanding their structure and formation. Various techniques can be employed to
perform morphological analysis, each with its own strengths and applications.
Here are some of the key techniques used in morphological analysis:
1. Stemming
Stemming reduces words to their base or root form, usually by removing suffixes. The resulting stems are
not necessarily valid words but are useful for text normalization.
Common ways to implement stemming in python:
● Porter Stemmer: One of the most popular stemming algorithms, known for its simplicity and
efficiency.
● Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple languages.
● Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in shorter stems.
2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). It considers the context and part of
speech, producing valid words. To implement lemmatization in python, WordNet Lemmatizer is used,
which leverages the WordNet lexical database to find the base form of words.
3. Morphological Parsing
Morphological parsing involves analyzing the structure of words to identify their morphemes (roots,
prefixes, suffixes). It requires knowledge of morphological rules and patterns. Finite-State Transducers
(FSTs) is uses as a tool for morphological parsing.
Finite-State Transducers (FSTs)
FSTs are computational models used to represent and analyze the morphological structure of words. They
consist of states and transitions, capturing the rules of word formation.
Applications:
● Recurrent Neural Networks (RNNs): Useful for sequential data like text.
● Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
● Transformers: Advanced models like BERT and GPT that understand context and semantics.
5. Rule-Based Methods
Rule-based methods rely on manually defined linguistic rules for morphological analysis. These rules can
handle specific language patterns and exceptions.
Applications:
● Affix Stripping: Removing known prefixes and suffixes to find the root form.
● Inflectional Analysis: Identifying grammatical variations like tense, number, and case.
Applications:
● Sequence Prediction: Predicting the most likely sequence of morphemes for a given word.
1. Information Retrieval: Enhances search engines by improving the matching of query terms with
relevant documents, even if they are in different morphological forms.
2. Machine Translation: Facilitates accurate translation by understanding and generating correct
word forms in different languages.
3. Text-to-Speech Systems: Improves pronunciation and intonation by accurately identifying word
structures and their stress patterns.
4. Spell Checkers and Grammar Checkers: Detects and suggests corrections for misspelled
words and grammatical errors by analyzing word forms and their usage.
5. Named Entity Recognition (NER): Helps in identifying and classifying named entities in text
by understanding their morphological variations.
Tokenization in NLP
Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual
input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words,
or sentences. It helps in improving interpretability of text by different models. Let's understand How
Tokenization Works.
● Tokenization is a foundation step in NLP pipeline that shapes the entire workflow.
● Involves dividing a string or text into a list of smaller units known as tokens.
● Uses a tokenizer to segment unstructured data and natural language text into distinct chunks of
information, treating them as different elements.
● Tokens: Words or Sub-words in the context of natural language processing. Example: A word is
a token in a sentence, A character is a token in a word, etc.
● Application: Multiple NLP tasks, text processing, language modelling, and machine translation.
Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some types of
tokenization:
1. Word Tokenization
Word tokenization is the most commonly used method where text is divided into individual words. It works
well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating"
becomes:
Input before tokenization: ["Machine Learning is fascinating"]
Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]
2. Character Tokenization
In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This
is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear
boundaries. It can also be useful for modelling character-level language.
Example
Input before tokenization: ["You are helpful"]
Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]
3. Sub-word Tokenization
This strikes a balance between word and character tokenization by breaking down text into units that are
larger than a single character but smaller than a full word. This is useful when dealing with morphologically
rich languages or rare words.
Example
["Time", "table"]
["Rain", "coat"]
["Grace", "fully"]
["Run", "way"]
Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form
words by combining smaller units.
4. Sentence Tokenization
Sentence tokenization is also a common technique used to make a division of paragraphs or large set of
sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or
processing.
Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is
fascinating. Computer Vision handles images. "]
Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine
learning is fascinating.", "Computer Vision handles images."]
5. N-gram Tokenization
● Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical
and computational analysis.
● Feature extraction: Text data can be represented numerically for algorithmic comprehension by
using tokens as features in ML models.
● Information Retrieval: Tokenization is essential for indexing and searching in systems that
store and retrieve information efficiently based on words or phrases.
● Text Analysis: Used in sentiment analysis and named entity recognition, to determine the
function and context of individual words in a sentence.
● Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and
machine translation.
The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize function is used to
segment a given text into a list of sentences.
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
How sent_tokenize works: The sent_tokenize function uses an instance of PunktSentenceTokenizer from
the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and
beginning of sentence at what characters and punctuation.
Sentences from different languages can also be tokenized using different pickle file other than English.
● In the following code snippet, we have used NLTK library to tokenize a Spanish text into
sentences using pre-trained Punkt tokenizer for Spanish.
import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)
Output:
['Hola amigo.',
'Estoy bien.']
The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into
individual words.
● The word_tokenize function is helpful for breaking down a sentence or text into its constituent
words.
● Eases analysis or processing at the word level in natural language processing tasks.
The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize
a given text into individual words.
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)
Output:
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']
These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code
outputs above, it doesn't discard the punctuation, allowing a user to decide what to do with the punctuations
at the time of pre-processing.
The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries.
Each punctuation mark is treated as a separate token.
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")
Output:
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']
The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given
text based on a regular expression pattern.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)
Output:
['Let', 's', 'see', 'how', 'it', 's', 'working']
Using regular expressions allows for more fine-grained control over tokenization, and you can customize the
pattern based on your specific requirements.
More Techniques for Tokenization
We have discussed the ways to implement how we can perform tokenization using the NLTK library. We
can also implement tokenization using the following methods and libraries:
● BERT tokenizer: BERT uses Word Piece tokenizer, which is a type of sub-word tokenizer for
tokenizing input text. Using regular expressions allows for more fine-grained control over
tokenization, and you can customise the pattern based on your specific requirements.
● Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also
found applications in the field of natural language processing, specifically for tokenization. It is a
Sub-word Tokenization technique that works by iteratively merging the most frequent pairs of
consecutive bytes (or characters) in a given corpus.
● Sentence Piece: Sentence Piece is another sub-word tokenization algorithm commonly used for
natural language processing tasks. It is designed to be language-agnostic and works by iteratively
merging frequent sequences of characters or subwords in a given corpus.
Limitations of Tokenization
● Unable to capture the meaning of the sentence results in ambiguity.
● Chinese, Japanese, and Arabic lack distinct spaces between words. Hence, the absence of clear
boundaries complicates the process of tokenization.
● Tough to decide how to tokenize text that may include more than one word, for example, email
addresses, URLs and special symbols
Spelling correction is the process of substituting an erroneously spelt word with the most likely intended one
in a document produced in any Natural Language. It has been a focus of ongoing study in the field of Natural
Language Processing (NLP). The NLP group's mission is to create and construct software that will analyse,
comprehend, and generate natural human languages.Various applications are used in NLP like Text
Summarisation, Question Answering, Machine Translation, Parsing, Information Retrieval and Optical
Recognition. The basic aim of spell checking is to discover problems in written text. Dictionary lookup and
N-gram analysis are two ways for error detection. The technique for detecting errors often entails
determining whether or not an input string is a legitimate dictionary word. For detecting such errors,
efficient approaches have been developed. Dictionary lookup and n-gram approaches are used by most
spellcheckers. When a word in a written document is discovered as a mistake, spelling correction procedures
are used to correct the word or provide right options. For text error correction, many techniques, such as
rule-based techniques, edit distance techniques, N-gram techniques, and deep learning technique,s are
available.
Introduction
There are many commercial as well as non-commercial spelling error detection and correction tools
available in the market for almost all popular languages.
Every tool generally works at the word level, using an integral dictionary or WordNet as the backend
database for correction and detection.
● To correct the error, a spell checker searches the dictionary or WordNet for the word that most
closely resembles the erroneous word.
● These suggested words are presented to the user to choose the intended correct word.
● Machine Translation
● Search Engines
● Information Retrieval
● Text Editors
1. Error Detection
2. Error Correction
According to Damerau’s studies, spelling errors can generally be divided into two main types based on error
patterns:
● Occur when the correct spelling of a word is known but mistyped by mistake.
○ Intended: "because"
○ Typed: "becuase”
● Often result in words that are spelled correctly but used incorrectly.
● Pronunciation of the misspelled word is the same or similar to the correct word.
● Example:
○ N-gram Analysis
○ Dictionary / WordNet Lookup
1. N-gram Analysis
N-gram analysis is a method to find incorrectly spelt words in a large body of text.
Instead of comparing each entire word to a dictionary, this method checks n-grams (substrings of length n).
Process
1. An n-gram is a set of consecutive characters taken from a string, where n is a chosen length.
2. Real n-gram frequencies are stored in an n-dimensional matrix.
3. When processing text:
Advantages
A dictionary or WordNet is a lexical resource containing a list of correct words in a particular language.
Process
Drawbacks
2. May not cover all possible words (especially technical terms, slang, or new words)..
WordNet
● A large-scale lexical database with a hierarchical structure based on relationships between concepts.
Error Correction
Error correction is the process of replacing detected misspellings with the correct words.
Steps
1. Generation of Candidate Corrections
○ Uses a precompiled table of legal n-grams to find potential correction terms.
○ Can also use dictionary-based lookups and edit distance algorithms.
In some cases, ranking is skipped, and the user selects the correct word from the list of
candidates.
Error Correction Techniques
Error correction in spell checking involves selecting the most likely correct word from a set of candidates
generated after error detection. Several approaches are used to determine the best correction.
1. Dictionary Lookup
● Idea:
Compare each word in the text against a list of valid words (dictionary).
● Process:
○ Simple to implement.
○ Very fast for small dictionaries.
● Drawbacks:
Example:
Example:
"kitten" "sitting"
Substitution: k s
Insertion: e i (after t)
Insertion: g end
Distance = 3
Usage in correction:
Choose the dictionary word with the smallest edit distance to the misspelled word.
● Idea:
Correct words based on how they sound, not just spelling.
●
Why:
Example:
Soundex("Robert") = R163
Soundex("Rupert") = R163 Likely match.
○ Assign different costs for operations (e.g., substituting 'a' with 'e' costs less than substituting
'a' with 'z').
●
Example:
This approach measures how many n-grams the misspelled word and a dictionary word have in common,
weighted by the length of the words.
● Process:
1. Split both the misspelled word and each candidate word into n-grams.
2. Count the common n-grams between them.
3. Apply a similarity score (e.g., Jaccard coefficient or cosine similarity) that is normalized by
word length
Example:
6. Probabilistic Techniques
Uses probability to determine which candidate is most likely the correct spelling.
● Notation:
○ w = original (misspelled) word
○ c = candidate correction word
●
Goal:
○
P(wc)P(w|c)P(wc): probability of making the specific spelling error from c to w
Example:
w = "wird"
Drawback:
○ Can never be 100% certain; it’s always about most likely choice.
7. Neural Networks
● Method:
○ One output node per word in the dictionary.
○ Input nodes for every possible n-gram in every word position (n usually 1 or 2).
○ Network predicts which output (word) is most likely correct.
● Limitations:
Works well only for small dictionaries (< 1000 words).
● Intuition:
○ Treat the misspelled word as if it was a correct word distorted by noise during transmission
(like in a communication channel).
○ Noise could be substitutions, insertions, deletions, or transpositions of letters.
● Goal:
Find the true word by:
●
Bayesian Formulation:
flowchart TD
F2 --> G
F3 --> G
F4 --> G
The idea is to process all characters one by one starting from either from left or right sides of both strings.
Let us process from the right end of the strings, there are two possibilities for every pair of characters being
traversed, either they match or they don't match. If last characters of both string matches then we simply
recursively calculate the answer for rest of part of the strings. When last characters do not match, we perform
all three operations to match the last characters, i.e. insert, replace, and remove. And then recursively
calculate the result for the remaining part of the string. Upon completion of these operations, we will select
the minimum answer and add 1 to it.
Below is the recursive tree for this problem considering the case when the last characters never match.
When the last characters of strings matches. Make a recursive call editDistance(m-1, n-1) to calculate the
answer for remaining part of the strings.
When the last characters of strings don't matches. Make three recursive calls as show below:
● Case 1: When the last character of both the strings are same. editDistance(s1, s2, m, n) =
editDistance(s1, s2, m-1, n-1)
● Case 2: When the last characters are different
○ editDistance(s1, s2, m, n) = 1 + Minimum{ editDistance(s1, s2 ,m,n-1),
editDistance(s1, s2 ,m-1, n), editDistance(s1, s2 , m-1, n-1)}
if m == 0:
return n
if n == 0:
return m
# remaining strings.
if __name__ == "__main__":
s1 = "abcd"
s2 = "bcfe"
print(editDistance(s1, s2))
Output
So, we can use Memoization to store the result of each subproblems to avoid recalculating the result again
and again.
# of operations to convert s1 to s2
if m == 0:
return n
if n == 0:
return m
# If value is memoized
if memo[m][n] != -1:
return memo[m][n]
# remaining strings.
return memo[m][n]
memo[m][n] = 1 + min(
return memo[m][n]
# Wrapper function to initiate the recursive calculation
m, n = len(s1), len(s2)
if __name__ == "__main__":
s1 = "abcd"
s2 = "bcfe"
print(editDistance(s1, s2))
OUTPUT:
Use a table to store solutions of subproblems to avoiding recalculate the same subproblems multiple times.
By doing this, if same subproblems repeated during, we retrieve the solutions from the table itself.
Below are the steps to convert the recursive approach to Bottom up approach:
1. Choosing Dimensions of Table: The state of smaller sub-problems depends on the input parameters m and
n because at least one of them will decrease after each recursive call. So we need to construct a 2D table dp[]
[] to store the solution of the sub-problems.
2. Choosing Correct size of Table: The range of parameters goes from 0 to m and 0 to n. So we choose
dimensions as (m + 1)*(n + 1)
3. Filling the table: It consist of two stages, table initialisation and building the solution from the smaller
subproblems:
● Table initialisation: Before building the solution, we need to initialise the table with the known
values i.e. base case. Here m = 0 and n = 0 is the situation of the base case, so we initialise first-
column dp[i][0] with i and first-row dp[0][j] with j.
● Building the solution of larger problems from the smaller subproblems: We can easily define the
iterative structure by using the recursive structure of the above recursive solution.
● if (s1[i - 1] == s2[j - 1]) dp[i][j] = dp[i - 1][j - 1];
● if (s1[i - 1] != s2[j - 1]) dp[i][j] = 1 + min(dp[i][j - 1], dp[i - 1][j], dp[i - 1][j - 1]);
4. Returning final solution: After filling the table iteratively, our final solution gets stored at the bottom right
corner of the 2-D table i.e. we return dp[m][n] as an output.
Illustration:
● Plagiarism Detection
● String Matching
If we do not consider the replace operation, then edit distance problem is same as the Longest Common
Subsequence (LCS) problem. With only insert and delete operations allowed, the edit distance between two
strings is ( M + N - 2* LCS)