Natural Language
Processing
(PMDS606L)
MODULE 2: Text processing
Dr. Reya Sharma
Assistant Professor
Dept. of Analytics, SCOPE, VIT
Text Pre-Processing
Text pre-processing is the first step in any NLP project.
It involves cleaning and preparing text data so that a machine can
understand and work with it.
Text Pre-Processing
Why is it needed?
• Raw text is messy, inconsistent, and full of unnecessary elements
(punctuation, numbers, emojis, special characters, etc.)
• Computers can't process natural language as humans do.
• Clean, organized text improves accuracy in text analysis tasks like
classification, translation, and sentiment analysis.
Text Pre-Processing
"Heyyy, r u freee 2nite?? ”
Common Text Pre-Processing Steps
Step What it Does Example
Lowercasing Converts all text to lowercase "Hello" → "hello"
Removing Punctuation Removes symbols like .,?!; etc. "Hello, World!" → "Hello World"
Removing Numbers Removes numerical values if not required "I have 2 cats" → "I have cats"
Removes common words that don’t add
Removing Stopwords "is", "the", "and", "a"
much meaning
Tokenization Splits text into individual words or sentences "Hello World" → ["Hello", "World"]
"running", "runs" → "run“
Stemming Removes word suffixes to get the base form
“happiness” → “happi”
"am", "are", "is" → "be“
Lemmatization Converts words to their dictionary form
“better” → “good”
Challenges in Text Pre-Processing
Challenge Explanation Example
"Bank" (river bank or financial
Ambiguity Words can have multiple meanings
bank)
People use casual language, emojis,
Slang & Abbreviations "u", "gr8", "idk"
abbreviations in digital text
Mixed languages in one sentence
Code Mixing / Multilingual Text "Kal class hai bro!"
(common in India)
Need to decide whether to remove,
Special Characters & Emojis , , #, @
keep, or convert them
Different fields use different Medical reports vs. Twitter
Domain-Specific Words
vocabularies comments
Spelling Mistakes & Typos Errors can mislead NLP systems "recieve" instead of "receive"
Tokenization
• Tokenization is the process of breaking down text (like a sentence or
paragraph) into smaller units called tokens.
• These tokens can be:
• Words (e.g., "I love NLP" → ["I", "love", "NLP"])
• Sub-words (used in deep learning: "unhappiness" → ["un", "happi", "ness"])
• Sentences (when breaking a paragraph)
• Characters (rarely used but important in some models)
Tokenization
Types of Tokenization : Tokenization can be classified into several types based on
how the text is segmented. Here are some types of tokenization:
1. Whitespace Tokenizer
• Splits text using spaces.
• Example: "I like NLP." → ["I", "like", "NLP."]
• Issue: Keeps punctuation attached.
2. Word Tokenizer
• Word tokenization is the most commonly used method where text is divided into individual
words.
• Handles punctuation better.
• Example: "I like NLP." → ["I", "like", "NLP", "."]
3. Character Tokenizer
• Splits into individual characters.
• Example: "NLP" → ["N", "L", "P"]
Tokenization
4. Sub-word Tokenizer
• This strikes a balance between word and character tokenization by breaking down
text into units that are larger than a single character but smaller than a full word.
• Breaks rare or unknown words into parts.
• Helps models handle new or complex words.
• Example: "unhappiness" → ["un", "happi", "ness"]
5. Sentence Tokenizer
• Breaks a paragraph into sentences.
• Example: "He studies AI. She loves NLP." → ["He studies AI.", "She loves NLP."]
6. Regular Expression Tokenizer
• Uses custom patterns to split text.
• Example: Split only on punctuation or digits
Tokenization
7. N-gram Tokenization
• N-gram tokenization splits words into fixed-sized chunks (size = n) of data.
• In this technique text is broken into overlapping sequences of ‘n’ items (usually words or
characters).
➢ If n = 2, it's called a bigram
➢ If n = 3, it's called a trigram
➢ If n = 1, it's just individual words (also called unigrams)
Example:
• Input before tokenization: ["Machine learning is powerful"]
• Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is',
'powerful')]
Tokenization
Need of Tokenization:
• Effective Text Processing: Reduces the size of raw text, resulting in easy
and efficient statistical and computational analysis.
• Feature extraction: Text data can be represented numerically for
algorithmic comprehension by using tokens as features in ML models.
• Information Retrieval: Tokenization is essential for indexing and searching
in systems that store and retrieve information efficiently based on words or
phrases.
• Text Analysis: Used in sentiment analysis and named entity recognition, to
determine the function and context of individual words in a sentence.
• Vocabulary Management: Generates a list of distinct tokens, Helps manage
a corpus's vocabulary.
• Task-Specific Adaptation: Adapts to need of particular NLP task, Good for
summarization and machine translation.
Tokenization
Issues in Tokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• San Francisco → one token or two?
• m.p.h., PhD. → ??
Sentence Segmentation
• It is a property that divides sentences according to their context, in
each sentence according to its beginning, as well as the context under
which it falls.
• This method is done by choosing one of the two methods:
➢ Is there a clear sign in the sentence, such as ? or !
!, ? are relatively unambiguous
➢ Itis also done by using an unclear mark such as the period (.), which may have
several other uses, such as using it when abbreviating.
Period “.” is quite ambiguous
• Sentence boundary
• Abbreviations like etc. or Dr.
• Numbers like .02% or 4.3
Sentence Segmentation
• Build a binary classifier
➢ Looks at a “.”
➢ Decides EndOfSentence/NotEndOfSentence
➢ Classifiers: hand-written rules, regular expressions, or machine-learning.
Sentence Segmentation
• Determining if a word is end-of-sentence: a Decision Tree
Sentence Segmentation
More sophisticated decision tree features : As looking at the case of
the letters before and after the point, it indicates the EOF (End Of
Sentences) or the, and also looking at the length of the word after it.
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
➢ Lengthof word with “.”
➢ Probability (word with “.” occurs at end-of-s)
➢ Probability (word after “.” occurs at beginning-of-s)
Sentence Segmentation
Implementing Decision Trees
• A decision tree is just an if-then-else statement
• The interesting research is choosing the features
• Setting up the structure is often too hard to do by hand
• Hand-building only possible for very simple features, domains
• Instead, structure usually learned by machine learning from a training
corpus
Sentence Segmentation
Decision Trees and other classifiers
• We can think of the questions in a decision tree
• As features that could be exploited by any kind of classifier
➢ Logistic regression
➢ SVM
➢ Neural Nets, etc.
Regular Expressions
• A Regular Expression is a pattern that describes text you want to find
in a document. Think of it like a smart search function.
• Instead of searching for just one word like "cat", you can search for
any patterns like "any word ending in ‘ing’.
Why Do We Need REs in NLP?
➢ To find certain words or phrases in large text files.
➢ To clean and normalize messy text data.
➢ To extract information (like names, dates, or prices).
➢ Used in chatbots, text editors, search engines, etc.
Regular Expressions
• Basics of Regular Expressions
1. Simple Matching: The simplest kind of regular expression is a
sequence of simple characters. To search for woodchuck, we type:
➢ /woodchuck/ → finds the word “woodchuck” in a text
➢ /!/ → finds exclamation marks
Fig. Some simple regex searches.
Regular Expressions
2. Case Sensitivity:
• REs are case-sensitive: /s/ ≠ /S/.
• This means that the pattern /woodchucks/ will not match the string
Woodchucks.
• We can solve this problem with the use of the square braces [ and ].
• The string of characters inside the braces specifies a disjunction of
characters to match.
• Use brackets to include both: /[sS]/
Regular Expressions
3. Character Sets
Fig. The use of the brackets [] to specify a disjunction of characters.
• The regular expression /[1234567890]/ specified any single digit. While such
classes of characters as digits or letters are important building blocks in
expressions, they can get awkward.
• It’s inconvenient to specify /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ range to
mean “any capital letter”.
Regular Expressions
4. Ranges and Negation
• In cases where there is a well-defined sequence associated with a set of
characters, the brackets can be used with the dash (-) to specify any one
character in a range.
Fig. The use of the brackets [] plus the dash- to specify a range.
• The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ.
Regular Expressions
• Carat means nega on only when first in []
• If the caret ˆ is the first symbol after the open square brace [, the resulting
pattern is negated. For example, the pattern /[ˆa]/ matches any single character
(including special characters) except a.
Fig. Uses of the caret ˆ for negation or just to mean ˆ
Regular Expressions
5. Optional Characters
• Sometimes we want to match a word with or without a certain letter, like
woodchuck and woodchucks.
• Square brackets [ ] can’t help here because they only choose one character from
the list — they can’t say "or nothing".
• To make a character optional, we use the question mark ?.
• ? = “zero or one” of the previous character
• Example: /woodchucks?/ → matches "woodchuck" or "woodchucks“.
Fig. The question mark ? marks optionality of the previous expression
Regular Expressions
6. Repetition:
• The Kleene star (*) means:→ "zero or more of the character just before it".
• Examples:
• * → “zero or more” (e.g., /a*/ = "", "a", "aaa")
• + → “one or more” (e.g., /a+/ = "a", "aaa")
• /[0-9]+/ → matches one or more digits.
• Strings like “Off Minor” match /a*/ because they contain zero a’s.
• To match one or more a’s, we use /aa*/ → One 'a' followed by zero or more 'a’s.
• You can repeat more complex patterns too:
o /[ab]*/ matches zero or more a’s or b’s.
o It matches strings like: "aaaa", "ababab", "bbbb", or even "".
Regular Expressions
7. Wildcard
• The period . is a special character in regular expressions.
• It acts as a wildcard that matches any single character (except a line break).
• So, /./ will match letters, numbers, symbols—any one character.
• It is often used with the Kleene star * to mean → “any number of any
characters”.
• Example:
o /.*./ means "any string with at least two characters".
o To find a word like "aardvark" appearing twice in a line, we use: → /aardvark.*aardvark/
Fig. The use of the period . to specify any character.
Regular Expressions
8. Anchors
• Anchors are special symbols in regular expressions used to match specific
positions in a line of text.
• The two most common anchors are:
➢ Caret ^ – matches the start of a line.
➢ Dollar sign $ – matches the end of a line
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “ Hello”
\.$ The end.
.$ The end? The end!
Regular Expressions
• There are also two other anchors:
➢ \b matches a word boundary,
➢ \B matches a non-boundary (i.e., in the middle of a word).
• Example: /\bthe\b/ matches the word the but not the word other
• A word for the purposes of a regular expression is defined as any sequence of:
➢ Letters (a–z, A–Z)
➢ Digits (0–9)
➢ Underscore (_)
• Example: /\b99\b/
➢ will match the string 99 in “There are 99 bottles of beer on the wall” (because 99 follows a
space)
➢ but not 99 in “There are 299 bottles of beer on the wall” (since 99 follows a number).
➢ match 99 in “$99 (since 99 follows a dollar sign ($)”, which is not a digit, underscore, or
letter).
Regular Expressions
9. OR (Disjunction)
• Suppose we want to search for texts about pets, especially cats or dogs.
• We want to match either the word "cat" or "dog".
• We cannot use square brackets like /[catdog]/ because:
• [catdog] means "any one character" from the list c, a, t, d, o, g.
• It does not mean the full word "cat" or "dog".
• To search for entire words like "cat" or "dog", we use the pipe symbol |, called
the disjunction operator.
•The pattern /cat|dog/ → Match either "cat" or "dog".
Regular Expressions
10. Grouping with ()
• Sometimes we want to use OR (|) inside a bigger word or pattern.
➢ Example: We want to match both "guppy" and "guppies".
• We can’t write /guppy|ies/ because:
• It matches either "guppy" or just "ies", not "guppies".
• This happens because the | (OR) operator has low precedence — meaning:
➢ It treats "guppy" and "ies" as completely separate patterns.
• To fix this, we use parentheses () to group the variable part.
• So, we write: /gupp(y|ies)/
➢ This means: "guppy" OR "guppies", using one base pattern gupp with two suffix options.
• Parentheses make grouped patterns act like a single unit, so | applies only within
them.
Regular Expressions
• In regular expressions, some operators are applied before others — this is
known as operator precedence.
• This ordering of how operators are prioritized is called the regular expression
operator precedence hierarchy.
• The following table gives the order of RE operator precedence, from highest
precedence to lowest precedence.
Regular Expression Example: Find me all instances
of the word “the” in a text.
• A simple (but incorrect) pattern might be: /the/
• One problem is that this pattern will miss the word when it begins a
sentence and hence is capitalized (i.e., The).
• This might lead us to the following pattern: /[tT]he/
• But we will still incorrectly return texts with the embedded in other words
(e.g., other or theology).
• So we need to specify that we want instances with a word boundary on
both sides: /\b[tT]he\b/
• Suppose we wanted to do this without the use of /\b/.
• We might want this since /\b/ won’t treat underscores and numbers as
word boundaries; but we might want to find the in some context where it
might also have underlines or numbers nearby (the_ or the25).
Regular Expression Example: Find me all
instances of the word “the” in a text.
• We need to specify that we want instances in which there are no alphabetic
letters on either side of the: /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
• But there is still one more problem with this pattern: it won’t find the word the
when it begins a line.
• We can avoid this by specifying that before the we require either the beginning-
of-line or a non-alphabetic character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
Text Normalization
• Text normalization reduces the word to its base form by removing the inflectional
part.
• There are two main approaches toward text normalization in NLP:
Stemming and Lemmatization
• Text normalization is a crucial step in Natural Language Processing (NLP).
• It is widely used in various NLP applications like:
➢ Speech recognition
➢ Text-to-speech systems
➢ Spam email detection
➢ Sentiment analysis, etc.
Text Normalization
• Example: Words like "collection", "collective", "collect", and "collectively" are
all variations of the base word "collect".
• Human language is inherently random, and we often use different word forms
for the same concept.
• Computers struggle to handle this randomness effectively without
normalization.
• Text normalization removes this randomness by converting words to a
standard format.
• This process helps in:
• Reducing variations of the same word
• Lowering the number of input features in the model
• Improving model efficiency
Text Normalization
Stemming
• Stemming is a basic technique for text normalization in NLP.
• It works by removing prefixes and suffixes (inflectional parts) from words.
• The goal is to obtain the stem/root of the word.
Example: connection, connected, connecting word reduce to a common word “connect”
• It does not consider the semantic meaning of the word.
• Drawback: The resulting stem may not be a meaningful word.
Example: "laziness" → "lazi" (not "lazy")
• Can lead to loss of meaning or incorrect interpretations.
• In Python's NLTK library, PorterStemmer is used for stemming.
Text Normalization
• Stemming must be applied carefully to avoid Over-stemming and Under-stemming.
a) Over-stemming (too much chopping)
• Over-stemming occurs when the stemmer removes too many characters from a word.
• This leads to different words being incorrectly treated as the same.
• Example: "university" and "universe" both reduced to "univers".
• This falsely implies they are semantically the same, which they are not.
b) Under-stemming (not reducing enough)
• Occurs when the stemmer does not reduce words enough.
• Results in related words being treated as different.
• Example: "data" → "dat" and "datum" → "datu“ (instead of the same stem “dat”)
• The stemmer fails to recognize that both words come from the same root
Text Normalization
Porter’s algorithm: The most common English stemmer
• Porter Stemmer is a widely used stemming algorithm for English developed by
Martin Porter.
• Involves a step-by-step rule-based approach to strip suffixes.
Text Normalization
Handling Past Tense & Gerunds
Only apply if the word has a vowel before the suffix (*v* means vowel
exists).
Text Normalization
Text Normalization
Lemmatization
• Improves upon stemming by addressing its drawbacks.
• Uses vocabulary, grammar rules, and part-of-speech (POS) tags to reduce words
to their base form (lemma).
• Removes inflectional endings more accurately than simple suffix chopping.
• Relies on a predefined dictionary (lexicon) to identify the correct lemma based
on context.
• Ensures that the meaning of the word is preserved during normalization.
• In Python's NLTK library, WordNetLemmatizer is used for lemmatization.
• More suitable for tasks that require semantic accuracy, like: Text classification,
Question answering, Named Entity Recognition (NER)
Text Normalization
• Uses vocabulary + morphological analysis.
• Reduce inflections or variant forms to base form
• Examples:
➢ am, are, is → be
➢ car, cars, car's, cars' → car
➢ better → good
➢ running → run