Module 2 Reference Material 1

The document discusses text pre-processing in Natural Language Processing (NLP), highlighting its importance in cleaning and organizing raw text data for better machine understanding. It covers common steps like lowercasing, punctuation removal, tokenization, and challenges such as ambiguity and slang. Additionally, it explains tokenization types, sentence segmentation, and the use of regular expressions for text pattern matching and data extraction.

Uploaded by

PANASARA Nikhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views43 pages

Module 2 Reference Material 1

Uploaded by

PANASARA Nikhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language

Processing
(PMDS606L)

MODULE 2: Text processing

Dr. Reya Sharma

Assistant Professor
Dept. of Analytics, SCOPE, VIT
Text Pre-Processing
Text pre-processing is the first step in any NLP project.
It involves cleaning and preparing text data so that a machine can
understand and work with it.
Text Pre-Processing
Why is it needed?
• Raw text is messy, inconsistent, and full of unnecessary elements
(punctuation, numbers, emojis, special characters, etc.)
• Computers can't process natural language as humans do.
• Clean, organized text improves accuracy in text analysis tasks like
classification, translation, and sentiment analysis.
Text Pre-Processing

"Heyyy, r u freee 2nite?? ”

Common Text Pre-Processing Steps
Step What it Does Example
Lowercasing Converts all text to lowercase "Hello" → "hello"
Removing Punctuation Removes symbols like .,?!; etc. "Hello, World!" → "Hello World"
Removing Numbers Removes numerical values if not required "I have 2 cats" → "I have cats"
Removes common words that don’t add
Removing Stopwords "is", "the", "and", "a"
much meaning
Tokenization Splits text into individual words or sentences "Hello World" → ["Hello", "World"]
"running", "runs" → "run“
Stemming Removes word suffixes to get the base form
“happiness” → “happi”
"am", "are", "is" → "be“
Lemmatization Converts words to their dictionary form
“better” → “good”
Challenges in Text Pre-Processing
Challenge Explanation Example
"Bank" (river bank or financial
Ambiguity Words can have multiple meanings
bank)
People use casual language, emojis,
Slang & Abbreviations "u", "gr8", "idk"
abbreviations in digital text
Mixed languages in one sentence
Code Mixing / Multilingual Text "Kal class hai bro!"
(common in India)
Need to decide whether to remove,
Special Characters & Emojis , , #, @
keep, or convert them
Different fields use different Medical reports vs. Twitter
Domain-Specific Words
vocabularies comments
Spelling Mistakes & Typos Errors can mislead NLP systems "recieve" instead of "receive"
Tokenization
• Tokenization is the process of breaking down text (like a sentence or
paragraph) into smaller units called tokens.
• These tokens can be:
• Words (e.g., "I love NLP" → ["I", "love", "NLP"])
• Sub-words (used in deep learning: "unhappiness" → ["un", "happi", "ness"])
• Sentences (when breaking a paragraph)
• Characters (rarely used but important in some models)
Tokenization
Types of Tokenization : Tokenization can be classified into several types based on
how the text is segmented. Here are some types of tokenization:
1. Whitespace Tokenizer
• Splits text using spaces.
• Example: "I like NLP." → ["I", "like", "NLP."]
• Issue: Keeps punctuation attached.
2. Word Tokenizer
• Word tokenization is the most commonly used method where text is divided into individual
words.
• Handles punctuation better.
• Example: "I like NLP." → ["I", "like", "NLP", "."]
3. Character Tokenizer
• Splits into individual characters.
• Example: "NLP" → ["N", "L", "P"]
Tokenization
4. Sub-word Tokenizer
• This strikes a balance between word and character tokenization by breaking down
text into units that are larger than a single character but smaller than a full word.
• Breaks rare or unknown words into parts.
• Helps models handle new or complex words.
• Example: "unhappiness" → ["un", "happi", "ness"]
5. Sentence Tokenizer
• Breaks a paragraph into sentences.
• Example: "He studies AI. She loves NLP." → ["He studies AI.", "She loves NLP."]
6. Regular Expression Tokenizer
• Uses custom patterns to split text.
• Example: Split only on punctuation or digits
Tokenization
7. N-gram Tokenization
• N-gram tokenization splits words into fixed-sized chunks (size = n) of data.
• In this technique text is broken into overlapping sequences of ‘n’ items (usually words or
characters).
➢ If n = 2, it's called a bigram
➢ If n = 3, it's called a trigram
➢ If n = 1, it's just individual words (also called unigrams)
Example:
• Input before tokenization: ["Machine learning is powerful"]
• Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is',
'powerful')]
Tokenization
Need of Tokenization:
• Effective Text Processing: Reduces the size of raw text, resulting in easy
and efficient statistical and computational analysis.
• Feature extraction: Text data can be represented numerically for
algorithmic comprehension by using tokens as features in ML models.
• Information Retrieval: Tokenization is essential for indexing and searching
in systems that store and retrieve information efficiently based on words or
phrases.
• Text Analysis: Used in sentiment analysis and named entity recognition, to
determine the function and context of individual words in a sentence.
• Vocabulary Management: Generates a list of distinct tokens, Helps manage
a corpus's vocabulary.
• Task-Specific Adaptation: Adapts to need of particular NLP task, Good for
summarization and machine translation.
Tokenization
Issues in Tokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• San Francisco → one token or two?
• m.p.h., PhD. → ??
Sentence Segmentation
• It is a property that divides sentences according to their context, in
each sentence according to its beginning, as well as the context under
which it falls.
• This method is done by choosing one of the two methods:
➢ Is there a clear sign in the sentence, such as ? or !
!, ? are relatively unambiguous
➢ Itis also done by using an unclear mark such as the period (.), which may have
several other uses, such as using it when abbreviating.
Period “.” is quite ambiguous
• Sentence boundary
• Abbreviations like etc. or Dr.
• Numbers like .02% or 4.3
Sentence Segmentation
• Build a binary classifier
➢ Looks at a “.”
➢ Decides EndOfSentence/NotEndOfSentence
➢ Classifiers: hand-written rules, regular expressions, or machine-learning.
Sentence Segmentation
• Determining if a word is end-of-sentence: a Decision Tree
Sentence Segmentation
More sophisticated decision tree features : As looking at the case of
the letters before and after the point, it indicates the EOF (End Of
Sentences) or the, and also looking at the length of the word after it.
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
➢ Lengthof word with “.”
➢ Probability (word with “.” occurs at end-of-s)
➢ Probability (word after “.” occurs at beginning-of-s)
Sentence Segmentation
Implementing Decision Trees
• A decision tree is just an if-then-else statement
• The interesting research is choosing the features
• Setting up the structure is often too hard to do by hand
• Hand-building only possible for very simple features, domains
• Instead, structure usually learned by machine learning from a training
corpus
Sentence Segmentation
Decision Trees and other classifiers
• We can think of the questions in a decision tree
• As features that could be exploited by any kind of classifier
➢ Logistic regression
➢ SVM
➢ Neural Nets, etc.
Regular Expressions
• A Regular Expression is a pattern that describes text you want to find
in a document. Think of it like a smart search function.
• Instead of searching for just one word like "cat", you can search for
any patterns like "any word ending in ‘ing’.
Why Do We Need REs in NLP?
➢ To find certain words or phrases in large text files.
➢ To clean and normalize messy text data.
➢ To extract information (like names, dates, or prices).
➢ Used in chatbots, text editors, search engines, etc.
Regular Expressions
• Basics of Regular Expressions
1. Simple Matching: The simplest kind of regular expression is a
sequence of simple characters. To search for woodchuck, we type:
➢ /woodchuck/ → finds the word “woodchuck” in a text
➢ /!/ → finds exclamation marks

Fig. Some simple regex searches.

Regular Expressions
2. Case Sensitivity:
• REs are case-sensitive: /s/ ≠ /S/.
• This means that the pattern /woodchucks/ will not match the string
Woodchucks.
• We can solve this problem with the use of the square braces [ and ].
• The string of characters inside the braces specifies a disjunction of
characters to match.
• Use brackets to include both: /[sS]/
Regular Expressions
3. Character Sets

Fig. The use of the brackets [] to specify a disjunction of characters.

• The regular expression /[1234567890]/ specified any single digit. While such
classes of characters as digits or letters are important building blocks in
expressions, they can get awkward.
• It’s inconvenient to specify /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ range to
mean “any capital letter”.
Regular Expressions
4. Ranges and Negation
• In cases where there is a well-defined sequence associated with a set of
characters, the brackets can be used with the dash (-) to specify any one
character in a range.

Fig. The use of the brackets [] plus the dash- to specify a range.
• The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ.
Regular Expressions
• Carat means nega on only when first in []
• If the caret ˆ is the first symbol after the open square brace [, the resulting
pattern is negated. For example, the pattern /[ˆa]/ matches any single character
(including special characters) except a.

Fig. Uses of the caret ˆ for negation or just to mean ˆ

Regular Expressions
5. Optional Characters
• Sometimes we want to match a word with or without a certain letter, like
woodchuck and woodchucks.
• Square brackets [ ] can’t help here because they only choose one character from
the list — they can’t say "or nothing".
• To make a character optional, we use the question mark ?.
• ? = “zero or one” of the previous character
• Example: /woodchucks?/ → matches "woodchuck" or "woodchucks“.

Fig. The question mark ? marks optionality of the previous expression

Regular Expressions
6. Repetition:
• The Kleene star (*) means:→ "zero or more of the character just before it".
• Examples:
• * → “zero or more” (e.g., /a*/ = "", "a", "aaa")
• + → “one or more” (e.g., /a+/ = "a", "aaa")
• /[0-9]+/ → matches one or more digits.
• Strings like “Off Minor” match /a*/ because they contain zero a’s.
• To match one or more a’s, we use /aa*/ → One 'a' followed by zero or more 'a’s.
• You can repeat more complex patterns too:
o /[ab]*/ matches zero or more a’s or b’s.
o It matches strings like: "aaaa", "ababab", "bbbb", or even "".
Regular Expressions
7. Wildcard
• The period . is a special character in regular expressions.
• It acts as a wildcard that matches any single character (except a line break).
• So, /./ will match letters, numbers, symbols—any one character.
• It is often used with the Kleene star * to mean → “any number of any
characters”.
• Example:
o /.*./ means "any string with at least two characters".
o To find a word like "aardvark" appearing twice in a line, we use: → /aardvark.*aardvark/

Fig. The use of the period . to specify any character.

Regular Expressions
8. Anchors
• Anchors are special symbols in regular expressions used to match specific
positions in a line of text.
• The two most common anchors are:
➢ Caret ^ – matches the start of a line.
➢ Dollar sign $ – matches the end of a line

Pattern Matches
^[A-Z] Palo Alto
^[Â-Za-z] 1 “ Hello”
\.$ The end.
.$ The end? The end!
Regular Expressions
• There are also two other anchors:
➢ \b matches a word boundary,
➢ \B matches a non-boundary (i.e., in the middle of a word).
• Example: /\bthe\b/ matches the word the but not the word other
• A word for the purposes of a regular expression is defined as any sequence of:
➢ Letters (a–z, A–Z)
➢ Digits (0–9)
➢ Underscore (_)
• Example: /\b99\b/
➢ will match the string 99 in “There are 99 bottles of beer on the wall” (because 99 follows a
space)
➢ but not 99 in “There are 299 bottles of beer on the wall” (since 99 follows a number).
➢ match 99 in “$99 (since 99 follows a dollar sign ($)”, which is not a digit, underscore, or
letter).
Regular Expressions
9. OR (Disjunction)
• Suppose we want to search for texts about pets, especially cats or dogs.
• We want to match either the word "cat" or "dog".
• We cannot use square brackets like /[catdog]/ because:
• [catdog] means "any one character" from the list c, a, t, d, o, g.
• It does not mean the full word "cat" or "dog".
• To search for entire words like "cat" or "dog", we use the pipe symbol |, called
the disjunction operator.
•The pattern /cat|dog/ → Match either "cat" or "dog".
Regular Expressions
10. Grouping with ()
• Sometimes we want to use OR (|) inside a bigger word or pattern.
➢ Example: We want to match both "guppy" and "guppies".
• We can’t write /guppy|ies/ because:
• It matches either "guppy" or just "ies", not "guppies".
• This happens because the | (OR) operator has low precedence — meaning:
➢ It treats "guppy" and "ies" as completely separate patterns.
• To fix this, we use parentheses () to group the variable part.
• So, we write: /gupp(y|ies)/
➢ This means: "guppy" OR "guppies", using one base pattern gupp with two suffix options.
• Parentheses make grouped patterns act like a single unit, so | applies only within
them.
Regular Expressions
• In regular expressions, some operators are applied before others — this is
known as operator precedence.
• This ordering of how operators are prioritized is called the regular expression
operator precedence hierarchy.
• The following table gives the order of RE operator precedence, from highest
precedence to lowest precedence.
Regular Expression Example: Find me all instances
of the word “the” in a text.
• A simple (but incorrect) pattern might be: /the/
• One problem is that this pattern will miss the word when it begins a
sentence and hence is capitalized (i.e., The).
• This might lead us to the following pattern: /[tT]he/
• But we will still incorrectly return texts with the embedded in other words
(e.g., other or theology).
• So we need to specify that we want instances with a word boundary on
both sides: /\b[tT]he\b/
• Suppose we wanted to do this without the use of /\b/.
• We might want this since /\b/ won’t treat underscores and numbers as
word boundaries; but we might want to find the in some context where it
might also have underlines or numbers nearby (the_ or the25).
Regular Expression Example: Find me all
instances of the word “the” in a text.
• We need to specify that we want instances in which there are no alphabetic
letters on either side of the: /[â-zA-Z][tT]he[â-zA-Z]/
• But there is still one more problem with this pattern: it won’t find the word the
when it begins a line.
• We can avoid this by specifying that before the we require either the beginning-
of-line or a non-alphabetic character, and the same at the end of the line:
/(ˆ|[â-zA-Z])[tT]he([â-zA-Z]|$)/
Text Normalization
• Text normalization reduces the word to its base form by removing the inflectional
part.
• There are two main approaches toward text normalization in NLP:
Stemming and Lemmatization
• Text normalization is a crucial step in Natural Language Processing (NLP).
• It is widely used in various NLP applications like:
➢ Speech recognition
➢ Text-to-speech systems
➢ Spam email detection
➢ Sentiment analysis, etc.
Text Normalization
• Example: Words like "collection", "collective", "collect", and "collectively" are
all variations of the base word "collect".
• Human language is inherently random, and we often use different word forms
for the same concept.
• Computers struggle to handle this randomness effectively without
normalization.
• Text normalization removes this randomness by converting words to a
standard format.
• This process helps in:
• Reducing variations of the same word
• Lowering the number of input features in the model
• Improving model efficiency
Text Normalization
Stemming
• Stemming is a basic technique for text normalization in NLP.
• It works by removing prefixes and suffixes (inflectional parts) from words.
• The goal is to obtain the stem/root of the word.
Example: connection, connected, connecting word reduce to a common word “connect”
• It does not consider the semantic meaning of the word.
• Drawback: The resulting stem may not be a meaningful word.
Example: "laziness" → "lazi" (not "lazy")
• Can lead to loss of meaning or incorrect interpretations.
• In Python's NLTK library, PorterStemmer is used for stemming.
Text Normalization
• Stemming must be applied carefully to avoid Over-stemming and Under-stemming.
a) Over-stemming (too much chopping)
• Over-stemming occurs when the stemmer removes too many characters from a word.
• This leads to different words being incorrectly treated as the same.
• Example: "university" and "universe" both reduced to "univers".
• This falsely implies they are semantically the same, which they are not.

b) Under-stemming (not reducing enough)

• Occurs when the stemmer does not reduce words enough.
• Results in related words being treated as different.
• Example: "data" → "dat" and "datum" → "datu“ (instead of the same stem “dat”)
• The stemmer fails to recognize that both words come from the same root
Text Normalization
Porter’s algorithm: The most common English stemmer
• Porter Stemmer is a widely used stemming algorithm for English developed by
Martin Porter.
• Involves a step-by-step rule-based approach to strip suffixes.
Text Normalization
Handling Past Tense & Gerunds
Only apply if the word has a vowel before the suffix (*v* means vowel
exists).
Text Normalization
Text Normalization
Lemmatization
• Improves upon stemming by addressing its drawbacks.
• Uses vocabulary, grammar rules, and part-of-speech (POS) tags to reduce words
to their base form (lemma).
• Removes inflectional endings more accurately than simple suffix chopping.
• Relies on a predefined dictionary (lexicon) to identify the correct lemma based
on context.
• Ensures that the meaning of the word is preserved during normalization.
• In Python's NLTK library, WordNetLemmatizer is used for lemmatization.
• More suitable for tasks that require semantic accuracy, like: Text classification,
Question answering, Named Entity Recognition (NER)
Text Normalization
• Uses vocabulary + morphological analysis.
• Reduce inflections or variant forms to base form
• Examples:
➢ am, are, is → be
➢ car, cars, car's, cars' → car
➢ better → good
➢ running → run

Lecture 2 - NLP-I
No ratings yet
Lecture 2 - NLP-I
91 pages
NLP Overview and Key Concepts
No ratings yet
NLP Overview and Key Concepts
54 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
Week 2
No ratings yet
Week 2
90 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
NLP Module 2 1 (SAMI)
No ratings yet
NLP Module 2 1 (SAMI)
19 pages
Lec 5
No ratings yet
Lec 5
25 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
Lec 2
No ratings yet
Lec 2
21 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Regular Expressions
No ratings yet
Regular Expressions
20 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Theory of Computation - Practical
No ratings yet
Theory of Computation - Practical
23 pages
NLP m1
No ratings yet
NLP m1
148 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Text Processing Basics: Tokenization Guide
No ratings yet
Text Processing Basics: Tokenization Guide
42 pages
Regular Expressions Overview by Mansouri
No ratings yet
Regular Expressions Overview by Mansouri
39 pages
Text Mining Preprocessing Techniques
No ratings yet
Text Mining Preprocessing Techniques
15 pages
Unit1 01
No ratings yet
Unit1 01
10 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Ai Part B ch12
No ratings yet
Ai Part B ch12
16 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Text Data Preprocessing 2025
No ratings yet
Text Data Preprocessing 2025
39 pages
NLP Learning Materials 1
No ratings yet
NLP Learning Materials 1
28 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction to NLP and NLTK Basics
No ratings yet
Introduction to NLP and NLTK Basics
23 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Sample
No ratings yet
Sample
8 pages
Android Food Waste Management System
No ratings yet
Android Food Waste Management System
7 pages
iSpring Suite Max: Fast eLearning Tool
No ratings yet
iSpring Suite Max: Fast eLearning Tool
3 pages
SEO-Optimized Document Title
No ratings yet
SEO-Optimized Document Title
12 pages
SAP Invoice Management at Tullow Oil
No ratings yet
SAP Invoice Management at Tullow Oil
21 pages
Prem Nath 3+ Years QA Analyst Resume
No ratings yet
Prem Nath 3+ Years QA Analyst Resume
3 pages
Brochure Icx 7150
No ratings yet
Brochure Icx 7150
14 pages
Syllabus SSC CGL
No ratings yet
Syllabus SSC CGL
7 pages
Visual Electrophysiology Interpretation
No ratings yet
Visual Electrophysiology Interpretation
36 pages
Global Community
No ratings yet
Global Community
14 pages
Under Ground Cable Fault Detector Using Arduino
100% (1)
Under Ground Cable Fault Detector Using Arduino
24 pages
M Com Question Paper
No ratings yet
M Com Question Paper
3 pages
Computation 1
No ratings yet
Computation 1
13 pages
Expression Editor and Listen Node Software Version: 1.80 1
No ratings yet
Expression Editor and Listen Node Software Version: 1.80 1
298 pages
165e1847-10 Iom
No ratings yet
165e1847-10 Iom
88 pages
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
No ratings yet
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
32 pages
Scrubs The Complete First Season
100% (1)
Scrubs The Complete First Season
35 pages
Failure Analysis of Ball Valves Worcester
No ratings yet
Failure Analysis of Ball Valves Worcester
12 pages
Convolutional Layer Output Size Formula
No ratings yet
Convolutional Layer Output Size Formula
10 pages
System Administrator Interview Questions
No ratings yet
System Administrator Interview Questions
2 pages
5G Overview Day1 - 095205
No ratings yet
5G Overview Day1 - 095205
43 pages
S509 Alpha Analysis User's Manual
No ratings yet
S509 Alpha Analysis User's Manual
141 pages
Taylor Swift Lover Album Review
No ratings yet
Taylor Swift Lover Album Review
3 pages
SD 2
No ratings yet
SD 2
7 pages
Past Paper 1
No ratings yet
Past Paper 1
9 pages
Digital Audio Watermarking Techniques and Technologies Applications and Benchmarks 1st Edition Nedeljko Cvejic Full
No ratings yet
Digital Audio Watermarking Techniques and Technologies Applications and Benchmarks 1st Edition Nedeljko Cvejic Full
115 pages
Applications of Artificial Intteligence in Mining: by Suresh Babu
No ratings yet
Applications of Artificial Intteligence in Mining: by Suresh Babu
16 pages
Microprocessor Programming and Interfacing: Computing Concepts
No ratings yet
Microprocessor Programming and Interfacing: Computing Concepts
9 pages
Falcon Manual
100% (1)
Falcon Manual
230 pages
Qsysprt 1
No ratings yet
Qsysprt 1
5 pages
NP Unit-3
No ratings yet
NP Unit-3
14 pages

Module 2 Reference Material 1

Uploaded by

Module 2 Reference Material 1

Uploaded by

Natural Language

MODULE 2: Text processing

Dr. Reya Sharma

"Heyyy, r u freee 2nite?? ”

Fig. Some simple regex searches.

Fig. The use of the brackets [] to specify a disjunction of characters.

Fig. Uses of the caret ˆ for negation or just to mean ˆ

Fig. The question mark ? marks optionality of the previous expression

Fig. The use of the period . to specify any character.

b) Under-stemming (not reducing enough)

You might also like