0% found this document useful (0 votes)

20 views39 pages

Text Preprocessing

The document discusses text processing techniques, focusing on word tokenization, sentence segmentation, and normalization. It highlights various tokenization libraries, issues in tokenization across different languages, and methods for stemming and lemmatization. Additionally, it covers the implementation of decision trees and classifiers for text processing tasks.

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views39 pages

Text Preprocessing

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Text Processing

Word Tokenization
Tokenization is the process of segmenting a string of characters into
tokens (words).

An example
I have a can opener; but I can’t open these cans.
Word Tokens: 11
Word Types: 10

1/24/2022 Basic Text Processing

Several tokenization libraries

NLTK Toolkit (Python)

Spacy (Python)
Polyglot (Python)
Stanford CoreNLP (Java)
Unix Commands

Basic Text Processing

Issues in Tokenization
Common examples

Finland‟s → Finland Finlands Finland‟s ?

What‟re, I‟m, shouldn‟t → What are, I am, should not ?
San Francisco → one token or two?
m.p.h. → ??

Hyphenation
End-of-Line Hyphen: Used for splitting whole words into part for text
justification. e.g. “... apparently, mid-dle English followed this practice...”
Lexical Hyphen: Certain prefixes are offen written hyphenated, e.g. co-,
pre-, meta-, multi-, etc.
Sententially Determined Hyphenation: Mainly to prevent incorrect
parsing of the phrase. e.g. State-of-the-art, three-to-five-year, etc.

1/24/2022 Basic Text Processing

Language Specific Issues

French
l’ensemble: want to match with un ensemble

Basic Text Processing

1/24/2022
Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

1/24/2022 Basic Text Processing

Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

Sanskrit
Very long compound words

1/24/2022 Basic Text Processing

Language Specific Issues
Chinese
No space between words

Japanese
Further complications with multiple alphabets intermingled.

1/24/2022 Basic Text Processing

Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)

Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)

Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?

No: Thetabledownthere
Yes: #ThankYouSachin, #musicmonday etc.

1/24/2022 Basic Text Processing

Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous?

Basic Text Processing

Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous?

Basic Text Processing

Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)

Basic Text Processing

1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Basic Text Processing

1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No
Is period “.” ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Can we build a binary classifier for ‟period‟

classification? For each “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers can be: hand-written rules, regular expressions, or
machine learning

Basic Text Processing

1/24/2022
Sentence Segmentation: Decision Tree
Example
Decision Tree: Is this word the end-of-sentence (E-O-S)?

Basic Text Processing

1/24/2022
Other Important Features

Case of word with “.”: Upper, Lower, Number

Case of word after “.”: Upper, Lower, Number
Numeric Features
Length of word with “.”
Probability (word with “.” occurs at end-of-sentence)
Probability (word after “.” occurs at beginning-of-sentence)

Basic Text Processing

1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus

Basic Text Processing

Basic Idea
Usually works top-down, by choosing a variable at each step that best
splits the set of items.
Popular algorithms: ID3, C4.5, CART

Basic Text Processing

1/24/2022
Other Classifiers

Support Vector Machines

Logistic regression
Neural Networks

Basic Text Processing

1/24/2022
Normalization

Why to “normalize”?
Indexed text and query terms must have the same form.
U.S.A. and USA should be matched

We implicitly define equivalence classes of terms

Basic Text Processing

1/24/2022
Case Folding

Reduce all letters to lower case

Some caveats (Task dependent):
Upper case in mid sentence, may point to named entities (e.g. General
Motors)
For MT and information extraction, some cases might be helpful (US vs.
us)

Basic Text Processing

1/24/2022
Python tokenization example

http://text-processing.com/demo/tokenize/
Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < file_name

| sort
| uniq -c
| sort -rn

Change all non-alphabetic characters to newline

Sort in alphabetical order
Merge and count each type
Sort based on the count

For more info: execute ‘man tr’

1/24/2022 Basic Text Processing

Token normalization
We may want the same token for different forms of the word
• wolf, wolves  wolf
• talk, talks  talk
Stemming
• A process of removing and replacing suffixes to get to the root
form of the word, which is called the stem
• Usually refers to heuristics that chop off suffixes
Lemmatization
• Usually refers to doing things properly with the use of a
vocabulary and morphological analysis
• Returns the base or dictionary form of a word,
which is known as the lemma
Lemmatization example
WordNet lemmatizer
• Uses the WordNet Database to lookup lemmas
• nltk.stem.WordNetLemmatizer
• Examples:
− feet  foot cats  cat
− wolves  wolf talked  talked
• Problems: not all forms are reduced

• Takeaway: we need to try stemming or lemmatization and

choose best for our task
Lemmatization

Reduce inflections or variant forms to base form:

am, are, is → be
car, cars, car‟s, cars‟ → car
Have to find the correct dictionary headword form

1/24/2022
Lemmatization in Python

>>> from nltk.stem import WordNetLemmatizer

>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(’churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(’abaci’)
u’abacus’

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

Morphemes are divided into two categories

Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings
and grammatical functions
Prefix: un-, anti-, etc (a-, ati-, pra- etc.)
Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)

1/24/2022
Stemming

Reducing terms to their stems

Used in information retrieval Crude chopping of affixes
Language dependent

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king →

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ (walking → walk, king → king)
(*v*)ed → φ(played → play)
...
If first two rules of Step 1b are successful, the following is
done: AT → ATE (conflat(ed) → conflate)
BL → BLE (troubl(ed) → trouble)

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

Step 3
al → φ(revival → reviv)
able → φ(adjustable → adjust)
ate → φ(activate → activ)
...

Complete Algorithm is available at:

http://snowball.tartarus.org/algorithms/porter/stemmer.html
1/24/2022
Python stemming example
Stemming in Python

>>> from nltk.stem.porter import PorterStemmer

>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’maximum’)
’maximum’
>>> porter_stemmer.stem(’presumably’)
’presum’
>>> porter_stemmer.stem(’multiply’)
’multipli’
>>> porter_stemmer.stem(’provision’)
’provis’

1/24/2022

NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
59 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP m2
No ratings yet
NLP m2
71 pages
Week 3
No ratings yet
Week 3
15 pages
Chap 2
No ratings yet
Chap 2
70 pages
Week 2
No ratings yet
Week 2
90 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Text Mining
No ratings yet
Text Mining
62 pages
Module 2 Reference Material 1
No ratings yet
Module 2 Reference Material 1
43 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
4 Tokenization MED
No ratings yet
4 Tokenization MED
60 pages
Lec 2
No ratings yet
Lec 2
21 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
SPR 08 Algorithms
No ratings yet
SPR 08 Algorithms
41 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Text Normalization in NLP
No ratings yet
Text Normalization in NLP
29 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
TextMining
No ratings yet
TextMining
43 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
Text Processing Basics: Tokenization Guide
No ratings yet
Text Processing Basics: Tokenization Guide
42 pages
Lab 2
No ratings yet
Lab 2
49 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
Tokenization in Santali Language NLP
No ratings yet
Tokenization in Santali Language NLP
16 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
Session 1
No ratings yet
Session 1
33 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
Lecture 3 - Basic Text Processing
No ratings yet
Lecture 3 - Basic Text Processing
58 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Intro to NLP: Tokenization Basics
No ratings yet
Intro to NLP: Tokenization Basics
55 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Maclean's Guide To Bombay 1875
No ratings yet
Maclean's Guide To Bombay 1875
417 pages
Cement Stabilization of Soil
100% (1)
Cement Stabilization of Soil
19 pages
Language Study Tracker Template
No ratings yet
Language Study Tracker Template
6 pages
Analysis of Frost's "Fire and Ice" Poem
No ratings yet
Analysis of Frost's "Fire and Ice" Poem
2 pages
سفاري
0% (1)
سفاري
3 pages
Role of Educational Institutions
No ratings yet
Role of Educational Institutions
11 pages
Louis Sullivan
No ratings yet
Louis Sullivan
1 page
Notes On The Badal
No ratings yet
Notes On The Badal
5 pages
Understanding Physical Changes
No ratings yet
Understanding Physical Changes
1 page
Unit II
No ratings yet
Unit II
76 pages
Bhavani - Job Ana
No ratings yet
Bhavani - Job Ana
11 pages
Amendment No. 1/august, 2013/IRC:112-2011 To IRC:112-2011 "Code of Practice For Concrete Road Bridges"
No ratings yet
Amendment No. 1/august, 2013/IRC:112-2011 To IRC:112-2011 "Code of Practice For Concrete Road Bridges"
2 pages
ASS Class 1
No ratings yet
ASS Class 1
2 pages
Rizal's Legacy: A Reflection on His Execution
No ratings yet
Rizal's Legacy: A Reflection on His Execution
2 pages
CD - at The Alhambra - Duke Ellington His Orchestra
No ratings yet
CD - at The Alhambra - Duke Ellington His Orchestra
9 pages
Priyanshu Admit Card
No ratings yet
Priyanshu Admit Card
2 pages
Oil & Gas E&P Life Cycle Course
No ratings yet
Oil & Gas E&P Life Cycle Course
39 pages
RCIS 2022 Salvador Mendes
No ratings yet
RCIS 2022 Salvador Mendes
16 pages
Tulum
No ratings yet
Tulum
3 pages
Medieval Spain's Heroic Defenders
No ratings yet
Medieval Spain's Heroic Defenders
2 pages
The Truth About Me
No ratings yet
The Truth About Me
13 pages
MS Grade C-D Rates of Reaction and Physical and Chemical Changes
No ratings yet
MS Grade C-D Rates of Reaction and Physical and Chemical Changes
15 pages
The Key of It All
100% (2)
The Key of It All
32 pages
Chapter 1 - Introduction To Immunohematology
100% (1)
Chapter 1 - Introduction To Immunohematology
58 pages
Anuja Dhyani
No ratings yet
Anuja Dhyani
2 pages
Ulm - Ulw Catalog
No ratings yet
Ulm - Ulw Catalog
33 pages
Q2 2022 Competency Failures by GL
No ratings yet
Q2 2022 Competency Failures by GL
2 pages
8076 MEYER Chapter 2 - Language Change
No ratings yet
8076 MEYER Chapter 2 - Language Change
25 pages
Artifact Sampling of Different Types of Assessment
No ratings yet
Artifact Sampling of Different Types of Assessment
17 pages
Giancoli Chapter 5
No ratings yet
Giancoli Chapter 5
38 pages

Text Preprocessing

Uploaded by

Text Preprocessing

Uploaded by

Text Processing

1/24/2022 Basic Text Processing

NLTK Toolkit (Python)

Basic Text Processing

Finland‟s → Finland Finlands Finland‟s ?

1/24/2022 Basic Text Processing

Basic Text Processing

1/24/2022 Basic Text Processing

1/24/2022 Basic Text Processing

1/24/2022 Basic Text Processing

Maximum Matching (Greedy Algorithm)

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Maximum Matching (Greedy Algorithm)

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Can we decide where the sentences begin and end?

Basic Text Processing

Can we decide where the sentences begin and end?

Basic Text Processing

Basic Text Processing

Basic Text Processing

Can we build a binary classifier for ‟period‟

Basic Text Processing

Basic Text Processing

Case of word with “.”: Upper, Lower, Number

Basic Text Processing

Basic Text Processing

Basic Text Processing

Support Vector Machines

Basic Text Processing

We implicitly define equivalence classes of terms

Basic Text Processing

Reduce all letters to lower case

Basic Text Processing

tr -sc ’A-Za-z’ ’\n’ < file_name

Change all non-alphabetic characters to newline

For more info: execute ‘man tr’

1/24/2022 Basic Text Processing

• Takeaway: we need to try stemming or lemmatization and

Reduce inflections or variant forms to base form:

>>> from nltk.stem import WordNetLemmatizer

Morphemes are divided into two categories

Reducing terms to their stems

Complete Algorithm is available at:

>>> from nltk.stem.porter import PorterStemmer

You might also like