Text Preprocessing in NLP

This document provides an overview of common text preprocessing steps for natural language processing, including tokenization, stemming, stopword removal, and case folding. It also discusses Zipf's law, which states that the frequency of a word is inversely proportional to its rank in the frequency table, and how a similar distribution can arise from a monkey randomly typing on a keyboard.

Uploaded by

Prasetyo Adi Utomo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views3 pages

Text Preprocessing in NLP

Uploaded by

Prasetyo Adi Utomo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CS769 Spring 2010 Advanced Natural Language Processing

Basic Text Process

Lecturer: Xiaojin Zhu jerryzhu@[Link]

1 Common Preprocessing Steps

Counting words alone gives interesting information. This is known as unigram word count (or word frequency,
when normalized). For example Amazon concordance for the book The Very Hungry Caterpillar by Eric
Carle shows high frequency content words “hungry, ate, still, caterpillar, slice, ...” Another interesting
website is [Link].
How do we count words? First we need to define what a word is. This is highly non-trivial for languages
without space like Chinese. Even for seemingly simple English, getting text into a form where you can count
words is quite involved, as we see below.
Preprocessing text is called tokenization or text normalization. Things to consider include
• Throw away unwanted stuff(e.g., HTML tags – but sometimes they are valuable, UUencoding, etc.)
• Word boundaries: white space and punctuations – but words like Ph.D., isn’t, e-mail, C|net or $19.99
are problematic. If you like, we can spend a whole semester on this... This is fairly domain dependent,
and people typically use manually created regular expression rules.
• Stemming (Lemmatization): This is optional. English words like ‘look’ can be inflected with a mor-
phological suffix to produce ‘looks, looking, looked’. They share the same stem ‘look’. Often (but
not always) it is beneficial to map all inflected forms into the stem. This is a complex process, since
there can be many exceptional cases (e.g., department vs. depart, be vs. were). The most commonly
used stemmer is the Porter Stemmer. There are many others. They are not perfect and they do make
mistakes. Many are not designed for special domains like biological terms. Some other languages (e.g.,
Turkish) are particularly hard.
• Stopword removal: the most frequent words often do not carry much meaning. Examples: “the, a, of,
for, in, ...” You can also create your own stopword list for your application domain. If we count the
words in Tom Sawyer, the most frequent word types are (from [MS p.21]):

Word Count
the 3332
and 2972
a 1775
to 1725
of 1440
was 1161
it 1027
in 906
that 877
he 877

For many NLP purposes (i.e. text categorization) they are a nuisance, and stopword removal is a
common preprocessing step. SMART is such a stopword list.

1
Basic Text Process 2

• Capitalization, case folding: often it is convenient to lower case every character. Counterexamples
include ‘US’ vs. ‘us’. Use with care.

People devote a large amount of effort to create good text normalization systems.
Now you have clean text, there are two concepts:

• Word token: occurrences of a word.

• Word type: unique word as a dictionary entry.

For example, “The dog chases the cat.” has 5 word tokens but 4 word types. There are two tokens of the
word type “the”.
A vocabulary lists the word types. A typical vocabulary has 10,000 or more words (types). Sometimes
also in the vocabulary are <s> for the special start-of-sentence symbol, </s> for end-of-sentence, and <unk>
for all out-of-vocabulary words (they all map to this symbol).
A corpus is a large collection of text, e.g., several years’ newspapers. A vocabulary can be created from
a corpus. Often people apply a frequency cutoff to exclude word types with small counts (see below). The
cutoff is usually determined empirically (anywhere from one to tens or more).

1.1 Zipf ’s Law

If we rank word types by their count in Tom Sawyer, and compute count × rank, we see an interesting
pattern:
Word Count f rank r fr
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
but 410 20 8400
be 294 30 8820
there 222 40 8880
one 172 50 8600
two 104 100 10400
turned 51 200 10200
comes 16 500 8000
family 8 1000 8000
brushed 4 2000 8000
Could 2 4000 8000
Applausive 1 8000 8000

We see that f r ≈ constant, or f ∝ 1r . If we plot log(r) on the x-axis and log(f ) on the y-axis, the words
roughly form a line from upper-left to lower-right. Note f can be the frequency (count divided by corpus
size) and the relation still holds. This relation is known as Zipf ’s law. It holds for a variety of corpora.
Mandelbrot generalizes Zipf’s law with more parameters P, ρ, B so it is more flexible: f = P (r + ρ)−B .

1.2 Miller’s Monkeys

If we promise a monkey some bananas and ask it to type tirelessly on a computer keyboard, what do we
get?1 For simplicity, let us assume the keyboard has 27 keys: a to z, and white space. We also assume the
monkey hit each key with equal probability. Let us call a sequence of letters separated by white space a
‘word’. What frequency and rank relation do such monkey words possess?
1 No, not a software engineer.
Basic Text Process 3

The probability that a specific monkey word type has length i is

P (i) = (1/27)i (1/27) = (1/27)i+1 . (1)

As we can see, the longer the word, the lower its probability – therefore the lower the expected count in the
monkey corpus. Let us rank all monkey words by its probability. The number of monkey word-types with
length i is 26i . The rank ri of a word with length i thus satisfies
i−1
X i
X
26j < ri ≤ 26j (2)
j=1 j=1

Pi
Let us consider the word with rank r = j=1 26j . The word actually has length i, but from

i
X 26
r= 26j = (26i − 1), (3)
j=1
25

we can derive a ‘fractional length’ i0

25

0 log 26 r+1
i = . (4)
log 26
The frequency of this word is
0
p(i0 ) = (1/27)i +1 (5)
log( 25 r+1)
26
= (1/27) log 26 +1 (6)
− log 27
25 log 26
= (1/27) r+1 using the fact alog b = blog a (7)
26
≈ 0.04(r + 1.04)−1.01 , (8)

which fits Mandelbrot’s law, and is fairly close to Zipf’s law.

In light of the above analysis, Zipf’s law may not reflect some deep knowledge of languages. Nonetheless,
it still points to an important empirical observation, that almost all words are rare. This is also known as
the heavy tail property.

2 - Text Operation
No ratings yet
2 - Text Operation
35 pages
CH 2 - Text Operation
No ratings yet
CH 2 - Text Operation
38 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
Text Operations for Information Retrieval
No ratings yet
Text Operations for Information Retrieval
45 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Analyzing Word Frequency Distributions
No ratings yet
Analyzing Word Frequency Distributions
47 pages
Word Frequency Distribution Insights
No ratings yet
Word Frequency Distribution Insights
46 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
Chap 4
No ratings yet
Chap 4
76 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
Chapter Two IR
No ratings yet
Chapter Two IR
45 pages
4 Tokenization MED
No ratings yet
4 Tokenization MED
60 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
2 - Text Operations
No ratings yet
2 - Text Operations
56 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
28 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Intro to NLP: Tokenization Basics
No ratings yet
Intro to NLP: Tokenization Basics
55 pages
Zipf's Law and Heaps Law
No ratings yet
Zipf's Law and Heaps Law
10 pages
N-Gram Language Modelling Overview
No ratings yet
N-Gram Language Modelling Overview
40 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
DS Lecture 03 04
No ratings yet
DS Lecture 03 04
74 pages
3 TextNormalizations
No ratings yet
3 TextNormalizations
37 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Zipf's Law in Text Analysis
No ratings yet
Zipf's Law in Text Analysis
60 pages
Linguistic Frequency Analysis
No ratings yet
Linguistic Frequency Analysis
22 pages
NLPNOTES
No ratings yet
NLPNOTES
26 pages
NLP Notes-1
No ratings yet
NLP Notes-1
11 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Kornai - How Many Words Are There
No ratings yet
Kornai - How Many Words Are There
26 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLP Techniques for Word Prediction
No ratings yet
NLP Techniques for Word Prediction
77 pages
Week 2
No ratings yet
Week 2
90 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
NLP Mod-3
No ratings yet
NLP Mod-3
15 pages
Spelling PB
100% (1)
Spelling PB
128 pages
Form 5 Lesson 60
No ratings yet
Form 5 Lesson 60
2 pages
Ukuchazwa Kwamagama EsiZulu-udedangendlale Wendima Nenkiyankiya Umchwayo Wamamboza
No ratings yet
Ukuchazwa Kwamagama EsiZulu-udedangendlale Wendima Nenkiyankiya Umchwayo Wamamboza
282 pages
L2 Unit 2 Grammar
No ratings yet
L2 Unit 2 Grammar
1 page
WAA01 01 MSC 20210517
No ratings yet
WAA01 01 MSC 20210517
11 pages
WEEK 4 WLP in Reading
No ratings yet
WEEK 4 WLP in Reading
5 pages
Assignment #1
No ratings yet
Assignment #1
11 pages
Test TEVI-R
No ratings yet
Test TEVI-R
8 pages
Quiz on English Tenses: Present & Past
No ratings yet
Quiz on English Tenses: Present & Past
1 page
Learn Animals, Colors, Numbers & More
No ratings yet
Learn Animals, Colors, Numbers & More
7 pages
Preference Grammar Exercises
No ratings yet
Preference Grammar Exercises
1 page
Declensions of The Noun in Koine Greek
No ratings yet
Declensions of The Noun in Koine Greek
5 pages
Comments On Cotton's Vocabulary, G. Aubin
No ratings yet
Comments On Cotton's Vocabulary, G. Aubin
7 pages
Date Sheet Class Eight (Annual Exams2024)
No ratings yet
Date Sheet Class Eight (Annual Exams2024)
1 page
Regular and Irregular Verbs Worksheet Templates Layouts - 132367
100% (3)
Regular and Irregular Verbs Worksheet Templates Layouts - 132367
4 pages
Rizki Annisaa Bab II
No ratings yet
Rizki Annisaa Bab II
14 pages
SV Agreement Inverted Sentence
100% (1)
SV Agreement Inverted Sentence
17 pages
U (2) Grammar P (22) Handout
No ratings yet
U (2) Grammar P (22) Handout
6 pages
Workbook 3 UA
No ratings yet
Workbook 3 UA
217 pages
Pop Quiz A Eng356 2020S 2
No ratings yet
Pop Quiz A Eng356 2020S 2
8 pages
Language Skills Practice Exercises
No ratings yet
Language Skills Practice Exercises
2 pages
Possessive Case 2011
No ratings yet
Possessive Case 2011
2 pages
Phonetics Mandarin Chinese
No ratings yet
Phonetics Mandarin Chinese
5 pages
Language Programs and Policies in Multilingual Societies
No ratings yet
Language Programs and Policies in Multilingual Societies
4 pages
Modals Lesson Plan for 2 Bac
100% (1)
Modals Lesson Plan for 2 Bac
2 pages
4 Oral Language
No ratings yet
4 Oral Language
11 pages
General Knowledge Quiz
No ratings yet
General Knowledge Quiz
32 pages
Presentsimple Worksheet
No ratings yet
Presentsimple Worksheet
3 pages
INSTITUTO CAMBRIDGE de Cultura Inglesa - MARZO 2019: 2 - Year
100% (1)
INSTITUTO CAMBRIDGE de Cultura Inglesa - MARZO 2019: 2 - Year
4 pages
Essential Communication Types Explained
No ratings yet
Essential Communication Types Explained
34 pages
Past Perfect Continuous Guide
No ratings yet
Past Perfect Continuous Guide
11 pages