Text Processing
Word Tokenization
Tokenization is the process of segmenting a string of characters into
tokens (words).
An example
I have a can opener; but I can’t open these cans.
Word Tokens: 11
Word Types: 10
1/24/2022 Basic Text Processing
Several tokenization libraries
NLTK Toolkit (Python)
Spacy (Python)
Polyglot (Python)
Stanford CoreNLP (Java)
Unix Commands
Basic Text Processing
Issues in Tokenization
Common examples
Finland‟s → Finland Finlands Finland‟s ?
What‟re, I‟m, shouldn‟t → What are, I am, should not ?
San Francisco → one token or two?
m.p.h. → ??
Hyphenation
End-of-Line Hyphen: Used for splitting whole words into part for text
justification. e.g. “... apparently, mid-dle English followed this practice...”
Lexical Hyphen: Certain prefixes are offen written hyphenated, e.g. co-,
pre-, meta-, multi-, etc.
Sententially Determined Hyphenation: Mainly to prevent incorrect
parsing of the phrase. e.g. State-of-the-art, three-to-five-year, etc.
1/24/2022 Basic Text Processing
Language Specific Issues
French
l’ensemble: want to match with un ensemble
Basic Text Processing
1/24/2022
Language Specific Issues
French
l’ensemble: want to match with un ensemble
German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟
1/24/2022 Basic Text Processing
Language Specific Issues
French
l’ensemble: want to match with un ensemble
German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟
Sanskrit
Very long compound words
1/24/2022 Basic Text Processing
Language Specific Issues
Chinese
No space between words
Japanese
Further complications with multiple alphabets intermingled.
1/24/2022 Basic Text Processing
Word Tokenization in Chinese
Maximum Matching (Greedy Algorithm)
Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string
Will the above scheme work for English?
1/24/2022 Basic Text Processing
Word Tokenization in Chinese
Maximum Matching (Greedy Algorithm)
Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string
Will the above scheme work for English?
No: Thetabledownthere
Yes: #ThankYouSachin, #musicmonday etc.
1/24/2022 Basic Text Processing
Sentence Segmentation
Can we decide where the sentences begin and end?
Why it is difficult?
Are „!‟ and „?‟ ambiguous?
Basic Text Processing
Sentence Segmentation
Can we decide where the sentences begin and end?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous?
Basic Text Processing
Sentence Segmentation
Can we decide where the sentences begin and end?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Basic Text Processing
1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)
Basic Text Processing
1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No
Is period “.” ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)
Can we build a binary classifier for ‟period‟
classification? For each “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers can be: hand-written rules, regular expressions, or
machine learning
Basic Text Processing
1/24/2022
Sentence Segmentation: Decision Tree
Example
Decision Tree: Is this word the end-of-sentence (E-O-S)?
Basic Text Processing
1/24/2022
Other Important Features
Case of word with “.”: Upper, Lower, Number
Case of word after “.”: Upper, Lower, Number
Numeric Features
Length of word with “.”
Probability (word with “.” occurs at end-of-sentence)
Probability (word after “.” occurs at beginning-of-sentence)
Basic Text Processing
1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus
Basic Text Processing
1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus
Basic Idea
Usually works top-down, by choosing a variable at each step that best
splits the set of items.
Popular algorithms: ID3, C4.5, CART
Basic Text Processing
1/24/2022
Other Classifiers
Support Vector Machines
Logistic regression
Neural Networks
Basic Text Processing
1/24/2022
Normalization
Why to “normalize”?
Indexed text and query terms must have the same form.
U.S.A. and USA should be matched
We implicitly define equivalence classes of terms
Basic Text Processing
1/24/2022
Case Folding
Reduce all letters to lower case
Some caveats (Task dependent):
Upper case in mid sentence, may point to named entities (e.g. General
Motors)
For MT and information extraction, some cases might be helpful (US vs.
us)
Basic Text Processing
1/24/2022
Python tokenization example
http://text-processing.com/demo/tokenize/
Simple Tokenization in UNIX
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < file_name
| sort
| uniq -c
| sort -rn
Change all non-alphabetic characters to newline
Sort in alphabetical order
Merge and count each type
Sort based on the count
For more info: execute ‘man tr’
1/24/2022 Basic Text Processing
Token normalization
We may want the same token for different forms of the word
• wolf, wolves wolf
• talk, talks talk
Stemming
• A process of removing and replacing suffixes to get to the root
form of the word, which is called the stem
• Usually refers to heuristics that chop off suffixes
Lemmatization
• Usually refers to doing things properly with the use of a
vocabulary and morphological analysis
• Returns the base or dictionary form of a word,
which is known as the lemma
Lemmatization example
WordNet lemmatizer
• Uses the WordNet Database to lookup lemmas
• nltk.stem.WordNetLemmatizer
• Examples:
− feet foot cats cat
− wolves wolf talked talked
• Problems: not all forms are reduced
• Takeaway: we need to try stemming or lemmatization and
choose best for our task
Lemmatization
Reduce inflections or variant forms to base form:
am, are, is → be
car, cars, car‟s, cars‟ → car
Have to find the correct dictionary headword form
1/24/2022
Lemmatization in Python
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(’churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(’abaci’)
u’abacus’
1/24/2022
Morphology
Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes
1/24/2022
Morphology
Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes
Morphemes are divided into two categories
Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings
and grammatical functions
Prefix: un-, anti-, etc (a-, ati-, pra- etc.)
Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)
1/24/2022
Stemming
Reducing terms to their stems
Used in information retrieval Crude chopping of affixes
Language dependent
1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)
Step 1b
(*v*)ing → φ(walking → walk, king →
1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)
Step 1b
(*v*)ing → φ (walking → walk, king → king)
(*v*)ed → φ(played → play)
...
If first two rules of Step 1b are successful, the following is
done: AT → ATE (conflat(ed) → conflate)
BL → BLE (troubl(ed) → trouble)
1/24/2022
Porter‟s algorithm
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...
1/24/2022
Porter‟s algorithm
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...
Step 3
al → φ(revival → reviv)
able → φ(adjustable → adjust)
ate → φ(activate → activ)
...
Complete Algorithm is available at:
http://snowball.tartarus.org/algorithms/porter/stemmer.html
1/24/2022
Python stemming example
Stemming in Python
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’maximum’)
’maximum’
>>> porter_stemmer.stem(’presumably’)
’presum’
>>> porter_stemmer.stem(’multiply’)
’multipli’
>>> porter_stemmer.stem(’provision’)
’provis’
1/24/2022