Text Processing
Text Processing
Expression
Md Shad Akhtar
[Link]@[Link]
Shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● Words are the building blocks of language.
● Each language has a fixed number of valid words, aka. vocabulary.
2
shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● How do we identify a valid word, e.g., bank?
Dictionary lookup
● What about?
○ Bank
Can be handled
○ BANK through case-folding
○ Banks
In general, no
dictionary manages
plural form, explicitly.
3
shadakhtar:nlp:iiit[Link]text:processing:regex
Pattern Matching
● Searching/Matching a pattern (string) is very frequent in text processing.
○ Web-search or IR based system
○ Word-processing applications.
● A simpler yet powerful solution is to use regular expression for the pattern matching.
4
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:
The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.
● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).
● RegEx: /[Tt]he/
True positives
● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
5
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:
The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.
● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).
● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
6
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:
The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.
● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).
● RegEx: /\b[Tt]he\b/
● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
Success!
7
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression is more powerful than it seems!
8
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA [Weizenbaum, 1964-66, MIT AI lab]
● The first chatbot.
● Pattern matching system
9
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA
● Substitution with regex
○ s / RE / SubText /
● Three steps:
○ Changing the first person mentions to the uppercase second person mentions
■ I’m → YOU ARE
○ Create a set of all possible substitutions.
○ Rank the possible outputs to respond to the user.
s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1/
11
shadakhtar:nlp:iiit[Link]text:processing:regex
Validating words
● Bank
● bank
● Banks
● banks
12
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression
13
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression: Kleene closure
Examples: Any text
RE Matches
containing these symbols
/a*/ → 0 or more occurrence of as , a, aa, aaa, aaa, ...
14
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛆
𝛆
Regular Expression and Finite Automata
● Regular expressions can be implemented as finite-state automaton.
● Set of strings accepted by regular expression or the corresponding finite automata is
termed as regular language.
15
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
Finite Automata
● Given an input x, it outputs whether or not the input x is accepted by the automata M.
○ M(x) → {Accepted, Not accepted}
● Finite Automation
○ Deterministic Finite Automata (DFA)
○ Non-deterministic Finite Automata (NFA)
16
shadakhtar:nlp:iiit[Link]text:processing:regex
Deterministic Finite Automata (DFA)
● A finite automata can be represented as quintuple M = <Q, Σ, S, F, >, where
○ Q = Finite set of states
○ Σ = Finite set of input symbols
○ S = Start state S ⊆ Q
○ F = Finite set final/accepting states F ⊆ Q
○ = Transition function. : qi →a qj, where qi, qj ∈ Q and a ∈ Σ
Σ = {a-zA-Z} :
B
Q = {q1, q2, q3, q4, q5, q6}
q1 q2 q3 q4 q5 q6
S = {q1} a n k s
F = {q5, q6} b
17
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
𝛿
𝛿
𝛿
Transition diagram and table
Q/Σ a b .. k n s .. A B ...
q1 q2 q2
q2 q3
q3 q4
q4 q5
q5 q6
q6
Acceptance procedure:
1. While all the input symbols are not consumed:
a. At the current state, if the current input symbol matches any of its outlink labels,
i. Consume the input, move along the link to another state, and goto step 1.
b. Else
i. Announce failure
2. If the current state is one of the final states
a. Announce success
3. Else
a. Announce failure
19
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
20
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to
23,96,637. However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country
from the highly-contagious disease, government data this morning showed. With the death of 942 patients in the last 24
hours, the county's fatality count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the
pandemic after the United States and Brazil.
21
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Usual sentence end-marker: ‘.’, ‘?’, ‘!’
● Exclamation and Question marks are usually unambiguous.
○ Interjections: Oh!
22
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Disambiguation Blank line(s) after it?
○ Rule-based
Yes No
○ RegEx based
○ ML based EOS Is the punctuation ? or ! ?
Yes No
Yes No
Some other useful features
Abbreviation? Not EOS
● Case of word with “.”: Upper, Lower, Cap, Number
● Case of word after “.”: Upper, Lower, Cap, Number
Yes No
● Numeric features
○ Length of word with “.” Not EOS EOS
○ Probability(word with “.” occurs at end-of-s)
23
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.
● India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
● However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed.
● With the death of 942 patients in the last 24 hours, the county's fatality count rose 47,033, the Union Health Ministry said.
● India is the third worst-hit country by the pandemic after the United States and Brazil.
24
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization
● Usual delimiters: Whitespace, Dot, Comma, Question mark, Exclamation, etc.
● Some of the special cases
○ Abbreviation: Ph.D., AT&T, Mr., etc.
○ Time and date: [Link], 01.08.2020, or 01/08/2020
○ Apostrophe and clitics: didn’t, it’s
○ Clitics can not stand on its own.
○ Hyphens: Covid-19, Delhi-based
25
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Spacy
● Rule-based
○ Split the sentences into tokens using
whitespace chars (space, tab, etc.)
○ From left to right
■ Does the token requires special
attention?
● Yes
○ Check whether some
prefix, suffix, or infix
can be split.
● No
○ continue;
26
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Special cases
● Tokenization in other languages is much more complex:
○ Chinese: No space between words
■ 姚明进 总决赛 (Yao Ming reaches the finals)
■ 姚明(Yao Ming) 进 (reaches) 总决赛 (the finals) [Chinese Treebank segmentation]
27
shadakhtar:nlp:iiit[Link]text:processing:regex
入
入
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.
● India has recorded the biggest single - day spike of 66,999 COVID-19 cases , taking the total number of infections to 23,96,637
.
● However , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● With the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the Union Health Ministry said .
● India is the third worst - hit country by the pandemic after the United States and Brazil .
28
shadakhtar:nlp:iiit[Link]text:processing:regex
Unknown or Rare words
● Recall, the purpose of tokenization is to split the sentence into meaningful entities for
downstreaming tasks.
● What if there is an unknown word at the inference time? — the system will not have a clue about it.
30
shadakhtar:nlp:iiit[Link]text:processing:regex
Corpus
….. Find the most frequent pair of l-o-w-$ 5
…. characters and merge them. l-o-w-e-s-t-$ 2
…. n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$
…. n-e-w-e-r$ 6
…. w-i-d-e-r$ 3
Simple tokenization
l-o-w-$ 5
low, l-o-w-e-s-t-$ 2
lowest, n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$
new, n-e-w-er$ 6
newer, w-i-d-er$ 3
wider
Vocabulary: $, d, e, i, l, n, o ,r, s, t, w
low$ 5
low-e-s-t-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew, new, lo,
new$ 2 low, newer$, low$
newer$ 6
w-i-d-er$ 3
Continue for k iterations
Merge rules r$
er$
ew
Inference
new
lo
low
newer$
low$
Corpus: (“hug”, 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Character boundary: (“h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)
Initial Vocabulary: (“b”, 4), (“h”, 5), (“p”, 17), (“##g”, 20), (“##n”, 16), (“##s”, 5), (“##u”, 36)
Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….
Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….
}
Compute likelihood: ("##u", "##g") = 20 / 36 * 20 = 1 / 36
score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element) ("##u", "##n") = 16 / 36 * 16 = 1 / 36
Max
("##u", "##s") = 0 / 36 * 5 = 0
("##g", "##s") = 5 / 20 * 5 = 1 / 20
Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs" ]
Token boundary: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)
2nd Merge ("h", "##u") → “hu”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu" ]
3rd Merge ("hu", "##g") → “hug”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug" ]
Repeat the process until you reach the desired vocabulary size or adequate subword representation. 34
shadakhtar:nlp:iiit[Link]text:processing:regex
Other Tokenizers
● Unigram [Kudo, 2018] [Link]
● Initialize the vocabulary to include a large number of symbols — all pre-tokenize words and
common substring
● Iteratively, remove symbols from the vocabulary if the loss is minimal.
36
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Case-folding
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.
● india has recorded the biggest single - day spike of 66,999 covid-19 cases , taking the total number of infections to 23,96,637 .
● however , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● with the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the union health ministry said .
● india is the third worst - hit country by the pandemic after the united states and brazil .
37
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Lemmatization
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.
● india have record the big single - day spike of 66,999 covid-19 case , take the total number of infection to 23,96,637 .
● however , the recovery rate have go up to 70.76 per cent , with 16,95,982 people recover in the country from the highly -
contagious disease , government data this morning show .
● with the death of 942 patient in the last 24 hour , the county 's fatality count rise 47,033 , the union health ministry say .
● india be the third worst - hit country by the pandemic after the united state and brazil .
38
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Stemming
India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.
• india ha record the biggest single - day spike of 66,999 covid-19 case , take the total number of infect to 23,96,637 .
• howev , the recoveri rate ha gone up to 70.76 per cent , with 16,95,982 peopl recov in the countri from the highly - contagi
diseas , govern data thi morn show .
• with the death of 942 patient in the last 24 hour , the counti 's fatal count rose 47,033 , the union health ministri said .
• india is the third worst - hit countri by the pandem after the unit state and brazil .
39
shadakhtar:nlp:iiit[Link]text:processing:regex
[Link]
41
shadakhtar:nlp:iiit[Link]text:processing:regex
Classes of Morphology
Inflectional: No changes in the word class Derivational: Changes the word class
● Serves grammatical/semantic purposes different ● Combination of a stem with other morphemes
than the original form changes the class
● Formation of noun from verb/adjective
● Easy to predict the meaning
(nominalization)
● E.g., “s” or “es” to a noun → defines ● Summarize + ation → Summarization
pluralism ● Trust + ee → Trustee
● Highly systematic, though some irregularities/ ● Formation of adjective from noun/verb
exceptions are there ● Computation + al →
● Mouse + plural → Mice Computational
● Trust + able → Trustable
42
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Two questions
○ What is the plural of cat?
○ What does cats means?
43
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Khāyegā → Khā + ye +g +ā
eat 3rd per future male
● Khāyegī → Khā + ye +g +ī
eat 3rd per future female
● Khāongā → Khā + on +g +ā
eat 1st per future male
Ambiguity
● Synthesis (generation) is easier than Analysis (recognition/parsing)
○ Why?
■ Ambiguity: Utilize external evidence, e.g., context.
44
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Ingredients for the morphological parser
○ Lexicon
■ List of stems, affixes, and other information (POS tag of stem, etc.)
○ Rules
■ Rules for morpheme ordering, e.g., plural -s should follow a noun/verb stem
■ Rules that defines change in characters, e.g., city + s → cities
45
shadakhtar:nlp:iiit[Link]morphology
Are lexicon and rules mandatory?
● Lexicon-only morphology
○ Lists all surface level and lexical level pairs
■ Surface ←→ Lexical
○ No rules
○ Analysis and Synthesis are easy.
○ Difficult to record all possibilities for any language
● Stemming (Lexicon-free)
○ Set of rules
○ Interested in stems.
■ Don’t care about the structure of the word
■ Don’t care about the right stem, as long as we get consistent stem.
46
shadakhtar:nlp:iiit[Link]morphology
Thanks
47
shadakhtar:nlp:iiit[Link]text:processing:regex