0% found this document useful (0 votes)

57 views47 pages

Text Processing

The document discusses text processing and regular expressions, highlighting their importance in identifying valid words and pattern matching. It provides examples of regular expressions for case-insensitive word matching and outlines the basics of tokenization, sentence segmentation, and normalization in text processing. Additionally, it touches on the implementation of regular expressions as finite automata and their application in chatbots like ELIZA.

Uploaded by

saurav22465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views47 pages

Text Processing

Uploaded by

saurav22465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Text Processing and Regular

Expression
Md Shad Akhtar
[Link]@[Link]

Shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● Words are the building blocks of language.
● Each language has a fixed number of valid words, aka. vocabulary.

2
shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● How do we identify a valid word, e.g., bank?

Dictionary lookup
● What about?
○ Bank
Can be handled
○ BANK through case-folding
○ Banks

In general, no
dictionary manages
plural form, explicitly.
3
shadakhtar:nlp:iiit[Link]text:processing:regex
Pattern Matching
● Searching/Matching a pattern (string) is very frequent in text processing.
○ Web-search or IR based system
○ Word-processing applications.

● A simpler yet powerful solution is to use regular expression for the pattern matching.

4
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

● RegEx: /[Tt]he/
True positives

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
5
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

RegEx: False positives

● /[Tt]he/
Add word boundary in RE

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
6
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

● RegEx: /\b[Tt]he\b/

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
Success!
7
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression is more powerful than it seems!

8
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA [Weizenbaum, 1964-66, MIT AI lab]
● The first chatbot.
● Pattern matching system

9
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA
● Substitution with regex
○ s / RE / SubText /
● Three steps:
○ Changing the first person mentions to the uppercase second person mentions
■ I’m → YOU ARE
○ Create a set of all possible substitutions.
○ Rank the possible outputs to respond to the user.

s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

10
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular expressions and Finite automata

11
shadakhtar:nlp:iiit[Link]text:processing:regex
Validating words
● Bank
● bank
● Banks
● banks

12
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression

Examples: Any text

RE Matches
containing these symbols
/bank/ → Ordered seq of chars b, a, n, & k bank

/Banks/ → Ordered seq of chars B, a, n, k, & s Banks

/[Bb]ank/ → Either uppercase or lowercase b bank, Bank

/[Bb]anks?/ → Char s is optional (0 or 1) bank, Bank, banks, Banks

13
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression: Kleene closure
Examples: Any text
RE Matches
containing these symbols
/a*/ → 0 or more occurrence of as , a, aa, aaa, aaa, ...

/a+/ → 1 or more occurrence of as a, aa, aaa, aaa, ...

/[ab]*/ → 0 or more as or bs , a, b, aaaa, abba, baab

/baa+!/ ba followed by 1 or more as & end

→ baa!, baaaaa!, …..
with !

14
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛆
𝛆
Regular Expression and Finite Automata
● Regular expressions can be implemented as finite-state automaton.
● Set of strings accepted by regular expression or the corresponding finite automata is
termed as regular language.

Regular Expression (RE) • L: {a, b, ab, aa, ba, aba, ….,}

• RE: (a|b)+
• G:
S → a | b | aS | bS
• M:
Regular a, b
language
(L) q1 q2
a, b
Finite Automata (M) Regular grammar (G)

15
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
Finite Automata
● Given an input x, it outputs whether or not the input x is accepted by the automata M.
○ M(x) → {Accepted, Not accepted}

● Finite Automation
○ Deterministic Finite Automata (DFA)
○ Non-deterministic Finite Automata (NFA)

16
shadakhtar:nlp:iiit[Link]text:processing:regex
Deterministic Finite Automata (DFA)
● A finite automata can be represented as quintuple M = <Q, Σ, S, F, >, where
○ Q = Finite set of states
○ Σ = Finite set of input symbols
○ S = Start state S ⊆ Q
○ F = Finite set final/accepting states F ⊆ Q
○ = Transition function. : qi →a qj, where qi, qj ∈ Q and a ∈ Σ

Regular expression: /[Bb]anks?/

Σ = {a-zA-Z} :
B
Q = {q1, q2, q3, q4, q5, q6}
q1 q2 q3 q4 q5 q6
S = {q1} a n k s

F = {q5, q6} b
17
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
𝛿
𝛿
𝛿
Transition diagram and table
Q/Σ a b .. k n s .. A B ...

q1 q2 q2

q2 q3

q3 q4

q4 q5

q5 q6

*All blank entries will point to the dead (D) state

18
shadakhtar:nlp:iiit[Link]text:processing:regex
DFA: Example 2
a
Regular expression: /baa+!/
Valid inputs:
baa! q1 q2 q3 q4 q5
b a a !
baaa!
baaaaaa!
…

Acceptance procedure:
1. While all the input symbols are not consumed:
a. At the current state, if the current input symbol matches any of its outlink labels,
i. Consume the input, move along the link to another state, and goto step 1.
b. Else
i. Announce failure
2. If the current state is one of the final states
a. Announce success
3. Else
a. Announce failure
19
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing

20
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to
23,96,637. However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country
from the highly-contagious disease, government data this morning showed. With the death of 942 patients in the last 24
hours, the county's fatality count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the
pandemic after the United States and Brazil.

21
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Usual sentence end-marker: ‘.’, ‘?’, ‘!’
● Exclamation and Question marks are usually unambiguous.
○ Interjections: Oh!

● Period is quite ambiguous

○ Sentence end marker
○ Acronyms or Abbreviations: Ph.D. , [Link] , Mr. , Dr.
○ Numbers: 23.11
○ Ellipsis: …
○ Urls and email id: [Link], abc@[Link]

22
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Disambiguation Blank line(s) after it?
○ Rule-based
Yes No
○ RegEx based
○ ML based EOS Is the punctuation ? or ! ?

Yes No

EOS Is the punctuation . ?

Yes No
Some other useful features
Abbreviation? Not EOS
● Case of word with “.”: Upper, Lower, Cap, Number
● Case of word after “.”: Upper, Lower, Cap, Number
Yes No
● Numeric features
○ Length of word with “.” Not EOS EOS
○ Probability(word with “.” occurs at end-of-s)
23
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
● However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed.
● With the death of 942 patients in the last 24 hours, the county's fatality count rose 47,033, the Union Health Ministry said.
● India is the third worst-hit country by the pandemic after the United States and Brazil.
24
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization
● Usual delimiters: Whitespace, Dot, Comma, Question mark, Exclamation, etc.
● Some of the special cases
○ Abbreviation: Ph.D., AT&T, Mr., etc.
○ Time and date: [Link], 01.08.2020, or 01/08/2020
○ Apostrophe and clitics: didn’t, it’s
○ Clitics can not stand on its own.
○ Hyphens: Covid-19, Delhi-based

● Standard for tokenisation

● LDC Penn Treebank tokenization standard
○ didn’t → did + n’t
○ it’s → it + ‘s

25
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Spacy
● Rule-based
○ Split the sentences into tokens using
whitespace chars (space, tab, etc.)
○ From left to right
■ Does the token requires special
attention?
● Yes
○ Check whether some
prefix, suffix, or infix
can be split.
● No
○ continue;

26
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Special cases
● Tokenization in other languages is much more complex:
○ Chinese: No space between words
■ 姚明进总决赛 (Yao Ming reaches the finals)
■ 姚明(Yao Ming) 进 (reaches) 总决赛 (the finals) [Chinese Treebank segmentation]

○ Each Chinese character → a single unit of meaning (morpheme) and is pronounceable as a

single syllable.

27
shadakhtar:nlp:iiit[Link]text:processing:regex
入
入
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● India has recorded the biggest single - day spike of 66,999 COVID-19 cases , taking the total number of infections to 23,96,637
.
● However , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● With the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the Union Health Ministry said .
● India is the third worst - hit country by the pandemic after the United States and Brazil .
28
shadakhtar:nlp:iiit[Link]text:processing:regex
Unknown or Rare words
● Recall, the purpose of tokenization is to split the sentence into meaningful entities for
downstreaming tasks.
● What if there is an unknown word at the inference time? — the system will not have a clue about it.

lower Test Corpus

Train Corpus
??
low
lowest Model
new It knows tokens of
newer train data only.
wider

Can we do better with the tokenization?

29
shadakhtar:nlp:iiit[Link]text:processing:regex
● Answer: Yes. Subword tokens might help.
Byte-pair Encoding (BPE) [Link]
● Subword-based tokenization
● Provides a solution to handle unknown or rare words in downstreaming tasks.
● Learn from data what should be tokens.
● Iteratively merge frequent pairs of characters to form subword/word tokens.

30
shadakhtar:nlp:iiit[Link]text:processing:regex
Corpus
….. Find the most frequent pair of l-o-w-$ 5
…. characters and merge them. l-o-w-e-s-t-$ 2
…. n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$
…. n-e-w-e-r$ 6
…. w-i-d-e-r$ 3

Simple tokenization

l-o-w-$ 5
low, l-o-w-e-s-t-$ 2
lowest, n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$
new, n-e-w-er$ 6
newer, w-i-d-er$ 3
wider

Word → sequence of characters

Append special symbol ($) at the end l-o-w-$ 5
l-o-w-e-s-t-$ 2
*Hyphen (-) denotes character split n-ew-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew
n-ew-er$ 6
w-i-d-er$ 3
l-o-w-$ 5
l-o-w-e-s-t-$ 2
n-e-w-$ 2 l-o-w-$ 5
Record frequency of each
n-e-w-e-r-$ 6 l-o-w-e-s-t-$ 2
word in the corpus
w-i-d-e-r-$ 3 new-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew, new
new-er$ 6
w-i-d-er$ 3

Vocabulary: $, d, e, i, l, n, o ,r, s, t, w

low$ 5
low-e-s-t-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew, new, lo,
new$ 2 low, newer$, low$
newer$ 6
w-i-d-er$ 3
Continue for k iterations
Merge rules r$
er$
ew

Inference
new
lo
low
newer$
low$

● Apply merges on the test sentence that we learned in the order

● E.g.,
○ newer
■ n-e-w-e-r-$ n-e-w-e-r$ n-e-w-er$ n-ew-er$ new-er$ newer$
○ lower
■ l-o-w-e-r-$ l-o-w-e-r$ l-o-w-er$ lo-w-er$ low-er$
Word-Piece Tokenizer [Schuster et al., 2012]
● Uses word boundary token at the beginning of the word instead of at the end.
● Rather than merging two most-frequent pairs, it merges if the pair maximizes the likelihood of the
training data once added to the vocabulary.

Corpus: (“hug”, 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

Character boundary: (“h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

Initial Vocabulary: (“b”, 4), (“h”, 5), (“p”, 17), (“##g”, 20), (“##n”, 16), (“##s”, 5), (“##u”, 36)

Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….

Compute likelihood: score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element)

33
shadakhtar:nlp:iiit[Link]text:processing:regex
Word-Piece Tokenizer [Schuster et al., 2012]
Corpus: (“hug”, 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Token boundary: (“h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)
Initial Vocabulary: (“b”, 4), (“h”, 5), (“p”, 17), (“##g”, 20), (“##n”, 16), (“##s”, 5), (“##u”, 36)

Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….

}
Compute likelihood: ("##u", "##g") = 20 / 36 * 20 = 1 / 36
score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element) ("##u", "##n") = 16 / 36 * 16 = 1 / 36
Max
("##u", "##s") = 0 / 36 * 5 = 0
("##g", "##s") = 5 / 20 * 5 = 1 / 20
Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs" ]

Token boundary: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

2nd Merge ("h", "##u") → “hu”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu" ]

3rd Merge ("hu", "##g") → “hug”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug" ]

Repeat the process until you reach the desired vocabulary size or adequate subword representation. 34
shadakhtar:nlp:iiit[Link]text:processing:regex

Other Tokenizers
● Unigram [Kudo, 2018] [Link]
● Initialize the vocabulary to include a large number of symbols — all pre-tokenize words and
common substring
● Iteratively, remove symbols from the vocabulary if the loss is minimal.

● Sentence-piece [Kudo and Richardson, 2018] [Link]

● Instead of pre-tokenize words, it operates on the sequence of characters for a sentence.
(Space is a valid symbol)
● Employs BPE or Unigram tokeniser
Normalization
● Putting words/tokens in a standard format
● Types:
○ Case-Folding
■ CAT, Cat, cat, cat or CAT
○ Lemmatization: Finding the root form of the word
■ is, are, was be
■ cat, cats, goose, geese, fly, flies cat, cat, goose, goose, fly, fly
○ Stemming: Chopping off the affixes
■ is, are, was is, are, wa
■ cat, cats, goose, geese, fly, flies cat, cat, goos, gees, fli, fli
○ Others
■ Equivalence class
● Normalization vs Normalisation, color vs colour, center vs centre
● U.S.A. vs USA vs US

36
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Case-folding

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● india has recorded the biggest single - day spike of 66,999 covid-19 cases , taking the total number of infections to 23,96,637 .
● however , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● with the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the union health ministry said .
● india is the third worst - hit country by the pandemic after the united states and brazil .
37
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Lemmatization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● india have record the big single - day spike of 66,999 covid-19 case , take the total number of infection to 23,96,637 .
● however , the recovery rate have go up to 70.76 per cent , with 16,95,982 people recover in the country from the highly -
contagious disease , government data this morning show .
● with the death of 942 patient in the last 24 hour , the county 's fatality count rise 47,033 , the union health ministry say .
● india be the third worst - hit country by the pandemic after the united state and brazil .
38
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Stemming

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

• india ha record the biggest single - day spike of 66,999 covid-19 case , take the total number of infect to 23,96,637 .
• howev , the recoveri rate ha gone up to 70.76 per cent , with 16,95,982 peopl recov in the countri from the highly - contagi
diseas , govern data thi morn show .
• with the death of 942 patient in the last 24 hour , the counti 's fatal count rose 47,033 , the union health ministri said .
• india is the third worst - hit countri by the pandem after the unit state and brazil .
39
shadakhtar:nlp:iiit[Link]text:processing:regex
[Link]

Porter’s Stemmer [Porter, 1980] — Applies a series of rules

Rule 1.a. If Rule 1b.2 or 1.b.3 is successful, the following is done:
1. SSES → SS caresses → caress 1. AT -> ATE conflat(ed) → conflate
2. IES →I ponies → poni 2. BL -> BLE troubl(ed) → trouble
ties → ti 3. IZ -> IZE siz(ed) → size
1. SS → SS caress → caress 4. (*d and not (*L or *S or *Z)) → single letter
2. S →ε cats → cat hopp(ing) → hop
tann(ed) → tan
fall(ing) → fall
hiss(ing) → hiss
Rule 1.b. fizz(ed) → fizz
1. (m>0) EED → EE feed → feed 1. (m=1 and *o) -> E fail(ing) → fail
agreed → agree fil(ing) → file
1. (*v*) ED → ε plastered → plaster
bled → bled
1. (*v*) ING → ε motoring → motor Rule 2
sing → sing 1. ATIONAL → ATE relational → relate
2. TIONAL → TION conditional → condition
rational → rational
Rule 1.c. 1. ATOR → ATE operator → operate
1. (*v*) Y → I happy → happi
Sky → sky Rule 3 …. Rule 4 …. Rule 5 …. 40
shadakhtar:nlp:iiit[Link]morphology
Morphology
A quick review

41
shadakhtar:nlp:iiit[Link]text:processing:regex
Classes of Morphology
Inflectional: No changes in the word class Derivational: Changes the word class
● Serves grammatical/semantic purposes different ● Combination of a stem with other morphemes
than the original form changes the class
● Formation of noun from verb/adjective
● Easy to predict the meaning
(nominalization)
● E.g., “s” or “es” to a noun → defines ● Summarize + ation → Summarization
pluralism ● Trust + ee → Trustee
● Highly systematic, though some irregularities/ ● Formation of adjective from noun/verb
exceptions are there ● Computation + al →
● Mouse + plural → Mice Computational
● Trust + able → Trustable

42
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Two questions
○ What is the plural of cat?
○ What does cats means?

● Mapping the surface-level form to the lexical-level form

○ cats → Surface-level
○ cat + N + PL → Lexical-level

● Morphological recognition / analysis: Surface to lexical

● Morphological generation / synthesis: Lexical to surface

43
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing

● Khāyegā → Khā + ye +g +ā
eat 3rd per future male
● Khāyegī → Khā + ye +g +ī
eat 3rd per future female
● Khāongā → Khā + on +g +ā
eat 1st per future male

Ambiguity
● Synthesis (generation) is easier than Analysis (recognition/parsing)
○ Why?
■ Ambiguity: Utilize external evidence, e.g., context.

44
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Ingredients for the morphological parser
○ Lexicon
■ List of stems, affixes, and other information (POS tag of stem, etc.)

○ Rules
■ Rules for morpheme ordering, e.g., plural -s should follow a noun/verb stem
■ Rules that defines change in characters, e.g., city + s → cities

45
shadakhtar:nlp:iiit[Link]morphology
Are lexicon and rules mandatory?
● Lexicon-only morphology
○ Lists all surface level and lexical level pairs
■ Surface ←→ Lexical
○ No rules
○ Analysis and Synthesis are easy.
○ Difficult to record all possibilities for any language

● Stemming (Lexicon-free)
○ Set of rules
○ Interested in stems.
■ Don’t care about the structure of the word
■ Don’t care about the right stem, as long as we get consistent stem.

46
shadakhtar:nlp:iiit[Link]morphology
Thanks

47
shadakhtar:nlp:iiit[Link]text:processing:regex

NLP Notes of Unit One
No ratings yet
NLP Notes of Unit One
278 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
New Toc
No ratings yet
New Toc
36 pages
Bai601 NLP MODULE 2 Lecture Notes
No ratings yet
Bai601 NLP MODULE 2 Lecture Notes
22 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Slide Set 4 Lexical Analysis
No ratings yet
Slide Set 4 Lexical Analysis
11 pages
NLP Module 2 1 (SAMI)
No ratings yet
NLP Module 2 1 (SAMI)
19 pages
NLP Module 2 - 1
100% (1)
NLP Module 2 - 1
86 pages
Word-Level NLP Techniques and Regex
No ratings yet
Word-Level NLP Techniques and Regex
8 pages
02-Fsa NLP
No ratings yet
02-Fsa NLP
44 pages
RegexFSA
No ratings yet
RegexFSA
59 pages
Basic Text Processing: Regular Expressions & Automata in NLP
No ratings yet
Basic Text Processing: Regular Expressions & Automata in NLP
27 pages
COMP3411 Week 8 - Language Processing
No ratings yet
COMP3411 Week 8 - Language Processing
74 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
Lecture2 436n
No ratings yet
Lecture2 436n
140 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
NLP m1
No ratings yet
NLP m1
148 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
Lecture 4 Regular Expression
No ratings yet
Lecture 4 Regular Expression
30 pages
Lec02 1 BasicTextProcessing
No ratings yet
Lec02 1 BasicTextProcessing
47 pages
Compiler Design: Lexical Analysis Basics
No ratings yet
Compiler Design: Lexical Analysis Basics
52 pages
File 1675742677 110405 LexicalAnalysis-Continue1
No ratings yet
File 1675742677 110405 LexicalAnalysis-Continue1
39 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Chapter-2 Compiler Design
No ratings yet
Chapter-2 Compiler Design
98 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Introduction
No ratings yet
Introduction
33 pages
Unit1 01
No ratings yet
Unit1 01
10 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
02 Text Processing - Regular Expressions-Text Normalization
No ratings yet
02 Text Processing - Regular Expressions-Text Normalization
58 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
NLP QB Final
No ratings yet
NLP QB Final
51 pages
3-Thomson Construction of NFA-21-08-2024
No ratings yet
3-Thomson Construction of NFA-21-08-2024
11 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
3b TextProcessing
No ratings yet
3b TextProcessing
32 pages
Module2 Ch3 A
No ratings yet
Module2 Ch3 A
99 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
65 pages
Chapter 3 Implementation - of - Lexical - Analysis
No ratings yet
Chapter 3 Implementation - of - Lexical - Analysis
63 pages
Linguistics: Understanding Morphology
No ratings yet
Linguistics: Understanding Morphology
118 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Regular Expressions, Tok-Enization, Edit Distance
No ratings yet
Regular Expressions, Tok-Enization, Edit Distance
29 pages
Lec 1.1
No ratings yet
Lec 1.1
26 pages
Text Mining (22CS809)
No ratings yet
Text Mining (22CS809)
109 pages
Text Mining (22CS809)
No ratings yet
Text Mining (22CS809)
177 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Compiler Design - Lexical Analysis: University of Salford, UK
No ratings yet
Compiler Design - Lexical Analysis: University of Salford, UK
1 page
Mod 2
No ratings yet
Mod 2
49 pages
Compiler 18700220055 Prathamrai
No ratings yet
Compiler 18700220055 Prathamrai
12 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Support Pack Management Summary
100% (1)
Support Pack Management Summary
37 pages
Grade VIII ICSE Coursework Tasks
No ratings yet
Grade VIII ICSE Coursework Tasks
5 pages
Statistics Major-Minor Sem 2
No ratings yet
Statistics Major-Minor Sem 2
9 pages
RHS 100x60x4.0
No ratings yet
RHS 100x60x4.0
2 pages
Jun 2012, Paper 1, Foundation PDF
No ratings yet
Jun 2012, Paper 1, Foundation PDF
24 pages
The Mysterious Secrets of The: Great Pyramid
67% (3)
The Mysterious Secrets of The: Great Pyramid
36 pages
HAZOP Study For Risk Analysis of Pipelines
No ratings yet
HAZOP Study For Risk Analysis of Pipelines
6 pages
Engineering Graphics Lab Manual 2021-22
No ratings yet
Engineering Graphics Lab Manual 2021-22
56 pages
Decision Theory
No ratings yet
Decision Theory
17 pages
MG - ft2 To G - m2 Surface Density (Areal Density) Conversion Tables PDF
No ratings yet
MG - ft2 To G - m2 Surface Density (Areal Density) Conversion Tables PDF
4 pages
Vacuum and Gas Packaging
No ratings yet
Vacuum and Gas Packaging
6 pages
Fe Sem 01 Eng Maths Syllbus
No ratings yet
Fe Sem 01 Eng Maths Syllbus
7 pages
3 Seismologifundamental3
No ratings yet
3 Seismologifundamental3
202 pages
Ies PR
No ratings yet
Ies PR
25 pages
Powerpoint Biology 2
No ratings yet
Powerpoint Biology 2
30 pages
MLB HA 6 Answers Final
No ratings yet
MLB HA 6 Answers Final
13 pages
Windows 10 Training Assessment - Answers
No ratings yet
Windows 10 Training Assessment - Answers
4 pages
Outline Understanding Quran 1
No ratings yet
Outline Understanding Quran 1
5 pages
EC Tutorial4 Solutions
No ratings yet
EC Tutorial4 Solutions
6 pages
Peb Foundation
No ratings yet
Peb Foundation
6 pages
Structural Analysis for R.C.C Building Design
No ratings yet
Structural Analysis for R.C.C Building Design
8 pages
Mathematics P1 Feb March 2017 Memo Afr Eng
No ratings yet
Mathematics P1 Feb March 2017 Memo Afr Eng
18 pages
Grade 9 Chemistry Guide
No ratings yet
Grade 9 Chemistry Guide
83 pages
5th Sem Scheme & Syllabus 2020
No ratings yet
5th Sem Scheme & Syllabus 2020
87 pages
CT-01 Solutions
No ratings yet
CT-01 Solutions
44 pages
NJM2903/2403 Dual Comparator Overview
No ratings yet
NJM2903/2403 Dual Comparator Overview
5 pages
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
No ratings yet
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
377 pages
Lesson 1 Direct and Inverse Variations
No ratings yet
Lesson 1 Direct and Inverse Variations
32 pages
Lovecraft, H.P. - Selected Writings PDF
100% (1)
Lovecraft, H.P. - Selected Writings PDF
823 pages
N-T Coordinate System (A) 635410672717374182
100% (1)
N-T Coordinate System (A) 635410672717374182
14 pages

Text Processing

Uploaded by

Text Processing

Uploaded by

Text Processing and Regular

RegEx: False positives

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Examples: Any text

/Banks/ → Ordered seq of chars B, a, n, k, & s Banks

/[Bb]ank/ → Either uppercase or lowercase b bank, Bank

/[Bb]anks?/ → Char s is optional (0 or 1) bank, Bank, banks, Banks

/a+/ → 1 or more occurrence of as a, aa, aaa, aaa, ...

/[ab]*/ → 0 or more as or bs , a, b, aaaa, abba, baab

/baa+!/ ba followed by 1 or more as & end

Regular Expression (RE) • L: {a, b, ab, aa, ba, aba, ….,}

Regular expression: /[Bb]anks?/

*All blank entries will point to the dead (D) state

● Period is quite ambiguous

EOS Is the punctuation . ?

● Standard for tokenisation

○ Each Chinese character → a single unit of meaning (morpheme) and is pronounceable as a

lower Test Corpus

Can we do better with the tokenization?

Word → sequence of characters

● Apply merges on the test sentence that we learned in the order

Compute likelihood: score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element)

● Sentence-piece [Kudo and Richardson, 2018] [Link]

Porter’s Stemmer [Porter, 1980] — Applies a series of rules

● Mapping the surface-level form to the lexical-level form

● Morphological recognition / analysis: Surface to lexical

You might also like