0% found this document useful (0 votes)
57 views47 pages

Text Processing

The document discusses text processing and regular expressions, highlighting their importance in identifying valid words and pattern matching. It provides examples of regular expressions for case-insensitive word matching and outlines the basics of tokenization, sentence segmentation, and normalization in text processing. Additionally, it touches on the implementation of regular expressions as finite automata and their application in chatbots like ELIZA.

Uploaded by

saurav22465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views47 pages

Text Processing

The document discusses text processing and regular expressions, highlighting their importance in identifying valid words and pattern matching. It provides examples of regular expressions for case-insensitive word matching and outlines the basics of tokenization, sentence segmentation, and normalization in text processing. Additionally, it touches on the implementation of regular expressions as finite automata and their application in chatbots like ELIZA.

Uploaded by

saurav22465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Text Processing and Regular

Expression
Md Shad Akhtar
[Link]@[Link]

Shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● Words are the building blocks of language.
● Each language has a fixed number of valid words, aka. vocabulary.

2
shadakhtar:nlp:iiit[Link]text:processing:regex
Word
● How do we identify a valid word, e.g., bank?

Dictionary lookup
● What about?
○ Bank
Can be handled
○ BANK through case-folding
○ Banks

In general, no
dictionary manages
plural form, explicitly.
3
shadakhtar:nlp:iiit[Link]text:processing:regex
Pattern Matching
● Searching/Matching a pattern (string) is very frequent in text processing.
○ Web-search or IR based system
○ Word-processing applications.

● A simpler yet powerful solution is to use regular expression for the pattern matching.

4
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

● RegEx: /[Tt]he/
True positives

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
5
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

RegEx: False positives


● /[Tt]he/
Add word boundary in RE

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
6
shadakhtar:nlp:iiit[Link]text:processing:regex
An example
● Document:

The recent attempt by the police to retain their current rates of pay has not gathered
much favor with the southern factions.

● Write the regular expression that find all the occurrences of word ‘the’ (case-insensitive).

● RegEx: /\b[Tt]he\b/

● Output: The recent attempt by the police to retain their current rates of pay has not
gathered much favor with the southern factions.
Success!
7
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression is more powerful than it seems!

8
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA [Weizenbaum, 1964-66, MIT AI lab]
● The first chatbot.
● Pattern matching system

9
shadakhtar:nlp:iiit[Link]text:processing:regex
ELIZA
● Substitution with regex
○ s / RE / SubText /
● Three steps:
○ Changing the first person mentions to the uppercase second person mentions
■ I’m → YOU ARE
○ Create a set of all possible substitutions.
○ Rank the possible outputs to respond to the user.

s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/


10
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular expressions and Finite automata

11
shadakhtar:nlp:iiit[Link]text:processing:regex
Validating words
● Bank
● bank
● Banks
● banks

12
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression

Examples: Any text


RE Matches
containing these symbols
/bank/ → Ordered seq of chars b, a, n, & k bank

/Banks/ → Ordered seq of chars B, a, n, k, & s Banks

/[Bb]ank/ → Either uppercase or lowercase b bank, Bank

/[Bb]anks?/ → Char s is optional (0 or 1) bank, Bank, banks, Banks

13
shadakhtar:nlp:iiit[Link]text:processing:regex
Regular Expression: Kleene closure
Examples: Any text
RE Matches
containing these symbols
/a*/ → 0 or more occurrence of as , a, aa, aaa, aaa, ...

/a+/ → 1 or more occurrence of as a, aa, aaa, aaa, ...

/[ab]*/ → 0 or more as or bs , a, b, aaaa, abba, baab

/baa+!/ ba followed by 1 or more as & end


→ baa!, baaaaa!, …..
with !

14
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛆
𝛆
Regular Expression and Finite Automata
● Regular expressions can be implemented as finite-state automaton.
● Set of strings accepted by regular expression or the corresponding finite automata is
termed as regular language.

Regular Expression (RE) • L: {a, b, ab, aa, ba, aba, ….,}


• RE: (a|b)+
• G:
S → a | b | aS | bS
• M:
Regular a, b
language
(L) q1 q2
a, b
Finite Automata (M) Regular grammar (G)

15
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
Finite Automata
● Given an input x, it outputs whether or not the input x is accepted by the automata M.
○ M(x) → {Accepted, Not accepted}

● Finite Automation
○ Deterministic Finite Automata (DFA)
○ Non-deterministic Finite Automata (NFA)

16
shadakhtar:nlp:iiit[Link]text:processing:regex
Deterministic Finite Automata (DFA)
● A finite automata can be represented as quintuple M = <Q, Σ, S, F, >, where
○ Q = Finite set of states
○ Σ = Finite set of input symbols
○ S = Start state S ⊆ Q
○ F = Finite set final/accepting states F ⊆ Q
○ = Transition function. : qi →a qj, where qi, qj ∈ Q and a ∈ Σ

Regular expression: /[Bb]anks?/

Σ = {a-zA-Z} :
B
Q = {q1, q2, q3, q4, q5, q6}
q1 q2 q3 q4 q5 q6
S = {q1} a n k s

F = {q5, q6} b
17
shadakhtar:nlp:iiit[Link]text:processing:regex
𝛿
𝛿
𝛿
𝛿
Transition diagram and table
Q/Σ a b .. k n s .. A B ...

q1 q2 q2

q2 q3

q3 q4

q4 q5

q5 q6

q6

*All blank entries will point to the dead (D) state


18
shadakhtar:nlp:iiit[Link]text:processing:regex
DFA: Example 2
a
Regular expression: /baa+!/
Valid inputs:
baa! q1 q2 q3 q4 q5
b a a !
baaa!
baaaaaa!

Acceptance procedure:
1. While all the input symbols are not consumed:
a. At the current state, if the current input symbol matches any of its outlink labels,
i. Consume the input, move along the link to another state, and goto step 1.
b. Else
i. Announce failure
2. If the current state is one of the final states
a. Announce success
3. Else
a. Announce failure
19
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing

20
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to
23,96,637. However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country
from the highly-contagious disease, government data this morning showed. With the death of 942 patients in the last 24
hours, the county's fatality count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the
pandemic after the United States and Brazil.

21
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Usual sentence end-marker: ‘.’, ‘?’, ‘!’
● Exclamation and Question marks are usually unambiguous.
○ Interjections: Oh!

● Period is quite ambiguous


○ Sentence end marker
○ Acronyms or Abbreviations: Ph.D. , [Link] , Mr. , Dr.
○ Numbers: 23.11
○ Ellipsis: …
○ Urls and email id: [Link], abc@[Link]

22
shadakhtar:nlp:iiit[Link]text:processing:regex
Sentence Segmentation
● Disambiguation Blank line(s) after it?
○ Rule-based
Yes No
○ RegEx based
○ ML based EOS Is the punctuation ? or ! ?

Yes No

EOS Is the punctuation . ?

Yes No
Some other useful features
Abbreviation? Not EOS
● Case of word with “.”: Upper, Lower, Cap, Number
● Case of word after “.”: Upper, Lower, Cap, Number
Yes No
● Numeric features
○ Length of word with “.” Not EOS EOS
○ Probability(word with “.” occurs at end-of-s)
23
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
● However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed.
● With the death of 942 patients in the last 24 hours, the county's fatality count rose 47,033, the Union Health Ministry said.
● India is the third worst-hit country by the pandemic after the United States and Brazil.
24
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization
● Usual delimiters: Whitespace, Dot, Comma, Question mark, Exclamation, etc.
● Some of the special cases
○ Abbreviation: Ph.D., AT&T, Mr., etc.
○ Time and date: [Link], 01.08.2020, or 01/08/2020
○ Apostrophe and clitics: didn’t, it’s
○ Clitics can not stand on its own.
○ Hyphens: Covid-19, Delhi-based

● Standard for tokenisation


● LDC Penn Treebank tokenization standard
○ didn’t → did + n’t
○ it’s → it + ‘s

25
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Spacy
● Rule-based
○ Split the sentences into tokens using
whitespace chars (space, tab, etc.)
○ From left to right
■ Does the token requires special
attention?
● Yes
○ Check whether some
prefix, suffix, or infix
can be split.
● No
○ continue;

26
shadakhtar:nlp:iiit[Link]text:processing:regex
Tokenization: Special cases
● Tokenization in other languages is much more complex:
○ Chinese: No space between words
■ 姚明进 总决赛 (Yao Ming reaches the finals)
■ 姚明(Yao Ming) 进 (reaches) 总决赛 (the finals) [Chinese Treebank segmentation]

○ Each Chinese character → a single unit of meaning (morpheme) and is pronounceable as a


single syllable.

27
shadakhtar:nlp:iiit[Link]text:processing:regex


Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● India has recorded the biggest single - day spike of 66,999 COVID-19 cases , taking the total number of infections to 23,96,637
.
● However , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● With the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the Union Health Ministry said .
● India is the third worst - hit country by the pandemic after the United States and Brazil .
28
shadakhtar:nlp:iiit[Link]text:processing:regex
Unknown or Rare words
● Recall, the purpose of tokenization is to split the sentence into meaningful entities for
downstreaming tasks.
● What if there is an unknown word at the inference time? — the system will not have a clue about it.

lower Test Corpus


Train Corpus
??
low
lowest Model
new It knows tokens of
newer train data only.
wider

Can we do better with the tokenization?


29
shadakhtar:nlp:iiit[Link]text:processing:regex
● Answer: Yes. Subword tokens might help.
Byte-pair Encoding (BPE) [Link]
● Subword-based tokenization
● Provides a solution to handle unknown or rare words in downstreaming tasks.
● Learn from data what should be tokens.
● Iteratively merge frequent pairs of characters to form subword/word tokens.

30
shadakhtar:nlp:iiit[Link]text:processing:regex
Corpus
….. Find the most frequent pair of l-o-w-$ 5
…. characters and merge them. l-o-w-e-s-t-$ 2
…. n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$
…. n-e-w-e-r$ 6
…. w-i-d-e-r$ 3

Simple tokenization

l-o-w-$ 5
low, l-o-w-e-s-t-$ 2
lowest, n-e-w-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$
new, n-e-w-er$ 6
newer, w-i-d-er$ 3
wider

Word → sequence of characters


Append special symbol ($) at the end l-o-w-$ 5
l-o-w-e-s-t-$ 2
*Hyphen (-) denotes character split n-ew-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew
n-ew-er$ 6
w-i-d-er$ 3
l-o-w-$ 5
l-o-w-e-s-t-$ 2
n-e-w-$ 2 l-o-w-$ 5
Record frequency of each
n-e-w-e-r-$ 6 l-o-w-e-s-t-$ 2
word in the corpus
w-i-d-e-r-$ 3 new-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew, new
new-er$ 6
w-i-d-er$ 3

Vocabulary: $, d, e, i, l, n, o ,r, s, t, w

low$ 5
low-e-s-t-$ 2 Vocabulary: $, d, e, i, l, n, o , r, s, t, w, r$, er$, ew, new, lo,
new$ 2 low, newer$, low$
newer$ 6
w-i-d-er$ 3
Continue for k iterations
Merge rules r$
er$
ew

Inference
new
lo
low
newer$
low$

● Apply merges on the test sentence that we learned in the order


● E.g.,
○ newer
■ n-e-w-e-r-$ n-e-w-e-r$ n-e-w-er$ n-ew-er$ new-er$ newer$
○ lower
■ l-o-w-e-r-$ l-o-w-e-r$ l-o-w-er$ lo-w-er$ low-er$
Word-Piece Tokenizer [Schuster et al., 2012]
● Uses word boundary token at the beginning of the word instead of at the end.
● Rather than merging two most-frequent pairs, it merges if the pair maximizes the likelihood of the
training data once added to the vocabulary.

Corpus: (“hug”, 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

Character boundary: (“h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

Initial Vocabulary: (“b”, 4), (“h”, 5), (“p”, 17), (“##g”, 20), (“##n”, 16), (“##s”, 5), (“##u”, 36)

Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….

Compute likelihood: score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element)


33
shadakhtar:nlp:iiit[Link]text:processing:regex
Word-Piece Tokenizer [Schuster et al., 2012]
Corpus: (“hug”, 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Token boundary: (“h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)
Initial Vocabulary: (“b”, 4), (“h”, 5), (“p”, 17), (“##g”, 20), (“##n”, 16), (“##s”, 5), (“##u”, 36)

Pair Frequencies: (“##u", "##g") = 20, ("##g", "##s") = 5, ("##u", "##n") = 16, ("##u", "##s") = 0, ….

}
Compute likelihood: ("##u", "##g") = 20 / 36 * 20 = 1 / 36
score = (freq_of_pair) / (freq_of_first_element × freq_of_second_element) ("##u", "##n") = 16 / 36 * 16 = 1 / 36
Max
("##u", "##s") = 0 / 36 * 5 = 0
("##g", "##s") = 5 / 20 * 5 = 1 / 20
Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs" ]

Token boundary: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

2nd Merge ("h", "##u") → “hu”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu" ]

3rd Merge ("hu", "##g") → “hug”: Vocabulary Update: [ "b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug" ]

Repeat the process until you reach the desired vocabulary size or adequate subword representation. 34
shadakhtar:nlp:iiit[Link]text:processing:regex

Other Tokenizers
● Unigram [Kudo, 2018] [Link]
● Initialize the vocabulary to include a large number of symbols — all pre-tokenize words and
common substring
● Iteratively, remove symbols from the vocabulary if the loss is minimal.

● Sentence-piece [Kudo and Richardson, 2018] [Link]


● Instead of pre-tokenize words, it operates on the sequence of characters for a sentence.
(Space is a valid symbol)
● Employs BPE or Unigram tokeniser
Normalization
● Putting words/tokens in a standard format
● Types:
○ Case-Folding
■ CAT, Cat, cat, cat or CAT
○ Lemmatization: Finding the root form of the word
■ is, are, was be
■ cat, cats, goose, geese, fly, flies cat, cat, goose, goose, fly, fly
○ Stemming: Chopping off the affixes
■ is, are, was is, are, wa
■ cat, cats, goose, geese, fly, flies cat, cat, goos, gees, fli, fli
○ Others
■ Equivalence class
● Normalization vs Normalisation, color vs colour, center vs centre
● U.S.A. vs USA vs US

36
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Case-folding

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● india has recorded the biggest single - day spike of 66,999 covid-19 cases , taking the total number of infections to 23,96,637 .
● however , the recovery rate has gone up to 70.76 per cent , with 16,95,982 people recovering in the country from the highly -
contagious disease , government data this morning showed .
● with the death of 942 patients in the last 24 hours , the county 's fatality count rose 47,033 , the union health ministry said .
● india is the third worst - hit country by the pandemic after the united states and brazil .
37
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Lemmatization

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

● india have record the big single - day spike of 66,999 covid-19 case , take the total number of infection to 23,96,637 .
● however , the recovery rate have go up to 70.76 per cent , with 16,95,982 people recover in the country from the highly -
contagious disease , government data this morning show .
● with the death of 942 patient in the last 24 hour , the county 's fatality count rise 47,033 , the union health ministry say .
● india be the third worst - hit country by the pandemic after the united state and brazil .
38
shadakhtar:nlp:iiit[Link]text:processing:regex
Text Processing
● Three most frequent preprocessing steps
○ Sentence segmentation
○ Tokenization
○ Normalization: Stemming

India has recorded the biggest single-day spike of 66,999 COVID-19 cases, taking the total number of infections to 23,96,637.
However, the recovery rate has gone up to 70.76 per cent, with 16,95,982 people recovering in the country from the highly-
contagious disease, government data this morning showed. With the death of 942 patients in the last 24 hours, the county's fatality
count rose 47,033, the Union Health Ministry said. India is the third worst-hit country by the pandemic after the United States and
Brazil.

• india ha record the biggest single - day spike of 66,999 covid-19 case , take the total number of infect to 23,96,637 .
• howev , the recoveri rate ha gone up to 70.76 per cent , with 16,95,982 peopl recov in the countri from the highly - contagi
diseas , govern data thi morn show .
• with the death of 942 patient in the last 24 hour , the counti 's fatal count rose 47,033 , the union health ministri said .
• india is the third worst - hit countri by the pandem after the unit state and brazil .
39
shadakhtar:nlp:iiit[Link]text:processing:regex
[Link]

Porter’s Stemmer [Porter, 1980] — Applies a series of rules


Rule 1.a. If Rule 1b.2 or 1.b.3 is successful, the following is done:
1. SSES → SS caresses → caress 1. AT -> ATE conflat(ed) → conflate
2. IES →I ponies → poni 2. BL -> BLE troubl(ed) → trouble
ties → ti 3. IZ -> IZE siz(ed) → size
1. SS → SS caress → caress 4. (*d and not (*L or *S or *Z)) → single letter
2. S →ε cats → cat hopp(ing) → hop
tann(ed) → tan
fall(ing) → fall
hiss(ing) → hiss
Rule 1.b. fizz(ed) → fizz
1. (m>0) EED → EE feed → feed 1. (m=1 and *o) -> E fail(ing) → fail
agreed → agree fil(ing) → file
1. (*v*) ED → ε plastered → plaster
bled → bled
1. (*v*) ING → ε motoring → motor Rule 2
sing → sing 1. ATIONAL → ATE relational → relate
2. TIONAL → TION conditional → condition
rational → rational
Rule 1.c. 1. ATOR → ATE operator → operate
1. (*v*) Y → I happy → happi
Sky → sky Rule 3 …. Rule 4 …. Rule 5 …. 40
shadakhtar:nlp:iiit[Link]morphology
Morphology
A quick review

41
shadakhtar:nlp:iiit[Link]text:processing:regex
Classes of Morphology
Inflectional: No changes in the word class Derivational: Changes the word class
● Serves grammatical/semantic purposes different ● Combination of a stem with other morphemes
than the original form changes the class
● Formation of noun from verb/adjective
● Easy to predict the meaning
(nominalization)
● E.g., “s” or “es” to a noun → defines ● Summarize + ation → Summarization
pluralism ● Trust + ee → Trustee
● Highly systematic, though some irregularities/ ● Formation of adjective from noun/verb
exceptions are there ● Computation + al →
● Mouse + plural → Mice Computational
● Trust + able → Trustable

42
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Two questions
○ What is the plural of cat?
○ What does cats means?

● Mapping the surface-level form to the lexical-level form


○ cats → Surface-level
○ cat + N + PL → Lexical-level

● Morphological recognition / analysis: Surface to lexical


● Morphological generation / synthesis: Lexical to surface

43
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing

● Khāyegā → Khā + ye +g +ā
eat 3rd per future male
● Khāyegī → Khā + ye +g +ī
eat 3rd per future female
● Khāongā → Khā + on +g +ā
eat 1st per future male

Ambiguity
● Synthesis (generation) is easier than Analysis (recognition/parsing)
○ Why?
■ Ambiguity: Utilize external evidence, e.g., context.

44
shadakhtar:nlp:iiit[Link]morphology
Morphology Parsing
● Ingredients for the morphological parser
○ Lexicon
■ List of stems, affixes, and other information (POS tag of stem, etc.)

○ Rules
■ Rules for morpheme ordering, e.g., plural -s should follow a noun/verb stem
■ Rules that defines change in characters, e.g., city + s → cities

45
shadakhtar:nlp:iiit[Link]morphology
Are lexicon and rules mandatory?
● Lexicon-only morphology
○ Lists all surface level and lexical level pairs
■ Surface ←→ Lexical
○ No rules
○ Analysis and Synthesis are easy.
○ Difficult to record all possibilities for any language

● Stemming (Lexicon-free)
○ Set of rules
○ Interested in stems.
■ Don’t care about the structure of the word
■ Don’t care about the right stem, as long as we get consistent stem.

46
shadakhtar:nlp:iiit[Link]morphology
Thanks

47
shadakhtar:nlp:iiit[Link]text:processing:regex

You might also like