0% found this document useful (0 votes)
9 views69 pages

Module2 Lecture2 FST

The document discusses Regular Expressions (RE) and their applications in text processing, including information retrieval and word processing. It explains basic RE patterns, substitution techniques, and the relationship between RE and Finite State Automata (FSA). Additionally, it covers morphological analysis, parsing, and the use of finite-state transducers in linguistic applications.

Uploaded by

Sachin Kumar N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views69 pages

Module2 Lecture2 FST

The document discusses Regular Expressions (RE) and their applications in text processing, including information retrieval and word processing. It explains basic RE patterns, substitution techniques, and the relationship between RE and Finite State Automata (FSA). Additionally, it covers morphological analysis, parsing, and the use of finite-state transducers in linguistic applications.

Uploaded by

Sachin Kumar N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Regular Expression

Regular Expressions

 Standard Notation – characterising text sequences


 Useful in Information Retrieval applications
 Word Processing
 Computing Frequencies from Corpora
 Implemented using finite-state-automaton
RE -2

 One of the unsung successes in CS


 RE – language for specifying search text string
 RE – first developed by Kleene in 1956
 RE – formula in a special language for specifying simple classes
of strings
 String – A sequence of symbols
 Symbols – letters, numbers, spaces, tabs, punctuations
 RE – algebraic notation to specify strings || define a language
 RE Input – Pattern and a Corpus
 You can guess how it would work!!
RE – 3
Basic RE Patterns
 Simple RE – a sequence of characters
 To search for unicorn – type /unicorn/
 Matches any string with unicorn as substring
 REs are Case-Sensitive
RE – 4
Specifying Disjunction
RE -5
Specifying Ranges and Negation
RE -6
Options
RE – 7
Kleene * - zero or more occurrences of prev char
 Pronounced as “Cleany star”
 /a*/ - means ‘any string of zero or more a’s
 - will match a ( one a) or aaaa (many a’s) or creepy (zero a)
 So to match one or more a’s ?
 /[ab]*/ - zero or more a’s or b’s => Not zero or more right square
braces!!
 - will match strings like aaaa or
ababab or bbbb
RE -8
Kleene + one or more occurrences of prev
char
 RE for Individual Digit
 /[0-9]/
 RE for a String of Digits
 /[0-9][0-9]*/
 Why is it not just /[0-9]*/ ?
 Writing twice is annoying!!
 So, Kleene + comes in handy…means…one or more of previous
char
 RE for a sequence of digits is /[0-9]+/
 Decode /baa+!/
RE – 9
Wild Card
/./
- Matches any single character except carriage return

- How can it be used with kleene * to mean ‘any string of


characters’
- Please explore other operators for Counting, Prefixing,
Grouping, Beginning etc.,
RE -10
RE Substitution, Memory, ELIZA
 Substitution – an important use of RE
 Suppose, we want to put angle brackets around all
integers in a text
 Example 1
 the 35 boxes to the <35> boxes
 To do this, ( ) is put around the first pattern and number
operator \1 in the second pattern to refer back
 s/([0-9]+)/ <\1>/
RE -11
RE Substitution, Memory, ELIZA
 Example 2
 Suppose if we are looking for the pattern
 the Xer they were, the Xer they will be’
 We want both X to be the same string
 So, we surround the first X with parenthesis and replace the
second X with number operator
 SUBSTITUTION:
/ the (.*)er they were, the \1er they will be/
This will fetch the bigger they were, the bigger they will be and
not the bigger they were the faster they will be
RE-12
RE Substitution, Memory, ELIZA
 Two different sets of parenthesis can be matched:
/the (.*)er they (.*), the \1er they \2/
 Will match
 The bigger they were, the bigger they were
 Not
 The bigger they were, the bigger they will

 These numbered memories are registers (register 1, register 2


etc.,)
Substitutions Using Memory
In a Simple NL Application – ELIZA (Simulates a
Psychologist and could have conversations like:- )

User : Humans are all alike


ELIZA : IN WHAT WAY
User : They’re always bugging us about something
ELIZA : CAN YOU THINK OF A SPECIFIC EXAMPLE
User :Well, my friend made me come here.
ELIZA : YOUR FRIEND MADE YOU COME HERE
User : She says I’m depressed most of the time
ELIZA : I AM SORRY TO HEAR YOU ARE DEPRESSED.
RE substitutions in the Working of ELIZA

 Applies a cascade of RE substitutions


 my to YOUR
 I’m to YOU ARE
 s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU
ARE \1/
 s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU
ARE \1/
 s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite State Automata
RE & FSA

 RE – metalanguage for text searching


 RE – way of describing FSA
 RE along with FSA can be used to characterise a Formal
Language
FSA for Sheep talk – any string from the
following infinite set
FSA - Sheeptalk

 RE - /baa+!/
 The figure shown is an automaton for modelling this RE
 Automaton recognises the set of strings characterising sheeptalk
 Automaton as a directed Graph
 A finite set of vertices (nodes) - Circles
 Directed links between vertices called arcs – Arrows
 5 States
 State 0 - start state (incoming arrow)
 State 4 – final state/accepting state (double circle)
 4 Transitions (arcs)
Working - Input in tape
 The machine starts in State q0
 Set Current_state as Start_state
 Repeat
 Check the next letter in the input
 If it matches the symbol on an arc
 Move to the next state
 Update Current_state
 Check if it is the accepting state and if we have run out of symbols in input

 Then we have successfully recognized instance of sheeptalk, exit

 Else Advance one symbol in the input

 Exit if

 Machine never gets to the final stage (because it there was no match or it ran out of input or just stuck in some non-final
state)
State Transition Table – FSA

 Start State
 Accepting State(s)
 What transitions leave each state with what symbols
 --------------------------------------------------------------------
 State 4 – marked with : (accepting state)
 Ф indicates a illegal or missing transition
 We read the first row as:
 If we are in State 0 and we see the input b, we go to State 1
 If we are in State 0 and we see the input a or !, we fail
Formal Definition – FSA
 Defined by 5 Parameters:

 Q = {q0, q1, q2, q3, q4} = {a, b, !} F = {q4}


 = defined by the transition tables
Adding a fail state…
Another Example: 1-99 in English
Non-Deterministic FSA (NFSA)

 Consider the automata given:


 (Difference between this one and previous one is that – self loop is on
state 2 and not in state 3)

 Non-deterministic: when we get to state 2, we don’t know if we have to


stay there or go to state 3
DFSA Vs NFSA

 DFSA – Its behaviour during recognition is fully determined


by the state it is in and the symbol it is looking at.
 This is not the case with NFSA
NFSA #2

 Non-determinism - caused by arcs without symbols on them (ε


– transitions)

 The ε arc is interpreted as follows:


 if we are in state 3, we are allowed to move to state 2 without
looking at the input or advancing the input pointer
NFSA #1
Formal Languages
 empty set


Morphology

 Study of composition of words using Morphemes


 A morpheme is the smallest unit of language that has meaning
 Types of Morphemes
Stems
Affixes
Prefix
Suffix
Infix
Circumfix
Stems and Affixes
 Stems – Free Morphemes (can exist without affixes)
 Affixes – Bound Morphemes (exist with free morphemes)
 Eg: Unbelievable
 believe - free
 Un & able – affixes/bound morphemes
 Types of Languages
 Isolating (Only free forms – Mandarin)
 Agglutinative (small words combine to form compound info – Turkish)
 Inflecting (words are broken down to smaller units – different
meanings)
Inflection in Morphology

 Inflection – transforming a word into a form – representing


 Person
 Number
 Tense
 Gender
 Case
 Aspect
 Mood
 But the syntactic category of the token remains the same
Derivation in Morphology

 Syntactic category of the word is changed


Morphological Analyser (MA)

 process of obtaining grammatical information from tokens, given their


suffix information
 MA - generates morphological information, such as gender, number, class,
and so on, as an output.

#take a text, tokenise it


#use the following hints to find the category of a word
1. Morphological hints – eg. – suffix information like –ness –ment , imply
they exist with nouns
2. Syntactic hints – if we find that the word is noun category, an adjective
may occur before or after it
Morphological Parsing Using
Finite-State-Transducers
(FST)
Plurals

 Woodchuck –Woodchucks
 fox
 peccary
 aardvark
 goose
 fish
Kinds of Knowledge

 Morphological Rules
 fish – null plural
 goose – geese by changing vowel
 Spelling Rules
 -y to -i + -es
 Morphological Parsing
 fox -> foxes
 Two morphemes fox and es
 (Parsing means => taking an input and producing some sort of structure for it)
 Morphological; Syntactic; Semantic; Pragmatic
Types

 Concatenative Morphology
 Prefix and suffix
 Non-Concatenative Morphology
 Complex
 Templatic Morphology or Root-Pattern Morphology
 Hebrew
 Verb – Two components
 root + 3 Consonants (CCC)
 Ordering of consonants and vowels determine Verb form
 ‘lmd’ - learn
 CaCaCa – ‘lamad’ - he studied
 CiCeC –’limed’ - he taught
 CuCaC – ‘lumad’ – he was taught
Considerations:-
Lets look at ‘unbelievably’
 suffixes – ‘un’, ‘able’ , ‘ly’
 English can have a max of 4-5 suffixes
 Turkish – 9 or 10 suffixes
Morphological Parsing

 Parsing plural (-s) and Progressive (-ing)


 First Column – Words; Second Column –Morphological Features
Features

 +N (noun)
 +SG (Singular)
 +PL (Plural)
Morphological Parsing needs:

 Lexicon - List of Stems and Affixes + Basic Information


about them (Noun Stem or Verb Stem)
 Morphotactics – Model of Morpheme Ordering (which
classes of morphemes follow which other classes)
 Plural morpheme follows a noun rather than preceding it
 Orthographic or Spelling Rules – Spelling rules that occur
in a word when two morphemes combine
 y –ie (City + s => Cities)
Modeling Morphotactic Using FSA
 Regular Noun (-s plural)
 Irregular-Singular-Noun (goose; mouse)
 Irregular Plural Noun (geese; mice)
Verbal Inflection
FSA - Derivational
Morphology
 Noun-ize-ation
 Fossilization q0-q1-q2

 Adjectives-al-able at q5
 Equal, formal, realizable
suffix -ity or -ness at q6
reality; equality
naturalness, casualness
FSA for Morphological Recognition

 Checking if a string of letters is a legal word in English or


not…
 Procedure:
 By taking Morphotactic FSA and plugging in each ‘sub-lexicon’
into the FSA
 i.e by expanding each arc (eg reg-stem-noun) with all morphemes
that make up set of reg-stem-noun
 => The resulting FSA can be at the level of individual letter
What we have seen so far is Morphological
Recognition, we will now look at
Morphological Parsing
Recall, the table with Parsed Morphological
Output…
For Cats, the output should be CAT +N
+PL
Morphological Parsing Using Finite
State Transducers
 Two-level Morphology
 A word is represented as a correspondence between
 LEXICAL LEVEL (Simple concatenation of morphemes making up
a word)
 SURFACE LEVEL (Actual spelling of the final word)

 Example:
 Cat +N +PL => Lexical Level
 Cats => Surface Level
Morphological Parsing

 Done by building mapping rules between


 Letter sequences (surface level)
 Feature sequences (lexical level)
FSA used to provide mapping - FST

 A transducer maps between


 One set of symbols and another
 FST does this via FSA

 A FST is a two-tape automaton that recognizes (or


generates) pairs of strings
 So, FSA defines formal languages, by defining a set of
strings
 FST defines relations between sets of strings
FST – Mealy Machine Extension to FSA
FSA and FST of Sheep Language

 FSA accepts a language defined with symbols, for eg.,


Sheep language

 FST accepts a language defined over pairs of symbols as in


Two-level Morphology (FST with two
tapes)
 a:b
 Expresses how the symbol a from one tape is mapped to the symbol b
on another tape
 a : ε => a on the upper tape will correspond to nothing on the lower
tape
 a : a => if a symbol maps to itself, which will be common, it is
DEFAULT PAIR
 Upper or Lexical Tape
 Composed from characters from LEFT side of a:b pairs
 Lower or Surface Tape
 Composed from characters from RIGHT side of a:b pairs
FST Morphological Parser = Morphotactic FSA +
Lexical Tape
Lexicon should also have two levels:
Cascading Two Automata

 geese => goose +N +PL

 Need to Cascade the lexicon (in previous slide) with


Singular/Plural Automaton (in slide 38)
 Cascading two automata means running them in series
with the output of first feeding the input of second.
Automaton for Lexicon of Stems
cats => cats +N +PL

 C maps to itself (default pair)


 Likewise for ‘a’ and ‘t’
 +N maps with ε transition (Empty String) (+N means Noun)
 +PL maps to ^S ( ^ indicates Morpheme
boundary)
 # indicates Word Boundary
Orthographic Rules & FST

 English requires Spelling changes at morpheme


boundaries by introducing spelling rules (Orthographic
Rules)
 FST for writing such rules
E- Insertion Rule
E- Insertion Rule
E – Insertion Rule Automaton
COMBINING FST LEXICON AND RULES

You might also like