Regular Expression
Regular Expressions
Standard Notation – characterising text sequences
Useful in Information Retrieval applications
Word Processing
Computing Frequencies from Corpora
Implemented using finite-state-automaton
RE -2
One of the unsung successes in CS
RE – language for specifying search text string
RE – first developed by Kleene in 1956
RE – formula in a special language for specifying simple classes
of strings
String – A sequence of symbols
Symbols – letters, numbers, spaces, tabs, punctuations
RE – algebraic notation to specify strings || define a language
RE Input – Pattern and a Corpus
You can guess how it would work!!
RE – 3
Basic RE Patterns
Simple RE – a sequence of characters
To search for unicorn – type /unicorn/
Matches any string with unicorn as substring
REs are Case-Sensitive
RE – 4
Specifying Disjunction
RE -5
Specifying Ranges and Negation
RE -6
Options
RE – 7
Kleene * - zero or more occurrences of prev char
Pronounced as “Cleany star”
/a*/ - means ‘any string of zero or more a’s
- will match a ( one a) or aaaa (many a’s) or creepy (zero a)
So to match one or more a’s ?
/[ab]*/ - zero or more a’s or b’s => Not zero or more right square
braces!!
- will match strings like aaaa or
ababab or bbbb
RE -8
Kleene + one or more occurrences of prev
char
RE for Individual Digit
/[0-9]/
RE for a String of Digits
/[0-9][0-9]*/
Why is it not just /[0-9]*/ ?
Writing twice is annoying!!
So, Kleene + comes in handy…means…one or more of previous
char
RE for a sequence of digits is /[0-9]+/
Decode /baa+!/
RE – 9
Wild Card
/./
- Matches any single character except carriage return
- How can it be used with kleene * to mean ‘any string of
characters’
- Please explore other operators for Counting, Prefixing,
Grouping, Beginning etc.,
RE -10
RE Substitution, Memory, ELIZA
Substitution – an important use of RE
Suppose, we want to put angle brackets around all
integers in a text
Example 1
the 35 boxes to the <35> boxes
To do this, ( ) is put around the first pattern and number
operator \1 in the second pattern to refer back
s/([0-9]+)/ <\1>/
RE -11
RE Substitution, Memory, ELIZA
Example 2
Suppose if we are looking for the pattern
the Xer they were, the Xer they will be’
We want both X to be the same string
So, we surround the first X with parenthesis and replace the
second X with number operator
SUBSTITUTION:
/ the (.*)er they were, the \1er they will be/
This will fetch the bigger they were, the bigger they will be and
not the bigger they were the faster they will be
RE-12
RE Substitution, Memory, ELIZA
Two different sets of parenthesis can be matched:
/the (.*)er they (.*), the \1er they \2/
Will match
The bigger they were, the bigger they were
Not
The bigger they were, the bigger they will
These numbered memories are registers (register 1, register 2
etc.,)
Substitutions Using Memory
In a Simple NL Application – ELIZA (Simulates a
Psychologist and could have conversations like:- )
User : Humans are all alike
ELIZA : IN WHAT WAY
User : They’re always bugging us about something
ELIZA : CAN YOU THINK OF A SPECIFIC EXAMPLE
User :Well, my friend made me come here.
ELIZA : YOUR FRIEND MADE YOU COME HERE
User : She says I’m depressed most of the time
ELIZA : I AM SORRY TO HEAR YOU ARE DEPRESSED.
RE substitutions in the Working of ELIZA
Applies a cascade of RE substitutions
my to YOUR
I’m to YOU ARE
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU
ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU
ARE \1/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite State Automata
RE & FSA
RE – metalanguage for text searching
RE – way of describing FSA
RE along with FSA can be used to characterise a Formal
Language
FSA for Sheep talk – any string from the
following infinite set
FSA - Sheeptalk
RE - /baa+!/
The figure shown is an automaton for modelling this RE
Automaton recognises the set of strings characterising sheeptalk
Automaton as a directed Graph
A finite set of vertices (nodes) - Circles
Directed links between vertices called arcs – Arrows
5 States
State 0 - start state (incoming arrow)
State 4 – final state/accepting state (double circle)
4 Transitions (arcs)
Working - Input in tape
The machine starts in State q0
Set Current_state as Start_state
Repeat
Check the next letter in the input
If it matches the symbol on an arc
Move to the next state
Update Current_state
Check if it is the accepting state and if we have run out of symbols in input
Then we have successfully recognized instance of sheeptalk, exit
Else Advance one symbol in the input
Exit if
Machine never gets to the final stage (because it there was no match or it ran out of input or just stuck in some non-final
state)
State Transition Table – FSA
Start State
Accepting State(s)
What transitions leave each state with what symbols
--------------------------------------------------------------------
State 4 – marked with : (accepting state)
Ф indicates a illegal or missing transition
We read the first row as:
If we are in State 0 and we see the input b, we go to State 1
If we are in State 0 and we see the input a or !, we fail
Formal Definition – FSA
Defined by 5 Parameters:
Q = {q0, q1, q2, q3, q4} = {a, b, !} F = {q4}
= defined by the transition tables
Adding a fail state…
Another Example: 1-99 in English
Non-Deterministic FSA (NFSA)
Consider the automata given:
(Difference between this one and previous one is that – self loop is on
state 2 and not in state 3)
Non-deterministic: when we get to state 2, we don’t know if we have to
stay there or go to state 3
DFSA Vs NFSA
DFSA – Its behaviour during recognition is fully determined
by the state it is in and the symbol it is looking at.
This is not the case with NFSA
NFSA #2
Non-determinism - caused by arcs without symbols on them (ε
– transitions)
The ε arc is interpreted as follows:
if we are in state 3, we are allowed to move to state 2 without
looking at the input or advancing the input pointer
NFSA #1
Formal Languages
empty set
Morphology
Study of composition of words using Morphemes
A morpheme is the smallest unit of language that has meaning
Types of Morphemes
Stems
Affixes
Prefix
Suffix
Infix
Circumfix
Stems and Affixes
Stems – Free Morphemes (can exist without affixes)
Affixes – Bound Morphemes (exist with free morphemes)
Eg: Unbelievable
believe - free
Un & able – affixes/bound morphemes
Types of Languages
Isolating (Only free forms – Mandarin)
Agglutinative (small words combine to form compound info – Turkish)
Inflecting (words are broken down to smaller units – different
meanings)
Inflection in Morphology
Inflection – transforming a word into a form – representing
Person
Number
Tense
Gender
Case
Aspect
Mood
But the syntactic category of the token remains the same
Derivation in Morphology
Syntactic category of the word is changed
Morphological Analyser (MA)
process of obtaining grammatical information from tokens, given their
suffix information
MA - generates morphological information, such as gender, number, class,
and so on, as an output.
#take a text, tokenise it
#use the following hints to find the category of a word
1. Morphological hints – eg. – suffix information like –ness –ment , imply
they exist with nouns
2. Syntactic hints – if we find that the word is noun category, an adjective
may occur before or after it
Morphological Parsing Using
Finite-State-Transducers
(FST)
Plurals
Woodchuck –Woodchucks
fox
peccary
aardvark
goose
fish
Kinds of Knowledge
Morphological Rules
fish – null plural
goose – geese by changing vowel
Spelling Rules
-y to -i + -es
Morphological Parsing
fox -> foxes
Two morphemes fox and es
(Parsing means => taking an input and producing some sort of structure for it)
Morphological; Syntactic; Semantic; Pragmatic
Types
Concatenative Morphology
Prefix and suffix
Non-Concatenative Morphology
Complex
Templatic Morphology or Root-Pattern Morphology
Hebrew
Verb – Two components
root + 3 Consonants (CCC)
Ordering of consonants and vowels determine Verb form
‘lmd’ - learn
CaCaCa – ‘lamad’ - he studied
CiCeC –’limed’ - he taught
CuCaC – ‘lumad’ – he was taught
Considerations:-
Lets look at ‘unbelievably’
suffixes – ‘un’, ‘able’ , ‘ly’
English can have a max of 4-5 suffixes
Turkish – 9 or 10 suffixes
Morphological Parsing
Parsing plural (-s) and Progressive (-ing)
First Column – Words; Second Column –Morphological Features
Features
+N (noun)
+SG (Singular)
+PL (Plural)
Morphological Parsing needs:
Lexicon - List of Stems and Affixes + Basic Information
about them (Noun Stem or Verb Stem)
Morphotactics – Model of Morpheme Ordering (which
classes of morphemes follow which other classes)
Plural morpheme follows a noun rather than preceding it
Orthographic or Spelling Rules – Spelling rules that occur
in a word when two morphemes combine
y –ie (City + s => Cities)
Modeling Morphotactic Using FSA
Regular Noun (-s plural)
Irregular-Singular-Noun (goose; mouse)
Irregular Plural Noun (geese; mice)
Verbal Inflection
FSA - Derivational
Morphology
Noun-ize-ation
Fossilization q0-q1-q2
Adjectives-al-able at q5
Equal, formal, realizable
suffix -ity or -ness at q6
reality; equality
naturalness, casualness
FSA for Morphological Recognition
Checking if a string of letters is a legal word in English or
not…
Procedure:
By taking Morphotactic FSA and plugging in each ‘sub-lexicon’
into the FSA
i.e by expanding each arc (eg reg-stem-noun) with all morphemes
that make up set of reg-stem-noun
=> The resulting FSA can be at the level of individual letter
What we have seen so far is Morphological
Recognition, we will now look at
Morphological Parsing
Recall, the table with Parsed Morphological
Output…
For Cats, the output should be CAT +N
+PL
Morphological Parsing Using Finite
State Transducers
Two-level Morphology
A word is represented as a correspondence between
LEXICAL LEVEL (Simple concatenation of morphemes making up
a word)
SURFACE LEVEL (Actual spelling of the final word)
Example:
Cat +N +PL => Lexical Level
Cats => Surface Level
Morphological Parsing
Done by building mapping rules between
Letter sequences (surface level)
Feature sequences (lexical level)
FSA used to provide mapping - FST
A transducer maps between
One set of symbols and another
FST does this via FSA
A FST is a two-tape automaton that recognizes (or
generates) pairs of strings
So, FSA defines formal languages, by defining a set of
strings
FST defines relations between sets of strings
FST – Mealy Machine Extension to FSA
FSA and FST of Sheep Language
FSA accepts a language defined with symbols, for eg.,
Sheep language
FST accepts a language defined over pairs of symbols as in
Two-level Morphology (FST with two
tapes)
a:b
Expresses how the symbol a from one tape is mapped to the symbol b
on another tape
a : ε => a on the upper tape will correspond to nothing on the lower
tape
a : a => if a symbol maps to itself, which will be common, it is
DEFAULT PAIR
Upper or Lexical Tape
Composed from characters from LEFT side of a:b pairs
Lower or Surface Tape
Composed from characters from RIGHT side of a:b pairs
FST Morphological Parser = Morphotactic FSA +
Lexical Tape
Lexicon should also have two levels:
Cascading Two Automata
geese => goose +N +PL
Need to Cascade the lexicon (in previous slide) with
Singular/Plural Automaton (in slide 38)
Cascading two automata means running them in series
with the output of first feeding the input of second.
Automaton for Lexicon of Stems
cats => cats +N +PL
C maps to itself (default pair)
Likewise for ‘a’ and ‘t’
+N maps with ε transition (Empty String) (+N means Noun)
+PL maps to ^S ( ^ indicates Morpheme
boundary)
# indicates Word Boundary
Orthographic Rules & FST
English requires Spelling changes at morpheme
boundaries by introducing spelling rules (Orthographic
Rules)
FST for writing such rules
E- Insertion Rule
E- Insertion Rule
E – Insertion Rule Automaton
COMBINING FST LEXICON AND RULES