0% found this document useful (0 votes)

9 views69 pages

Module2 Lecture2 FST

The document discusses Regular Expressions (RE) and their applications in text processing, including information retrieval and word processing. It explains basic RE patterns, substitution techniques, and the relationship between RE and Finite State Automata (FSA). Additionally, it covers morphological analysis, parsing, and the use of finite-state transducers in linguistic applications.

Uploaded by

Sachin Kumar N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views69 pages

Module2 Lecture2 FST

Uploaded by

Sachin Kumar N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Regular Expression

Regular Expressions

 Standard Notation – characterising text sequences

 Useful in Information Retrieval applications
 Word Processing
 Computing Frequencies from Corpora
 Implemented using finite-state-automaton
RE -2

 One of the unsung successes in CS

 RE – language for specifying search text string
 RE – first developed by Kleene in 1956
 RE – formula in a special language for specifying simple classes
of strings
 String – A sequence of symbols
 Symbols – letters, numbers, spaces, tabs, punctuations
 RE – algebraic notation to specify strings || define a language
 RE Input – Pattern and a Corpus
 You can guess how it would work!!
RE – 3
Basic RE Patterns
 Simple RE – a sequence of characters
 To search for unicorn – type /unicorn/
 Matches any string with unicorn as substring
 REs are Case-Sensitive
RE – 4
Specifying Disjunction
RE -5
Specifying Ranges and Negation
RE -6
Options
RE – 7
Kleene * - zero or more occurrences of prev char
 Pronounced as “Cleany star”
 /a*/ - means ‘any string of zero or more a’s
 - will match a ( one a) or aaaa (many a’s) or creepy (zero a)
 So to match one or more a’s ?
 /[ab]*/ - zero or more a’s or b’s => Not zero or more right square
braces!!
 - will match strings like aaaa or
ababab or bbbb
RE -8
Kleene + one or more occurrences of prev
char
 RE for Individual Digit
 /[0-9]/
 RE for a String of Digits
 /[0-9][0-9]*/
 Why is it not just /[0-9]*/ ?
 Writing twice is annoying!!
 So, Kleene + comes in handy…means…one or more of previous
char
 RE for a sequence of digits is /[0-9]+/
 Decode /baa+!/
RE – 9
Wild Card
/./
- Matches any single character except carriage return

- How can it be used with kleene * to mean ‘any string of

characters’
- Please explore other operators for Counting, Prefixing,
Grouping, Beginning etc.,
RE -10
RE Substitution, Memory, ELIZA
 Substitution – an important use of RE
 Suppose, we want to put angle brackets around all
integers in a text
 Example 1
 the 35 boxes to the <35> boxes
 To do this, ( ) is put around the first pattern and number
operator \1 in the second pattern to refer back
 s/([0-9]+)/ <\1>/
RE -11
RE Substitution, Memory, ELIZA
 Example 2
 Suppose if we are looking for the pattern
 the Xer they were, the Xer they will be’
 We want both X to be the same string
 So, we surround the first X with parenthesis and replace the
second X with number operator
 SUBSTITUTION:
/ the (.*)er they were, the \1er they will be/
This will fetch the bigger they were, the bigger they will be and
not the bigger they were the faster they will be
RE-12
RE Substitution, Memory, ELIZA
 Two different sets of parenthesis can be matched:
/the (.*)er they (.*), the \1er they \2/
 Will match
 The bigger they were, the bigger they were
 Not
 The bigger they were, the bigger they will

 These numbered memories are registers (register 1, register 2

etc.,)
Substitutions Using Memory
In a Simple NL Application – ELIZA (Simulates a
Psychologist and could have conversations like:- )

User : Humans are all alike

ELIZA : IN WHAT WAY
User : They’re always bugging us about something
ELIZA : CAN YOU THINK OF A SPECIFIC EXAMPLE
User :Well, my friend made me come here.
ELIZA : YOUR FRIEND MADE YOU COME HERE
User : She says I’m depressed most of the time
ELIZA : I AM SORRY TO HEAR YOU ARE DEPRESSED.
RE substitutions in the Working of ELIZA

 Applies a cascade of RE substitutions

 my to YOUR
 I’m to YOU ARE
 s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU
ARE \1/
 s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU
ARE \1/
 s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite State Automata
RE & FSA

 RE – metalanguage for text searching

 RE – way of describing FSA
 RE along with FSA can be used to characterise a Formal
Language
FSA for Sheep talk – any string from the
following infinite set
FSA - Sheeptalk

 RE - /baa+!/
 The figure shown is an automaton for modelling this RE
 Automaton recognises the set of strings characterising sheeptalk
 Automaton as a directed Graph
 A finite set of vertices (nodes) - Circles
 Directed links between vertices called arcs – Arrows
 5 States
 State 0 - start state (incoming arrow)
 State 4 – final state/accepting state (double circle)
 4 Transitions (arcs)
Working - Input in tape
 The machine starts in State q0
 Set Current_state as Start_state
 Repeat
 Check the next letter in the input
 If it matches the symbol on an arc
 Move to the next state
 Update Current_state
 Check if it is the accepting state and if we have run out of symbols in input

 Then we have successfully recognized instance of sheeptalk, exit

 Else Advance one symbol in the input

 Exit if

 Machine never gets to the final stage (because it there was no match or it ran out of input or just stuck in some non-final
state)
State Transition Table – FSA

 Start State
 Accepting State(s)
 What transitions leave each state with what symbols
 --------------------------------------------------------------------
 State 4 – marked with : (accepting state)
 Ф indicates a illegal or missing transition
 We read the first row as:
 If we are in State 0 and we see the input b, we go to State 1
 If we are in State 0 and we see the input a or !, we fail
Formal Definition – FSA
 Defined by 5 Parameters:

 Q = {q0, q1, q2, q3, q4} = {a, b, !} F = {q4}

 = defined by the transition tables
Adding a fail state…
Another Example: 1-99 in English
Non-Deterministic FSA (NFSA)

 Consider the automata given:

 (Difference between this one and previous one is that – self loop is on
state 2 and not in state 3)

 Non-deterministic: when we get to state 2, we don’t know if we have to

stay there or go to state 3
DFSA Vs NFSA

 DFSA – Its behaviour during recognition is fully determined

by the state it is in and the symbol it is looking at.
 This is not the case with NFSA
NFSA #2

 Non-determinism - caused by arcs without symbols on them (ε

– transitions)

 The ε arc is interpreted as follows:

 if we are in state 3, we are allowed to move to state 2 without
looking at the input or advancing the input pointer
NFSA #1
Formal Languages
 empty set


Morphology

 Study of composition of words using Morphemes

 A morpheme is the smallest unit of language that has meaning
 Types of Morphemes
Stems
Affixes
Prefix
Suffix
Infix
Circumfix
Stems and Affixes
 Stems – Free Morphemes (can exist without affixes)
 Affixes – Bound Morphemes (exist with free morphemes)
 Eg: Unbelievable
 believe - free
 Un & able – affixes/bound morphemes
 Types of Languages
 Isolating (Only free forms – Mandarin)
 Agglutinative (small words combine to form compound info – Turkish)
 Inflecting (words are broken down to smaller units – different
meanings)
Inflection in Morphology

 Inflection – transforming a word into a form – representing

 Person
 Number
 Tense
 Gender
 Case
 Aspect
 Mood
 But the syntactic category of the token remains the same
Derivation in Morphology

 Syntactic category of the word is changed

Morphological Analyser (MA)

 process of obtaining grammatical information from tokens, given their

suffix information
 MA - generates morphological information, such as gender, number, class,
and so on, as an output.

#take a text, tokenise it

#use the following hints to find the category of a word
1. Morphological hints – eg. – suffix information like –ness –ment , imply
they exist with nouns
2. Syntactic hints – if we find that the word is noun category, an adjective
may occur before or after it
Morphological Parsing Using
Finite-State-Transducers
(FST)
Plurals

 Woodchuck –Woodchucks
 fox
 peccary
 aardvark
 goose
 fish
Kinds of Knowledge

 Morphological Rules
 fish – null plural
 goose – geese by changing vowel
 Spelling Rules
 -y to -i + -es
 Morphological Parsing
 fox -> foxes
 Two morphemes fox and es
 (Parsing means => taking an input and producing some sort of structure for it)
 Morphological; Syntactic; Semantic; Pragmatic
Types

 Concatenative Morphology
 Prefix and suffix
 Non-Concatenative Morphology
 Complex
 Templatic Morphology or Root-Pattern Morphology
 Hebrew
 Verb – Two components
 root + 3 Consonants (CCC)
 Ordering of consonants and vowels determine Verb form
 ‘lmd’ - learn
 CaCaCa – ‘lamad’ - he studied
 CiCeC –’limed’ - he taught
 CuCaC – ‘lumad’ – he was taught
Considerations:-
Lets look at ‘unbelievably’
 suffixes – ‘un’, ‘able’ , ‘ly’
 English can have a max of 4-5 suffixes
 Turkish – 9 or 10 suffixes
Morphological Parsing

 Parsing plural (-s) and Progressive (-ing)

 First Column – Words; Second Column –Morphological Features
Features

 +N (noun)
 +SG (Singular)
 +PL (Plural)
Morphological Parsing needs:

 Lexicon - List of Stems and Affixes + Basic Information

about them (Noun Stem or Verb Stem)
 Morphotactics – Model of Morpheme Ordering (which
classes of morphemes follow which other classes)
 Plural morpheme follows a noun rather than preceding it
 Orthographic or Spelling Rules – Spelling rules that occur
in a word when two morphemes combine
 y –ie (City + s => Cities)
Modeling Morphotactic Using FSA
 Regular Noun (-s plural)
 Irregular-Singular-Noun (goose; mouse)
 Irregular Plural Noun (geese; mice)
Verbal Inflection
FSA - Derivational
Morphology
 Noun-ize-ation
 Fossilization q0-q1-q2

 Adjectives-al-able at q5
 Equal, formal, realizable
suffix -ity or -ness at q6
reality; equality
naturalness, casualness
FSA for Morphological Recognition

 Checking if a string of letters is a legal word in English or

not…
 Procedure:
 By taking Morphotactic FSA and plugging in each ‘sub-lexicon’
into the FSA
 i.e by expanding each arc (eg reg-stem-noun) with all morphemes
that make up set of reg-stem-noun
 => The resulting FSA can be at the level of individual letter
What we have seen so far is Morphological
Recognition, we will now look at
Morphological Parsing
Recall, the table with Parsed Morphological
Output…
For Cats, the output should be CAT +N
+PL
Morphological Parsing Using Finite
State Transducers
 Two-level Morphology
 A word is represented as a correspondence between
 LEXICAL LEVEL (Simple concatenation of morphemes making up
a word)
 SURFACE LEVEL (Actual spelling of the final word)

 Example:
 Cat +N +PL => Lexical Level
 Cats => Surface Level
Morphological Parsing

 Done by building mapping rules between

 Letter sequences (surface level)
 Feature sequences (lexical level)
FSA used to provide mapping - FST

 A transducer maps between

 One set of symbols and another
 FST does this via FSA

 A FST is a two-tape automaton that recognizes (or

generates) pairs of strings
 So, FSA defines formal languages, by defining a set of
strings
 FST defines relations between sets of strings
FST – Mealy Machine Extension to FSA
FSA and FST of Sheep Language

 FSA accepts a language defined with symbols, for eg.,

Sheep language

 FST accepts a language defined over pairs of symbols as in

Two-level Morphology (FST with two
tapes)
 a:b
 Expresses how the symbol a from one tape is mapped to the symbol b
on another tape
 a : ε => a on the upper tape will correspond to nothing on the lower
tape
 a : a => if a symbol maps to itself, which will be common, it is
DEFAULT PAIR
 Upper or Lexical Tape
 Composed from characters from LEFT side of a:b pairs
 Lower or Surface Tape
 Composed from characters from RIGHT side of a:b pairs
FST Morphological Parser = Morphotactic FSA +
Lexical Tape
Lexicon should also have two levels:
Cascading Two Automata

 geese => goose +N +PL

 Need to Cascade the lexicon (in previous slide) with

Singular/Plural Automaton (in slide 38)
 Cascading two automata means running them in series
with the output of first feeding the input of second.
Automaton for Lexicon of Stems
cats => cats +N +PL

 C maps to itself (default pair)

 Likewise for ‘a’ and ‘t’
 +N maps with ε transition (Empty String) (+N means Noun)
 +PL maps to ^S ( ^ indicates Morpheme
boundary)
 # indicates Word Boundary
Orthographic Rules & FST

 English requires Spelling changes at morpheme

boundaries by introducing spelling rules (Orthographic
Rules)
 FST for writing such rules
E- Insertion Rule
E- Insertion Rule
E – Insertion Rule Automaton
COMBINING FST LEXICON AND RULES

Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Module 2 Chap1
No ratings yet
Module 2 Chap1
92 pages
Understanding Regular Expressions
No ratings yet
Understanding Regular Expressions
85 pages
Words
No ratings yet
Words
44 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Linguistics: Understanding Morphology
No ratings yet
Linguistics: Understanding Morphology
118 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
2.chapter3 - Regular Expressions and Automata
No ratings yet
2.chapter3 - Regular Expressions and Automata
28 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
NLP Module 2 - 1
100% (1)
NLP Module 2 - 1
86 pages
Module 2
No ratings yet
Module 2
78 pages
Scan 27 Nov 23 09 21 15
No ratings yet
Scan 27 Nov 23 09 21 15
11 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Regular Expressions & FSAs Guide
No ratings yet
Regular Expressions & FSAs Guide
40 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
CH 2
No ratings yet
CH 2
36 pages
7-Morphology Part2
No ratings yet
7-Morphology Part2
28 pages
File 1675742677 110405 LexicalAnalysis-Continue1
No ratings yet
File 1675742677 110405 LexicalAnalysis-Continue1
39 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
20 pages
Module II
No ratings yet
Module II
47 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Lecture 3 (30-1-23)
No ratings yet
Lecture 3 (30-1-23)
11 pages
Slide-1 Hello Everyone!! Welcome You All at Centre For Future Technologies in This Session We Will Discuss About The Word Level Analysis
No ratings yet
Slide-1 Hello Everyone!! Welcome You All at Centre For Future Technologies in This Session We Will Discuss About The Word Level Analysis
9 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Lect 03
No ratings yet
Lect 03
19 pages
2 - Compilers (Lexical Analysis)
No ratings yet
2 - Compilers (Lexical Analysis)
60 pages
Chapter Three Words and Transducers
No ratings yet
Chapter Three Words and Transducers
56 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
27 pages
Lec 03 - Finite Languages
No ratings yet
Lec 03 - Finite Languages
29 pages
RegexFSA
No ratings yet
RegexFSA
59 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
10 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Word-Level NLP Techniques and Regex
No ratings yet
Word-Level NLP Techniques and Regex
8 pages
Compiler Design - Lexical Analysis: University of Salford, UK
No ratings yet
Compiler Design - Lexical Analysis: University of Salford, UK
1 page
Regular Expression & Autometa
No ratings yet
Regular Expression & Autometa
62 pages
NLP Notes Complete
No ratings yet
NLP Notes Complete
99 pages
Nlplect 10
No ratings yet
Nlplect 10
18 pages
13-Application - Complete Prediction-02-02-2024
No ratings yet
13-Application - Complete Prediction-02-02-2024
49 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
1 Toc-1
No ratings yet
1 Toc-1
95 pages
Token Recognition in Compiler Design
No ratings yet
Token Recognition in Compiler Design
51 pages
Compiler Construction Lecture 3-4
No ratings yet
Compiler Construction Lecture 3-4
78 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Unit 2: Role of Lexical Analyzer
No ratings yet
Unit 2: Role of Lexical Analyzer
11 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
11 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
88 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Automata and Formal Languages Cheatsheet
100% (1)
Automata and Formal Languages Cheatsheet
5 pages
Compiler Construction Lecture Notes
No ratings yet
Compiler Construction Lecture Notes
21 pages
Chapter 05 Finite Autamata
No ratings yet
Chapter 05 Finite Autamata
22 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lec 1 IntroToAutomataTheory
100% (1)
Lec 1 IntroToAutomataTheory
20 pages
Components
No ratings yet
Components
29 pages
Module5 Lecture1 LanguageModelling
No ratings yet
Module5 Lecture1 LanguageModelling
52 pages
Module3 Lecture1 ContextFreeGrammar
No ratings yet
Module3 Lecture1 ContextFreeGrammar
11 pages
Module2 Lecture4 HMM
No ratings yet
Module2 Lecture4 HMM
20 pages
The Allomorphy
No ratings yet
The Allomorphy
13 pages
Igbo (2nd Sem)
No ratings yet
Igbo (2nd Sem)
81 pages
Lesson 7
No ratings yet
Lesson 7
9 pages
Phonology The Sound of Language
No ratings yet
Phonology The Sound of Language
21 pages
Morphology-Basic Allomorphs
No ratings yet
Morphology-Basic Allomorphs
3 pages
Instructor's Answer Guide
No ratings yet
Instructor's Answer Guide
89 pages
Lexicology As A Science. The Object of Lexicology. Main Lexicological Problems
No ratings yet
Lexicology As A Science. The Object of Lexicology. Main Lexicological Problems
4 pages
Лексикологія тести
No ratings yet
Лексикологія тести
11 pages
Understanding Language: Forms and Definitions
No ratings yet
Understanding Language: Forms and Definitions
50 pages
Orthography Syntax and Morphemes in Cebuano
No ratings yet
Orthography Syntax and Morphemes in Cebuano
34 pages
Definition of Language
No ratings yet
Definition of Language
14 pages
ENG503 - Quiz For Mid-Term
No ratings yet
ENG503 - Quiz For Mid-Term
14 pages
GEC 105: Purposive Communication Module
No ratings yet
GEC 105: Purposive Communication Module
18 pages
Contrastive Linguistics
No ratings yet
Contrastive Linguistics
49 pages
Unit 3 Morphology
No ratings yet
Unit 3 Morphology
20 pages
Comprehensive Notes of Linguistics
No ratings yet
Comprehensive Notes of Linguistics
44 pages
MODULE 5 Morphology Lingusitics 1
No ratings yet
MODULE 5 Morphology Lingusitics 1
11 pages
Morphology Basics for Linguistics Students
No ratings yet
Morphology Basics for Linguistics Students
45 pages
Linguistics Notes
No ratings yet
Linguistics Notes
5 pages
LIN 101 Introduction To Lingusitics I-1
No ratings yet
LIN 101 Introduction To Lingusitics I-1
21 pages
Homework
No ratings yet
Homework
2 pages
Group 3 (Fika Nurul Fadillah & Nurhidayanti Badru) - Structural Morphology
No ratings yet
Group 3 (Fika Nurul Fadillah & Nurhidayanti Badru) - Structural Morphology
25 pages
Exercise 1 1 Identify The Free Morphemes
No ratings yet
Exercise 1 1 Identify The Free Morphemes
6 pages
Grammatical Units: Words, Phrases, Sentences
No ratings yet
Grammatical Units: Words, Phrases, Sentences
18 pages
S1 Morphemes
No ratings yet
S1 Morphemes
2 pages
BROWN - Development of The First Language
100% (3)
BROWN - Development of The First Language
10 pages
Exploring Affixation in English
No ratings yet
Exploring Affixation in English
10 pages
Underlying Representation Slidesx1 PDF
No ratings yet
Underlying Representation Slidesx1 PDF
50 pages
Task 2 English Morphology and Lexicology
No ratings yet
Task 2 English Morphology and Lexicology
3 pages
Understanding English Morphology Basics
No ratings yet
Understanding English Morphology Basics
21 pages

Module2 Lecture2 FST

Uploaded by

Module2 Lecture2 FST

Uploaded by

Regular Expression

 Standard Notation – characterising text sequences

 One of the unsung successes in CS

- How can it be used with kleene * to mean ‘any string of

 These numbered memories are registers (register 1, register 2

User : Humans are all alike

 Applies a cascade of RE substitutions

 RE – metalanguage for text searching

 Then we have successfully recognized instance of sheeptalk, exit

 Else Advance one symbol in the input

 Q = {q0, q1, q2, q3, q4} = {a, b, !} F = {q4}

 Consider the automata given:

 Non-deterministic: when we get to state 2, we don’t know if we have to

 DFSA – Its behaviour during recognition is fully determined

 Non-determinism - caused by arcs without symbols on them (ε

 The ε arc is interpreted as follows:

 Study of composition of words using Morphemes

 Inflection – transforming a word into a form – representing

 Syntactic category of the word is changed

 process of obtaining grammatical information from tokens, given their

#take a text, tokenise it

 Parsing plural (-s) and Progressive (-ing)

 Lexicon - List of Stems and Affixes + Basic Information

 Checking if a string of letters is a legal word in English or

 Done by building mapping rules between

 A transducer maps between

 A FST is a two-tape automaton that recognizes (or

 FSA accepts a language defined with symbols, for eg.,

 FST accepts a language defined over pairs of symbols as in

 geese => goose +N +PL

 Need to Cascade the lexicon (in previous slide) with

 C maps to itself (default pair)

 English requires Spelling changes at morpheme

You might also like