Compilation Techniques
(2)
Formal languages
Regular expressions
Finite Automata
Alphabet
A set of symbols, letters, digits,
punctuators, spaces, …
Noted as a set: A={…}
Binary:{0, 1}
Morse: { . , _ }
ASCII
UNICODE
String (sentence or word)
Finite sequence of symbols from an
alphabet
Empty sequence: ϵ
|s| – the length of a string (in characters)
Binarystrings: ϵ,0,1,01,10,11,1000,0011
ASCII strings: ϵ,speed, 3.14159, ”hi”
Parts of a string
PREFIX – any string obtained by removing zero or
more symbols from the end of the string.
( prefixes of “flower”: ϵ, f, flowe, flower )
SUFFIX – any string obtained by removing zero or
more symbols from the beginning of the string
( suffixes of “flower”: ϵ, r, lower, flower )
SUBSTRING – any string obtained by removing
any prefixes and suffixes from a string
( substrings of “flower”: ϵ, w, lowe, flower )
PROPER prefixes, suffixes or substrings – the
ones which are not ϵ and are not the same as the
original string
Language
A countable set of strings over an
alphabet
Noted as a set: L={…}
Formal language – a language
constrained by rules
L+ – a language without ϵ
The language of strictly two binary digits:
{00, 01, 10, 11}
A baby language: {ϵ, mom, dad}
Concatenation of strings
The operation to obtain a new string by
adjoining two strings (the juxtaposition of
two strings)
Let L a language and x,y,z ∈ L
xy is the concatenation of x and y
(xy)z = x(yz)
ϵx = xϵ = x
Ex: L={ϵ, sun, flower} => concatenated strings:
sun, flower, sunflower, flowerflower, …
The (Kleene) closure
The set of strings obtained by
concatenating a language’s strings one or
more times
Noted L*
By definition: L0={ϵ}
Inductively: Ln=Ln-1L
L+ - a closure without ϵ (ϵ is present in L+ only if
it is present in L)
Ex: let L={a,b,…,z}
L2={aa,ab,…az,ba,bb,…bz,…za,zb,…zz}
L*={…any sequence of letters…}
Union
A new language obtained by all
strings from other two languages
L ∪ M = {s | s ∈ L or s ∈ M }
Ex: L={a,b}, M={0,1,2}
L ∪ M = {a,b,0,1,2}
Regular expressions
A regular expression “e” denotes a language L(e) over an
alphabet A, obtained by one of the following rules:
◦ ϵ is a regular expression and L(ϵ)={ϵ}
◦ if a∈A, L(a)={a}
◦ e* – repetition zero or more times: L(e)*. The operator * has the
highest precedence.
◦ e1e2 – concatenation: L(e1)L(e2). Concatenation (sequence) has the
second highest precedence.
◦ e1|e2 – union: L(e1)∪L(e2). Union (alternative) has the lowest
precedence.
◦ (e) – parenthesis around e does not change e: L(e)
Note: all operators are left associative
A language that can be defined by a regular expression is called a
regular set. If two regular expressions r and s denote the same regular
set, we say they are equivalent: r=s (ex: a|b = b|a )
Extensions of regular expressions
e+ – one or more instances of e: L(e)+.
e*=e+|ϵ e+=ee*=e*e
e? – zero or one instance of e: L(e)∪{ϵ}
if a1,a2,…an∈A (an alphabet):
a1|a2|…|an=[a1a2…an] (character class)
if a1,a2,…an form a logical sequence (they are
consecutive in A):
a1|a2|…|an=[a1-an] (range)
[^a1…an] – any character except a1…an
\a – a loses any special significance and it will be
treated as a simple character
Examples of regular
expressions
flower – the sequence ‘f’ ‘l’ ‘o’ ‘w’ ‘e’ ‘r’
[0-9] – any digit
[^a-zA-Z] – any character except letters
[0-9]+(,[0-9]+)* –
a list of at least one integer, separated
by comma
[0-9]+(\.[0-9]+)?[eE][+\-]?[0-9]+ –
a number in scientific notation (mantissa
and exponent): 1.05e3, 7E-2
Finite automata
A finite set of states: S={S0,S1,…Sn-1}
A set of input symbols (the input alphabet): Σ={a1,a2,
…am}. (ϵ is never a member of Σ)
A state S0∈S, distinguished as start state (or initial
state)
A set of states F⊂S, distinguished as accepting states
(or final states)
A transition function
Nondeterministic Finite Automata (NFA)
From the same state multiple transitions are possible
with the same input symbol
ϵ is accepted as input symbol
The transition function is defined as:
T:Sx(Σ∪{ϵ})->P(S)
P(S) – the power set of S: the set of all subsets of S
S={a,b,c} => P(S)={{},{a},{b},{c},{a,b},{a,c},{b,c},{a,b,c}}
Deterministic Finite Automata (DFA)
For each state s and input symbol a, there is
exactly one transition from s labeled a
ϵ is not accepted as input symbol
The transition function is defined as:
T:SxΣ->S
Any DFA is a NFA
Finite automata representation
S={0,1,2,3}, Σ={a,b}, initial state 0, F={3} State a b
T given as a transition table:
0 0,1 0
for each state and input, 1 2
the possible transitions are given 2 3
3
Ifno possible transitions are possible from a given state
with a given input, that cell is empty (or ∅)
The initial state is figured sometimes with an arrow from
nowhere to it
a
a b b
0 1 2 3
b (a|b)*abb
NFA -> DFA (1)
DFA can use faster traversal algorithms
because in each state it is only one possible
transition for a given input character => it is
important to convert a NFA to a DFA
Rabin-Scott power set construction
For a state s, ϵ-closure(s): the set of all
reachable states from s, including itself, without
consuming any characters (by ϵ transitions)
SDFA – resulted DFA states. Initially S DFA will
contain sets of NFA states which will be
renamed later as new states
T – resulted DFA transitions
NFA -> DFA (2)
If s0 is the NFA initial state: SDFA={ϵ-closure(s0)}, TDFA={}
For each SDFA state (or set of states) s:
◦ For each input symbol a:
Let p=the set of all NFA transitions from s with a
Let c=ϵ-closure(p) in NFA
If c is not in SDFA, add it to SDFA
Add (c,a) to TDFA
Final states in SDFA are the ones those sets contains
final states in NFA
Rename all new sets in SDFA as new states
NFA -> DFA (3)
a
a b b
0 1 2 3
b
SDFA a b
0 {0,1} 0
{0,1} => 4 {0,1} {0,2}
{0,2} => 5 {0,1} {0,3}
{0,3} => 6 {0,1} 0
b
a
a b b
0 4 5 6
a
b
a
Regular expression -> NFA (1)
Inorder to implement regular expressions, these can be
converted to NFA
Thompson-McNaughton-Yamada algorithm
Each component of the regular expression is
represented as in the following figures. The rectangles
are substituted with the NFA corresponding to that
regular expression. ϵ
ϵ e
ϵ a ϵ
a e?
ϵ
ϵ ϵ ϵ
e1e2 e1 e2
ϵ e
ϵ
ϵ
e* ϵ
ϵ
ϵ e1
e1|e2 ϵ
ϵ
ϵ ϵ
e2 ϵ ϵ ϵ
e+ e
Regular expression -> NFA (2)
(a|b)*abb
ϵ
ϵ a
ϵ
ϵ ϵ
ϵ ϵ b
ϵ
ϵ
ϵ a ϵ b ϵ b
Optimization: if between two states is only an ϵ transition, these states can be
joined, preserving all the incoming/outgoing transitions of both states
a
a b b
0 1 2 3
b
Bibliography reading
Alfred V. Aho, Monica S. Lam, Ravi Sethi,
Jeffrey D. Ullman: Compilers. Principles,
Techniques and Tools, 2nd edition, Chapters
3.3, 3.6, 3.7