Chapter – two
Lexical analysis
1
Outline
Introduction
Interaction of Lexical Analyzer with Parser
Token, pattern, lexeme
Specification of patterns using regular expressions
Regular expressions
Regular expressions for tokens
NFA and DFA
Conversion from RE to NFA to DFA…
Lex Scanner Generator
Creating a Lexical Analyzer with Lex
Regular Expressions in Lex
Lex specifications and examples
2
Introduction
The role of lexical analyzer is:
• to read a sequence of characters from the source
program
• group them into lexemes and
• produce as output a sequence of tokens for each
lexeme in the source program.
The scanner can also perform the following
secondary tasks:
stripping out blanks, tabs, new lines
stripping out comments
keep track of line numbers (for error reporting)
3
Interaction of the Lexical Analyzer
with the Parser
next char next token
lexical Syntax
analyzer analyzer
get next
char get next
token
Source
Program
symbol
table
(Contains a record
for each identifier)
token: smallest meaningful sequence of characters
of interest in source program
4
Token, pattern, lexeme
A token is a sequence of characters from the source
program having a collective meaning.
A token is a classification of lexical units.
- For example: id and num
Lexemes are the specific character strings that make up
a token.
– For example: abc and 123A
Patterns are rules describing the set of lexemes
belonging to a token.
– For example: “letter followed by letters and digits”
Patterns are usually specified using regular expressions.
[a-zA-Z]*
Example: printf("Total = %d\n", score);
5
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letter and digits pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”
• In general, in programming languages, the following
are tokens:
keywords, operators, identifiers, constants, literals,
punctuation symbols…
Attributes of tokens
When more than one pattern matches a lexeme, the
scanner must provide additional information about the
particular lexeme to the subsequent phases of the
compiler.
For example, both 0 and 1 match the pattern for the
token num.
But the code generator needs to know which number is
recognized.
The lexical analyzer collects information about tokens
into their associated attributes.
• Tokens influence parsing decisions;
• Attributes influence the translation of tokens after
parse
7
Attributes of tokens…
Practically, a token has one attribute:
a pointer to the symbol table entry in which
information about the token is kept.
The symbol table entry contains various
information about the token
such as its lexeme, type, the line number in which
it was first seen …
Ex. y = 31 + 28 * x, The tokens and their
attributes are written as:
8
Attributes of tokens…
9
9
Errors
Very few errors are detected by the lexical
analyzer.
For example, if the programmer mistakes fi for
if, the lexical analyzer cannot detect the error
since it will consider as an identifier.
Nonetheless, if a certain sequence of
characters follows none of the specified
patterns, the lexical analyzer can detect the
error.
10
Errors…
When an error occurs, the lexical analyzer
recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token (panic mode recover)
deleting one character from the remaining input
inserting missing characters into the remaining input
replacing an incorrect character by a correct
character
transposing two adjacent characters
11
Specification of patterns using
regular expressions
Regular expressions
Regular expressions for tokens
12
Regular expression: Definitions
Represents patterns of strings of characters.
An alphabet Σ is a finite set of symbols
(characters)
A string s is a finite sequence of symbols
from Σ
|s| denotes the length of string s
ε denotes the empty string, thus |ε| = 0
A language L is a specific set of strings over
some fixed alphabet Σ
13
Regular expressions…
A regular expression is one of the following:
Symbol: a basic regular expression consisting of a single
character a, where a is from:
an alphabet Σ of legal characters;
the metacharacter ε: or
the metacharacter ø.
In the first case, L(a)={a};
in the second case, L(ε)= {ε};
in the third case, L(ø)= { }.
{} – contains no string at all.
{ε} – contains the single string consists of no character
14
Regular expressions…
Union/Alternation: an expression of the form r|s,
where r and s are regular expressions.
In this case , L(r|s) = L(r) U L(s) ={r,s}
Concatenation: An expression of the form rs, where r
and s are regular expressions.
In this case, L(rs) = L(r)L(s)={rs}
Repetition: An expression of the form r*, where r is a
regular expression.
In this case, L(r*) = L(r)* ={ε, r,…}
15
Regular expression: Language Operations
Union of L and M
L ∪ M = {s |s ∈ L or s ∈ M}
Concatenation of L and M
LM = {xy | x ∈ L and y ∈ M}
Exponentiation of L
L0 = {ε}; Li = Li-1L The following shorthands
Kleene closure of L are often used:
L* = ∪i=0,…,∞ Li r+ =rr*
Positive closure of L r* = r+| ε
r? =r|ε
L+ = ∪i=1,…,∞ Li
16
Regular expressions…
Examples:
1- a | b = {a,b}
2- (a|b)a = {aa,ba}
3- (ab) | ε ={ab, ε}
4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
Reverse
1 – Even binary numbers (0|1)*0
2 – An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}
17
Exercises
Describe the languages denoted by the following
regular expressions:
1- a(a|b)*a
2- ((ε|a)b*)*
3- (a|b)*a(a|b)(a|b)
4- a*ba*ba*ba*
5- (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*
18
Regular expressions for tokens
Regular expressions are used to specify the
patterns of tokens.
Each pattern matches a set of strings. It falls into
different categories:
Reserved (Key) words: They are represented by
their fixed sequence of characters,
Ex. if, while and do....
If we want to collect all the reserved words into
one definition, we could write it as follows:
Reserved = if | while | do |...
19
Regular expressions for tokens…
Special symbols: including arithmetic operators,
assignment and equality such as =, :=, +, -, *
Identifiers: which are defined to be a sequence of
letters and digits beginning with letter,
we can express this in terms of regular definitions as
follows:
letter = A|B|…|Z|a|b|…|z
digit = 0|1|…|9
or
letter= [a-zA-Z]
digit = [0-9]
identifiers = letter(letter|digit)*
20
Regular expressions for tokens…
Numbers: Numbers can be:
sequence of digits (natural numbers), or
decimal numbers, or
numbers with exponent (indicated by an e or E).
Example: 2.71E-2 represents the number 0.0271.
We can write regular definitions for these numbers as
follows:
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
Literals or constants: which can include:
numeric constants such as 42, and
string literals such as “ hello, world”.
21
Regular expressions for tokens…
relop < | <= | = | <> | > | >=
Comments: Ex. /* this is a C comment*/
Delimiter newline | blank | tab | comment
White space = (delimiter )+
22
Design of a Lexical Analyzer/Scanner
Finite Automata
Lex – turns its input program into lexical analyzer.
At the heart of the transition is the formalism known as
finite automata.
1. Finite automata are recognizers; they simply say "yes" or
"no" about each possible input string.
2. Finite automata come in two flavors:
a) Nondeterministic finite automata (NFA) have no restrictions
on the labels of their edges.
ε, the empty string, is a possible label.
b) Deterministic finite automata (DFA) have, for each state,
and for each symbol of its input alphabet exactly one edge
with that symbol leaving that state.
23
The Whole Scanner Generator Process
Overview
Direct construction of Nondeterministic finite
Automation (NFA) to recognize a given regular
expression.
Easy to build in an algorithmic way
Requires ε-transitions to combine regular sub expressions
Construct a deterministic finite automation
(DFA) to simulate the NFA Optional
Use a set-of-state construction
Minimize the number of states in the DFA
Generate the scanner code.
24
Design of a Lexical Analyzer …
Token Pattern
Pattern Regular Expression
Regular Expression NFA
NFA DFA
DFA’s or NFA’s for all tokens Lexical Analyzer
25
Non-Deterministic Finite Automata
(NFA)
Definition
An NFA M consists of five tuples: ( Σ,S, T, s0, F)
A set of input symbols Σ, the input alphabet
a finite set of states S,
a transition function T: S × (Σ U { ε}) -> S (next state),
a start state s0 from S, and
a set of accepting/final states F from S.
The language accepted by M, written L(M), is defined as:
The set of strings of characters c1c2...cn with each ci from Σ
U { ε} such that there exist states s 1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.
26
NFA…
It is a finite automata which has choice of
edges
• The same symbol can label edges from one state to
several different states.
An edge may be labeled by ε, the empty
string
• We can have transitions without any input
character consumption.
27
Transition Graph
The transition graph for an NFA recognizing the
language of regular expression (a|b)*abb
all strings of a's and b's ending in the
particular string abb
a
start a b b
0 1 2 3
b S={0,1,2,3}
Σ={a,b}
S0=0
F={3}
28
Transition Table
The mapping T of an NFA can be represented
in a transition table
State Input Input Input
a b ε
0 {0,1} {0} ø
T(0,a) = {0,1} 1 ø {2} ø
T(0,b) = {0}
T(1,b) = {2} 2 ø {3} ø
T(2,b) = {3}
3 ø ø ø
The language defined by an NFA is the set of input
strings it accepts, such as (a|b)*abb for the example
NFA
29
Acceptance of input strings by NFA
An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states
The string aabb is accepted by the NFA:
a a b b
0 0 1 2 3 YES
a a b b
0 0 0 0 0 NO
30
Another NFA
a
a
start
b
b
An -transition is taken without consuming any character from
the input.
What does the NFA above accepts?
aa*|bb*
31
Deterministic Finite Automata (DFA)
A deterministic finite automaton is a special
case of an NFA
No state has an ε-transition
For each state S and input symbol a there is at
most one edge labeled a leaving S
Each entry in the transition table is a single state
At most one path exists to accept a string
Simulation algorithm is simple
32
DFA example
A DFA that accepts (a|b)*abb
33
Simulating a DFA: Algorithm
How to apply a DFA to a string.
INPUT:
An input string x terminated by an end-of-file character eof.
A DFA D with start state So, accepting states F, and
transition function move.
OUTPUT: Answer ''yes" if D accepts x; "no" otherwise
METHOD
Apply the algorithm in (next slide) to the input string x.
The function move(s, c) gives the state to which there is an
edge from state s on input c.
The function nextChar() returns the next character of the
input string x.
34
Simulating a DFA
s = so;
c = nextchar();
while ( c != eof ) {
s = move(s, c);
c = nextchar();
}
if ( s is in F ) return
"yes"; DFA accepting (a|b)*abb
else return "no";
Given the input string ababb, this DFA enters the
sequence of states 0,1,2,1,2,3 and returns "yes"
35
DFA: Exercise
Draw DFAs for the string matched by the
following definition:
digit =[0-9]
nat=digit+
signednat=(+|-)?nat
number=signednat(“.”nat)?(E signedNat)?
36
Design of a Lexical Analyzer Generator
Regular Expression DFA
Two algorithms:
1- Translate a regular expression into an NFA
(Thompson’s construction)
2- Translate NFA into DFA
(Subset construction)
37
From regular expression to an NFA
It is known as Thompson’s construction.
Rules:
1- For an ε, a regular expressions, construct:
start a
38
From regular expression to an NFA…
2- For a composition of regular expression:
Case 1: Alternation: regular expression(s|r), assume
that NFAs equivalent to r and s have been
constructed.
39
39
From regular expression to an NFA…
Case 2: Concatenation: regular expression sr
ε
…r …s
Case 3: Repetition r*
40
From RE to NFA:Exercises
Construct NFA for token identifier.
letter(letter|digit)*
Construct NFA for the following regular
expression:
(a|b)*abb
41
From an NFA to a DFA
(subset construction algorithm)
Input NFA N Both accept the same
Output DFA D language usage (RE)
Rules:
Start state of D is assumed to be unmarked.
Start state of D is = ε-closer (S0),
where S0 - start state of N.
42
NFA to a DFA…
ε- closure
ε-closure (S’) – is a set of states with the following
characteristics:
1- S’ € ε-closure(S’) itself
2- if t € ε-closure (S’) and if there is an edge labeled
ε from t to v, then v € ε-closure (S’)
3- Repeat step 2 until no more states can be added
to ε-closure (S’).
E.g: for NFA of (a|b)*abb
ε-closure (0)= {0, 1, 2, 4, 7}
ε-closure (1)= {1, 2, 4}
43
NFA to a DFA…
Algorithm
While there is unmarked state
X = { s0, s1, s2,..., sn} of D do
Begin
Mark X
For each input symbol ‘a’ do
Begin
Let T be the set of states to which there is a transition ‘a’ from state s i in X.
Y= ε-Closer (T)
If Y has not been added to the set of states of D then {
Mark Y an “Unmarked” state of D add a transition from X to Y labeled a if
not already presented
}
End
End
44
NFA for identifier: letter(letter|digit)*
ε
letter
3 4
ε ε
start
letter ε ε
0 1 2 7 8
digit ε
ε 5 6
45
NFA to a DFA…
Example: Convert the following NFA into the corresponding
DFA. letter (letter|digit)*
A={0}
B={1,2,3,5,8}
start letter C={4,7,2,3,5,8}
A B
D={6,7,8,2,3,5}
letter digit
letter
digit D digit
C
letter
46
Exercise: convert NFA of (a|b)*abb in to DFA.
47
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the
input source and converts strings in the source
to tokens.
Lex: generates a scanner (lexical analyzer or
lexer) given a specification of the tokens using
REs.
The input notation for the Lex tool is referred to as
the Lex language and
The tool itself is the Lex compiler.
The Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.
48
Lex…
By using regular expressions, we can specify
patterns to lex that allow it to scan and match
strings in the input.
Each pattern in lex has an associated action.
Typically an action returns a token, representing the
matched string, for subsequent use by the parser.
It uses patterns that match strings in the input and
converts the strings to tokens.
49
General Compiler Infra-structure
Parse tree
Program source Tokens Parser
Scanner Semantic
(tokenizer) Routines
(stream of
characters) Annotated/decorated
x c
Le
Y ac tree
Analysis/
Transformations/
Symbol and optimizations
literal Tables
IR: Intermediate
Representation
Code
Generator
Assembly code
50
Scanner, Parser, Lex and Yacc
5151
Generating a Lexical Analyzer using Lex
Lex is a scanner generator ----- it takes lexical specification as
input, and produces a lexical analyzer written in C.
Lex source
program Lex compiler lex.yy.c
lex.l
lex.yy.c
C compiler a.out
Sequence of
Input stream
a.out tokens
Lexical Analyzer
52
Implementation
A DFA can be implemented by a 2D table T
One dimension is “states”
Other dimension is “input symbols”
For every transition Si a Sk define T[i,a] = k
DFA “execution”
If in state Si and input a, read T[i,a] = k and
skip to state Sk
Very efficient
53
Table Implementation of a DFA
0
0 T
S 0 1
1
1 U
0 1
S T U
T T U
U T U
54
Implementation (Cont.)
NFA -> DFA conversion is at the heart of
tools such as flex or jflex
But, DFAs can be huge
In practice, flex-like tools trade off speed
for space in the choice of NFA and DFA
representations
55
Readings
Chapter 3 of the book: Aho2