Chapter 2: Lexical Analysis/Scanner
Compiler Design (Comp382)
By Esubalew Alemneh
Session1
Lexical Analyzer
The Role of the Lexical Analyzer
What is token?
Tokens, Patterns and Lexemes
Token Attributes
Lexical Errors
Input Buffering
Specification of Tokens
Recognition of Tokens
Lexical Analyzer
Lexical analysis is the process of converting a sequence of
characters into a sequence of tokens.
A program or function which performs lexical analysis is called a
lexical analyzer, lexer, or scanner.
A lexer often exists as a single function which is called by a
parser (syntax Analyzer) or another function.
The tasks done by scanner are
Grouping input characters into tokens
Stripping out comments and white spaces and other separators
Correlating error messages with the source program
Macros expansion
Pass token to parser
The Role of a Lexical Analyzer
pass token
and attribute value
read char
Source
program
Lexical
analyzer
id
Symbol Table
Parser
get next
Token
Lexical Analyzer .
Sometimes Lexical Analyzers are divided into a cascade of two
phases; Scanning and Lexical analysis
Advantages of separating lexical analysis from scanning are
Simpler design - Separate the two analysis
High cohesion / Low coupling
Improve maintainability
Compiler efficiency
Liberty to apply specialized techniques that serves only lexical tasks,
not the whole parsing
Speedup reading input characters using specialized buffering
techniques
Compiler portability (e.g. Linux to Win)
Input device peculiarities are restricted to the lexical Analyzer
Enable integration of third-party lexers [lexical analysis tool]
What is token?
Tokens correspond to sets of strings.
Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits; 123, 123.45
Keyword: else or if or begin or
Whitespace: a non-empty sequence of blanks, newlines, and tabs
Symbols:
+, -, *, /, =, <, >, ->,
Char (string) literals: Hello, c
Comments, White space
Terminal symbols in a grammar
What is token? Example
What do we want to do?
Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
Goal: Partition input string into substrings
Where the substrings are tokens
Tokens: if, (, i, ==, j, ), z, =, 0, ;, else, 1,
Tokens, Patterns and Lexemes
Lexeme
Actual sequence of characters that matches a pattern and has a given
token class.
Examples: Name, Data, x, 345,2,0,629,....
Token
A classification for a common set of strings
Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....
One token for all identifiers
Pattern
The rules that characterize the set of strings for a token
Examples: [0-9]+
identifier:
([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*
Tokens, Patterns and Lexemes: Example
Token Attributes
More than one lexeme can match a pattern
the lexical analyzer must provide the information about the particular
lexeme that matched.
For example, the pattern for token number matches both 0, 1, 934,
But code generator must know which lexeme was found in the
source program.
Thus, the lexical analyzer returns to the parser a token name and an
attribute value
For each lexeme the following type of output is produced
(token-name, attribute-value)
attribute-value points to an entry in the symbol table for this token
Example of Attribute Values
E = M * C ** 2
<id, pointer to symbol table entry for E>
< assign_op>
<id, pointer to symbol table entry for M>
<mult_op>
<id, pointer to symbol table entry for C>
<exp_op>
<number, integer value 2>
Lexical Errors
Lexical analyzer cant detect all errors needs other
components
In what situations do errors occur?
When none of the patterns for tokens matches any prefix of the
remaining input.
However look at: fi(a==f(x))
generates no lexical error in C subsequent phases of compiler do
generate the errors
Possible error recovery actions:
Deleting or Inserting Input Characters
Replacing or Transposing Characters
Or, skip over to next separator to ignore problem
Input Buffering
Processing large number of characters during the compilation of
a large source program consumes time- due to overhead required to
process a single input character
How to speed the reading of source program ?
specialized buffering techniques have been developed
two buffer scheme - that handles large lookaheads safely
sentinels improvement which saves time of checking buffer ends
Buffer Pair
Buffer size N, N = size of a disk block (4096 bytes)
read N characters into a buffer
one system call i.e. not one call per character
If read < N characters we put eof that marks the end of the source file
Buffer Pair
Two pointers to the input are maintained
lexemeBegin marks the beginning of the current lexeme
Forward scans ahead until a pattern match is found
Initially both pointers point the first character of the next lexeme
Forward pointer scans; if a lexeme is found, it is set to the last
character of the lexeme found
So, forward must be retracted one position to its left.
After processing the lexeme, both pointers are set to the character
immediately the lexeme
Sentinels
In preceding schema, for each character read we make two tests:
have we moved off one of the buffer, and
one to determine what character is read
We can combine the two tests if we extend each buffer to hold a
sentinel character (e.g. eof ) at the end:
Note that eof retains its use as a marker for the end of the entire
input.
Any eof that appears other than at the end of a buffer means that
the input is at an end.
SentinelsAlgorithm for advancing forward
Specifying Tokens
Two issues in lexical analysis.
How to specify tokens? And How to recognize the tokens giving a token
specification? (i.e. how to implement the nexttoken( ) routine)?
How to specify tokens:
Tokens are specified by regular expressions.
Regular Expressions
Represent patterns of strings of characters
cannot express all possible patterns, they are very effective in specifying
those types of patterns that we actually need for tokens.
The set of strings generated by a regular expression r is as L(r)
Strings and Alphabets
alphabet : a finite set of symbols. E.g. {a, b, c}
A string/sentence/word over an alphabet is a finite sequence of
symbols drawn from that alphabet
Operations.
concatenation of strings x and y: xy; x=x =x
exponential : S0 = , Si =si-1s, i>0
Do you remember words related to parts of a string:
prefix VS suffix
proper prefixVS proper suffix,
substring
subsequence
Languages
A language is any countable set of strings over some fixed
alphabet.
Alphabet
{0,1}
{a,b,c}
{AZ}
{AZ,az,09,
+,-,,<,>,}
Special Languages:
Language
{0,10,100,1000,10000,}
{0,1,100,000,111,}
{abc,aabbcc,aaabbbccc,}
{TEE,FORE,BALL}
{FOR,WHILE,GOTO}
{All legal PASCAL progs}
{All grammatically correct English Sentences}
EMPTY LANGUAGE
contains empty string only
Operations on Languages
L = {A, B, C, D }
D = {1, 2, 3}
L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
Regular Expressions
Formal definition of Regular expression:
Given an alphabet ,
(1) l is a regular expression that denote {l}, the set that
contains the empty string.
(2) For each a e , a is a regular expression denote {a}, the set
containing the string a.
(3) r and s are regular expressions, then
o r | s is a regular expression denoting L( r ) U L( s )
o rs is a regular expression denoting L( r ) L ( s )
o ( r )* is a regular expression denoting (L ( r )) *
Regular expression is defined together with the language it
denotes.
Regular expressions are used to denote regular languages
Regular Expressions: Example
Regular Expressions: Algebraic Properties
All Strings in English alphabet that start with tab or end with bat:
tab{A,,Z,a,...,z}*|{A,,Z,a,....,z}*bat
All Strings in English alphabet in Which {1,2,3} exist in ascending order:
{A,,Z}*1 {A,,Z}*2 {A,,Z}*3 {A,,Z}*
Regular Definition
Gives names to regular expressions to construct more complicate
regular expressions.
If is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form:
d 1 r 1,
d 2 r 2,
d 3 r 3,
where:
Each di is a new symbol, not in and not the same as any other of
the d's
Each ri is a regular expression over the alphabet U {dl, d2,. . . , dil).
Regular Definition: Example
Example
C identifiers are strings of letters, digits, and underscores. The regular
definition for the language of C identifiers.
LetterA | B | C|| Z | a | b | |z|
digit 0|1|2 | | 9
id letter( letter | digit )*
Unsigned numbers (integer or floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4. The regular definition
digit 0|1|2 | | 9
digits digit digit*
optionalFraction .digits |
optionalExponent ( E( + |- | ) digits ) |
number digits optionalFraction optionalExponent
Regular Definition
Short hand notations
+: one or more
r* = r+/ and r+ = rr*
?: zero or one
r? = r/
[A-Z] = A|B|C|.|Z
[range]: set range of characters
Example
C variable name: [A -Za - z ][A - Za - z0 - 9 ]*
Write regular definitions for
Phone numbers of form (510) 643-1481
number ( area ) exchange - phone,
phone digit4,
exchange digit3 ,
area digit3,
digit { 0, 1, 2, 3, , 9 }
Email Addresses (Exercise)
Regular Grammars
Basically regular grammars are used to represent regular languages
The Regular Grammars are either left or right linear:
Right Regular Grammars:
Rules of the forms: A , A a, A aB, A,B: variables and a:
terminal
Left Regular Grammars:
Rules of the forms: A , A a, A Ba; A,B: variables and A:
terminal
Example: S aS | bA, A cA |
This grammar produces the language produced by the regular
expression a*bc*
Recognition of tokens
Recognition of tokens is the second issue in lexical analysis
How to recognize the tokens giving a token specification?
Given the grammar of branching statement:
The terminals of the grammar, which are if, then, else, relop,
id, and number, are the names of tokens
The patterns for the given tokens are:
Recognition of tokens
Note that
The lexical analyzer also has the job of stripping out whitespace, by
recognizing the "token" ws defined by:
Recognition of tokens
The table shows which token name is returned to the parser and what
attribute value for each lexeme or family of lexemes.
Recognition of tokens
Transition diagrams
As intermediate step in the construction of a lexical analyzer, we
first convert patterns into stylized flowcharts, called "transition
diagrams"
Example Transition diagram for relop
* Is used to indicate that
we must retract the input
one position ( back)
Recognition of tokens
Token nexttoken( ) {
Mapping transition
diagrams into C code
while (1) {
switch (state) {
case 0:
c = nextchar();
if (c is white space) state = 0;
else if (c == <) state = 1;
else if (c == =) state = 5;
case 1:
c = nextchar();
if(c===) go to state 2
return;
else if(c==>) goto state 3 retrun
else
retract ; return;
if (isletter( c) ) state = 10; else state =fail();
break;
case 10: .
case 11: retract(1); insert(id);
return;
}}}
Recognition of tokens
Recognition of Reserved Words and Identifiers
Recognizing keywords and identifiers presents a problem.
keywords like if or then are reserved but they look like identifiers
There are two ways that we can handle
stall the reserved words in the symbol table initially.
If we find an identifier, installID is called to
place it in the symbol table.
Create separate transition diagrams for each keyword;
the transition diagram for the reserved word then
Session2
Finite Automata
Nondeterministic Finite Automata
Deterministic Finite Automata
From Regular Expressions to Automata
Conversion of NFA to DFA
State minimization
Design of a Lexical-Analyzer Generator
Implementing Scanners
Finite Automata
At the heart of the transition diagram is the formalism known as
finite automata
A finite automaton is a finite-state transition diagram that can be
used to model the recognition of a token type specified by a regular
expression
Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
A finite automaton can be a
Non-deterministic finite automaton (NFA
Deterministic finite automaton (DFA)
DFA & NFA are capable of recognizing the same languages
Nondeterministic Finite Automata
NFA: Transition Graph and Transition tables
An NFA can be diagrammatically represented by a labeled
directed graph called a transition graph
The set of states =
{A,B, C, D}
Input symbol = {0, 1}
Start state is A,
accepting state is D
Input Symbol
Transition graph shown in
preceding slide can be
represented by transition
table on the right side
State
{A,B}
{A,C}
{D}
--
--
{D}
{D}
{D}
Nondeterministic Finite Automata
Acceptance of Input Strings by Automata
An NFA accepts an input string str iff there is some path in the finite-state
transition diagram from the start state to some final state such that the edge
labels along this path spell out str
The language recognized by an NFA is the set of strings it accepts
Example
The language recognized by this NFA is aa*|bb*
Deterministic Finite Automata
A deterministic finite automaton (DFA) is a special case of an
NFA where:
There are no moves on input ,and
For each state q and input symbol a, there is exactly one edge out of a
labeled a.
Example
The language recognized by
this DFA is also
(b*aa*(aa*b)+)
Implementing a DFA
Let us assume that the end of a string is marked with a special
symbol (say eos).
The algorithm for recognition will be as follows: (an efficient
implementation)
q = q0
c = nextchar
while (c != eos) do
begin
q = move(q,c)
c = nextchar
end
if (q in F) then
return yes
else
return no
{ start from the initial state }
{ get the next character from the input string }
{ do until the end of the string }
{ transition function }
{ if s is an accepting state }
Implementing an NFA
Q = -closure({q0})
c = nextchar
while (c != eos) {
begin
Q = -closure(move(Q,c))
c = nextchar
end
if (Q F != ) then
return yes
else
return no
{ set of all states that can be
accessible from q0 by -transitions }
{ set of all states that can
be accessible from a state
in q by a transition on c }
{ if Q contains an accepting state }
This algorithm is not efficient.
Computing the -closure (Q)
-closure(q): set of states reachable from some state q in Q on -
transitions alone.
move(q, c): set of states to which there is a transition on input symbol
c from some state q in Q
Computing the -closure (Q) ...Example
Given an NFA below find -closure (s)
i.e. All states that can be
reached through -transitions
only
from Regular Expression to NFA
For each kind of RE, there is a corresponding NFA
i.e. any regular expression can be converted to a NFA that defines the
same language.
The algorithm is syntax-directed, in the sense that it works
recursively up the parse tree for the regular expression.
For each sub-expression the algorithm constructs an NFA with a
single accepting state.
Algorithm: The McNaughton-Yamada-Thompson
algorithm to convert a regular expression to a NFA.
INPUT: A regular expression r over alphabet .
OUTPUT: An NFA N accepting L(r).
from Regular expressions to NFA
Method: Begin by parsing r into its constituent sub-expressions. The rules
for constructing an NFA consist of basis rules for handling sub-expressions
with no operators, and inductive rules for constructing larger NFA's from
the NFA's for the immediate sub-expressions of a given expression.
For expression e construct the NFA
For any sub-expression a in C, construct the NFA
NFA for the union of two regular expressions
Ex: a|b
from Regular expressions to NFA
NFA for r1r2
NFA for r*
Example: (a|b)*
from Regular expressions to NFA
Example: Constructing NFA for regular expression
r= (a|b)*abb
Step 1: construct a, b
Step 2: constructing a | b
Step3: construct (a|b)*
Step4: concatenate it with a, then, b,
then b
Conversion of NFA to DFA
Why?
DFA is difficult to construct directly from REs
NFA is difficult to represent in a computer program and inefficient to
compute
So, RENFADFA
Conversion algorithm: subset construction
The idea is that each DFA state corresponds to a set of NFA states.
After reading input a1a2an, the DFA is in a state that represents the
subset Q of the states of the NFA that are reachable from the start state.
INPUT: An NFA N
OUTPUT: A DFA D accepting the same language as N.
ALGORITHM:
Subset Construction Algorithm
Dstates := -closure (q0) //and they all are unmarked
While (there is an unmarked state Q in Dstates ) do
begin
mark Q;
for (each input symbol a) do
begin
U := -closure ( move(Q,a) );
if (U is not in Dstates ) then
add U as an unmarked state to Dstates;
Dtran [Q, a] := U;
end
end
Subset Construction Algorithm
Example NFA to DFA
The start state A of the equivalent DFA is -closure(0),
A = {0,1,2,4,7},
Since these are exactly the states reachable from state 0 via a
path all of whose edges have label .
Note that a path can have zero edges, so state 0 is reachable from
itself by an -labeled path.
Subset Construction Algorithm
The input alphabet is {a, b). Thus, our first step is to mark A and
compute
Dtran[A, a] = -closure(move(A, a)) and
Dtran[A, b] = - closure(move(A, b)) .
Among the states 0, 1, 2, 4, and 7, only 2 and 7 have transitions
on a, to 3 and 8, respectively.
Thus, move(A, a) = {3,8). Also, -closure({3,8} )= {1,2,3,4,6,7,8), so
we call this set B,
let Dtran[A, a] = B
compute Dtran[A, b]. Among the states in A, only 4 has a
transition on b, and it goes to 5
Subset Construction Algorithm
C=
If we continue this process with the unmarked sets B and C, we eventually
reach a point where all the states of the DFA are marked
NFA to DFA conversion: Example 2
a
a
start
0
b
1
b
2
b
-closure(0) = {0}
move(0,a) = {0,1}
move(0,b) = {0}
move({0,1}, a) = {0,1}
move({0,1}, b) = {0,2}
move({0,2}, a) = {0,1}
move({0,2}, b) = {0,3}
New states
A = {0}
B = {0,1}
C = {0,2}
D = {0,3}
NFA to DFA conversion: Example 2
a
start
b
a
b
b
D
NFA to DFA conversion: Example 3
1
start
Due to -transitions, we
must compute -closure(S)
Example: -closure (0) =
{0,3}
a
2
Dstates := -closure(1) = {1,2}
U:= -closure (move( {1,2}, a)) = {3,4,5}
Add {3,4,5} to Dstates
U:= -closure (move( {1,2}, b)) = {}
-closure (move( {3,4,5}, a)) = {5}
-closure (move( {3,4,5}, b)) = {4,5}
-closure (move( {4,5}, a)) = {5}
-closure (move( {4,5}, b)) = {5}
NFA to DFA conversion: Example 3
a
start
a
4
a
a
A{1,2}
--
B{3,4,5}
C{4,5}
D{5}
--
--
a|b
NFA to DFA conversion: Exercise
Convert the -NFA shown below into a DFA, using subset construction
Putting the Pieces Together
Regular
Expression
RE NFA
Conversion
NFA DFA
Conversion
Input
String
DFA
Simulation
Yes, if w L(R)
No, if w L(R)
State minimization
If we implement a lexical analyzer as a DFA, we would generally
prefer a DFA with as few states as possible,
since each state requires entries in the table that describes the lexical
analyzer.
There is always a unique minimum state DFA for any regular
language.
Moreover, this minimum-state DFA can be constructed from any DFA
for the same language by grouping sets of equivalent states.
An Algorithm to Minimizing the number of states of a DFA
INPUT: A DFA D with set of states S, input alphabet , start state 0, and
set of accepting states F.
OUTPUT: A DFA D' accepting the same language as D and having as
few states as possible.
State minimization
State
minimization
An Algorithm to Minimizing the number of states of a DFA
Example: input set is {a,b}, with DFA
1. Initially partition consists of the two groups
non-final states {A, B, C, D},
final state{E}
2. To Construct IInew , group {E} cannot be split
3. group {A, B, C, D} can be split into {A, B, C}{D}, and IInew for this
round is {A, B, C}{D}{E}.
Minimized DFA
In the next round, split {A, B, C} into {A, C}{B}, since A and C
each go to a member of {A, B, C) on input b, while B goes to a
member of another group, {D}. Thus, after the second round, new
= {A, C} {B} {D} {E).
For the third round, we cannot split the one remaining group with
more than one state, since A and C each go to the same state (and
therefore to the same group) on each input. final = {A,
C}{B){D){E). The minimum-state of the given DFA has four states.
b
a
A
a
b
a
b
E
D
b
Minimized DFA: Exercise
Minimize states of the following DFA
B(first convert to DFA)
Some Other Issues in Lexical Analyzer
The lexical analyzer has to recognize the longest possible string.
Ex: identifier newval n ne new newv newva newval
What is the end of a token? Is there any character which marks the
end of a token? It is normally not defined.
If the number of characters in a token is fixed, in that case no problem:
+ - are ok but < implies < or <> (in Pascal)
The end of an identifier : the characters that cannot be in an identifier can
mark the end of token.
We may need a lookhead
In Prolog: p :- X is 1. p :- X is 1.5.
The dot followed by a white space character can mark the end of a
number. But if that is not the case, the dot must be treated as a
part of the number.
Implementing the Scanner
We have learnt the formalism for lexical analysis
Regular expression (RE)
How to actually get the lexical analyzer?
Solution 1: to implement using a tool Lex (for C), Flex (for
C++), Jlex (for java)
Programmer specifies the interesting tokens using REs
The tool generates the source code from the given Res
Solution 2: to write the code starting from scratch
This is also the code that the tool generates
The code is table-driven and is based on finite state automaton
Lex: a Tool for Lexical Analysis
Big difference from your previous coding experience
Writing REs instead of the code itself
Writing actions associated with each RE
Internal Structure of Lex
The final states of the DFA
are associated with actions
The detailed about lex will be discussed laboratory sessions
Assignment 1 (10%)
Implement DFA and NFA. The program should accept states,
alphabet, transition function, start state and set of final states. Then
the program should check if a string provided by a user is accepted
or rejected.You can use either Java or C++ for the
implementation
2. Convert the following NFA to DFA
1.
3. Minimize the number of states from the DFA you obtained as an
answer for question 2
Reading Assignment
Constructing a DFA directly
from a regular expression