Ch2 - Lexical Analysis | PDF | Metalogic | Implementation
0% found this document useful (0 votes)
39 views

Ch2 - Lexical Analysis

The document discusses lexical analysis and tokenization. It defines key terms like tokens, lexemes, patterns and describes the role of the lexical analyzer in grouping characters into lexical units and outputting a sequence of tokens. It also covers topics like token attributes, lexical errors, input buffering and specification of tokens.

Uploaded by

developerdawit1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Ch2 - Lexical Analysis

The document discusses lexical analysis and tokenization. It defines key terms like tokens, lexemes, patterns and describes the role of the lexical analyzer in grouping characters into lexical units and outputting a sequence of tokens. It also covers topics like token attributes, lexical errors, input buffering and specification of tokens.

Uploaded by

developerdawit1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Chapter -2

Lexical Analysis/Scanner

Chapter – 2 : Lexical Analysis 1 Bahir Dar Institute of Technology


Session1
▪ Lexical Analyzer
▪ The Role of the Lexical Analyzer
▪ What is token?
▪ Tokens, Patterns and Lexemes
▪ Token Attributes
▪ Lexical Errors
▪ Input Buffering
▪ Specification of Tokens
▪ Recognition of Tokens

Chapter – 2 : Lexical Analysis 2 Bahir Dar Institute of Technology


Lexical Analyzer
▪ Lexical analysis is the process of converting a sequence of
characters into a sequence of tokens.
▪ A program or function which performs lexical analysis is called a
lexical analyzer, Lexer, or Scanner.
▪ A lexer often exists as a single function which is called by a parser
(syntax Analyzer) or another function.
▪ The Role of the Lexical Analyzer
• Read the input characters
• Grouping them into lexemes
• Output a sequence of tokens
• Others:
• Macros expansion: If the source program uses a macro
preprocessor, the expansion of macros may also be performed
by the lexical analyzer
Chapter – 2 : Lexical Analysis 3 Bahir Dar Institute of Technology
The Role of the Lexical Analyzer con…
• Stripping out comments and white spaces and other separators
• Correlating error messages with the source program
• Egg. keep track of number of newline characters seen for
associating a line number with each error message
• Pass token to parser
• Interact with:
▪ Parser: for syntax analysis
▪ Symbol table: for gathering information regarding to kind of
identifier to assist it in determining the proper token it must pass to
the parser

Chapter – 2 : Lexical Analysis 4 Bahir Dar Institute of Technology


Lexical Analyzer ….
▪ Sometimes Lexical Analyzers are divided into a
cascade of two phases; Scanning and Lexical
analysis proper

▪ Scanning consists of the simple processes that do not


require tokenization of the input.
• (egg. deletion of comments and compaction of consecutive
whitespace characters into one).

▪ Lexical analysis proper is the more complex portion,


where the scanner produces the sequence of tokens as
output.
Chapter – 2 : Lexical Analysis 5 Bahir Dar Institute of Technology
Lexical Analyzer ….
▪ Advantages of separating lexical analysis from other phases
are:
• Simpler design - Separate the two analysis
• High cohesion / Low coupling
• Improve maintainability
• Compiler efficiency
• Liberty to apply specialized techniques that serves only
lexical tasks, not the whole parsing
• Speedup reading input characters using specialized
buffering techniques
• Compiler portability (e.g. Linux to Windows)
• Input device peculiarities are restricted to the lexical Analyzer
• Enable integration of third-party Lexers [lexical analysis tool]
Chapter – 2 : Lexical Analysis 6 Bahir Dar Institute of Technology
What is token?
▪ A token is a pair consisting of a token name and an optional
attribute value.
▪ The token name is an abstract symbol representing a kind of lexical
unit
▪ Tokens correspond to sets of strings.
• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits; 123, 123.45
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks, newlines, & tabs
• Symbols: +, -, *, /, =, <, >, ->, …
• Char (string) literals: “Hello”, ‘c’
• Comments
• Terminal symbols in a grammar
▪ The token names are the input symbols that the parser processes.

Chapter – 2 : Lexical Analysis 7 Bahir Dar Institute of Technology


What is token? …Example
▪ What do we want to do?
• Example:
if (i == j)
Z = 0;
else
Z = 1;
• The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
• Goal: Partition input string into substrings
• Where the substrings are tokens
• Tokens: if, (, i, ==, j, ), z, =, 0, ;, else, 1, …

Chapter – 2 : Lexical Analysis 8 Bahir Dar Institute of Technology


Tokens, Patterns and Lexemes
▪ Lexeme: Actual sequence of characters that matches a pattern and
has a given token class.
• Examples: Name, Data, x, 345,2,0,629,....
▪ Token
• A classification for a common set of strings
• Examples: Identifier, Integer, Float, Assign, LeftParen,
RightParen,....
• One token for all identifiers:
– egg. id is token for different identifiers: a, b, c , …
▪ Pattern: a description of the form that the lexemes of a token may
take.
• The rules that characterize the set of strings for a token
• Examples: [0-9]+
• identifier: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*
Chapter – 2 : Lexical Analysis 9 Bahir Dar Institute of Technology
Tokens, Patterns and Lexemes: Example

Chapter – 2 : Lexical Analysis 10 Bahir Dar Institute of Technology


Token Attributes
▪ More than one lexeme can match a pattern
• the lexical analyzer must provide the information about the
particular lexeme that matched.
• For example, the pattern for token number matches both 0, 1,
934, …
▪ But code generator must know which lexeme was found
in the source program.
• Thus, the lexical analyzer returns to the parser a token name
and an attribute value
▪ For each lexeme the following type of output is produced
(token-name, attribute-value)
➔attribute-value points to an entry in the symbol table for
this token

Chapter – 2 : Lexical Analysis 11 Bahir Dar Institute of Technology


Example of Attribute Values
▪ E = M * C ** 2 [Fortran source code]
▪ Token names and associated attribute values:
• <id, pointer to symbol table entry for E>
• < assign_op>
• <id, pointer to symbol table entry for M>
• <mult_op>
• <id, pointer to symbol table entry for C>
• <exp_op>
• <number, integer value 2>

Chapter – 2 : Lexical Analysis 12 Bahir Dar Institute of Technology


Lexical Errors
▪ Lexical analyzer can’t detect all errors – needs other
components
▪ In what situations do errors occur?
• When no patterns for tokens matches any prefix of the
remaining input (character sequence).
▪ However look egg.: fi(a==f(x)) … in C program
• A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared
function identifier. Since fi is a valid lexeme for the
token id, the lexical analyzer must return the token id
to the parser.
• so this error will generate by next phases of compiler
Chapter – 2 : Lexical Analysis 13 Bahir Dar Institute of Technology
Lexical Errors
▪ Possible error recovery actions:
• Panic mode: successive characters are ignored until we reach
to a well formed token at the beginning of what input is left.
• i.e.,. In this method of discovering the error, the parser
discards input symbols one at a time. Eg. int a, 5abcd, sum;
• Deleting or Inserting Input Characters
• Replacing a character by another character
• Transposing two adjacent characters. Egg. fi to if
• Or, skip over to next separator to ignore problem

Chapter – 2 : Lexical Analysis 14 Bahir Dar Institute of Technology


Input Buffering
▪ The lexical analysis scans the characters of the source program one
at a time to discover tokens.
▪ Processing large number of characters during the compilation of a
large source program consumes time.
▪ To reduce overhead required time & cost to process a single input
character, specialized buffering techniques have been developed
• two buffer scheme - that handles large lookaheads safely
• sentinels – which saves time of checking buffer ends
▪ Buffer Pair
• Each buffer is of the same size N, & N = size of a disk
block (egg. 4096 bytes)
• read N characters into a buffer using one system read command,
rather than using one system call per character
• If fewer than N characters remains in the input file, eof [special
character] that marks the end of the source file
Chapter – 2 : Lexical Analysis 15 Bahir Dar Institute of Technology
Buffer Pair
• Two pointers to the input are maintained
• lexemeBegin – marks the beginning of the current lexeme
• Forward – scans ahead until a pattern match is found

• Initially both pointers point the first character of the next


lexeme
• Forward pointer scans; if a lexeme is found, it is set to the
last character of the lexeme found
• So, forward must be retracted one position to its left.
• After processing the lexeme, both pointers are set to the
character immediately the lexeme
Chapter – 2 : Lexical Analysis 16 Bahir Dar Institute of Technology
Buffer Pair: algorithm
Code to advance forward pointer
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else forward := forward + 1;

Chapter – 2 : Lexical Analysis 17 Bahir Dar Institute of Technology


Sentinels
▪ A special character at the buffer end
▪ In preceding schema, for each character read we make two tests:
• have we moved off one of the buffer, and
• one to determine what character is read
▪ We can combine the two tests if we extend each buffer to hold a
sentinel character (e.g. eof ) at the end:

• Note that eof retains its use as a marker for the end of the entire
input.
• Any eof that appears other than at the end of a buffer means that
the input is at an end.

Chapter – 2 : Lexical Analysis 18 Bahir Dar Institute of Technology


Sentinels…Algorithm for advancing forward

Chapter – 2 : Lexical Analysis 19 Bahir Dar Institute of Technology


Specification of Tokens
▪ Two issues in lexical analysis.
• How to specify tokens? And
• How to recognize the tokens giving a token specification? (i.e. how
to implement the nexttoken( ) routine)?
▪ How to specify tokens:
• Tokens are specified by regular expressions.
▪ Regular Expressions:
• Represent patterns of strings of characters
• Regular expressions are important part in specifying lexeme
patterns.
• Each pattern matches a set of strings, so regular expressions will
serve as names for sets of strings.
• cannot express all possible patterns, they are very effective in
specifying those types of patterns that we actually need for
tokens.
• The set of strings generated by a regular expression r is as L(r)
Chapter – 2 : Lexical Analysis 20 Bahir Dar Institute of Technology
Strings and Alphabets
▪ alphabet : any finite set of symbols.
• E.g. {a, b, c}, {0, 1} binary alphabet
▪ A string/sentence/word over an alphabet is a finite
sequence of symbols drawn from that alphabet.
▪ Length of string s is denoted |s|
▪ Operations.
• concatenation of strings x and y: xy; εx=x ε=x
• exponential : S0 = ε, S1=S, S2=SS,… Si =Si-1S, i>0
▪ Do you remember words related to parts of a string:
• prefix VS suffix
• Substring
• proper prefixes, suffixes
• subsequence

Chapter – 2 : Lexical Analysis 21 Bahir Dar Institute of Technology


Languages
▪ A language is any countable set of strings over some fixed
alphabet.
Alphabet Language
{0,1} {0,10,100,1000,10000,…}
{0,1,100,000,111,…}
{a, b, c} {abc, aabbcc, aaabbbccc,…}
{A…Z} {TEE,FORE,BALL…}
{FOR,WHILE,GOTO…}
{A…Z,a…z,0…9, {All legal PASCAL progs}
+,-,…,<,>,…} {All grammatically correct English
Sentences}
Special Languages: Φ – EMPTY LANGUAGE
ε – contains empty string ε only

Chapter – 2 : Lexical Analysis 22 Bahir Dar Institute of Technology


Operations on Languages

▪ Let, L= {A, B, C, D}, and D={1, 2, 3} then,


▪ LD= {A, B, C, D, 1, 2, 3}
▪ LD= {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3}
▪ L2 = {AA, AB, AC, AD, BA, BB, BC, BD, CA, CB, CC, CD, DA, DB, DC, DD}
▪ L4 = L2 L2 = ??
▪ L* = {All possible strings of L plus }
▪ L+ = L* - 
▪ L(LD) = {set of all strings of letters and digits beginning with a letter. & each
length =2}
▪ L(LD)* = {set of all strings of letters & digits starting with a letter. & length >= 1}
Chapter – 2 : Lexical Analysis 23 Bahir Dar Institute of Technology
Regular Expressions
▪ Formal definition of Regular expression:
• Given an alphabet  ,
• (1) Ɛ is a regular expression that denote {Ɛ}, the set
that contains the empty string and L{Ɛ }= {Ɛ}
• (2) For each a 𝜖 Â , a is a regular expression denote {a},
the set containing the string a and L{a }= {a}
• (3) r and s are regular expressions, then
• (r | s ) is a regular expression denoting L( r ) U L( s )
• (rs) is a regular expression denoting L( r ) L ( s )
• ( r )* is a regular expression denoting (L ( r )) *
• Regular expression is defined together with the language
it denotes.
• Regular expressions are used to denote regular languages

Chapter – 2 : Lexical Analysis 24 Bahir Dar Institute of Technology


Regular Expressions: Example

Chapter – 2 : Lexical Analysis 25 Bahir Dar Institute of Technology


Regular Expressions: Algebraic Properties

▪ All Strings in English alphabet that start with “tab” or end with
“bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat
▪ All Strings in English alphabet in Which {1,2,3} exist in
ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

Chapter – 2 : Lexical Analysis 26 Bahir Dar Institute of Technology


Regular Definition
• Gives names to regular expressions to construct more
complicate regular expressions.
▪ If  is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
• d1 → r1,
• d2 → r2,
• d3 → r3,
• …

▪ where:
• Each di is a new symbol, not in  and not the same as any other of the
d's
• Each ri is a regular expression over the alphabet U {dl, d2,. . . , di-1).

Chapter – 2 : Lexical Analysis 27 Bahir Dar Institute of Technology


Regular Definition: Example
• Example: C identifiers are strings of letters, digits, and
underscores.
• The regular definition for the language of C identifiers.
• Letter→A | B | C|…| Z | a | b | … |z|
• digit → 0|1|2 |… | 9
• id →letter( letter | digit )*
▪ Unsigned numbers (integer or floating point) are strings such as
5280, 0.01234, 6.336E4, or 1.89E-4.
▪ The regular definition
• digit → 0|1|2 |… | 9
• digits → digit digit*
• optionalFraction → .digits | 
• optionalExponent → ( E( + |- | ) digits ) | 
• number → digits optionalFraction optionalExponent

Chapter – 2 : Lexical Analysis 28 Bahir Dar Institute of Technology


Short hand notations/extensions of regular expression
• “+”: one or more r* = r+|Ɛ and r+ = rr* = r*r
• “?”: zero or one r? = r|Ɛ
• [range]: Character classes/set range of characters [a1|a2|...|an ] = [a1-an]
• Egg. [A-Z] = A|B|C|….|Z egg2. [abc]=a|b|c
• Example
• C variable name: [A -Za - z ][A - Za - z0 - 9 ]*
▪ Write regular definitions for
• Phone numbers of form (510) 643-1481
• number → ‘(‘ area ‘)’ exchange ‘-’ phone,
• phone → digit4,
• exchange → digit3 ,
• area →digit3,
• digit →{ 0, 1, 2, 3, …, 9 }
• Email Addresses (Exercise)
Chapter – 2 : Lexical Analysis 29 Bahir Dar Institute of Technology
Short hand notations/extensions of regular expression
• Some useful regular expressions

what is their difference with the r1r2 ???

Chapter – 2 : Lexical Analysis 30 Bahir Dar Institute of Technology


Regular Grammars
▪ Basically regular grammars are used to represent
regular languages
▪ The Regular Grammars are either left or right linear:
• Right Regular Grammars:
• Rules of the forms: A → ε, A → a, A → aB,
– A, B: variables and a: terminal
• Left Regular Grammars:
• Rules of the forms: A → ε, A → a, A → Ba;
– A, B: variables and A: terminal
• Example: S → aS | bA, A → cA | ε
• This grammar produces the language produced by the
regular expression a*bc*
Chapter – 2 : Lexical Analysis 31 Bahir Dar Institute of Technology
Recognition of tokens
• Recognition of tokens is the second issue in lexical
analysis
• How to recognize the tokens giving a token specification?
• Given the grammar of branching statement:

▪ The terminals of the grammar, which are if, then,


else, relop, id, and number, are the names of tokens
Chapter – 2 : Lexical Analysis 32 Bahir Dar Institute of Technology
Recognition of tokens…
❑ The patterns for the given tokens are described using
regular definitions as:

Note that

 The lexical analyzer also has the job of stripping out


whitespace, by recognizing the "token" ws defined
by:

Chapter – 2 : Lexical Analysis 33 Bahir Dar Institute of Technology


Recognition of tokens…
❑ The table shows which token name is returned to
the parser and what attribute value for each
lexeme or family of lexemes.

Chapter – 2 : Lexical Analysis 34 Bahir Dar Institute of Technology


Recognition of tokens…
▪ Transition diagrams
• As intermediate step in the construction of a lexical analyzer, we first
convert patterns into stylized flowcharts, called "transition diagrams"
Example Transition diagram for relop

• * is used to indicate
that we must retract the
input/forward pointer
one position ( back)
• i.e., the lexeme does
not include the symbol
that got us to the
accepting state

Chapter – 2 : Lexical Analysis 35 Bahir Dar Institute of Technology


Recognition of tokens…
▪ Sketch of implementation of relop transition diagram

Chapter – 2 : Lexical Analysis 36 Bahir Dar Institute of Technology


Recognition of tokens…
Mapping transition diagrams into C code
..

Token nexttoken( ) { else if(c==‘>’)


while (1) { goto state 3
switch (state) { retrun;
case 0: else
c = nextchar(); retract ; return;
if (c is white space) state = 0; if (isletter( c) ) state = 10;
else if (c == ‘<‘) state = 1; else state =fail();
else if (c == ‘=‘) state = 5; break;
… case 10: ….
case 1: case 11: retract(1);
c = nextchar(); insert(id);
if(c==‘=‘) go to state 2 return;
return; }}}
Chapter – 2 : Lexical Analysis 37 Bahir Dar Institute of Technology
Recognition of tokens…
▪ Recognition of Reserved Words and Identifiers
• Recognizing keywords and identifiers presents a problem.
• keywords like if or then are reserved but they look like
identifiers
• There are two ways that we can handle
• install the reserved words in the symbol table initially.
If we find an identifier, installID is
called to place it in the symbol table.

• Create separate transition diagrams for each keyword;


– the transition diagram for the reserved word then

NB: A non-letter/digit is used to test any character that cannot be the continuation of an identifier

Chapter – 2 : Lexical Analysis 38 Bahir Dar Institute of Technology


Session2
▪ Finite Automata
▪ Nondeterministic Finite Automata
▪ Deterministic Finite Automata
▪ From Regular Expressions to Automata
▪ Conversion of NFA to DFA
▪ DFA state minimization
▪ Design of a Lexical-Analyzer Generator
▪ Implementing Scanners

Chapter – 2 : Lexical Analysis 39 Bahir Dar Institute of Technology


Finite Automata
▪ At the heart of the transition diagram is the formalism
known as finite automata
▪ A finite automaton is a finite-state transition diagram
that can be used to model the recognition of a token type
specified by a regular expression
▪ Finite automata are recognizers; they simply say "yes" or
"no" about each possible input string.
▪ A finite automaton can be a
• Non-deterministic finite automaton (NFA
• Deterministic finite automaton (DFA)
▪ DFA & NFA are capable of recognizing the same
languages

Chapter – 2 : Lexical Analysis 40 Bahir Dar Institute of Technology


Nondeterministic Finite Automata

Chapter – 2 : Lexical Analysis 41 Bahir Dar Institute of Technology


NFA: Transition Graph and Transition tables
▪ An NFA can be diagrammatically represented by a
labeled directed graph called a transition graph
The set of states = {A,B, C, D}
Input symbol = {0, 1}
Start state is A, accepting state
is D

Input Symbol
State 0 1
A {A,B} {A,C}
 Transition graph shown B {D} --
above can be represented C -- {D}
by transition table on the D {D} {D}
right side
Chapter – 2 : Lexical Analysis 42 Bahir Dar Institute of Technology
Nondeterministic Finite Automata…
▪ Acceptance of Input Strings by Automata
• An NFA accepts an input string str iff there is some path in the
finite-state transition diagram from the start state to some final
state such that the edge labels along this path spell out str
• The language recognized by an NFA is the set of strings it
accepts
• Example

The language recognized by this NFA is aa*|bb*

Chapter – 2 : Lexical Analysis 43 Bahir Dar Institute of Technology


Deterministic Finite Automata
▪ A deterministic finite automaton (DFA) is a
special case of an NFA where:
• There are no moves on input ε, and
• For each state q and input symbol a, there is exactly
one edge out of a labeled a.

• Example The language


recognized by this
DFA is also
(b*aa*(a*b)+)

Chapter – 2 : Lexical Analysis 44 Bahir Dar Institute of Technology


Implementing/simulating a DFA
▪ Let us assume that the end of a string is marked with a
special symbol (say eos).
▪ The algorithm for recognition will be as follows: (an efficient
implementation)
q = q0 { start from the initial state }
c = nextchar { get the next character from the input string }
while (c != eos) do { do until the end of the string }
begin
q = move(q, c) { transition function }
c = nextchar
end
if (q in F) then { if q is an accepting state }
return “yes”
else
return “no”
Chapter – 2 : Lexical Analysis 45 Bahir Dar Institute of Technology
Implementing/simulating an NFA
Q = -closure({q0})
{ set of all states that can
c = nextchar be accessible from q0 by -
while (c != eos) { transitions }
begin
Q = -closure(move(Q,c)) { set of all states that
can be accessible
c = nextchar
from a state in q by a
end transition on c }

if (Q  F != ) then { if Q contains an accepting state }


return “yes”
else
return “no”

Chapter – 2 : Lexical Analysis 46 Bahir Dar Institute of Technology


Computing the -closure (Q)
▪ -closure(q): set of states reachable from some state q in Q on
-transitions alone.
▪ move(q, c): set of states to which there is a transition on input
symbol c from some state q in Q

Chapter – 2 : Lexical Analysis 47 Bahir Dar Institute of Technology


Computing the -closure (Q) ...Example
▪ Given an NFA below find -closure (s)

1
1

i.e. All states that can


be reached through -
transitions only
Chapter – 2 : Lexical Analysis 48 Bahir Dar Institute of Technology
From Regular Expression to NFA
▪ For each kind of RE, there is a corresponding NFA
• i.e. any regular expression can be converted to a NFA that
defines the same language.
▪ The algorithm is syntax-directed, in the sense that it
works recursively up the parse tree for the regular
expression.
▪ For each sub-expression the algorithm constructs an NFA
with a single accepting state.
▪ Algorithm: The McNaughton-Yamada-Thompson
algorithm to convert a regular expression to a NFA.
• INPUT: A regular expression r over alphabet .
• OUTPUT: An NFA N accepting L(r).
Chapter – 2 : Lexical Analysis 49 Bahir Dar Institute of Technology
from Regular expressions to NFA…
▪ Method: Begin by parsing r into its constituent sub-expressions.
Then, using rules (1) and (2) below, we construct NFA’s for each of
the basic symbols in r
Rule1: For expression 𝜀 construct the NFA
• Rule2: For any sub-expression a in Ʃ, construct the NFA

• Rule3: NFA for the union of two regular expressions

• Ex: a|b

Chapter – 2 : Lexical Analysis 50 Bahir Dar Institute of Technology


from Regular expressions to NFA…
▪ Rule4: NFA for r1r2

▪ Rule5: NFA for r*

▪ Example: (a|b)*

Chapter – 2 : Lexical Analysis 51 Bahir Dar Institute of Technology


from Regular expressions to NFA…
▪ Example: Constructing NFA for regular expression
r= (a|b)*abb
Step 1: constructing a | b
Step 2: construct (a|b)*
Step 3: concatenate it with a,
then, b, then b

Chapter – 2 : Lexical Analysis 52 Bahir Dar Institute of Technology


Conversion of NFA to DFA
▪ Why?
• DFA is difficult to construct directly from RE’s
• NFA is difficult to represent in a computer program and inefficient to
compute
• So, RE➔NFA➔DFA
▪ Conversion algorithm: subset construction
• The idea is that each DFA state corresponds to a set of NFA states.
• After reading input a1a2…an, the DFA is in a state that represents the
subset Q of the states of the NFA that are reachable from the start
state.
• INPUT: An NFA N
• OUTPUT: A DFA D accepting the same language as N.
• ALGORITHM: …
• The algorithm constructs a transition table Dtran for D.

Chapter – 2 : Lexical Analysis 53 Bahir Dar Institute of Technology


Subset Construction Algorithm
▪ Each state of D is a set of NFA states, and we construct Dtran so D will simulate "in
parallel" all possible moves N can make on a given input string
▪ Algorithm: Note that q is a single state of N , while Q is a set of states of N .
Dstates := -closure (q0) // q0 =initial state and they all are unmarked
While (there is an unmarked state Q in Dstates ) do
begin
mark Q;
for (each input symbol a) do
begin
U =  -closure ( move(Q, a) );
if (U is not in Dstates ) then
add U as an unmarked state to Dstates;
Dtran [Q, a] = U;
end
end
Chapter – 2 : Lexical Analysis 54 Bahir Dar Institute of Technology
Subset Construction Algorithm…
Example NFA to DFA, RE= (a|b)*abb

▪ The start state A of the equivalent DFA is -closure(0),


• A = {0,1,2,4,7},
• Since these are exactly the states reachable from state
0 via a path all of whose edges have label .
• Note that a path can have zero edges, so state 0 is
reachable from itself by an  -labeled path.

Chapter – 2 : Lexical Analysis 55 Bahir Dar Institute of Technology


Subset Construction Algorithm…
▪ The input alphabet is {a, b). Thus, our first step is to
mark A and compute
Dtran[A, a] =  -closure(move(A, a)) and
Dtran[A, b] =  - closure(move(A, b)) .
▪ Among the states 0, 1, 2, 4, and 7, only 2 and 7 have
transitions on a, to 3 and 8, respectively.
• Thus, move(A, a) = {3,8). Also,  -closure({3,8} )= {1,2,3,4,6,7,8), so we
call this set B,

let Dtran[A, a] = B
▪ compute Dtran[A, b].
• Among the states in A, only 4 has a transition on b, and it
goes to 5

Chapter – 2 : Lexical Analysis 56 Bahir Dar Institute of Technology


Subset Construction Algorithm…
▪ C = Dtran[A, b] = Ɛ-closure({5}) = {1, 2, 4, 5, 6, 7}
▪ If we continue this process with the unmarked sets B and C, we
eventually reach a point where all the states of the DFA are
marked

Chapter – 2 : Lexical Analysis 57 Bahir Dar Institute of Technology


NFA to DFA conversion: Example 2

a
start a b b
0 1 2 3
b
-closure(0) = {0}
move(0,a) = {0,1} New states
move(0,b) = {0} a b
move({0,1}, a) = {0,1} A B A
A = {0}
move({0,1}, b) = {0,2} B = {0,1} B B C
move({0,2}, a) = {0,1} C = {0,2}
move({0,2}, b) = {0,3} C B D
D = {0,3}
move({0,3}, a) = {0,1} D B A
move({0,3}, b) = {0}
NFA to DFA conversion: Example 3
a
 a
1 2
start
0 b
 b
3 4

Due to -transitions, we must compute -


closure(S)
Example: -closure (0) = {0,1,3}
NFA to DFA conversion: Example 4
a
 2 a
start 4 5
1
a|b
a 3 b
Dstates := -closure(1) = {1,2}
U:= -closure (move( {1,2}, a)) =
a b {3,4,5}
Add {3,4,5} to Dstates
A{1,2} B -- U:= -closure (move( {1,2}, b)) = {}
B{3,4,5} D C -closure (move( {3,4,5}, a)) = {5}
-closure (move( {3,4,5}, b)) = {4,5}
C{4,5} D D
-closure (move( {4,5}, a)) = {5}
D{5} -- -- -closure (move( {4,5}, b)) = {5}
NFA to DFA conversion: Exercise

Convert the -NFA shown below into a DFA, using subset


construction
Putting the Pieces Together

Regular
R RE  NFA
Expression Conversion

NFA  DFA
Conversion

DFA Yes, if w  L(R)


Input w
String Simulation No, if w  L(R)

Chapter – 2 : Lexical Analysis 62 Bahir Dar Institute of Technology


DFA state minimization
▪ Sometimes a DFA will have more states than necessary.
▪ For every DFA there is a unique smallest equivalent DFA (fewest
states possible).
▪ Some DFA’s contain unreachable states that cannot be reached
▪ from the start state.
▪ Other DFA’s may contain dead states that cannot reach any
accepting state.
▪ Dead state a rejected state where machine can’t move further to
its final state.
▪ It is clear that neither unreachable states nor dead states can
▪ participate in scanning any valid token.
▪ We therefore eliminate all such states as part of our optimization
process.
DFA state minimization
▪ We optimize a DFA by merging together states we know to be
equivalent.
▪ For example, two accepting states that have no transitions at all
out of them are equivalent.
▪ Why? Because they behave exactly the same way they accept the
string read so far, but will accept no additional characters.
▪ If two states, s1 and s2, are equivalent, then all transitions to
s2 can be replaced with transitions to s1.
▪ In effect, two states are merged together into one common state.
▪ How do we decide what states to merge together?
▪ We take a greedy approach and try the most optimistic merger of
states.
▪ By definition, accepting and non-accepting states are distinct, so
we initially try to create only two states: one representing the
merger of all accepting states and the other representing the
merger of all non-accepting states.
DFA state minimization
▪ This merger into only two states is almost certainly too optimistic.
▪ In particular, all the constituents of a merged state must agree on
the same transition for each possible character.
▪ That is, for character c all the merged states must have no
successor under c or they must all go to a single (possibly merged)
state.
▪ If all constituents of a merged state do not agree on the transition
to follow for some character, the merged state is split into two or
more smaller states that do agree.
▪ As an example, assume we start with the following automaton:
DFA state minimization
▪ Initially we have a merged nonaccepting state {1,2,3,5,6} and a
merged accepting state {4,7}.
▪ A merger is legal if and only if all constituent states agree on the
same successor state for all characters.
▪ For example, states 3 and 6 would go to an accepting state given
character c; states 1, 2, 5 would not, so a split must occur.
▪ We will add an error state sE to the original DFA that is the
successor state under any illegal character.
▪ (Thus reaching sE becomes equivalent to detecting an illegal
token.) sE is not a real state; rather it allows us to assume every
state has a successor under every character.
▪ sE is never merged with any real state. Algorithm Split , shown
below, splits merged states whose constituents do not agree on a
common successor state for all characters.
▪ When Split terminates, we know that states that remain merged
are equivalent in that they always agree on common successors.
DFA state minimization
▪ Algorithm:
▪ Split(FASet StateSet) {
repeat
for(each merged state S in StateSet) {
Let S correspond to { s1, ...,sn }
for(each char c in Alphabet) {
Let t1, ...,tn be the successor states to s1, ...,sn under c
if(t1, ...,tn do not all belong to the same merged state)
{ Split S into two or more new states such that si and sj
remain in the same merged state if and only if ti and tj
are in the same merged state
}
}
until no more splits are possible
}
}
DFA state minimization

▪ Returning to our example, we initially have states {1,2,3,5,6}


and {4,7}. Invoking Split , we first observe that states 3 and 6
have a common successor under c, and states 1 , 2, and 5 have no
successor under c (equivalently, have the error state sE as a
successor).
▪ This forces a split, yielding {1,2,5}, {3,6} and {4,7}.
▪ Now, for character b, states 2 and 5 would go to the merged state
{3,6}, but state 1 would not, so another split occurs.
▪ We now have: {1}, {2, 5}, {3, 6} and {4, 7}.
▪ At this point we are done, as all constituents of merged states
agree on the same successor for each input symbol.
▪ Once Split is executed, we are essentially done.
▪ Transitions between merged states are the same as the transitions
between states in the original DFA.
DFA state minimization
▪ Thus, if there was a transition between state si and sj under
character c, there is now a transition under c from the merged state
containing si to the merged state containing sj.
▪ The start state is that merged state containing the original start
state.
▪ Accepting states are those merged states containing accepting
states (recall that accepting and non-accepting states are never
merged).
▪ Returning to our example, the minimum state automaton we
obtain is:
DFA state minimization
▪ Example-2: minimize the states of following DFA.

▪ Solution:
{𝐴, 𝐵, 𝐶, 𝐷, 𝐸}

Nonaccepting States Accepting States


{𝐴, 𝐵, 𝐶, 𝐷} {𝐸}

{𝐴, 𝐵, 𝐶} {𝐷}

{𝐴, 𝐶} {𝐵}

▪ Now no more splitting is possible.


DFA state minimization
▪ Example-2: con… states Inputs
▪ Previous transition table: a b
A B C
B B D
C B C
D B E
E B C

▪ If we chose A as the representative for group (AC), then we obtain


reduced transition table a
states Inputs B
a b a a
{A,C}= A B A a b
A b E
A
B B D
D B E b b
D
E B A
Lexical Analyzer Generator - Lex

Lex Source program


lex.l Lexical Compiler lex.yy.c

lex.yy.c C a.out
compiler

Input stream Sequence


a.out of tokens

Chapter – 2 : Lexical Analysis 72 Bahir Dar Institute of Technology


Structure of Lex programs
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
That is:

Chapter – 2 : Lexical Analysis 73 Bahir Dar Institute of Technology


Structure of Lex programs
• Big difference from your previous coding experience
• Writing REs instead of the code itself
• Writing actions associated with each RE

 Internal Structure of Lex

The final states of


the DFA are
associated with actions

Chapter – 2 : Lexical Analysis 74 Bahir Dar Institute of Technology


Example
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}

Chapter – 2 : Lexical Analysis 75 Bahir Dar Institute of Technology


Example con…
▪ The procedure Int installID() : has access to the buffer,
where the identifier lexeme has been located.
▪ In other words, it is a function to install the lexeme, whose
first character is pointed to by yytext, and whose length is
yyleng, into the symbol table and return a pointer there to.
▪ The symbol table is examined and if the lexerne is found
there marked as a keyword, InstallID() returns 0.
▪ If the lexeme is found and is a program variable, InstallID()
returns a pointer to the symbol table entry.
▪ If the lexeme is not found in the symbol table, it is installed
as a variable and a pointer to the newly created entry is
returned.
▪ Int installNum() { /* similar to installID, but puts numerical
constants into a separate table */}
Chapter – 2 : Lexical Analysis 76 Bahir Dar Institute of Technology

You might also like