0% found this document useful (0 votes)

20 views102 pages

2.1 - Lexical Analysis

Uploaded by

devildheke7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views102 pages

2.1 - Lexical Analysis

Uploaded by

devildheke7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LEXICAL

ANALYSIS
Lexical Analysis

■ It is the initial part of reading and analyzing the

program text
■ The text is read and divided into tokens, each of
which corresponds to a symbol in programming
language, e.g. a variable name, keyword or number,
etc.
■ Lexical analyzer (also called lexer) will as its input
take a string of individual letters and divide this
string into tokens
The Role of Lexical Analyzer
The Role of Lexical Analyzer

■ A lexical analyzer doesn’t return a list of tokens at

once, it returns a token when the parser asks a token
from it
■ The parser requests the lexical analyzer for the next
token whenever it requires one using getnexttoken( )
■ On the receipt of the command, lexer scans the input
and processes until a token is matched
The Role of Lexical Analyzer

■ The function of a lexical analyzer is to read the input

stream representing the source program, one character
at a time and to translate it into valid tokens
■ Lexical analyzer may also perform other operations like
– removing redundant white spaces (i.e. blanks, tabs
and newlines)
– removing token separators (like semicolon)
– removal of comments
– providing line number to the parser for error reporting
Reasons for Separation of Lexical
Analysis and Parsing
■ Simplicity of design is the most important
consideration. The separation of lexical and syntactic
analysis often allows us to simplify at least one of these
tasks.
■ Compiler efﬁciency is improved. A separate lexical
analyzer allows us to apply specialized techniques that
serve only the lexical task, not the job of parsing.
■ Compiler portability is enhanced. Input-device-speciﬁc
peculiarities can be restricted to the lexical analyzer.
Tokens, Patterns and Lexemes

■ Tokens
– A token is a single word of source code input
– A token is a pair consisting of a token name and
an optional attribute value
– Tokens are the separately identiﬁable blocks with
collective meaning
Tokens, Patterns and Lexemes

– When a string representing a program is broken

into sequence of substrings, such that each
substring represents a constant, identiﬁer,
operator, keyword, etc of the language, these
substrings are called the tokens of the language
– They are the building block of the programming
language
– Eg: if, else, identiﬁers
Tokens, Patterns and Lexemes

■ Lexemes
– Lexemes are the actual string matched as token
– They are the speciﬁc characters that make up of a
token
– For example, abc and 123
– A token can represent more than one lexeme. i.e.
token intnum can represent lexemes 123, 244,
4545, etc
Tokens, Patterns and Lexemes

■ Patterns
– Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits”
and “non-empty sequence of digits”
– Regular expressions are usually used to specify
patterns
– Eg: intnum token can be defined as [0-9][0-9]*
Attributes for Tokens
■ When a token represents more than one lexeme,
lexical analyzer must provide additional information
about the particular lexeme
■ This additional information is called as the attribute of
the token
■ Eg: If token id matched var1 and var2 both, then
lexical analyzer must be able to represent var1 and
var2 as different identifiers
■ For obtaining actual value, each token is associated
with attribute, generally pointer to the symbol table
Attributes for Tokens
■ Some attributes:
– <id, attr> where attr is pointer to the symbol table
– <assignop,_> no attribute is needed (if there is
only one assignment operator)
– <num, val> where val is the actual value of the
number
■ Eg: dest = source + 5
– Tokens: <id, pt for dest>, <assignop>, <num, 5>
■ Token type and its attribute uniquely identifies a
lexeme
Attributes for Tokens

■ Example: take statement, area = 3.1416 * r * r

1. getnexttoken( ) returns (id, attr) where attr is pointer
to area in symbol table
2. getnexttoken( ) returns (assignop) no attribute is
needed, if there is only one assignment operator
3. getnexttoken( ) returns (ﬂoatnum, 3.1416) where
3.1416 is the actual value of ﬂoatnum
etc
Lexical Errors

■ Though error at lexical analysis is normally not

common, there is possibility of errors
■ When error occurs, the lexical analyzer must not halt
the process
■ It can print the error message and continue
■ Error in this phase is found when there are no
matching strings found as given by the pattern
Lexical Errors

■ Some error recovery techniques

– Deletion of extraneous character
– Inserting missing character
– Replacing incorrect character by correct one
– Transposition of adjacent characters
■ Lexical error recovery is normally an expensive
process
■ Recovery eg: finding the number of transformations
that would make the correct tokens
Approaches to Implementing
Lexical Analyzer
1. Use lexical analyzer generator like Flex that produces
lexical analyzer from the given specification as
regular expression. The generator provides routine
for reading and buffering the input.
2. Write a lexical analyzer in general programming
language like C. We need to use the I/O facility of the
language for reading and buffering the input.
3. Use the assembly language to write the lexical
analyzer. Explicitly manage the reading of input.
Approaches to Implementing
Lexical Analyzer
■ These strategies are in increasing order of difficulty
and efficiency
■ Since we deal with characters in lexical analysis, it is
better to take some time during implementation to
get efficient result
Input Buffering

■ Technique used to speed up reading the source

program
■ There are many situations where we need to look at
least one (if not more) additional character ahead to
recognize lexemes in the input
■ For eg, int is a keyword in C but intnum is an
identiﬁer so when the scanner reads i, n, t, it has to
look for other characters to see whether it is just int
or some other word
Input Buffering

■ In this case, when the token is read the next time, the
scanner needs to move back to rescan the input
again for the characters that are not used for the
lexeme and this is time consuming
■ In C, single-character operators like -, =, or < could
also be the beginning of a two-character operator
like ->, ==, or <=
■ To reduce the overhead and efﬁciently move back
and forth, input buffering technique is used
Buffer Pairs (2N Buffering)

■ Specialized buffering techniques have been

developed to reduce the amount of overhead
required to process a single input character
Buffer Pairs

■ Each buffer is of the same size N, and N is usually the

size of a disk block, e.g., 4096 bytes
■ Using one system read command we can read N
characters into a buffer, rather than using one
system call per character
■ If fewer than N characters remain in the input ﬁle,
then a special character, represented by eof marks
the end of the source ﬁle
Buffer Pairs
■ Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the
current lexeme, whose extent we are attempting to
determine
– Pointer forward scans ahead until a pattern match
is found
■ Once the next lexeme is determined, forward is set to
the character at its right end
■ After the lexeme is recorded, lexemeBegin is set to
the character immediately after the lexeme just found
Buffer Pairs

■ In the above ﬁgure, forward has passed the end of

the next lexeme, and must be retracted one position
to its left
■ Advancing forward requires that we ﬁrst test
whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer
from the input, and move forward to the beginning of
the newly loaded buffer
Sentinels

■ If we use the 2N buffering scheme, we must check,

each time we advance forward, that we have not
moved off one of the buffers; if we do, then we must
also reload the other buffer
■ Thus, for each character read, we make two tests:
one for the end of the buffer, and one to determine
what character is read
Sentinels
■ We can combine the buffer-end test with the test for
the current character if we extend each buffer to hold a
sentinel character at the end
■ The sentinel is a special character that cannot be part
of the source program, and a natural choice is the
character eof
■ eof retains its use as a marker for the end of the entire
input
■ Any eof that appears other than at the end of a buffer
means that the input is at an end
Sentinels
Speciﬁcation of Tokens

■ Regular expression is the common way of specifying

the patterns for tokens
■ Some Deﬁnitions:
■ Alphabet:
– Set of symbols that generate language. For e.g.
{0-9} is an alphabet that is used to produce all the
nonnegative integer numbers
– {0-1} is an alphabet that is used to produce all the
binary strings
Speciﬁcation of Tokens

■ String:
– Finite sequence of characters from the alphabet
– Given the alphabet A, A2 = A.A is set of strings of
length 2, similarly An is set of strings of length n
– The length of the string w is denoted by |w| i.e.
number of characters (symbols) in w
– We also have A0 ={ε}, where ε is called empty
string
Speciﬁcation of Tokens

■ Kleene Closure:
– Kleene closure of an alphabet A denoted by A* is
set of all strings of any length (0 also) possible
from A
– Mathematically A* = A0 ∪ A1 ∪ A2 ∪ … …
– For any string, w over alphabet A, w ∈ A*
Specification of Tokens
Specification of Tokens
■ A language L over alphabet A is the set such that L ⊆ A*
■ The string s is called prefix of w, if the string s is
obtained by removing zero or more trailing characters
from w. If s is a proper prefix, then s ≠ w.
■ The string s is called suffix of w, if the string s is
obtained by deleting zero or more leading characters
from w. We say s is a proper suffix if s ≠ w.
■ The string s is called substring of w if we can obtain s by
deleting zero or more leading or trailing characters from
w. We say s is a proper substring if s ≠ w.
Specification of Tokens
■ Regular Operators:
■ The following operators are called regular operators
and the language formed called regular language.
– . → Concatenation operator, R.S = {rs | r ∈ R and s
∈S}

– * → Kleene star operator, A* = ⋃i≥0 Ai

– + / ∪ / | → Choice/union operator, R ∪ S = {t | t ∈
R or t ∈ S }
Speciﬁcation of Tokens

■ Regular Expression (RE):

■ We use regular expression to describe the tokens of
a programming language.
■ Basis Symbol
– ε is a regular expression denoting language { ε }
– a ∈ A is a regular expression denoting {a}
Speciﬁcation of Tokens

■ If r and s are regular expressions denoting languages

L1(r) and L2(s) respectively, then
– r + s is a regular expression denoting L1(r) ∪ L2(s)
– rs is a regular expression denoting L1(r) . L2(s)
– r* is a regular expression denoting (L1(r) )*
– (r) is a regular expression denoting L1(r)
Speciﬁcation of Tokens

■ Practice Questions

(a) RE that gives the binary strings that are divisible by 2

(b) RE that gives binary strings having at most two 1s
(c) RE that denotes the language of all strings that ends
with 00 (binary number multiple of 4)
(d) RE that denotes the set of all strings that describes
alternating 1s and 0s
Speciﬁcation of Tokens

■ Answers

(a) (1+0)*0
(b) 0*10*10* + 0*10* + 0*
(c) (1+0)*00
(d) (01)* + (10)*
Speciﬁcation of Tokens

■ Properties of RE
– r+s = s+r ( + is commutative)
– r+(s+t) = (r+s)+t ; r(st) = (rs)t (+ and . are associative)
– r(s+t) = (rs)+(rt); (r+s)t =(rt)+(st) (. distributes over +)
– εr = rε (ε is identity element)
– r* = (r+ε)* (relation between * and ε)
– r** = r* (* is idempotent)
Speciﬁcation of Tokens

■ Regular Definitions:
■ To write regular expression for some languages can
be difficult, because their regular expressions can be
quite complex
■ In those cases, we may use regular definitions
■ We can give names to regular expressions, and we
can use these names as symbols to define other
regular expressions
Specification of Tokens
Specification of Tokens
■ Identifiers in Pascal are defined as a string of letters
and digits beginning with a letter
letter → A | B | ... | Z | a | b | ... | z
digit → 0 | 1 | ... | 9
id → letter (letter | digit ) *
■ If we try to write the regular expression representing
identifiers without using regular definitions, that
regular expression will be complex:
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Specification of Tokens
Recognition of Tokens

■ A recognizer for a language is a program that takes a

string w, and answers “YES” if w is a sentence of that
language, otherwise “NO”
■ The tokens that are specified using RE are
recognized by using transition diagram or finite
automata (FA)
■ Starting from the start state we follow the transition
defined
Recognition of Tokens

■ If the transition leads to the accepting state, then the

token is matched and hence the lexeme is returned,
otherwise other transition diagrams are tried out
until we process all transition diagrams or failure is
detected
■ Recognizer of tokens takes the language L and the
string s as input and tries to verify whether s ∈ L or
not
Recognition of Tokens

■ We concentrate on a class of recognizer called Finite

Automata (FA)
■ There are two types of Finite Automaton:
– Deterministic Finite Automaton (DFA)
– Non Deterministic Finite Automaton (NFA)
Recognition of Tokens
Recognition of Tokens

■ Why convert from NFA to DFA?

– Computer programs generally need to know all
possible transitions and states for a given state
machine. A non-deterministic finite automaton can
have a transition that goes to any number of states
for a given input and state. This is a problem for a
computer program because it needs precisely one
transition for a given input from a given state. The
process of converting NFA to DFA eliminates this
ambiguity and allows a program to be made.
Deterministic Finite Automata (DFA)
■ FA is deterministic, if there is exactly one transition
for each (state, input) pair
■ It is a fast recognizer but takes large space
■ DFA is a five tuple (S, ∑, q0, δ, F) where,
– S → finite set of states
– ∑ → finite set of input alphabets
– q0 → starting state
– δ → transition function i.e. δ : S × ∑ → S
– F → set of final states F ⊆ S
Deterministic Finite Automata

■ The following is the algorithm for simulating DFA for

recognizing given string
■ For a given string w, in DFA D, with start state q0, with
set of ﬁnal states F, the output is “YES”, if D accepts
w, otherwise “NO”
DFASim(D, q0) {
q = q0
c = getchar( )
while (c != eof) {
q = move(q, c) // this is δ function.
c = getchar( )
}
if (q is in F)
return “yes” // if D accepts w
else
return “no”
}
Deterministic Finite Automata

■ The ﬁgure shows DFA for the regular expression:

(a+b)*abb
Practice Questions

Q) Write the regular expression and draw its corresponding

DFA for the following:

(1) Language accepting strings containing exactly two 0s

over input alphabets ∑ = {0, 1}.
(2) Language accepting strings starting and ending with
different characters over input alphabets ∑ = {0, 1}.
Ans 1: 1*01*01*
Ans 2: (0(0+1)*1)+(1(1+0)*0)
Non-Deterministic Finite Automata
(NFA)
■ FA is non-deterministic if there can be more than one
transition (or none) for each (state, input) pair
■ It is a slow recognizer but takes less space
■ An NFA is a five tuple (S, ∑, q0, δ, F) where,
– S → finite set of states
– ∑ → finite set of input alphabets
– q0 → starting state
– δ → transition function i.e. δ : S × ∑ → 2|S|
– F → set of final states F ⊆ S
Non-Deterministic Finite Automata

■ The ﬁgure shows NFA for the regular expression:

(a+b)*abb
Practice Questions

(1) Design an NFA with ∑ = {0, 1} that accepts all strings

in which the third symbol from the right end is always 0.

(2) Design an NFA with ∑ = {0, 1} which accepts all

strings containing the substring 1110.
Ans 1:

Ans 2:
ε-NFA

■ ε-transitions are allowed in ε-NFAs

■ In other words, we can move from one state to
another without consuming any symbol
■ In case of ε-NFA the only difference is ∑ = ∑ ∪ {ε} and
hence δ: S × ∑ ∪ {ε} → 2|S|
ε-NFA

■ The ﬁgure shows a state machine with ε moves that

is equivalent to the regular expression: aa* +bb*
//SIMULATING NFA
S = ε-closure( {S0} ) // set of all states that can be
// accessed from S0 by ε- transitions
c = getchar( )
while(c != eof) {
S = ε-closure(move(S, c)) //set of all states that can be accessible
//from a state in S by a transition on c
c = getchar( ) }
if ( S ∩ F ≠ Φ) then
return “YES”
else
return “NO”
RE to NFA (Thompson’s Construction)

■ Thompson’s construction is a simple and systematic

method
■ It guarantees that the resulting NFA will have exactly
one ﬁnal state and one start state
■ Construction starts from simplest parts(alphabet
symbols)
■ To create an NFA for a complex regular expression,
NFAs of its subexpressions are combined
RE to NFA

■ Input → RE, r, over alphabet ∑

■ Output → ε-NFA accepting L(r)
■ Procedure → Process in bottom-up manner by
creating ε-NFA for each symbol in ∑ including ε. Then
recursively create for other operations as shown
below.
RE to NFA

(1) To recognize an empty string ε

(2) To recognize a symbol a in the alphabet ∑

RE to NFA
(3) If N(r1) and N(r2) are NFAs for regular expressions r1
and r2
(A) For regular expression r1 + r2
RE to NFA
(B) For regular expression r1 . r2
The start state of N(r1) becomes the start state of N(r1r2)
and ﬁnal state of N(r2) become ﬁnal state of N(r1r2)
RE to NFA
(C) For regular expression r*

Using rules 1 and 2, we construct NFAs for each basic

symbol in the expression; we combine these basic NFAs
using rule 3 to obtain an NFA for the entire expression
Prepared by Sherin Joshi
NFA to DFA (Subset Construction)

■ The subset construction algorithm converts an NFA

into a DFA
■ We use the following operations to keep the track of
sets of NFA
– ε-closure(s) → the set of NFA states reachable from
state s on ε- transition
– ε-closure(T) → the set of NFA states reachable
from state s in T on ε- transition
– move(T, a) → the set of NFA states to which there is
transition on input a from state s in T
NFA to DFA

■ The algorithm produces:

– Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
– Dtran is the transition table of the new DFA
■ The start state of DFA is ε-closure(q0)
■ A set of Dstates is an accepting state of DFA if it is a
set of NFA states containing at least one accepting
state of NFA
//Algorithm
Put ε-closure(q0) as an unmarked state in Dstates
While there is an unmarked state T in Dstates do
Mark T
For each input symbol a ∈ ∑ do
U = ε-closure(move(T, a))
If U is not in Dstates then
Add U as an unmarked state to Dstates
End if
Dtran[T, a] = U
End do
End do
Prepared by Sherin Joshi
Prepared by Sherin Joshi
Practice Question

Q) Convert the following NFA to DFA:

Practice Question

Ans:
RE to DFA (directly)

■ Important States:
– The “important states” of an NFA are those
without a null transition; that is, if move({s},a) ≠ ø
for some a, then s is an important state
– In an optimal state machine, all states are
important states
– The subset construction algorithm uses only the
important states when it determines
ε-closure(move(T,a))
RE to DFA

■ Augmented Regular Expression:

■ The ε-NFA created from RE has exactly one
accepting state and it does not have any transition
i.e. it is not important state
■ We introduce an “augmented character” # from the
accepting state to make it an important state
■ The regular expression (r)# is called the augmented
regular expression of the original expression r
Procedure:
1. Augment the given regular expression by
concatenating it with special symbol # i.e r → (r)#
2. Create the syntax tree for this augmented regular
expression
▪ In this syntax tree, all alphabet symbols (plus #
and the empty string) in the augmented regular
expression will be on the leaves, and all inner
nodes will be the operators in that augmented
regular expression.
3. Then each alphabet symbol (plus #) will be
numbered (position numbers)
4. Traverse the tree to construct functions nullable,
ﬁrstpos, lastpos, and followpos
5. Finally construct the DFA from the followpo
Syntax Tree Construction:
■ (a|b)*a → (a|b)*a# [augmented regular expression]

Syntax tree of (a|b)*a#

• each symbol is numbered (positions)

• each symbol is at a leaf
• inner nodes are operators
followpos
■ We deﬁne the function followpos for the positions
(positions assigned to leaves)
■ followpos(i) → is the set of positions which can
follow the position i in the strings generated by the
augmented regular expression
■ For example, ( a | b)* a #
1 2 34
■ followpos(1) = {1, 2, 3}
■ followpos(2) = {1, 2, 3}
■ followpos(3) = {4}
■ followpos(4) = { }
ﬁrstpos, lastpos, nullable

■ To evaluate followpos, we need three functions to

define the nodes (not just for leaves) of the syntax
tree:
■ firstpos(n) → the set of the positions of the first
symbols of strings generated by the sub-expression
rooted by n
■ lastpos(n) → the set of the positions of the last
symbols of strings generated by the sub-expression
rooted by n
■ nullable(n) → true if the empty string is a member of
strings generated by the sub-expression rooted by n,
false otherwise
■ Rules for calculating nullable, firstpos & lastpos
Evaluating followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i) ∪ firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i) ∪ firstpos(n)
end do
end if
end do
Compute the followpos bottom-up for each node o
Evaluating followpos example:
red – firstpos
blue – lastpos

Then we can calculate followpos

followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = { }
After we calculate followpos,
we are ready to create DFA for
the regular expression
Algorithm for converting RE to DFA

1. Create the syntax tree of (r)#

2. Calculate the functions: nullable, ﬁrstpos, lastpos &
followpos
3. Put ﬁrstpos(root) into the states of DFA as an unmarked
state
4. while (there is an unmarked state S in the states of DFA) do
– mark S
– for each input symbol a do
• let s1,...,sn are positions in S and symbols in those
positions is a
• S’ ← followpos(s1) ∪ ... ∪ followpos(sn)
• move(S, a) ← S’
• if (S’ is not empty and not in the states of DFA)
– put S’ into the states of DFA as an unmarked state

The start state of DFA is ﬁrstpos(root)

The accepting states of DFA are all states containing
Prepared by Sherin Joshi
Prepared by Sherin Joshi
State Minimization in DFA

■ DFA minimization refers to the task of transforming a

given DFA into an equivalent DFA which has
minimum number of states
■ Two states p and q are called equivalent if for all
input strings w, δ(p, w) is an accepting state iff δ(q,
w) is an accepting state
■ Otherwise they are called distinguishable states
State Minimization in DFA

■ String w distinguishes state s from state t if, by

starting with DFA M in state s and feeding it input w,
we end up in an accepting state, but starting in state
t and feeding it with same input w, we end up in a non
accepting state, or vice-versa
■ The procedure ﬁnds the states that can be
distinguished by some input string
■ Each group of states that cannot be distinguished is
then merged into a single state
State Minimization in DFA

Suppose there is a DFA D < Q, Σ, q0, δ, F > which

recognizes a language L. Then the minimized DFA D
<Q’, Σ, q0, δ’, F’ > can be constructed for language L as:

Step 1: Divide Q (set of states) into two sets. One set

will contain all ﬁnal states and the other set will contain
all non-ﬁnal states. This partition is called P0.

Step 2: Initialize k = 1
State Minimization in DFA

Step 3: Find Pk by partitioning the different sets of Pk-1.

In each set of Pk-1, take all possible pair of states. If two
states of a set are distinguishable, split the states into
different sets in Pk.

Step 4: Stop when Pk = Pk-1 (No change in partition)

Step 5: All states of one set are merged into one. No. of
states in minimized DFA will be equal to no. of sets in
Pk.
State Minimization in DFA

■ In addition to the procedure, we also remove the

following states from the DFA:
– Unreachable State: A state that cannot be
reached through any transition from any other
state in the DFA
– Dead State: A non-ﬁnal state, that when an
automata reaches, cannot transit into any other
state
State Minimization in DFA

■ How to ﬁnd whether two states in partition Pk are

distinguishable?
– Two states ( qi, qj ) are distinguishable in partition
Pk if for any input symbol a, δ( qi, a ) and δ( qj, a )
are in different sets in partition Pk-1
State Minimization in DFA

■ Start state of the minimized DFA is the group

containing the start state of the original DFA
■ Accepting states of the minimized DFA are the
groups containing the accepting states of the
original DFA
Example
Consider the following DFA shown in figure.
1. P0 will have two sets of states. One set will contain q1, q2, q4 which
are final states of DFA and another set will contain remaining states.
So P0 = { { q1, q2, q4 }, { q0, q3, q5 } }.
2. To calculate P1, we will check whether sets of partition P0 can be
partitioned or not:
i) For set { q1, q2, q4 } :
δ ( q1, 0 ) = δ ( q2, 0 ) = q2 and δ ( q1, 1 ) = δ ( q2, 1 ) = q5, So q1 and q2
are not distinguishable.
Similarly, δ ( q1, 0 ) = δ ( q4, 0 ) = q2 and δ ( q1, 1 ) = δ ( q4, 1 ) = q5, So q1
and q4 are not distinguishable.
Since, q1 and q2 are not distinguishable and q1 and q4 are also not
distinguishable, So q2 and q4 are not distinguishable. So, { q1, q2, q4 }
set will not be partitioned in P1.
ii) For set { q0, q3, q5 } :
δ ( q0, 0 ) = q3 and δ ( q3, 0 ) = q0
δ ( q0, 1) = q1 and δ( q3, 1 ) = q4
Moves of q0 and q3 on input symbol 0 are q3 and q0 respectively
which are in same set in partition P0. Similarly, Moves of q0 and q3 on
input symbol 1 are q1 and q4 which are in same set in partition P0. So,
q0 and q3 are not distinguishable.
δ ( q0, 0 ) = q3 and δ ( q5, 0 ) = q5 and δ ( q0, 1 ) = q1 and δ ( q5, 1 ) = q5
Moves of q0 and q5 on input symbol 1 are q1 and q5 respectively
which are in different sets in partition P0. So, q0 and q5 are
distinguishable. So, set { q0, q3, q5 } will be partitioned into { q0, q3 }
and { q5 }. So,
P1 = { { q1, q2, q4 }, { q0, q3}, { q5 } }
To calculate P2, we will check whether sets of partition P1 can be
partitioned or not:
iii)For set { q1, q2, q4 } :
δ ( q1, 0 ) = δ ( q2, 0 ) = q2 and δ ( q1, 1 ) = δ ( q2, 1 ) = q5, So q1 and q2
are not distinguishable.
Similarly, δ ( q1, 0 ) = δ ( q4, 0 ) = q2 and δ ( q1, 1 ) = δ ( q4, 1 ) = q5, So q1
and q4 are not distinguishable.
Since, q1 and q2 are not distinguishable and q1 and q4 are also not
distinguishable, So q2 and q4 are not distinguishable. So, { q1, q2, q4 }
set will not be partitioned in P2.
iv)For set { q0, q3 } :
δ ( q0, 0 ) = q3 and δ ( q3, 0 ) = q0
δ ( q0, 1 ) = q1 and δ ( q3, 1 ) = q4
Moves of q0 and q3 on input symbol 0 are q3 and q0 respectively
which are in same set in partition P1. Similarly, Moves of q0 and q3 on
input symbol 1 are q1 and q4 which are in same set in partition P1. So,
q0 and q3 are not distinguishable.
v) For set { q5 }:
Since we have only one state in this set, it can’t be further
partitioned. So,
P2 = { { q1, q2, q4 }, { q0, q3 }, { q5 } }
Here, P1=P2. So, this is the final partition. Partition P2 means that q1,
q2 and q4 states are merged into one. Similarly, q0 and q3 are
merged into one. Minimized DFA corresponding to the given input
DFA is shown below:
Practice Question
Q) Minimize the following DFA:
Ans:

Compiler - Design - Module2-Print
No ratings yet
Compiler - Design - Module2-Print
16 pages
Parser Lexical Analysis
No ratings yet
Parser Lexical Analysis
6 pages
UNIT 2 Compiler Design
No ratings yet
UNIT 2 Compiler Design
23 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
68 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
Lecture 04 05 PDF
No ratings yet
Lecture 04 05 PDF
8 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
26 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
58 pages
CD Aii Partb Ans
No ratings yet
CD Aii Partb Ans
8 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Comp Chap2
No ratings yet
Comp Chap2
36 pages
Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
No ratings yet
Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
53 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
CD 1
No ratings yet
CD 1
92 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
14 pages
Compiler Design Basics
No ratings yet
Compiler Design Basics
14 pages
Compiler - 2
No ratings yet
Compiler - 2
15 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
HW 31712
No ratings yet
HW 31712
22 pages
Unit2 Lexical Analyzer
No ratings yet
Unit2 Lexical Analyzer
6 pages
Lexical Analysis Overview
No ratings yet
Lexical Analysis Overview
17 pages
Lexical Analysis in Compilers
No ratings yet
Lexical Analysis in Compilers
5 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Unit 2
No ratings yet
Unit 2
61 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
56 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
24 pages
Role of Lexical Analyzer - Input Buffering
No ratings yet
Role of Lexical Analyzer - Input Buffering
11 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
11 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Lexical Analysis Overview by Atif Ishaq
100% (1)
Lexical Analysis Overview by Atif Ishaq
37 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
Compiler Design for Students
No ratings yet
Compiler Design for Students
40 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
CD UNIT-1
No ratings yet
CD UNIT-1
60 pages
Lexical Analysis and Token Recognition
No ratings yet
Lexical Analysis and Token Recognition
67 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
Overview of Lexical Analyzers
No ratings yet
Overview of Lexical Analyzers
17 pages
Chapter 2 Lexical Analyser
No ratings yet
Chapter 2 Lexical Analyser
40 pages
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
No ratings yet
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
55 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
2 1 Lexical Analysis
No ratings yet
2 1 Lexical Analysis
30 pages
CD - Module 2
No ratings yet
CD - Module 2
12 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
Pcdunit2 Class
No ratings yet
Pcdunit2 Class
21 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Three-Pass Compiler Optimization Techniques
No ratings yet
Three-Pass Compiler Optimization Techniques
50 pages
PDDL: Knight Tour With Holes: Urukh-Ai
No ratings yet
PDDL: Knight Tour With Holes: Urukh-Ai
14 pages
BKS Unit II-S and L-Attributed SDDs
No ratings yet
BKS Unit II-S and L-Attributed SDDs
32 pages
OCaml Basics for CS Students
No ratings yet
OCaml Basics for CS Students
61 pages
String & Matrix
No ratings yet
String & Matrix
13 pages
Tutorial1 PDF
No ratings yet
Tutorial1 PDF
27 pages
E. W. Beth - Semantic Construction of Intuitionistic Logic (1956, North-Holland) - Libgen - Li
No ratings yet
E. W. Beth - Semantic Construction of Intuitionistic Logic (1956, North-Holland) - Libgen - Li
36 pages
Grammar and Language: Grammar: It Is System That Specifies
No ratings yet
Grammar and Language: Grammar: It Is System That Specifies
40 pages
Regular Expressions
No ratings yet
Regular Expressions
17 pages
OCaml: Effects Without Monads
100% (1)
OCaml: Effects Without Monads
26 pages
Unit 6 Computable Function
No ratings yet
Unit 6 Computable Function
8 pages
ACSL Contest 2: Prefix, Bit Strings, LISP
100% (1)
ACSL Contest 2: Prefix, Bit Strings, LISP
22 pages
Unit 2
No ratings yet
Unit 2
15 pages
CS264A ReviewNotes
No ratings yet
CS264A ReviewNotes
11 pages
Backus-Naur Form - Computer Science
100% (1)
Backus-Naur Form - Computer Science
18 pages
Kant's Logic in The Critique of Practical Reason
No ratings yet
Kant's Logic in The Critique of Practical Reason
157 pages
Prac 5
No ratings yet
Prac 5
6 pages
Knowledge Representation in AI: Logic & Reasoning
No ratings yet
Knowledge Representation in AI: Logic & Reasoning
10 pages
Compiler Design
No ratings yet
Compiler Design
2 pages
Btech-IT-semantic Nets (Compatibility Mode)
No ratings yet
Btech-IT-semantic Nets (Compatibility Mode)
33 pages
Quine On Deduction
No ratings yet
Quine On Deduction
11 pages
Java Enums and Type Wrappers Guide
No ratings yet
Java Enums and Type Wrappers Guide
9 pages
The Cambridge Quintet: Book Review by J. Lambek, Mcgill University
No ratings yet
The Cambridge Quintet: Book Review by J. Lambek, Mcgill University
2 pages
Applications of Propositional Logic To Cellular Automata
No ratings yet
Applications of Propositional Logic To Cellular Automata
25 pages
CS402 Short Notes: For More Visit
No ratings yet
CS402 Short Notes: For More Visit
39 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
17 pages
First Order Predicate Logic Overview
No ratings yet
First Order Predicate Logic Overview
30 pages
Logic for Computer Science Students
No ratings yet
Logic for Computer Science Students
30 pages
Compiler Construction Guide
No ratings yet
Compiler Construction Guide
20 pages
Computational Learning Theory Guide
No ratings yet
Computational Learning Theory Guide
43 pages

2.1 - Lexical Analysis

Uploaded by

2.1 - Lexical Analysis

Uploaded by

LEXICAL

■ It is the initial part of reading and analyzing the

■ A lexical analyzer doesn’t return a list of tokens at

■ The function of a lexical analyzer is to read the input

– When a string representing a program is broken

■ Example: take statement, area = 3.1416 * r * r

■ Though error at lexical analysis is normally not

■ Some error recovery techniques

■ Technique used to speed up reading the source

■ Specialized buffering techniques have been

■ Each buffer is of the same size N, and N is usually the

■ In the above ﬁgure, forward has passed the end of

■ If we use the 2N buffering scheme, we must check,

■ Regular expression is the common way of specifying

– * → Kleene star operator, A* = ⋃i≥0 Ai

■ Regular Expression (RE):

■ If r and s are regular expressions denoting languages

(a) RE that gives the binary strings that are divisible by 2

■ A recognizer for a language is a program that takes a

■ If the transition leads to the accepting state, then the

■ We concentrate on a class of recognizer called Finite

■ Why convert from NFA to DFA?

■ The following is the algorithm for simulating DFA for

■ The ﬁgure shows DFA for the regular expression:

Q) Write the regular expression and draw its corresponding

(1) Language accepting strings containing exactly two 0s

■ The ﬁgure shows NFA for the regular expression:

(1) Design an NFA with ∑ = {0, 1} that accepts all strings

(2) Design an NFA with ∑ = {0, 1} which accepts all

■ ε-transitions are allowed in ε-NFAs

■ The ﬁgure shows a state machine with ε moves that

■ Thompson’s construction is a simple and systematic

■ Input → RE, r, over alphabet ∑

(1) To recognize an empty string ε

(2) To recognize a symbol a in the alphabet ∑

Using rules 1 and 2, we construct NFAs for each basic

■ The subset construction algorithm converts an NFA

■ The algorithm produces:

Q) Convert the following NFA to DFA:

■ Augmented Regular Expression:

Syntax tree of (a|b)*a#

• each symbol is numbered (positions)

■ To evaluate followpos, we need three functions to

Then we can calculate followpos

1. Create the syntax tree of (r)#

The start state of DFA is ﬁrstpos(root)

■ DFA minimization refers to the task of transforming a

■ String w distinguishes state s from state t if, by

Suppose there is a DFA D < Q, Σ, q0, δ, F > which

Step 1: Divide Q (set of states) into two sets. One set

Step 3: Find Pk by partitioning the different sets of Pk-1.

Step 4: Stop when Pk = Pk-1 (No change in partition)

■ In addition to the procedure, we also remove the

■ How to ﬁnd whether two states in partition Pk are

■ Start state of the minimized DFA is the group

You might also like