0% found this document useful (0 votes)

24 views11 pages

Lexical Analysis in Compiler Design

Uploaded by

its.vikas013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

Lexical Analysis in Compiler Design

Uploaded by

its.vikas013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Unit-2

Lexical Analysis

2.1 OVER VIEW OF LEXICAL ANALYSIS

 To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
 Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.

2.2 ROLE OF LEXICAL ANALYZER

The LA is the first phase of a compiler. It main task is to read the input character and produce as
output a sequence of tokens that the parser uses for syntax analysis.

Fig. 3.1: Role of Lexical analyzer

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the
input character until it can identify the next token. The LA return to the parser representation for
the token it has found. The representation will be an integer code, if the token is a simple construct
such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping
out from the source program the commands and white spaces in the form of blank, tab and new
line characters. Another is correlating error message from the compiler with the source program.

2.3 TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,

1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Fig. 3.2: Example of Token, Lexeme and Pattern

2.4 LEXICAL ERRORS:

Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side,
will be thrown by your scanner when a given set of already recognised valid tokens don't match
any of the right sides of your grammar rules. simple panic-mode error handling system requires
that we return to a high-level parsing function when a parsing or lexical error is detected.

Error-recovery actions are:

2.4.1 Delete one character from the remaining input.
2.4.2 Insert a missing character in to the remaining input.
2.4.3 Replace a character by another character.
2.4.4 Transpose two adjacent characters.

2.5 REGULAR EXPRESSIONS

Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits. In
regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
 is a regular expression denoting { € }, that is, the language containing only the empty
string.
 For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string
consisting of the single symbol ‘a’ .
 If R and S are regular expressions, then

(R) | (S) means L(r) U L(s)

R.S means L(r).L(s)
R* denotes L(r*)

2.6 REGULAR DEFINITIONS

For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and finds
a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
|є
Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value

Any WS - -
if if -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

2.7 TRANSITION DIAGRAM:

Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
Fig. 3.3: Transition diagram of Relational operators

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,

called a transition diagram. Position in a transition diagram, are drawn as circles and are called
as states.

Fig. 3.4: Transition diagram of Identifier

The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.

2.8 FINITE AUTOMATON

 A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
 We call the recognizer of the tokens as a finite automaton.
 A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
 This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
 Both deterministic and non-deterministic finite automaton recognize regular sets.
 Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
 First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
2.9 Non-Deterministic Finite Automaton (NFA)
 A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
 ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
 A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:

2.10 Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a NFA.

 No state has ε- transition
 For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)

Example:
2.11 Converting RE to NFA
 This is one way to convert a regular expression into a NFA.
 There can be other ways (much efficient) for the conversion.
 Thomson’s Construction is simple and systematic method.
 It guarantees that the resulting NFA will have exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet symbols).
 To create a NFA for a complex regular expression, NFAs of its sub-expressions are
combined to create its NFA.
 To recognize an empty string ε:

 To recognize a symbol a in the alphabet Σ:

 For regular expression r1 | r2:

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
 For regular expression r1 r2

Here, final state of N(r1) becomes the final state of N(r1r2).

 For regular expression r*

Example:
For a RE (a|b) * a, the NFA construction is shown below.

2.12 Converting NFA to DFA (Subset Construction)

We merge together NFA states by looking at them from the point of view of the input characters:
 From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming any
character. Thus states which are connected by an -transition will be represented by the
same states in the DFA.
 If it is possible to have multiple transitions based on the same symbol, then we can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be combined
into a single DFA state.
To perform this operation, let us define two functions:
 The -closure function takes a state and returns the set of states reachable from it based on
(one or more) -transitions. Note that this will always include the state itself. We should be
able to get from a state to any state in its -closure without consuming any input.
 The function move takes a state and a character, and returns the set of states reachable by
one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.

For Example, if A, B and C are states, move({A,B,C},à') = move(A,à') move(B,à')

move(C,`a').
The Subset Construction Algorithm is a follows:

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)

while (there is one unmarked S1 in DS) do
begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end

 a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA

 the start state of DFA is ε-closure({s0})

2.13 Lexical Analyzer Generator

2.14 Lex specifications:

A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14), and
regular definitions.
2. The translation rules of a Lex program are statements of the form :

p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the
actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.

Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.

2.15. INPUT BUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:

1. Buffer pairs
2. Sentinels

The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the next
token itself can be determined. For this and other reasons, it is desirable for thelexical analyzer to
read its input from an input buffer. Figure shows a buffer divided into two haves of, say 100
characters each. One pointer marks the beginning of the token being discovered. A look ahead
pointer scans ahead of the beginning point, until the token is discovered .we view the position of
each pointer as being between the character last read and thecharacter next to be read. In practice
each buffering scheme adopts one convention either apointer is at the symbol last read or the
symbol it is ready to read.

Token beginnings look ahead pointerThe distance which the lookahead pointer may have to travel
past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the buffer
shown in above figure is of limited size there is an implied constraint on how much look ahead
can be used before the next token is discovered. In the above example, ifthe look ahead traveled
to the left half and all the way through the left half to the middle, we could not reload the right
half, because we would lose characters that had not yet been groupedinto tokens. While we can
make the buffer larger if we chose or use another buffering scheme,we cannot ignore the fact that
overhead is limited.

Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
56 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
88 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
48 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
27 pages
Unit NO.03 Phases in Compilers-Lexical Analysis& Syntax Analysis
No ratings yet
Unit NO.03 Phases in Compilers-Lexical Analysis& Syntax Analysis
43 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Lexical Analysis: Tokens & Patterns Explained
No ratings yet
Lexical Analysis: Tokens & Patterns Explained
77 pages
CC Unit 2
No ratings yet
CC Unit 2
80 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Token Recognition in Compiler Design
No ratings yet
Token Recognition in Compiler Design
51 pages
Compiler
No ratings yet
Compiler
60 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
CH 2
No ratings yet
CH 2
36 pages
Module 3
No ratings yet
Module 3
7 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
34 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
20 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
CH 3
No ratings yet
CH 3
66 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
6-Lexical Analysis Part5
No ratings yet
6-Lexical Analysis Part5
20 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Unit 6
No ratings yet
Unit 6
109 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
2 Lex
No ratings yet
2 Lex
45 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Lexical Analysis All Token List and Diffence
No ratings yet
Lexical Analysis All Token List and Diffence
4 pages
Bài Tập Quá Khứ Đơn & Tiếp Diễn Nâng Cao
No ratings yet
Bài Tập Quá Khứ Đơn & Tiếp Diễn Nâng Cao
7 pages
ODD Semester 2025
No ratings yet
ODD Semester 2025
13 pages
Margret Prepositions
No ratings yet
Margret Prepositions
269 pages
English Modals: Usage & Exercises
No ratings yet
English Modals: Usage & Exercises
45 pages
Lesson 2 Content PDF
No ratings yet
Lesson 2 Content PDF
3 pages
Indianschoolsohar: CLASS III ASSET Exam Sample Questions
100% (1)
Indianschoolsohar: CLASS III ASSET Exam Sample Questions
9 pages
In Company Elementary Test
No ratings yet
In Company Elementary Test
5 pages
English Listening Practice Report
No ratings yet
English Listening Practice Report
2 pages
SCM745 - EN - Col01 SAP Processes in Manufacturing Execution
100% (1)
SCM745 - EN - Col01 SAP Processes in Manufacturing Execution
140 pages
How To Hide Final Entry Indicator and Block Indicator From Service Entry Sheet (ML81N) - SAP Blogs
No ratings yet
How To Hide Final Entry Indicator and Block Indicator From Service Entry Sheet (ML81N) - SAP Blogs
12 pages
Creative Non Fiction
No ratings yet
Creative Non Fiction
14 pages
Address Resolution Protocol: Finding Feature Information
No ratings yet
Address Resolution Protocol: Finding Feature Information
18 pages
The Other Rma Matricide and Genocide in The Mythology of Paraurma Brian Collins Instant Download
No ratings yet
The Other Rma Matricide and Genocide in The Mythology of Paraurma Brian Collins Instant Download
90 pages
English OP
No ratings yet
English OP
17 pages
Verbos Irregulares
No ratings yet
Verbos Irregulares
2 pages
Abm CH 6 Full PDF
No ratings yet
Abm CH 6 Full PDF
21 pages
Sacrament of Confirmation
No ratings yet
Sacrament of Confirmation
5 pages
Rizal as the Tagalog Christ: A Study
No ratings yet
Rizal as the Tagalog Christ: A Study
3 pages
ED6 Sem 1 End-Term Ver A
No ratings yet
ED6 Sem 1 End-Term Ver A
7 pages
Project Report Formatting Guide
No ratings yet
Project Report Formatting Guide
10 pages
104eng 40
No ratings yet
104eng 40
2 pages
Patents and Declarations 3 of 3
No ratings yet
Patents and Declarations 3 of 3
126 pages
Myanmar's Thadingyut Festival
100% (1)
Myanmar's Thadingyut Festival
7 pages
Cyrus Cylinder and King Hammurabi
No ratings yet
Cyrus Cylinder and King Hammurabi
4 pages
Silverlight & Expression Blend Guide
No ratings yet
Silverlight & Expression Blend Guide
13 pages
Default Arguments and Function Overloading: Example
No ratings yet
Default Arguments and Function Overloading: Example
14 pages
Octavia Butler Bibliography Update
No ratings yet
Octavia Butler Bibliography Update
37 pages
Adobe House Construction Guide
No ratings yet
Adobe House Construction Guide
9 pages
Module 3 GR 3 PPT LESSON 5 Marlyn Panebio
No ratings yet
Module 3 GR 3 PPT LESSON 5 Marlyn Panebio
10 pages
Online Vehicle Insurance System Overview
No ratings yet
Online Vehicle Insurance System Overview
9 pages

Lexical Analysis in Compiler Design

Uploaded by

Lexical Analysis in Compiler Design

Uploaded by

Unit-2

2.1 OVER VIEW OF LEXICAL ANALYSIS

2.2 ROLE OF LEXICAL ANALYZER

Fig. 3.1: Role of Lexical analyzer

2.3 TOKEN, LEXEME, PATTERN:

1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

2.4 LEXICAL ERRORS:

Error-recovery actions are:

2.5 REGULAR EXPRESSIONS

(R) | (S) means L(r) U L(s)

2.6 REGULAR DEFINITIONS

Lexeme Token Name Attribute Value

2.7 TRANSITION DIAGRAM:

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,

Fig. 3.4: Transition diagram of Identifier

2.8 FINITE AUTOMATON

2.10 Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a NFA.

 To recognize a symbol a in the alphabet Σ:

 For regular expression r1 | r2:

Here, final state of N(r1) becomes the final state of N(r1r2).

2.12 Converting NFA to DFA (Subset Construction)

For Example, if A, B and C are states, move({A,B,C},à') = move(A,à') move(B,à')

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)

 a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA

2.13 Lexical Analyzer Generator

2.14 Lex specifications:

2.15. INPUT BUFFERING

You might also like