0% found this document useful (0 votes)
24 views11 pages

Lexical Analysis in Compiler Design

Uploaded by

its.vikas013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Lexical Analysis in Compiler Design

Uploaded by

its.vikas013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit-2

Lexical Analysis

2.1 OVER VIEW OF LEXICAL ANALYSIS


 To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
 Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.

2.2 ROLE OF LEXICAL ANALYZER


The LA is the first phase of a compiler. It main task is to read the input character and produce as
output a sequence of tokens that the parser uses for syntax analysis.

Fig. 3.1: Role of Lexical analyzer

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the
input character until it can identify the next token. The LA return to the parser representation for
the token it has found. The representation will be an integer code, if the token is a simple construct
such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping
out from the source program the commands and white spaces in the form of blank, tab and new
line characters. Another is correlating error message from the compiler with the source program.

2.3 TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,

1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Fig. 3.2: Example of Token, Lexeme and Pattern

2.4 LEXICAL ERRORS:


Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side,
will be thrown by your scanner when a given set of already recognised valid tokens don't match
any of the right sides of your grammar rules. simple panic-mode error handling system requires
that we return to a high-level parsing function when a parsing or lexical error is detected.

Error-recovery actions are:


2.4.1 Delete one character from the remaining input.
2.4.2 Insert a missing character in to the remaining input.
2.4.3 Replace a character by another character.
2.4.4 Transpose two adjacent characters.

2.5 REGULAR EXPRESSIONS


Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits. In
regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
 is a regular expression denoting { € }, that is, the language containing only the empty
string.
 For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string
consisting of the single symbol ‘a’ .
 If R and S are regular expressions, then

(R) | (S) means L(r) U L(s)


R.S means L(r).L(s)
R* denotes L(r*)

2.6 REGULAR DEFINITIONS


For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and finds
a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt

Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any WS - -
if if -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

2.7 TRANSITION DIAGRAM:


Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
Fig. 3.3: Transition diagram of Relational operators

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,


called a transition diagram. Position in a transition diagram, are drawn as circles and are called
as states.

Fig. 3.4: Transition diagram of Identifier

The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.

2.8 FINITE AUTOMATON


 A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
 We call the recognizer of the tokens as a finite automaton.
 A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
 This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
 Both deterministic and non-deterministic finite automaton recognize regular sets.
 Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
 First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
2.9 Non-Deterministic Finite Automaton (NFA)
 A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
 ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
 A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:

2.10 Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a NFA.


 No state has ε- transition
 For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)

Example:
2.11 Converting RE to NFA
 This is one way to convert a regular expression into a NFA.
 There can be other ways (much efficient) for the conversion.
 Thomson’s Construction is simple and systematic method.
 It guarantees that the resulting NFA will have exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet symbols).
 To create a NFA for a complex regular expression, NFAs of its sub-expressions are
combined to create its NFA.
 To recognize an empty string ε:

 To recognize a symbol a in the alphabet Σ:

 For regular expression r1 | r2:

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
 For regular expression r1 r2

Here, final state of N(r1) becomes the final state of N(r1r2).


 For regular expression r*

Example:
For a RE (a|b) * a, the NFA construction is shown below.

2.12 Converting NFA to DFA (Subset Construction)


We merge together NFA states by looking at them from the point of view of the input characters:
 From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming any
character. Thus states which are connected by an -transition will be represented by the
same states in the DFA.
 If it is possible to have multiple transitions based on the same symbol, then we can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be combined
into a single DFA state.
To perform this operation, let us define two functions:
 The -closure function takes a state and returns the set of states reachable from it based on
(one or more) -transitions. Note that this will always include the state itself. We should be
able to get from a state to any state in its -closure without consuming any input.
 The function move takes a state and a character, and returns the set of states reachable by
one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.

For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')


move(C,`a').
The Subset Construction Algorithm is a follows:

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)


while (there is one unmarked S1 in DS) do
begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end

 a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA


 the start state of DFA is ε-closure({s0})

2.13 Lexical Analyzer Generator

2.14 Lex specifications:


A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14), and
regular definitions.
2. The translation rules of a Lex program are statements of the form :

p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the
actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.

Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.

2.15. INPUT BUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:

1. Buffer pairs
2. Sentinels

The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the next
token itself can be determined. For this and other reasons, it is desirable for thelexical analyzer to
read its input from an input buffer. Figure shows a buffer divided into two haves of, say 100
characters each. One pointer marks the beginning of the token being discovered. A look ahead
pointer scans ahead of the beginning point, until the token is discovered .we view the position of
each pointer as being between the character last read and thecharacter next to be read. In practice
each buffering scheme adopts one convention either apointer is at the symbol last read or the
symbol it is ready to read.

Token beginnings look ahead pointerThe distance which the lookahead pointer may have to travel
past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the buffer
shown in above figure is of limited size there is an implied constraint on how much look ahead
can be used before the next token is discovered. In the above example, ifthe look ahead traveled
to the left half and all the way through the left half to the middle, we could not reload the right
half, because we would lose characters that had not yet been groupedinto tokens. While we can
make the buffer larger if we chose or use another buffering scheme,we cannot ignore the fact that
overhead is limited.

You might also like