Unit 2
Lexical Analyzer
Outline
Informal sketch of lexical analysis
Identifies tokens in input string
Issues in lexical analysis
Lookahead
Ambiguities
Specifying lexemes
Regular expressions
Examples of regular expressions
Lexical Analyzer
Functions
Grouping input characters into tokens
Stripping out comments and white
spaces
Correlating error messages with the
source program
Issues (why separating lexical analysis
from parsing)
Simpler design
Compiler efficiency
Compiler portability (e.g. Linux to Win)
The role of the lexical
analyzer
Lexical analyzers are divided two processes:
Scanning
No tokenization of the input
deletion of comments, compaction of white space
characters
Lexical analysis
Producing tokens
Scanning
Perspective
Purpose
Transform a stream of symbols
Into a stream of tokens
Lexical Analyzer
Responsibilities
Lexical analyzer
[Scanner]
Scan input
Remove white
spaces
Remove comments
Manufacture tokens
Generate lexical
errors
Pass token to
parser
The Role of a Lexical Analyzer
pass token
and attribute value
read char
Source
program
Lexical
analyzer
put back
char
Read entire
program into
memory
id
Symbol Table
Compiler Construction
Parser
get next
Lexical Analysis
What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z
= 1;
Goal: Partition input string into substrings
Where the substrings are tokens
Compiler Construction
Whats a Token?
A syntactic category
In English:
noun, verb, adjective,
In a programming language:
Identifier, Integer, Keyword,
Whitespace,
Compiler Construction
What are Tokens For?
Classify program substrings
according to role
Output of lexical analysis is a stream
of tokens . . .which is input to the
parser
Parser relies on token distinctions
An identifier is treated differently than a
keyword
Compiler Construction
Tokens
Tokens correspond to sets of strings.
Identifier: strings of letters or digits,
starting with a letter
Integer: a non-empty string of digits
Keyword: else or if or begin or
Whitespace: a non-empty sequence of
blanks, newlines, and tabs
Compiler Construction
Typical Tokens in a PL
Symbols: +, -, *, /, =, <, >, ->,
Keywords: if, while, struct, float, int,
Integer and Real (floating point)
literals
123, 123.45
Char (string) literals
Identifiers
Comments
White space
Compiler Construction
Terminology
Token
A classification for a common set of strings
Examples: Identifier, Integer, Float, Assign, LeftParen,
RightParen,....
Pattern
The rules that characterize the set of strings for a token
A pattern is a description of the form that the lexemes of
token may take.
Examples: [0-9]+
Lexeme
Actual sequence of characters that matches a pattern and
has a given Token class.
A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that
token.
Examples:
Identifier: Name,Data,x
Integer: 345,2,0,629,....
13
Examples
14
Lexical Errors
Error Handling is very localized, w.r.t. Input Source
Example:
fi(a==f(x))
generates no lexical error in C
In what situations do errors occur?
Prefix of remaining input doesnt match any defined
token
Possible error recovery actions:
Deleting or Inserting Input Characters
Replacing or Transposing Characters
Or, skip over to next separator to ignore problem
15
Basic Scanning technique
Use 1 character of look-ahead
Obtain char with getc()
Do a case analysis
Based on lookahead char
Based on current lexeme
Outcome
If char can extend lexeme, all is well, go on.
If char cannot extend lexeme:
Figure out what the complete lexeme is and
return its token
Put the lookahead back into the symbol stream
16
Token Attribute
E = C1 ** 10
Token
Attribute
ID
Index to symbol table entry E
=
ID
Index to symbol table entry C1
**
NUM
10
Compiler Construction
Lexical Errors
It is hard for a lexical analyzer to tell,
without the aid of other components,
that there is a source-code error.
E.g., fi ( a == f(x)) ...
Keyword if
18
an identifier
Suppose a situation in which none of
the patterns for tokens matches a
prefix of the remaining input.
E.g. $%#if a>0 a+=1;
19
The simplest recovery strategy is
panic mode recovery.
Delete successive characters from the
remaining input until the lexical analyzer
can find a well-formed token.
This technique may occasionally
confuse the parser, but in an
interactive computing environment
it may be quit adequate.
20
Other possible errorrecovery actions
Delete one extraneous character
from the remaining input.
Insert a missing character into the
remaining input.
Replace a character by another
character.
Transpose two adjacent characters.
21
Lexical Error and Recovery
Error detection
Error reporting
Error recovery
Delete the current character and
restart scanning at the next character
Delete the first character read by the
scanner and resume scanning at the
character following it.
How about runaway strings and
comments?
Compiler Construction
Specification of Tokens
Regular expressions are an important notation
for specifying lexeme patterns. While they
cannot express all possible patterns, they are
very effective in specifying those types of
patterns that we actually need for tokens.
Compiler Construction
24
Input Buffering
LA reads the source program character by character from secondary storage. Reading the input character by character by is
costly.
Therefore a block of data is first read into buffer and then scanned by the LA.
In order to efficiently move back and forth in the input stream input buffering is used.
There are various methods by which buffering may be done
One Buffer Scheme
Two Buffer Scheme
One Buffer Scheme: If a lexeme crosses over the boundary of the buffer the buffer has to be refilled in order to scan the
rest of the lexeme and the first part of the lexeme will be overwritten in the process.
25
Buffer Pairs
Two buffers of the same size, say 4096, are alternately
reloaded.
Two pointers to the input are maintained:
Pointer lexeme_Begin marks the beginning of the
current lexeme.
Pointer forward pointer scans ahead until a pattern
match is found.
The string of characters between the two pointers is the
current lexeme. Initially both pointers point to the
beginning of the lexeme.
The fp then starts scanning forward until a match for a
pattern is found. When the end of the lexeme is
identified, the fp is set to the character at its right end,
the token and the attribute corresponding to this
lexeme is returned.
After that both the pointers lb and fp are then set to the
beginning of the next token, which is the character
immediately past the lexeme.
26
Buffering
Input Buffering:
Some efficiency issues concerned with the buffering of input.
A two-buffer input scheme that is useful when lookahead on the input is necessary to
identify tokens.
Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the
buffer end.
There are three general approaches to the implementation of a lexical analyser:
Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a
regular expression based specification. In this, the generator provides routines for reading and
buffering the input.
2. Write the lexical analyser in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. Write the lexical analyser in assembly language and explicitly manage the reading of input.
Buffer pairs:
Because of a large amount of time can be consumed moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead required to
process an input character.
The scheme to be discussed:
Consists a buffer divided into two N-character halves.
N Number of characters on one disk block, e.g., 1024 or 4096.
Read N characters into each half of the buffer with one system read command.
If fewer than N characters remain in the input, then eof is read into the buffer
after the input characters.
Two pointers to the input buffer are maintained.
The string of characters between two pointers is the current lexeme.
Initially both pointers point to the first character of the next lexeme to be
found.
Forward pointer, scans ahead until a match for a pattern is found.
Once the next lexeme is determined, the forward pointer is set to the character at its
right end.
If the forward pointer is about to move past the halfway mark, the right half is
filled with N new input characters.
If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the
beginning of the buffer.
Disadvantage of this scheme:
This scheme works well most of the time, but the amount of lookahead is
limited.
This limited lookahead may make it impossible to recognize tokens in
situations where the distance that the forward pointer must travel is more than the
length of the buffer.
For example: DECLARE ( ARG1, ARG2, , ARGn ) in PL/1 program;
Cannot determine whether the DECLARE is a keyword or an array name until
the character that follows the right parenthesis.
Whenever we advance fp we must check whether
we have reached the end of one buffer or not as in
that case we have to refill the other buffer before
advancing fp. So we have to make three tests.
1.For checking whether the first buffer is full or not.
2.For checking whether the second buffer is full or
not.
3.To check the end of the input.
30
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else
if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else if (fp==eof)
{ terminate scanning
}
Else
forward:=forward + 1;
31
Sentinels
We can reduce these three tests to one by
introducing a special character called
sentinel at the end. Let us choose EOF as
the sentinel at the end of each buffer.
E
M * eof C * * 2 eof
eof
Sentinels
In the previous scheme, must check each time the move forward pointer that
have not moved off one half of the buffer. If it is done, then must reload the other
half.
Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
This can reduce the two tests to one if it is extend each buffer half to hold a
sentinel character at the end.
The sentinel is a special character that cannot be part of the source program. (eof
character is used as sentinel).
In this, most of the time it performs only one test to see whether forward points
to an eof.
Only when it reach the end of the buffer half or eof, it performs more tests.
Since N input characters are encountered between eofs, the average number of
tests per input character is very close to 1.
forward:=forward+1;
If (forward==eof)
{
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else terminate lexical analysis;
}
34
How can deal with a long and long
andlong lexeme, this is a
problem in the two buffer scheme.
DECLARE(ARG1, ARG2,,ARGn)
E.g. When a function is rewritten in c+
+, a function name is represent
several function.
35
3.3 Specification of Tokens
Regular expressions are an important notation for
specifying token patterns.
Study formal notations for regular expressions.
these expressions are used in lexical-analyzer generator.
Sec. 3.7 shows how to build the lexical analyzer by
converting regular expressions to automata.
36
1 Regular Definition of Tokens
Defined in regular expression
e.g. identifier can be defined by regular
Grammar
Id letter(letter|digit)
letter A|B||Z|a|b||z
digit 0|1|2||9
Identifier can also be expressed by following
regular expression
(A|B||Z|a|b||z)(A|B||Z|a|b||z| 0|1|2||
9)*
37
Regular expressions are an
important notation for specifying
patterns. Each pattern matches a
set of strings, so regular
expressions will serve as names
for sets of strings.
38
2 Regular Expression & Regular
language
Regular Expression
A notation that allows us to define a pattern in
a high level language.
Regular language
Each regular expression r denotes a language
L(r) (the set of sentences relating to the regular
expression r)
39
Each token in a program can be expressed in a regular expression
40
3 The construct rule of regular
expression over alphabet
is a regular expression that denote
{}
is regular expression
{} is the related regular language
2) If a is a symbol in , then a is a regular
expression that denotes {a}
41
a is regular expression
{a} is the related regular language
3) Suppose and are regular
expressions, then |, , (), * , * is
also a regular expression
L(|)= L()L()
L()= L()L()
L(())= L()
L(*)={}L()L()L()... L()
L()
42
4 Algebraic laws of regular
expressions
1) |= |
2) |(|)=(|)| () =( )
3) (| )= |
(|)= |
4) = =
5)(*)*=*
6) *= |
* = *
7) (|)*= (* | *)*= (* *)*
43
8) If L(),then
= |
= *
= |
= *
Notes: We assume that the precedence
of * is the highest, the precedence of |
is the lowest and they are left
associative
44
Example unsigned numbers such as
5280, 39.37, 6.336E4, 1.894E-4
digit0 | 1 || 9
digits digit digit*
optional_fraction .digits|
optional_exponent (E(+|-| )digits)|
num digits optional_fraction
optional_exponent
45
5 Notational Short-hands
a)One or more instances
( r )+ digit+
b)Zero or one instance
r? is a shorthand for r|
digits)?
c)Character classes
[a-z] denotes a|b|c||z
[A-Za-z] [A-Za-z0-9]
46
(E(+|-)?
3.4 Recognition of Tokens
1 Task of recognition of token in a
lexical analyzer
47
Isolate the lexeme for the next token
in the input buffer
Produce as output a pair consisting of
the appropriate token and attributevalue, such as <id,pointer to table
entry> , using the translation table
given in the Fig in next page
48
Regular
expression
if
id
Token
<
relop
if
id
Attributevalue
Pointer to
table entry
LT
2 Methods to recognition of token
49
Use Transition Diagram
3 Transition Diagram(Stylized
flowchart)
50
Depict the actions that take place
when a lexical analyzer is called by the
parser to get the next token
start
0
Start
state
>
Accepting
state
6
return(relop,GE)
other
8
return(relop,GT)
Notes: Here we use * to indicate states on which input
retraction must take place
51
4 Implementing a Transition Diagram
52
Each state gets a segment of code
If there are edges leaving a state, then
its code reads a character and selects
an edge to follow, if possible
Use nextchar() to read next character
from the input buffer
while (1) {
switch(state) {
case 0: c=nextchar();
if (c==blank || c==tab ||
c==newline){
state=0;lexeme_beginning+
+}
else if (c== <) state=1;
else if (c===) state=5;
else if(c==>) state=6 else
state=fail();
break
case 9: c=nextchar();
if (isletter( c)) state=10;
else state=fail(); break
}}}
53
5 A generalized transition diagram
Finite Automation
54
Deterministic or non-deterministic FA
Non-deterministic means that more
than one transition out of a state may
be possible on the the same input
symbol
6 The model of recognition of
tokens
Input buffer
d 2 =
Lexeme_beginning
FA simulator
55
e.g The FA simulator for Identifiers is:
1
letter
letter
2
digit
Which represent the rule:
identifier=letter(letter|digit)*
56