Unit2
Unit2
1
LEXICAL ANALYSIS
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
• Reviewing Finite Automata Concepts
• Non-Deterministic and Deterministic FA
• Conversion Process
• Regular Expressions to NFA
• NFA to DFA
• Relating NFAs/DFAs /Conversion to Lexical Analysis
2
NEED AND ROLE OF LEXICAL ANALYZER
1. Lexical Analysis is the first phase of compiler. It reads the input characters from left to
right, one character at a time, from the source program.
2. It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit
such as identifiers, keywords, operators and punctuation marks.
3. It needs to enter that lexeme into the symbol table and also reads from the symbol table.
Lexical analyzer acts as a subroutine, which is called by the parser whenever it needs a new
token.
3
Why Separating Lexical and Syntactic?
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
» It leads to simpler design of the parser as the unnecessary tokens can be eliminated by
scanner.
» Efficiency of the process of compilation is improved. The lexical analysis phase is most time
consuming phase in compilation. Using specialized buffering to improve the speed of
compilation.
» Portability of the compiler is enhanced as the specialized symbols and characters (language
and machine specific) are isolated during this phase.
4
Input Buffering
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long time and
costs a lot of money. As a result, the input strings are buffered before being examined by Lexical
Analysis.
The lexical analyser scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as shown below
5
6
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme.
In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is identified. The
fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer.
7
There are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are explained
as following
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan
rest of the lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.
8
2.Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are scanned alternately. when end of current buffer is reached the
other buffer is filled.
The only problem with this method is that if length of the lexeme is longer than length of the buffer then scanning
input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right in search
of end of lexeme. as soon as blank character is recognized, the string between bp and fp is identified as corresponding
token. to identify, the boundary of first buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second buffer. when
fp encounters first eof, then one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both the buffers can be filled up
until end of the input program and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
9
10
Lexical Analyzer in Perspective
11
NEED AND ROLE OF LEXICAL ANALYZER
Read input characters from the source program
Macros expansion
Remove comments and white spaces (scanning)
Group characters lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by the compiler with the source program.
12
Basic Terminology
14
Basic Terminology
Token classes
• One token per keyword
• Tokens for the operators
• One token representing all
identifiers
• Tokens representing constants
(e.g. numbers)
• Tokens for punctuation
symbols
15
Example
16
Example of TOKENS
17
Example of NON TOKENS
18
Tasks Lexical Analyzer
» Stripping out the unnecessary white spaces from the source code.
» Keeping track of line numbers while scanning the new line characters. These
line numbers are used by the error handler to print the error messages.
» Preprocessing of macros
19
Identify tokens and lexemes?
1. x=x*(acc+123)
20
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.
21
Handling Lexical Errors
1. Its hard for lexical analyzer without the aid of other components, that there is a source-code
error.
-- If the statement fi is encountered for the first time in a C program it can not tell whether fi is
misspelling of if statement or a undeclared literal.
-- Probably the parser in this case will be able to handle this.
In what Situations do Errors Occur?
Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix
of remaining input.
Panic mode recovery: delete successive characters from remaining input until token
found
• Insert missing character
• Delete a character
• Replace character by another
• Transpose two adjacent characters
• It is the strategy generally followed by the lexical analyzer to correct the errors in the
lexemes.
• It is nothing but the minimum number of corrections to be made to convert an invalid
lexeme to a valid lexeme.
• But it is not generally used in practice because it is too costly to implement.
24
Tokens Specification
We need a formal way to specify patterns: regular expressions
• Alphabet: any finite set of symbols
• String over alphabet: finite sequence of symbols drawn from that alphabet
• Language: countable set of strings over some fixed alphabet
25
Tokens Specification
Kleene closure
26
Tokens Specification
Operations on Languages
27
Tokens Specification
Rules for specifying Regular Expressions
28
Tokens Specification
Regular Expressions
29
Tokens Specification
Regular Expressions example
30
Tokens Specification
Algebraic laws for regular expressions
31
Tokens Specification
Regular Expressions example
32
Tokens Specification
Regular Definition
33
Tokens Specification
Regular Definition
34
Tokens Specification
Regular Definition
35
Tokens Specification
Describe the languages denoted by the following regular expression:
a(a|b)*a
36
Tokens Specification
37
Tokens Specification
In a string of length n, how many of the prefixes are there?
38
Tokens Specification
In a string of length n, how many of the suffixes are there?
39
Tokens Specification
In a string of length n, how many of the proper suffixes are there?
40
Tokens Specification
In a string of length n, how many of the proper substrings are there?
41
Tokens Specification
In a string of length n, how many of the proper subsequences are there?
42
43
Tokens Specification
Write regular definition for the following language:
All strings of lowercase letters that contain the five vowels in order.
44
Tokens Specification
Write regular definition for the following language:
All strings of lowercase letters in which the letters are in ascending lexicographic order.
45
Tokens Specification
Write regular definition for the following language:
All strings of a’s and b’s that contains an even number of a’s.
46
Tokens Specification
Write regular definition for the following language:
All strings of a’s and b’s that contains an odd number of b’s.
47
Tokens Specification
Write regular definition for the following language:
All strings of a’s and b’s that contains at least two b’s.
48
Tokens Specification
Write regular definition for the following language:
All strings of a’s and b’s that contains at most two b’s.
49
Tokens Specification
Write regular definition for the following language:
All strings of a’s and b’s that contains just two or three b’s.
50
Token Recognition
Tokens are recognized with the transition diagram.
51
Token Recognition
Tokens are recognized with the transition diagram.
52
Token Recognition
Tokens are recognized with the transition diagram.
53
Token Recognition
Tokens are recognized with the transition diagram.
54
Token Recognition
Tokens are recognized with the transition diagram.
55
Token Recognition
Tokens are recognized with the transition diagram.
56
Token Recognition
Regular Definition
57
Token Recognition
58
What Else Does Lexical Analyzer Do?
59
Overall
60
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.
61