Unit2 | PDF | Regular Expression | Parsing
0% found this document useful (0 votes)
9 views

Unit2

The document discusses lexical analysis as the first phase of a compiler, detailing its role in reading input characters, generating tokens, and interacting with the symbol table. It explains the importance of separating lexical analysis from parsing for efficiency and design simplicity, and describes input buffering techniques, token specification using regular expressions, and error handling strategies. Additionally, it outlines the tasks of a lexical analyzer, including stripping unnecessary characters, recognizing tokens, and managing errors during the analysis process.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit2

The document discusses lexical analysis as the first phase of a compiler, detailing its role in reading input characters, generating tokens, and interacting with the symbol table. It explains the importance of separating lexical analysis from parsing for efficiency and design simplicity, and describes input buffering techniques, token specification using regular expressions, and error handling strategies. Additionally, it outlines the tasks of a lexical analyzer, including stripping unnecessary characters, recognizing tokens, and managing errors during the analysis process.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

LEXICAL ANALYSIS

Dr. Banee Bandana Das


Department of CSE

1
LEXICAL ANALYSIS
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
• Reviewing Finite Automata Concepts
• Non-Deterministic and Deterministic FA
• Conversion Process
• Regular Expressions to NFA
• NFA to DFA
• Relating NFAs/DFAs /Conversion to Lexical Analysis
2
NEED AND ROLE OF LEXICAL ANALYZER
1. Lexical Analysis is the first phase of compiler. It reads the input characters from left to
right, one character at a time, from the source program.
2. It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit
such as identifiers, keywords, operators and punctuation marks.
3. It needs to enter that lexeme into the symbol table and also reads from the symbol table.
Lexical analyzer acts as a subroutine, which is called by the parser whenever it needs a new
token.

3
Why Separating Lexical and Syntactic?
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
» It leads to simpler design of the parser as the unnecessary tokens can be eliminated by
scanner.
» Efficiency of the process of compilation is improved. The lexical analysis phase is most time
consuming phase in compilation. Using specialized buffering to improve the speed of
compilation.
» Portability of the compiler is enhanced as the specialized symbols and characters (language
and machine specific) are isolated during this phase.

4
Input Buffering
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long time and
costs a lot of money. As a result, the input strings are buffered before being examined by Lexical
Analysis.

The lexical analyser scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown below

5
6
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme.

In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is identified. The
fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.

The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer.
7
There are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are explained
as following

1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan
rest of the lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.

8
2.Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are scanned alternately. when end of current buffer is reached the
other buffer is filled.
The only problem with this method is that if length of the lexeme is longer than length of the buffer then scanning
input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right in search
of end of lexeme. as soon as blank character is recognized, the string between bp and fp is identified as corresponding
token. to identify, the boundary of first buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second buffer. when
fp encounters first eof, then one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both the buffers can be filled up
until end of the input program and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.

9
10
Lexical Analyzer in Perspective

• LEXICAL ANALYZER • PARSER

• Scan Input • Perform Syntax Analysis


• Remove white spaces , • Actions Dictated by Token Order
comments • Update Symbol Table Entries
• Identify Tokens • Create Abstract Rep. of Source
• Create Symbol Table • Generate Errors
• Insert Tokens into ST • And More…. (We’ll see later)
• Send Tokens to Parser

11
NEED AND ROLE OF LEXICAL ANALYZER
Read input characters from the source program
Macros expansion
Remove comments and white spaces (scanning)
Group characters lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by the compiler with the source program.

12
Basic Terminology

What are Major Terms for Lexical Analysis?


Token: a pair consisting of
–Token name: abstract symbol representing lexical unit [affects parsing
decision]
–Optional attribute value [influences translations after parsing]
Token name: Keywords, operators, identifiers, constants, literal strings, punctuation symbols (such as commas,
semicolons)
Pattern: a description of the form that different lexemes take.
– For keywords, the pattern is just a sequence of characters that form keywords.

Lexeme: sequence of characters in source program matching a pattern


Actual sequence of characters that matches pattern and is classified by a token.
E.g.Relation{<.<=,>,>=,==,<>}
13
Basic Terminology

14
Basic Terminology

Token classes
• One token per keyword
• Tokens for the operators
• One token representing all
identifiers
• Tokens representing constants
(e.g. numbers)
• Tokens for punctuation
symbols

15
Example

16
Example of TOKENS

17
Example of NON TOKENS

18
Tasks Lexical Analyzer

Separation of the input source code into tokens.

» Stripping out the unnecessary white spaces from the source code.

» Removing the comments from the source text.

» Keeping track of line numbers while scanning the new line characters. These
line numbers are used by the error handler to print the error messages.

» Preprocessing of macros

19
Identify tokens and lexemes?

1. x=x*(acc+123)

20
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.

21
Handling Lexical Errors
1. Its hard for lexical analyzer without the aid of other components, that there is a source-code
error.
-- If the statement fi is encountered for the first time in a C program it can not tell whether fi is
misspelling of if statement or a undeclared literal.
-- Probably the parser in this case will be able to handle this.
In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix
of remaining input.

Panic mode Recovery


• Delete successive characters from the remaining input until the analyzer can find a
well-formed token.
• May confuse the parser – creating syntax error
22
Handling Lexical Errors
Panic mode Recovery
• Delete successive characters from the remaining input until the analyzer can find a
well-formed token.
• May confuse the parser – creating syntax error

Panic mode recovery: delete successive characters from remaining input until token
found
• Insert missing character
• Delete a character
• Replace character by another
• Transpose two adjacent characters

Possible error recovery actions:


• Deleting or Inserting Input Characters
• Replacing or Transposing Characters
23
Handling Lexical Errors
Minimum distance error correction

• It is the strategy generally followed by the lexical analyzer to correct the errors in the
lexemes.
• It is nothing but the minimum number of corrections to be made to convert an invalid
lexeme to a valid lexeme.
• But it is not generally used in practice because it is too costly to implement.

24
Tokens Specification
We need a formal way to specify patterns: regular expressions
• Alphabet: any finite set of symbols
• String over alphabet: finite sequence of symbols drawn from that alphabet
• Language: countable set of strings over some fixed alphabet

• Scanners are special pattern matching processors.


• Regular Expressions (RE) are used for the representation of patterns of strings.
• A regular expression (r) is defined by a set of strings that matches it.
• The language generated by the regular expression is represented as L(r).
• The set of symbols in the language is called the alphabet of the language is represented as Σ.

25
Tokens Specification
Kleene closure

26
Tokens Specification
Operations on Languages

27
Tokens Specification
Rules for specifying Regular Expressions

28
Tokens Specification
Regular Expressions

29
Tokens Specification
Regular Expressions example

30
Tokens Specification
Algebraic laws for regular expressions

31
Tokens Specification
Regular Expressions example

Which language is generated by:


• (a|b)(a|b)
• a*
• (a|b)*
• a|a*b

32
Tokens Specification
Regular Definition

33
Tokens Specification
Regular Definition

34
Tokens Specification
Regular Definition

35
Tokens Specification
Describe the languages denoted by the following regular expression:

a(a|b)*a

36
Tokens Specification

37
Tokens Specification
In a string of length n, how many of the prefixes are there?

38
Tokens Specification
In a string of length n, how many of the suffixes are there?

39
Tokens Specification
In a string of length n, how many of the proper suffixes are there?

40
Tokens Specification
In a string of length n, how many of the proper substrings are there?

41
Tokens Specification
In a string of length n, how many of the proper subsequences are there?

42
43
Tokens Specification
Write regular definition for the following language:

All strings of lowercase letters that contain the five vowels in order.

44
Tokens Specification
Write regular definition for the following language:

All strings of lowercase letters in which the letters are in ascending lexicographic order.

45
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains an even number of a’s.

46
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains an odd number of b’s.

47
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains at least two b’s.

48
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains at most two b’s.

49
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains just two or three b’s.

50
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


1. Recognition of Identifiers.
2. Recognition of Delimiters.
3. Recognition of Relational Operators.
4. Recognition of Keywords.
5. Recognition of numbers.

51
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


1.Recognition of Identifiers.

52
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


2. Recognition of Delimiters.

53
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


3. Recognition of Relational Operators.

54
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


4. Recognition of Keywords.

55
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:


5. Recognition of numbers.

56
Token Recognition
Regular Definition

57
Token Recognition

58
What Else Does Lexical Analyzer Do?

Scan away blanks, new lines, tabs


Can we Define Tokens For These?
blank → blank
tab → tab
newline → newline
delim → blank | tab | newline
ws → delim +
In these cases no token is returned to parser

59
Overall

60
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.

61

You might also like