0% found this document useful (0 votes)

9 views

Unit2

The document discusses lexical analysis as the first phase of a compiler, detailing its role in reading input characters, generating tokens, and interacting with the symbol table. It explains the importance of separating lexical analysis from parsing for efficiency and design simplicity, and describes input buffering techniques, token specification using regular expressions, and error handling strategies. Additionally, it outlines the tasks of a lexical analyzer, including stripping unnecessary characters, recognizing tokens, and managing errors during the analysis process.

Uploaded by

Dharmatej Mallampati

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Unit2

Uploaded by

Dharmatej Mallampati

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

LEXICAL ANALYSIS

Dr. Banee Bandana Das

Department of CSE

1
LEXICAL ANALYSIS
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
• Reviewing Finite Automata Concepts
• Non-Deterministic and Deterministic FA
• Conversion Process
• Regular Expressions to NFA
• NFA to DFA
• Relating NFAs/DFAs /Conversion to Lexical Analysis
2
NEED AND ROLE OF LEXICAL ANALYZER
1. Lexical Analysis is the first phase of compiler. It reads the input characters from left to
right, one character at a time, from the source program.
2. It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit
such as identifiers, keywords, operators and punctuation marks.
3. It needs to enter that lexeme into the symbol table and also reads from the symbol table.
Lexical analyzer acts as a subroutine, which is called by the parser whenever it needs a new
token.

3
Why Separating Lexical and Syntactic?
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
» It leads to simpler design of the parser as the unnecessary tokens can be eliminated by
scanner.
» Efficiency of the process of compilation is improved. The lexical analysis phase is most time
consuming phase in compilation. Using specialized buffering to improve the speed of
compilation.
» Portability of the compiler is enhanced as the specialized symbols and characters (language
and machine specific) are isolated during this phase.

4
Input Buffering
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long time and
costs a lot of money. As a result, the input strings are buffered before being examined by Lexical
Analysis.

The lexical analyser scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown below

5
6
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme.

In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is identified. The
fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.

The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer.
7
There are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are explained
as following

1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan
rest of the lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.

8
2.Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are scanned alternately. when end of current buffer is reached the
other buffer is filled.
The only problem with this method is that if length of the lexeme is longer than length of the buffer then scanning
input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right in search
of end of lexeme. as soon as blank character is recognized, the string between bp and fp is identified as corresponding
token. to identify, the boundary of first buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second buffer. when
fp encounters first eof, then one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both the buffers can be filled up
until end of the input program and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.

9
10
Lexical Analyzer in Perspective

• LEXICAL ANALYZER • PARSER

• Scan Input • Perform Syntax Analysis

• Remove white spaces , • Actions Dictated by Token Order
comments • Update Symbol Table Entries
• Identify Tokens • Create Abstract Rep. of Source
• Create Symbol Table • Generate Errors
• Insert Tokens into ST • And More…. (We’ll see later)
• Send Tokens to Parser

11
NEED AND ROLE OF LEXICAL ANALYZER
Read input characters from the source program
Macros expansion
Remove comments and white spaces (scanning)
Group characters lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by the compiler with the source program.

12
Basic Terminology

What are Major Terms for Lexical Analysis?

Token: a pair consisting of
–Token name: abstract symbol representing lexical unit [affects parsing
decision]
–Optional attribute value [influences translations after parsing]
Token name: Keywords, operators, identifiers, constants, literal strings, punctuation symbols (such as commas,
semicolons)
Pattern: a description of the form that different lexemes take.
– For keywords, the pattern is just a sequence of characters that form keywords.

Lexeme: sequence of characters in source program matching a pattern

Actual sequence of characters that matches pattern and is classified by a token.
E.g.Relation{<.<=,>,>=,==,<>}
13
Basic Terminology

14
Basic Terminology

Token classes
• One token per keyword
• Tokens for the operators
• One token representing all
identifiers
• Tokens representing constants
(e.g. numbers)
• Tokens for punctuation
symbols

15
Example

16
Example of TOKENS

17
Example of NON TOKENS

18
Tasks Lexical Analyzer

Separation of the input source code into tokens.

» Stripping out the unnecessary white spaces from the source code.

» Removing the comments from the source text.

» Keeping track of line numbers while scanning the new line characters. These
line numbers are used by the error handler to print the error messages.

» Preprocessing of macros

19
Identify tokens and lexemes?

1. x=x*(acc+123)

20
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.

21
Handling Lexical Errors
1. Its hard for lexical analyzer without the aid of other components, that there is a source-code
error.
-- If the statement fi is encountered for the first time in a C program it can not tell whether fi is
misspelling of if statement or a undeclared literal.
-- Probably the parser in this case will be able to handle this.
In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix
of remaining input.

Panic mode Recovery

• Delete successive characters from the remaining input until the analyzer can find a
well-formed token.
• May confuse the parser – creating syntax error
22
Handling Lexical Errors
Panic mode Recovery
• Delete successive characters from the remaining input until the analyzer can find a
well-formed token.
• May confuse the parser – creating syntax error

Panic mode recovery: delete successive characters from remaining input until token
found
• Insert missing character
• Delete a character
• Replace character by another
• Transpose two adjacent characters

Possible error recovery actions:

• Deleting or Inserting Input Characters
• Replacing or Transposing Characters
23
Handling Lexical Errors
Minimum distance error correction

• It is the strategy generally followed by the lexical analyzer to correct the errors in the
lexemes.
• It is nothing but the minimum number of corrections to be made to convert an invalid
lexeme to a valid lexeme.
• But it is not generally used in practice because it is too costly to implement.

24
Tokens Specification
We need a formal way to specify patterns: regular expressions
• Alphabet: any finite set of symbols
• String over alphabet: finite sequence of symbols drawn from that alphabet
• Language: countable set of strings over some fixed alphabet

• Scanners are special pattern matching processors.

• Regular Expressions (RE) are used for the representation of patterns of strings.
• A regular expression (r) is defined by a set of strings that matches it.
• The language generated by the regular expression is represented as L(r).
• The set of symbols in the language is called the alphabet of the language is represented as Σ.

25
Tokens Specification
Kleene closure

26
Tokens Specification
Operations on Languages

27
Tokens Specification
Rules for specifying Regular Expressions

28
Tokens Specification
Regular Expressions

29
Tokens Specification
Regular Expressions example

30
Tokens Specification
Algebraic laws for regular expressions

31
Tokens Specification
Regular Expressions example

Which language is generated by:

• (a|b)(a|b)
• a*
• (a|b)*
• a|a*b

32
Tokens Specification
Regular Definition

33
Tokens Specification
Regular Definition

34
Tokens Specification
Regular Definition

35
Tokens Specification
Describe the languages denoted by the following regular expression:

a(a|b)*a

36
Tokens Specification

37
Tokens Specification
In a string of length n, how many of the prefixes are there?

38
Tokens Specification
In a string of length n, how many of the suffixes are there?

39
Tokens Specification
In a string of length n, how many of the proper suffixes are there?

40
Tokens Specification
In a string of length n, how many of the proper substrings are there?

41
Tokens Specification
In a string of length n, how many of the proper subsequences are there?

42
43
Tokens Specification
Write regular definition for the following language:

All strings of lowercase letters that contain the five vowels in order.

44
Tokens Specification
Write regular definition for the following language:

All strings of lowercase letters in which the letters are in ascending lexicographic order.

45
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains an even number of a’s.

46
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains an odd number of b’s.

47
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains at least two b’s.

48
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains at most two b’s.

49
Tokens Specification
Write regular definition for the following language:

All strings of a’s and b’s that contains just two or three b’s.

50
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

1. Recognition of Identifiers.
2. Recognition of Delimiters.
3. Recognition of Relational Operators.
4. Recognition of Keywords.
5. Recognition of numbers.

51
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

1.Recognition of Identifiers.

52
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

2. Recognition of Delimiters.

53
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

3. Recognition of Relational Operators.

54
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

4. Recognition of Keywords.

55
Token Recognition
Tokens are recognized with the transition diagram.

We need to recognize the following tokens:

5. Recognition of numbers.

56
Token Recognition
Regular Definition

57
Token Recognition

58
What Else Does Lexical Analyzer Do?

Scan away blanks, new lines, tabs

Can we Define Tokens For These?
blank → blank
tab → tab
newline → newline
delim → blank | tab | newline
ws → delim +
In these cases no token is returned to parser

59
Overall

60
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyser.
• Convert patterns into flowcharts
called transition diagrams.
– nodes or circles: called states
– Edges: directed from state to
another, labelled by symbols.

c2 PDF
No ratings yet
c2 PDF
13 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
76 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
56 pages
Ch2_Lexical Analysis
No ratings yet
Ch2_Lexical Analysis
71 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
UNIT 2 Compiler Design
No ratings yet
UNIT 2 Compiler Design
23 pages
lec 02
No ratings yet
lec 02
17 pages
Ch2_Lexical Analysis (2)
No ratings yet
Ch2_Lexical Analysis (2)
71 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
pcdunit2 class
No ratings yet
pcdunit2 class
21 pages
Chapter Two-Lexical Analysis
No ratings yet
Chapter Two-Lexical Analysis
4 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
HW_31712
No ratings yet
HW_31712
22 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
14 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD_UNIT-2
No ratings yet
CD_UNIT-2
64 pages
2.1 - Lexical Analysis
No ratings yet
2.1 - Lexical Analysis
102 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
cd1
No ratings yet
cd1
92 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
Chapter -2 Lexical Analysis
No ratings yet
Chapter -2 Lexical Analysis
160 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
CD Unit I Part II Lexical Analysis
No ratings yet
CD Unit I Part II Lexical Analysis
58 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Comp Chap2
No ratings yet
Comp Chap2
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Lecture3_E
No ratings yet
Lecture3_E
153 pages
L2 Lexical Analysis
No ratings yet
L2 Lexical Analysis
59 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Chapter-2
No ratings yet
Chapter-2
41 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
CD Aii Partb Ans
No ratings yet
CD Aii Partb Ans
8 pages
cd UNIT-1
No ratings yet
cd UNIT-1
60 pages
Lexical
No ratings yet
Lexical
34 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
84 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
L4 - Lexical Analysis
No ratings yet
L4 - Lexical Analysis
44 pages
Lexical Analysis
No ratings yet
Lexical Analysis
88 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
Chapter 2 - Lexical Analysis (1) (1)
No ratings yet
Chapter 2 - Lexical Analysis (1) (1)
48 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
compiler_design- Module2-print
No ratings yet
compiler_design- Module2-print
16 pages
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
From RE To NFA and Vise Versa
0% (1)
From RE To NFA and Vise Versa
47 pages
Compiler Design Concepts Worked Out Examples and M
100% (1)
Compiler Design Concepts Worked Out Examples and M
100 pages
CFG and Its FORMS PDF
No ratings yet
CFG and Its FORMS PDF
47 pages
This Chapter "Theory of Computation" Is Taken From Our
No ratings yet
This Chapter "Theory of Computation" Is Taken From Our
51 pages
Automata and Compiler Design: D.Rahul
No ratings yet
Automata and Compiler Design: D.Rahul
638 pages
CS 3723 - Programming Language: 1. Introductory Stuff
No ratings yet
CS 3723 - Programming Language: 1. Introductory Stuff
11 pages
Equivalence NFA To DFA
No ratings yet
Equivalence NFA To DFA
11 pages
Theory Comp
No ratings yet
Theory Comp
76 pages
CD Assignment1
No ratings yet
CD Assignment1
3 pages
DFA - Deterministic Finite Automata
No ratings yet
DFA - Deterministic Finite Automata
8 pages
TOC-M1,M2
No ratings yet
TOC-M1,M2
32 pages
Be - Computer Engineering - Semester 5 - 2022 - October - Theory of Computation Toc Pattern 2019
No ratings yet
Be - Computer Engineering - Semester 5 - 2022 - October - Theory of Computation Toc Pattern 2019
2 pages
Module 1
No ratings yet
Module 1
185 pages
21CS1503 - Theory of Computation
No ratings yet
21CS1503 - Theory of Computation
3 pages
HW 3 Ans
No ratings yet
HW 3 Ans
5 pages
BCS503 Notes 1
No ratings yet
BCS503 Notes 1
26 pages
MCQ Automata Unit-1
74% (23)
MCQ Automata Unit-1
61 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
7 pages
Module-2_Regular_Expressions
No ratings yet
Module-2_Regular_Expressions
95 pages
Semester B.C.A. 6-17: Examination
No ratings yet
Semester B.C.A. 6-17: Examination
3 pages
Finite State Machines With Output (Mealy and Moore Machines)
No ratings yet
Finite State Machines With Output (Mealy and Moore Machines)
17 pages
Finite Autometa PDF
No ratings yet
Finite Autometa PDF
40 pages
Pumping Lemma For Regular Language
No ratings yet
Pumping Lemma For Regular Language
38 pages
Applications of Regular Expressions: Unix RE's Text Processing Lexical Analysis
No ratings yet
Applications of Regular Expressions: Unix RE's Text Processing Lexical Analysis
20 pages
Automata Syl L
No ratings yet
Automata Syl L
2 pages
Regular Expressions To Finite Automata: - High-Level Sketch
No ratings yet
Regular Expressions To Finite Automata: - High-Level Sketch
32 pages
Flat CH 2
No ratings yet
Flat CH 2
86 pages
Unit 1 - Finite Automata
No ratings yet
Unit 1 - Finite Automata
18 pages
2_8 Design of a Lexical Analyzer Generator
No ratings yet
2_8 Design of a Lexical Analyzer Generator
15 pages
QP IAT1 CD Set1
No ratings yet
QP IAT1 CD Set1
4 pages