0% found this document useful (0 votes)

46 views

Lexical Analysis

The document discusses the role of a lexical analyzer in compiling source code. It begins by explaining that a lexical analyzer identifies words (tokens) by converting character streams into token streams. It then discusses how the lexical analyzer interacts with the parser and some of its secondary tasks. The rest of the document covers various issues in lexical analysis like simplicity, efficiency, and portability. It also defines key terminology used in lexical analysis like tokens, lexemes, and patterns.

Uploaded by

Afifa murshida Nazin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Lexical Analysis

Uploaded by

Afifa murshida Nazin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lexical Analysis

Lecture 02
Role of the Lexical Analyzer

• Identify the words: Lexical Analysis

– Converts a stream of characters (input program) into a
stream of tokens.
• Also called Scanning or Tokenizing
• Identify the sentences: Parsing.
– Derive the structure of sentences: construct parse trees
from a stream of tokens.

Next_char() Next_token()
Input Scanner Parser
character token

Symbol
Table
Interaction of Lexical Analyzer with Parser

Next_char() Next_token()

Input
Scanner Parser
character token

Symbol
Table

• Often a subroutine of the parser

• Secondary tasks of Lexical Analyzer
– Strip out comments and white spaces from the source
– Correlate error messages with the source program
– Preprocessing may be implemented as lexical analysis
takes place
Issues in lexical analysis

• Simplicity/Modularity: Conventions about “words"

are often different from conventions about
“sentences".

• Efficiency: Word identification problem has a much

more efficient solution than sentence identification
problem.

• Portability: Character set, special characters, device

features.
Terminology

• Token: Name given to a family of words.

– e.g., tok_integer_constant
• Lexeme: Actual sequence of characters representing a word.
– e.g., 32894
• Pattern: Notation used to identify the set of lexemes
represented by a token.
– e.g., digit followed by zero or more digits

Token Sample Lexemes Pattern

tok_while while while
tok_integer_constant 32894, -1093, 0 digit followed by zero or more digits
tok_relation <, <=, =, !=, >, >= < or <= or = or != or >= or >
tok_identifier buffer_size, D2 letter followed by letters or digits
Token Stream

• Tokens are terminal symbol in the grammar for the source

language
• keywords, operators, identifiers, constants, literal strings,
punctuation symbols etc. are treated as tokens

• Source:
if ( x == -3.1415 ) /* test x */ then ...
• Token Stream:
< IF >
< LPAREN >
< ID, “x” >
< EQUALS >
< NUM, -3.14150000 >
< RPAREN >
< THEN >
...
Token Attributes
• More than one lexeme matches a pattern
– We need attribute to distinguish them
• e.g. “tok_relation” matches “< or <= or = or != or >= or >”
• < tok_integer_constant, 1415 >

token type token attribute (if available)

– Lexical analyzer collects information about tokens and as well as
their attributes
• Attributes influence the translation of tokens
• A token usually has only a single attribute
– A pointer to the symbol-table entry
– Other attributes (e.g. line number, lexeme) can be stored in symbol table
• Example:
– E = M * C ** 2 <tok_identifier, pointer to symbol table entry for E>
<tok_assign, >
<tok_identifier, pointer to symbol table entry for M>
......
Lexical Error

• Few errors can be caught by the lexical analyzer

– Most errors tend to be “typos”
– Not noticed by the programmer
• return 1.23;
• retunn 1,23;
– ... Still results in sequence of legal tokens
• <ID, “retunn”> <INT,1> <COMMA> <INT,23> <SEMICOLON>
– No lexical error, but problems during parsing!
– Another example: fi (a == f(x))
• Errors caught by lexer:
– EOF within a String / missing ”
– Invalid ASCII character in file
– String / ID exceeds maximum length
– etc...
Recovery from lexical errors

• Panic mode recovery

– Delete successive characters from the input until the
lexical analyzer can find a well-formed token
– “…….day = 30 ^^^ month; …….”
– May confuse the parser
• The parser will detect syntax errors and get straightened
out (hopefully!)
• Other possible error-recovery actions
• Deleting an extra character
• Inserting a missing character
• Replacing an incorrect character by a correct character
• Swapping two adjacent character
– Attempt to repair the input using single error
transformations
Implementing a lexical analyzer

assembly language

system-programming language

lexical-analyzer generator

Harder to Faster
implement
Managing Input Buffers

• Option 1: Read one char from OS at a time.

• Option 2: Read N characters per system call
– e.g., N = 4096
• Manage input buffers in Lexer
– More efficient
• Often, we need to look ahead

.. E = M * C * * 2 ..

forward
lexeme_beginning

• But! Due to look ahead we need to push back the lookahead

characters
– Need specialized buffer managing technique to improve efficiency
Buffer Pairs

• Token could overlap / span buffer boundaries

.. .. .. .. .. .. .. .. .. .. .. .. .. .. 1 2 .
12.46
4 6 .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

• Solution: Use a paired buffer of N characters each

N-characters N-characters

.. .. E = M * C * * 2 \0

forward
• Deficiency: lexeme_beginning
Code:
if (forward at end of buffer1) then
relaod buffer2;
forward = forward +1;
else if (forward at end of buffer2) then
relaod first half;
move forward to the beginning of the buffer1
else
forward = forward +1
Sentinels

• Technique: Use “Sentinels” to reduce testing

• Choose some character that occurs rarely in most inputs e.g. ‘\0’
N-characters N-characters

.. .. E = M * \0 C * * 2 \0 \0

forward
lexeme_beginning
forward++;
if *forward == ‘\0’ then
if forward at end of buffer #1 then
Read next N bytes into buffer #2;
forward = address of first char of buffer #2;
elseIf forward at end of buffer #2 then
Read next N bytes into buffer #1;
forward = address of first char of buffer #1;
else
// do nothing; a real \0 occurs in the input
endIf
endIf
Terminology
• Alphabet (∑) : AKA character class
– A set of symbols (“characters”)
– Examples: ∑ = { 1, 0} : binary alphabet
∑ = { 1, 2, 3, 4, 5, 6 } : Alphabet on dice outcome
• String : AKA Sentence or word
– Sequence of symbols
– Finite in length
– Example: abbadc Length of s = |s|
• Empty String (ε)
– It is a string
– |ε|=0
• Language
– A set of strings over some fixed alphabet Each string is finite in length,
– Examples: L1 = { a, baa, bccb } but the set may have an infinite
L2 = { } number of elements.
Note the difference
L3 = {ε}
L4 = {ε, ab, abab, ababab, abababab,... }
Terminology
• Prefix ...of string s
– String obtained by removing zero or more trailing symbols
– s = hello
– Prefixes: ε, h, he, hel, hell, hello
• Suffix ...of string s
– String obtained by deleting zero or more of the leading symbols
– s = hello
– Suffixes: hello, ello, llo, lo, o, ε
• Substring ...of string s
– String obtained by deleting a prefix and a suffix
– s = hello
– Substrings: ε. ell, hel, llo, hello, ….
• Proper prefix / suffix / substring ... of s
– String s1 that is respectively prefix, suffix, or substring of s such that s1 ≠ s
and s1 ≠ ε
• Subsequence….of string s
– String formed by deleting zero or more not necessarily contiguous symbols
– S=hello
– Subsequence: hll, eo, hlo, etc.
Terminology
Terminology

• Language
– A set of strings
– L = { ... }
– M = { ... }
• Union of two languages
– L ∪ M = { s | s is in L or is in M }
– Example:
– L = { a, ab }
– M = { c, dd }
– L ∪ M = { a, ab, c, dd }
• Concatenation of two languages
– L M = { st | s is in L and t is in M }
– Example:
– L = { a, ab }
– M = { c, dd }
– L M = { ac, add, abc, abdd }
Kleene closure
Positive closure

• Let: L = { a, bc }
• Example: L0 = { ε }
L1 = L = { a, bc }
L2 = LL = { aa, abc, bca, bcbc }
L3 = LLL = { aaa, aabc, abca, abcbc, bcaa, bcabc, bcbca, bcbcbc }
...etc...
LN = LN-1L = LLN-1
• The “Positive Closure” of a language:
∞
L+ = U Li = L1 ∪ L2 ∪ L3 ∪ K
i =1
Note ε is not included UNLESS it is
• Example: in L to start with

• L+ = { a, bc, aa, abc, bca, bcbc, aaa, aabc, abca, abcbc, ... }
L1 L2 L3
Example

Let: L = { a, b, c, ..., z }
D = { 0, 1, 2, ..., 9 }

D+ = “The set of strings with one or more digits”

L ∪ D = “The set of alphanumeric characters”

{ a, b, c, ..., z, 0, 1, 2, ..., 9 }

( L ∪ D )* = “Sequences of zero or more letters and digits”

L ( ( L ∪ D )* ) = “Set of strings that start with a letter, followed by

zero or more letters and digits.”
Definition: Regular Expressions

• (Over alphabet ∑)

• ε is a regular expression.

• If a is a symbol (i.e., if a∈ ∑, then a is a regular expression.

• If R and S are regular expressions, then R|S is a regular

expression.

• If R and S are regular expressions, then RS is a regular

expression.

• If R is a regular expression, then R* is a regular expression.

• If R is a regular expression, then (R) is a regular expression.

Regular Expressions and Language

• (Over alphabet ∑)
• And, given a regular expression R, what is L(R) ?
• ε is a regular expression.
– L(ε) = { ε }
• If a is a symbol (i.e., if a∈ ∑, then a is a regular expression.
– L(a) = { a }
• If R and S are regular expressions, then R|S is a regular
expression.
– L(R|S) = L(R) ∪ L(S)
• If R and S are regular expressions, then RS is a regular
expression.
– L(RS) = L(R) L(S)
• If R is a regular expression, then R* is a regular expression.
– L(R*) = (L(R))*
• If R is a regular expression, then (R) is a regular expression.
How to “Parse” Regular Expressions
• Precedence:
– * has highest precedence.
– Concatenation as middle precedence.
– | has lowest precedence.
– Use parentheses to override these rules.

• Examples:
– a b* = a (b*)
• If you want (a b)* you must use parentheses.
– a | b c = a | (b c)
• If you want (a | b) c you must use parentheses.

• Concatenation and | are associative.

– (a b) c = a (b c) = a b c
– (a | b) | c = a | (b | c) = a | b | c
• Example:
– b d | e f * | g a = (b d) | (e (f *)) | (g a)
Regular Language
• Definition: “Regular Language” (or “Regular Set”)
• ... A language that can be described by a regular expression.

• Any finite language (i.e., finite set of strings) is a regular

language.
• Regular languages are (usually) infinite.
• Regular languages are, in some sense, simple languages.
• Regular Languages ⊂ Context-Free Languages

• Examples:
– a | b | cab {a, b, cab}
– b* {ε, b, bb, bbb, ...}
– a | b* {a, ε, b, bb, bbb, ...}
– (a | b)* {ε, a, b, aa, ab, ba, bb, aaa, ...}
“Set of all strings of a’s and b’s, including ε.”
Equality vs Equivalence

• Are these regular expressions equal?

R = a a* (b | c)
S = a* a (c | b)
... No!

• Yet, they describe the same language.

L(R) = L(S)
• “Equivalence” of regular expressions
If L(R) = L(S) then we say R ≅ S
“R is equivalent to S”
• From now on, we’ll just say R = S to mean R ≅ S
Algebraic law of regular expressions
Regular Definition

• If Σ is an alphabet of basic symbols then a regular

definition is a sequence of the following form:
d1→r1
d2→r2
……..
dn→rn

where
• Each di is a new symbol such that di ∉ Σ and di ≠dj
where j < I
• Each ri is a regular expression over Σ ∪ {d1,d2,…,di-1)
Regular Definition
Addition Notation / Shorthand
Nonregular sets
Problem: How to describe tokens?
Solution: Regular Expressions

Problem: How to recognize tokens?

Approaches:
1. Hand-coded routines
2. Finite State Automata
3. Scanner Generators (Java: JLex, C: Lex)

Scanner Generators
Input: Sequence of regular definitions
Output: A lexer (e.g., a program in Java or “C”)

Approach:
– Read in regular expressions
– Convert into a Finite State Automaton (FSA)
– Optimize the FSA
– Represent the FSA with tables / arrays
– Generate a table-driven lexer (Combine “canned” code with tables.)

PYTHON Training Report PDF
92% (24)
PYTHON Training Report PDF
45 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
CSS Cascading Style Sheets (CSS) Is A Style Sheet Language Used For Describing The Presentation of A
No ratings yet
CSS Cascading Style Sheets (CSS) Is A Style Sheet Language Used For Describing The Presentation of A
6 pages
cd1
No ratings yet
cd1
92 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD_UNIT-2
No ratings yet
CD_UNIT-2
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Lecture3_E
No ratings yet
Lecture3_E
153 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
No ratings yet
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
55 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
CD ch2
No ratings yet
CD ch2
104 pages
Chapter -2 Lexical Analysis
No ratings yet
Chapter -2 Lexical Analysis
160 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
CD Aii Partb Ans
No ratings yet
CD Aii Partb Ans
8 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
lec 02
No ratings yet
lec 02
17 pages
2-Patterns, lexemes, Tokens, Attributes-18-12-2024
No ratings yet
2-Patterns, lexemes, Tokens, Attributes-18-12-2024
73 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
No ratings yet
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
23 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Chapter 3 Finite automata and lexical analysis
No ratings yet
Chapter 3 Finite automata and lexical analysis
100 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Cse309 3
No ratings yet
Cse309 3
101 pages
Module 3
No ratings yet
Module 3
7 pages
Compiler Design
No ratings yet
Compiler Design
102 pages
Unit2
No ratings yet
Unit2
61 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
17 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Compiler
No ratings yet
Compiler
60 pages
CSI 411 - Compiler - Lecture 2 PDF
No ratings yet
CSI 411 - Compiler - Lecture 2 PDF
22 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
2-lexing
No ratings yet
2-lexing
73 pages
Sayeda Jannat Joti - Lt2
No ratings yet
Sayeda Jannat Joti - Lt2
3 pages
Bangladesh University of Business and Technology: Database System
No ratings yet
Bangladesh University of Business and Technology: Database System
10 pages
Bangladesh University of Business and Technology: Database System
No ratings yet
Bangladesh University of Business and Technology: Database System
10 pages
Bangladesh University of Business and Technology: Assistant Professor
No ratings yet
Bangladesh University of Business and Technology: Assistant Professor
4 pages
Cse 205: Digital Logic Design
No ratings yet
Cse 205: Digital Logic Design
89 pages
Introduction To Compilation
No ratings yet
Introduction To Compilation
33 pages
Cse 205: Digital Logic Design
No ratings yet
Cse 205: Digital Logic Design
109 pages
CSE 205: Digital Logic Design
No ratings yet
CSE 205: Digital Logic Design
53 pages
Unit 3
No ratings yet
Unit 3
68 pages
PHP Star Rating System With JavaScript - Phppot
No ratings yet
PHP Star Rating System With JavaScript - Phppot
5 pages
What Is Serialization PDF
No ratings yet
What Is Serialization PDF
1 page
Assign 3
No ratings yet
Assign 3
9 pages
Apoorva 00615602717
No ratings yet
Apoorva 00615602717
1 page
Web Development Frameworks, Tools and Platforms
100% (1)
Web Development Frameworks, Tools and Platforms
102 pages
PowerShell Reguslar Expressions
No ratings yet
PowerShell Reguslar Expressions
2 pages
MiniProject Report
No ratings yet
MiniProject Report
23 pages
Selenium With Python
No ratings yet
Selenium With Python
23 pages
Rachana G: Sr. Java Full Stack Developer Number: 3476201224 Summary
No ratings yet
Rachana G: Sr. Java Full Stack Developer Number: 3476201224 Summary
6 pages
Flutter Dart
No ratings yet
Flutter Dart
3 pages
NormanNolasco Resume
No ratings yet
NormanNolasco Resume
5 pages
Netf.anom
No ratings yet
Netf.anom
2 pages
Practicals Web Technology
No ratings yet
Practicals Web Technology
16 pages
First Last: Education
No ratings yet
First Last: Education
1 page
Exercise-6 MST Programs
No ratings yet
Exercise-6 MST Programs
11 pages
Week 3 - Forms, RWD, Bootstrap Grid & WCAG
No ratings yet
Week 3 - Forms, RWD, Bootstrap Grid & WCAG
51 pages
Ashish JavaFSD
No ratings yet
Ashish JavaFSD
8 pages
Shapely Readthedocs Io en Latest
No ratings yet
Shapely Readthedocs Io en Latest
82 pages
Sem Vi Ty BSC Cs Qp's April 2022 NSG Academy
No ratings yet
Sem Vi Ty BSC Cs Qp's April 2022 NSG Academy
15 pages
Pransu Jain Resume
No ratings yet
Pransu Jain Resume
1 page
Dot Net Core - Lab - 3
No ratings yet
Dot Net Core - Lab - 3
34 pages
Java Cheatsheet
No ratings yet
Java Cheatsheet
17 pages
Sarthak: Singhal
No ratings yet
Sarthak: Singhal
2 pages
Boolean Searches For: Using Boolean Is A Helpful Way To Source Faster
No ratings yet
Boolean Searches For: Using Boolean Is A Helpful Way To Source Faster
5 pages
C Programming History
No ratings yet
C Programming History
4 pages
Sunamco Series758 FR PDF
No ratings yet
Sunamco Series758 FR PDF
6 pages
Cs. Project
No ratings yet
Cs. Project
38 pages