Lecture 3 (30-1-23)

The document discusses lexical analysis in compilers. Lexical analysis scans input programs to identify valid tokens by removing comments and whitespace. It breaks the input into tokens using patterns and passes them to the parser. Regular expressions and finite state automata are used to specify valid tokens and recognize patterns in lexical analysis.

Uploaded by

Tahsk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views11 pages

Lecture 3 (30-1-23)

Uploaded by

Tahsk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lexical Analysis

• Scan input program to identify valid words, removes comments, extra

white space.
• How do we specify the valid words of a language?
• Regular expression.
• How do we check if sequence of character matches the valid
words of a language?
• Finite Automata.

• token (also called word) –> set of strings defining an atomic element
with a defined meaning
• pattern -> a rule describing a set of string (specified using regular
expression)
• lexeme -> a sequence of characters that match some pattern
• symbol -> the recognized token
• At the first occurrence of the symbol, entry is made in symbol
table
• Additional information (attributes) about the symbol may be
added by the parser
Examples

Token Pattern Sample

Lexeme
while while while

relation_op = | != | < | > <

integer (0-9)* 42

string Characters “hello”

between “ “
Tokens

• Keywords, operators, identifiers (names), constants, literal strings, punctuation symbols such as
parentheses, brackets, commas, semicolons, and colons, etc.
• A unique integer representing the token is passed by LA to the parser
• Attributes for tokens (apart from the integer representing the token)
• identifier: the lexeme of the token, or a pointer into the symbol table where the lexeme is
stored by the LA
• intnum: the value of the integer (similarly for floatnum, etc.)
• string: the string itself
• The exact set of attributes are dependent on the compiler designer
Challenges in lexical analysis
• Certain languages do not have any reserved words, e.g., while, do, if, else, etc., are reserved in ’C’, but not in
PL/1
Example of using do loop in FORTRAN
• In FORTRAN, some keywords are context-dependent

• In the statement, DO 10 I = 10.86

• DO10I is an identifier, and DO is not a keyword
• But in the statement, DO 10 I = 10, 86
• DO is a keyword
• Such features require substantial look ahead for resolution
• Example above -> we cannot be sure until we see the comma (after 10) that DO is a keyword
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but not so in ’C’
• Lexical analysis cannot catch any significant errors except for simple errors such as, illegal symbols, etc.
• In such cases, lexical analysis skips characters in the input until a well-formed token is found
Languages
• Symbol: An abstract entity, not defined
• Examples: letters {a,b,c,…,z} and digits {0,1,..,9}
• String: A finite sequence of symbols
• abcb, caba are strings over the symbols {a,b,c}
• |w| is the length of the string w, and is the #symbols in it
• ∊ is the empty string and is of length 0
• Alphabet: A finite set of symbols (e.g., {a,b,c,…,z}, {0,1,..,9} )
• Language: A set of strings of symbols from some alphabet
• Φ (empty language) and {∊} (set with empty string) are languages
• The set of palindromes over {0,1} is an infinite language
• The set of strings, {01, 10, 111} over {0,1} is a finite language
• If Σ is an alphabet, Σ∗ is the set of all strings over Σ
• We need a ‘finite representation’ (encoded by finite string) for a language
• Regular language (or type-3) is represented by Regular expression
• Context-free language (or type-2) is represented by a Context-free grammar
• Context-sensitive language (or type-1) is represented by a Context-sensitive grammar
• type-0 language is represented by type-0 grammar
Regular Expressions (REs)
• Let Σ be an alphabet. The REs over Σ and the languages they denote (or generate) are defined as
below
• φ (empty language/set, not even empty string) is an RE. L(φ) = φ
• ∊ (empty string) is an RE. L(∊) = {∊}
• For each a ∈ Σ, a is an RE. L(a) = {a}
• E.g., Σ ={1,2,3,4} then each 1,2,3,4 are REs, L(1)={1}, L(2)={2},…
• If r and s are REs (not symbol) denoting the languages R and S, respectively
• Concatenation: (rs) is an RE, L(rs) = R.S = {xy | x ∈ R ∧ y ∈ S}
• Union: (r + s) (or (r|s)) is an RE, L(r + s) = R ∪ S
• Kleene closure/closure: (r∗) is an RE, L(r∗) = R∗ = ⋃ 𝑅𝑖

(L∗ is called the Kleene closure or closure of L)

Examples of Regular Expressions
• Given L = set of all strings of 0’s and 1’s, what is the RE?
• r = (0 + 1)* or (0|1)*
• How do we generate the string 101 ?
• (0 + 1) ∗ ⇒ (0 + 1)(0 + 1)(0 + 1) ⇒ 101
• Given L = set of all strings of 0’s and 1’s, with at least two consecutive 0’s, what is the RE?
• r = (0 + 1) ∗00(0 + 1) ∗
• Given L = {w ∈ {0, 1} ∗ | w has two or three occurrences of 1, the first and second of which are not
consecutive}, what is the RE?
• r = 0∗10∗010∗ (10∗+ ∊)

• Given r = (1 + 10)∗, what is the language?

• L = set of all strings of 0’s and 1’s, beginning with 1 and not having two consecutive 0’s
• Given r = (0 + 1)∗011, what is the language?
• L = set of all strings of 0’s and 1’s ending in 011
Examples of Regular Expressions
• Given r = c∗(a + bc∗)∗ , what is the language?
• L = set of all strings over {a,b,c} that do not have the substring ac
• Given L = {w | w ∈ {a, b}∗ ∧ w ends with a} , what is the RE?
• r = (a + b)∗a
• Given L = {if, then, else, while, do, begin, end}, what is the RE?
• r = if + then + else + while + do + begin + end
Automata
• Automata are machines (abstract machines) that accept languages
• Finite State Automata accept RLs (corresponding to REs)
• Pushdown Automata accept CFLs (corresponding to CFGs)
• Linear Bounded Automata accept CSLs (corresponding to CSGs)
• Turing Machines accept type-0 languages (corresponding to type-0 grammars)
• Applications of Automata
• Switching circuit design
• Lexical analyzer in a compiler
• String processing (grep, awk), etc.
• State charts used in object-oriented design
• Modelling control applications, e.g., elevator operation
• Parsers of all types
• Compilers
Finite State Automaton (FSA)
• An FSA is an acceptor or recognizer of regular languages
• An FSA is a 5-tuple, (Q, Σ, δ, q0, F), where
• Q is a finite set of states
• Σ is the input alphabet
• δ is the transition function, δ : Q × Σ → Q
• That is, δ(q, a) is a state for each state q and input symbol a
• q0 is the start state
• F is the set of final or accepting states
• In one move from some state q, an FSA reads an input symbol, changes the state based on δ, and
gets ready to read the next input symbol
• An FSA accepts its input string, if starting from q0, it consumes the entire input string, and
reaches a final state
• If the last state reached is not a final state, then the input string is rejected
FSA example
• Q = {q0, q1, q2, q3} -> finite set of states
• Σ = {a, b, c} -> the input alphabet
• q0 is the start state and F = {q0, q2} (F -> set
of final or accepting states)
• The transition function δ is defined by the
table below

• Language accepted by the FSA?

• is the set of all strings beginning with an
’a’ and ending with a ’c’ ( is also
accepted)

Lecture 2
No ratings yet
Lecture 2
20 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Compiler 2
No ratings yet
Compiler 2
38 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
10 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Topic 3
No ratings yet
Topic 3
66 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
27 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
56 pages
2 - Compilers (Lexical Analysis)
No ratings yet
2 - Compilers (Lexical Analysis)
60 pages
Understanding Finite Automata Theory
No ratings yet
Understanding Finite Automata Theory
24 pages
Lect 03
No ratings yet
Lect 03
19 pages
Compiler Design - Lexical Analysis: University of Salford, UK
No ratings yet
Compiler Design - Lexical Analysis: University of Salford, UK
1 page
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Symbols, Strings, and Formal Languages
No ratings yet
Symbols, Strings, and Formal Languages
32 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
CH 2
No ratings yet
CH 2
36 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
8 pages
Automata and Complexity Theory
100% (6)
Automata and Complexity Theory
18 pages
FLAT Unit 1 August 2023
No ratings yet
FLAT Unit 1 August 2023
69 pages
CD ppt1
No ratings yet
CD ppt1
62 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
CP 324 Lexical Analysis l2
No ratings yet
CP 324 Lexical Analysis l2
26 pages
Chapter-2 Compiler Design
No ratings yet
Chapter-2 Compiler Design
98 pages
Compiler
No ratings yet
Compiler
60 pages
CS1303 Theory of Computation Overview
No ratings yet
CS1303 Theory of Computation Overview
25 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Lexical Analysis All Token List and Diffence
No ratings yet
Lexical Analysis All Token List and Diffence
4 pages
Lec 03 - Finite Languages
No ratings yet
Lec 03 - Finite Languages
29 pages
Regular Expressions & Grammars in CS
No ratings yet
Regular Expressions & Grammars in CS
75 pages
CD ch2
No ratings yet
CD ch2
104 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Theory of Computation
67% (3)
Theory of Computation
24 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
TOA Lecture 03
No ratings yet
TOA Lecture 03
63 pages
Token Recognition in Compiler Design
No ratings yet
Token Recognition in Compiler Design
51 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
M2 Main
No ratings yet
M2 Main
41 pages
Languages, Grammar and Recognizers
No ratings yet
Languages, Grammar and Recognizers
17 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Lexical Analysis for Programmers
No ratings yet
Lexical Analysis for Programmers
67 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
No ratings yet
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
19 pages
Introduction to Digital Systems
No ratings yet
Introduction to Digital Systems
58 pages
Even Solutions Mme
No ratings yet
Even Solutions Mme
19 pages
Department of Mathematics at Columbia University - Calculus I Sample Syllabus
No ratings yet
Department of Mathematics at Columbia University - Calculus I Sample Syllabus
2 pages
FUZZY
No ratings yet
FUZZY
36 pages
Year10 CT10 Presentation
No ratings yet
Year10 CT10 Presentation
14 pages
Understanding Nested Quantifiers in Logic
No ratings yet
Understanding Nested Quantifiers in Logic
32 pages
Mathematics in Lean
No ratings yet
Mathematics in Lean
148 pages
Geometry 10
No ratings yet
Geometry 10
13 pages
Ambiguities in the Principle of Purity
No ratings yet
Ambiguities in the Principle of Purity
26 pages
J. R. Kantor - Psychology and Logic - Vol. II. 2-Principia Press (1950)
No ratings yet
J. R. Kantor - Psychology and Logic - Vol. II. 2-Principia Press (1950)
372 pages
Introduction To Dynamic Programming: Optimal Substructure
No ratings yet
Introduction To Dynamic Programming: Optimal Substructure
3 pages
Object Oriented Thinking
No ratings yet
Object Oriented Thinking
29 pages
Solution Mid Paper
No ratings yet
Solution Mid Paper
5 pages
Automata Theory for CS Students
No ratings yet
Automata Theory for CS Students
40 pages
Properties of Binary Relations
No ratings yet
Properties of Binary Relations
32 pages
Lecture-3-Cfg + Dfa + Nfa
No ratings yet
Lecture-3-Cfg + Dfa + Nfa
36 pages
Linear Programming and Transportation Problems
100% (1)
Linear Programming and Transportation Problems
42 pages
Lab-ExperimentNo2 Subtractor
No ratings yet
Lab-ExperimentNo2 Subtractor
5 pages
ICT G12 Unit 6
No ratings yet
ICT G12 Unit 6
29 pages
DAA Question Bank 2020
100% (1)
DAA Question Bank 2020
7 pages
Automated Reasoning 10th International Joint Conference IJCAR 2020 Paris France July 1 4 2020 Proceedings Part I Nicolas Peltier
100% (2)
Automated Reasoning 10th International Joint Conference IJCAR 2020 Paris France July 1 4 2020 Proceedings Part I Nicolas Peltier
65 pages
Proof Evaluation and Feedback Guide
No ratings yet
Proof Evaluation and Feedback Guide
6 pages
Multiobjective Genetic Algorithm Guide
No ratings yet
Multiobjective Genetic Algorithm Guide
6 pages
Class XI Computer Science Lesson 15 Logic Gates Part 2 Session 2023-'24
No ratings yet
Class XI Computer Science Lesson 15 Logic Gates Part 2 Session 2023-'24
10 pages
Java Generics and Collections: Lecture 10/11
No ratings yet
Java Generics and Collections: Lecture 10/11
14 pages
#4 - JavaScript Data Types
No ratings yet
#4 - JavaScript Data Types
5 pages
Boolean Algebra Simplification Guide
100% (1)
Boolean Algebra Simplification Guide
20 pages
Optimum Final 2019 - Model Answer
No ratings yet
Optimum Final 2019 - Model Answer
4 pages
hw5 Sol PDF
No ratings yet
hw5 Sol PDF
7 pages
Recursion Basics and Examples
No ratings yet
Recursion Basics and Examples
21 pages

Lecture 3 (30-1-23)

Uploaded by

Lecture 3 (30-1-23)

Uploaded by

Lexical Analysis

• Scan input program to identify valid words, removes comments, extra

Token Pattern Sample

relation_op = | != | < | > <

string Characters “hello”

• In the statement, DO 10 I = 10.86

(L∗ is called the Kleene closure or closure of L)

• Given r = (1 + 10)∗, what is the language?

• Language accepted by the FSA?

You might also like