0% found this document useful (0 votes)
26 views35 pages

Lexical Analysis I: Compiler Construction

This document provides an overview of lexical analysis for a compiler construction course. It discusses the role of the lexical analyzer in removing whitespace and comments, expanding macros, tokenizing the input, and interacting with the symbol table. It defines key terms like tokens, patterns, and lexemes. It also covers designing the lexical analyzer using techniques like regular expressions, transition diagrams, and buffering input. It provides examples of token classes and discussing error handling. Finally, it outlines the implementation of a lexical analyzer using tools like flex.

Uploaded by

Wakuma Lachisa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
26 views35 pages

Lexical Analysis I: Compiler Construction

This document provides an overview of lexical analysis for a compiler construction course. It discusses the role of the lexical analyzer in removing whitespace and comments, expanding macros, tokenizing the input, and interacting with the symbol table. It defines key terms like tokens, patterns, and lexemes. It also covers designing the lexical analyzer using techniques like regular expressions, transition diagrams, and buffering input. It provides examples of token classes and discussing error handling. Finally, it outlines the implementation of a lexical analyzer using tools like flex.

Uploaded by

Wakuma Lachisa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

CSCI-GA.

2130-001
Compiler Construction
Lecture 4:
Lexical Analysis I

Mohamed Zahran (aka Z)


mzahran@cs.nyu.edu
Role of the Lexical Analyzer
Remove comments and white spaces (aka
scanning)
Macros expansion
Read input characters from the source
program
Group them into lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by
the compiler with the source program
Scanner-Parser Interaction
Why Separating
Lexical and Syntactic?
Simplicity of design
Improved compiler efficiency
allows us to use specialized technique for
lexer, not suitable for parser
Higher portability
Input-device-specific peculiarities
restricted to lexer
Some Definitions
Token: a pair consisting of
Token name: abstract symbol representing
lexical unit [affects parsing decision]
Optional attribute value [influences
translations after parsing]
Pattern: a description of the form that
different lexemes take
Lexeme: sequence of characters in
source program matching a pattern
Pattern

Token classes
One token per keyword
Tokens for the operators
One token representing all identifiers
Tokens representing constants (e.g. numbers)
Tokens for punctuation symbols
Example
Dealing With Errors
Lexical analyzer unable to proceed: no
pattern matches
Panic mode recovery: delete successive
characters from remaining input until
token found
Insert missing character
Delete a character
Replace character by another
Transpose two adjacent characters
Example

What tokens will be generated from the above C++ program?


Buffering Issue
Lexical analyzer may need to look at
least a character ahead to make a token
decision.
Buffering: to reduce overhead required
to process a single character
Tokens Specification
We need a formal way to specify
patterns: regular expressions
Alphabet: any finite set of symbols
String over alphabet: finite sequence of
symbols drawn from that alphabet
Language: countable set of strings over
some fixed alphabet
Operations

Zero or one instance: r? is equivalent to r|

Character class: a|b|c||z can be replaced by [a-z]


a|c|d|h can be replaced by [acdh]
Operations
Trailing context: exp1/exp2

Match exp1 only if followed by exp2


exp2 is NOT consumed and remained
to be returned in subsequent tokens
Only one / is permitted per pattern
Example: a/b matches a in string ab but will not
match anything in a or ac
Examples
Which language is generated by:
(a|b)(a|b)
a*
(a|b)*
a|a*b
Example
Presenting number that can be integer
with option floating point and
exponential parts?
Example
Presenting number that can be integer
with option floating point and
exponential parts?
Lets analyze some possible solutions
Examples
Write regular definition of all strings of
lowercase letters in which the letters
are in ascending order
Tokens Recognition
Implementation:
Transition Diagrams
Intermediate step in constructing
lexical analyzer
Convert patterns into flowcharts called
transition diagrams
nodes or circles: called states
Edges: directed from state to another,
labeled by symbols
Initial state Accepting or final state

Actions associated with


final state
Means retract the forward
pointer
Examine symbol table for the lexeme found
Returns whatever token name is there

Places ID in symbol table if not there.


Returns a pointer to symbol table entry
Reserved Words and Identifiers
Install reserved words in symbol table
initially
OR
Create transition diagram for each
keyword, then prioritize the tokens so
that keywords have higher preference
Implementation of Transition
Diagram
Using All Transition Diagrams:
The Big Picture
Arrange for the transition diagrams for
each token to be tried sequentially
Run transition diagrams in parallel
Combine all transition diagrams into one
The First Part of the Project
The First Part of the Project

declarations
%%
translation rules
%%
auxiliary functions
declarations

translation rules

auxiliary functions
Anything between these
2 marks is copied as it is
in lex.yy.c
braces means the pattern
is defined somewhere

Actions
pattern
Lecture of Today
Sections 3.1 to 3.5
First part of the project assigned

You might also like