Lexical Analysis
1
Contents
Lexical Analysis
Role of the Lexical Analyzer
Tokens, Patterns & Lexemes
Attributes for Tokens
2
Lexical Analysis
Typical tasks performed by the lexical analyzer
Remove white space and comments
Encode constants as tokens ( 45, Hello & True/False)
int x = 42;
char* str = "Hello";
Recognize keywords (if, int, while, return, for etc)
Recognize identifiers (int x = 10; float myVar = 20.5; )and store
identifier names in a global symbol table
Lexical Error
A character sequence that cannot scan any valid
token is a lexical error.
Lexical Phase error can be:
Spelling error
Exceeding the length of the identifier
Appearance of an illegal character
Note:
Lexical phase error is found during the execution of
the program.
Role of Lexical Analyzer
5
Role of Lexical Analyzer..
Sometimes, lexical analyzers are divided into a cascade of two
processes:
Scanning consists of simple processes that do not require tokenization
of the input, such as deletion of comments and compaction of
consecutive whitespace characters into one.
Lexical analysis is the more complex portion, where the scanner
produces the sequence of tokens as output.
6
Lexical Analysis Vs Parsing
There are several reasons why the analysis portion is normally
separated into lexical analysis and parsing.
Parsing: It is the process of analyzing the grammatical structure of the
source code based on the rules of a formal grammar (usually context-
free grammar)
The separation of lexical and syntactic analysis often allows us to
simplify at least one of these tasks.
Lexical and syntactic separation simplifies these tasks by allowing the
lexical analyzer to focus on identifying tokens, while the parser focuses
on grammar structure. This improves compiler efficiency, as specialized
techniques can be used for each task separately.
Compiler portability is enhanced because the lexical analyzer can
handle device-specific input peculiarities, isolating them from the main
parsing logic, making the compiler easier to adapt to different systems.
7
Tokens, Patterns & Lexemes
A token is a sequence of characters that can be treated as a unit in
the grammar of the programming language.
Examples: Identifiers, Keywords, operators, special symbols, constants
Non Tokens: Comments, Tabs, Blanks & New Line
8
Tokens, Patterns & Lexemes
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword.
Example:
For a keyword to be identified as a valid token, the pattern is the sequence of characters
that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it
must start with alphabet, followed by alphabet or a digit.
A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
9
Tokens, Patterns & Lexemes..
In many programming languages, the following classes cover most
or all of the tokens:
10
Attributes for Tokens
When a lexeme is encountered in a source program, it is necessary
to keep track of other occurrences of the same lexeme.
For example, if this lexeme seen before or not.
11
Attributes for Tokens..
Ex. The token names and associated attribute values for the
Fortran statement E = M * C2 are as follows:
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
12