LEXICAL ANALYSIS
• Lexical Analysis is the first phase of the compiler also known as a scanner.
• It converts the High level input program into a sequence of Tokens.
• Main task: to read input characters and group them into “tokens.”
• Secondary tasks:
• Skip comments and whitespace;
• Correlate error messages with source program (e.g., line number of error).
1. Lexical Analysis can be implemented with the Deterministic finite Automata.
[Link] output is a sequence of tokens that is sent to the parser for syntax
analysis
What is a Token?
• A lexical token is a sequence of characters that can be treated
as a unit in the grammar of the programming languages.
Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
• Keywords; Examples- for, while, if etc.
• Identifier; Examples- Variable name, function name, etc.
Operators; Examples- '+', '++', '-' etc.
• Separators; Examples- ',' ';' etc.
Lexical Analysis: Terminology
• token: a name for a set of input strings with related structure.
Example: “identifier,” “integer constant”
• pattern: a rule describing the set of strings associated with a token.
Example: “a letter followed by zero or more letters, digits, or underscores.”
• lexeme: the actual input string that matches a pattern.
Example: count
5
Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by LETTER OR DIGIT (MAX 8 CHARACTRES)
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by DIGIT”
Lexeme: 123
6
Attributes for Tokens
• If more than one lexeme can match the pattern for a token, the scanner
must indicate the actual lexeme that matched.
• This information is given using an attribute associated with the token.
Example: The program statement
count = 123
yields the following token-attribute pairs:
identifier, pointer to the string “count”
assg_op,
integer_const, the integer value 123
7
• The tokens of a language are specified using
regular expressions.
.Tokens are Recognized by Finite Automata.
Structure of a Scanner Automaton
9
Example of Non-Tokens:
• Comments, preprocessor directive, macros, blanks, tabs, newline,
etc.
How Lexical Analyzer Works?
[Link] preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
[Link]: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against
a set of patterns or regular expressions that define the different types of
tokens.
[Link] classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types
4. Token validation:
In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For
example, it might check that a variable name is a valid
identifier, or that an operator has the correct syntax.
[Link] generation: In this final stage, the lexer generates the output of the lexical
analysis process, which is typically a list of tokens. This list of tokens can then be
passed to the next stage of compilation or interpretation.
int main()
{ // 2 variables int a, b;
a = 10; return 0;
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';' 'a' '='
'10' ';' 'return' '0' ';' '}'
Exercise 1: Count number of tokens:
int main()
{ int a = 10, b = 20;
printf("sum is:%d",a+b); return 0;
}
Answer: Total number
of token: 27.