0% found this document useful (0 votes)
37 views15 pages

Understanding Lexical Analysis in Compilers

Lexical Analysis is the first phase of a compiler that converts high-level input programs into a sequence of tokens, while also handling comments and error correlation. It utilizes deterministic finite automata to recognize tokens defined by regular expressions and involves stages such as input preprocessing, tokenization, classification, validation, and output generation. Tokens include identifiers, keywords, operators, and punctuation, with each token associated with specific attributes and patterns.

Uploaded by

sahurinku112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views15 pages

Understanding Lexical Analysis in Compilers

Lexical Analysis is the first phase of a compiler that converts high-level input programs into a sequence of tokens, while also handling comments and error correlation. It utilizes deterministic finite automata to recognize tokens defined by regular expressions and involves stages such as input preprocessing, tokenization, classification, validation, and output generation. Tokens include identifiers, keywords, operators, and punctuation, with each token associated with specific attributes and patterns.

Uploaded by

sahurinku112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LEXICAL ANALYSIS

• Lexical Analysis is the first phase of the compiler also known as a scanner.

• It converts the High level input program into a sequence of Tokens.

• Main task: to read input characters and group them into “tokens.”

• Secondary tasks:
• Skip comments and whitespace;
• Correlate error messages with source program (e.g., line number of error).
1. Lexical Analysis can be implemented with the Deterministic finite Automata.

[Link] output is a sequence of tokens that is sent to the parser for syntax
analysis

What is a Token?

• A lexical token is a sequence of characters that can be treated


as a unit in the grammar of the programming languages.
Example of tokens:

• Type token (id, number, real, . . . )

• Punctuation tokens (IF, void, return, . . . )

• Alphabetic tokens (keywords)

• Keywords; Examples- for, while, if etc.

• Identifier; Examples- Variable name, function name, etc.


Operators; Examples- '+', '++', '-' etc.

• Separators; Examples- ',' ';' etc.


Lexical Analysis: Terminology
• token: a name for a set of input strings with related structure.
Example: “identifier,” “integer constant”
• pattern: a rule describing the set of strings associated with a token.
Example: “a letter followed by zero or more letters, digits, or underscores.”
• lexeme: the actual input string that matches a pattern.
Example: count

5
Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by LETTER OR DIGIT (MAX 8 CHARACTRES)
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by DIGIT”
Lexeme: 123

6
Attributes for Tokens
• If more than one lexeme can match the pattern for a token, the scanner
must indicate the actual lexeme that matched.
• This information is given using an attribute associated with the token.
Example: The program statement
count = 123
yields the following token-attribute pairs:
identifier, pointer to the string “count”
assg_op, 
integer_const, the integer value 123

7
• The tokens of a language are specified using
regular expressions.

.Tokens are Recognized by Finite Automata.


Structure of a Scanner Automaton

9
Example of Non-Tokens:

• Comments, preprocessor directive, macros, blanks, tabs, newline,


etc.
How Lexical Analyzer Works?

[Link] preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.

[Link]: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against
a set of patterns or regular expressions that define the different types of
tokens.
[Link] classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types
4. Token validation:
In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For
example, it might check that a variable name is a valid
identifier, or that an operator has the correct syntax.

[Link] generation: In this final stage, the lexer generates the output of the lexical
analysis process, which is typically a list of tokens. This list of tokens can then be
passed to the next stage of compilation or interpretation.
int main()

{ // 2 variables int a, b;

a = 10; return 0;

All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';' 'a' '='
'10' ';' 'return' '0' ';' '}'
Exercise 1: Count number of tokens:

int main()

{ int a = 10, b = 20;

printf("sum is:%d",a+b); return 0;

}
Answer: Total number
of token: 27.

You might also like