Document 4: Compiler Design – Lexical Analysis
Notes
1. Introduction to Lexical Analysis
First phase of compiler.
Converts source code tokens.
Removes whitespace, comments.
Passes token stream to syntax analysis.
2. Role of Lexical Analyzer
Reads input characters.
Groups them into lexemes.
Produces tokens (token name + attribute value).
Reports lexical errors.
3. Tokens, Lexemes, and Patterns
Token: Category of lexemes (e.g., keyword, identifier).
Lexeme: Actual string in source (e.g., int, main).
Pattern: Rule that defines a token (e.g., [a-zA-Z][a-zA-Z0-9]* for identifiers).
4. Example
Source Code:
int sum = a + b;
Tokens:
1.
Keyword: int
2.
Identifier: sum
3.
Operator: =
4.
Identifier: a
5.
Operator: +
6.
Identifier: b
7.
Delimiter: ;
5. Finite Automata in Lexical Analysis
Regular Expressions used to describe tokens.
NFA (Nondeterministic Finite Automata) converted to DFA for efficient scanning.
DFA is used to implement lexical analyzers.
6. Regular Expressions Examples
Identifier: [a-zA-Z][a-zA-Z0-9]*
Integer Constant: [0-9]+
Whitespace: ( \t | \n | " " )
7. Lexical Errors
Unrecognized symbols.
Unterminated strings/comments.
Illegal identifiers.
Error Handling:
Panic mode recovery.
Skipping characters until a valid token is found.
8. Symbol Table
Stores identifiers with attributes (type, scope, memory location).
Used by later phases of compiler.
9. Lex Tool
Lex (or Flex) is a tool to automatically generate lexical analyzers.
Input: set of token definitions (regex).
Output: C program for scanning tokens.
10. Conclusion
Lexical analysis is the foundation phase of compiler design. It simplifies input by converting
raw code into meaningful tokens, which allows further phases like syntax and semantic
analysis to work effectively.