Compiler Design - Complete Study Notes 📚
1. Phases of Compiler (WITH MEMORY TRICK! 🧠)
Memory Trick: "Lexy Sarah Sees Incredible Code Optimization"
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Object Code Generation
Detailed Explanation:
Phase 1: Lexical Analysis (Scanner)
What it does: Converts source code into tokens
Input: Character stream
Output: Token stream
Example:
int x = 10;
Tokens: <INT>, <IDENTIFIER,"x">, <ASSIGN>, <NUMBER,"10">, <SEMICOLON>
Phase 2: Syntax Analysis (Parser)
What it does: Checks grammatical structure
Input: Token stream
Output: Parse tree/Abstract Syntax Tree
Example: Verifies int x = 10; follows grammar rules
Phase 3: Semantic Analysis
What it does: Type checking, scope resolution
Input: Parse tree
Output: Annotated parse tree
Example: Ensures you can't do int x = "hello";
Phase 4: Intermediate Code Generation
What it does: Creates platform-independent code
Input: Annotated parse tree
Output: Three-address code
Example: t1 = 10; x = t1;
Phase 5: Code Optimization
What it does: Improves code efficiency
Input: Intermediate code
Output: Optimized intermediate code
Example: Removes dead code, constant folding
Phase 6: Code Generation
What it does: Generates target machine code
Input: Optimized intermediate code
Output: Assembly/Machine code
2. Compiler vs Interpreter
Aspect Compiler Interpreter
Translation Entire program at once Line by line
Execution After compilation During translation
Speed Faster execution Slower execution
Memory More memory needed Less memory
Error Detection All errors at once Stops at first error
Examples C, C++, Java Python, JavaScript
Memory Trick: Compiler = Complete translation first, Interpreter = Immediate execution
3. Tokens, Patterns, and Lexemes
Easy Definition:
Lexeme: The actual string in source code
Token: Category/type of the lexeme
Pattern: Rule that describes the token
Example:
int count = 42;
Lexeme Token Pattern
int KEYWORD Reserved word
count IDENTIFIER [a-zA-Z][a-zA-Z0-9]*
= ASSIGN =
42 NUMBER [0-9]+
Memory Trick: Lexeme = Literal string, Token = Type, Pattern = Production rule
4. LEX - Lexical Analyzer Generator
LEX Structure:
%{
/* C declarations */
%}
/* LEX definitions */
%%
/* LEX rules */
%%
/* User functions */
Example LEX Program:
lex
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("NUMBER: %s\n", yytext); }
[a-zA-Z]+ { printf("IDENTIFIER: %s\n", yytext); }
"+" { printf("PLUS\n"); }
"=" { printf("ASSIGN\n"); }
[ \t\n] { /* ignore whitespace */ }
%%
int main() {
yylex();
return 0;
}
5. Input Buffering Techniques
Why Buffer?
Reading character by character is expensive
Need lookahead for token recognition
Techniques:
1. Single Buffer
Problem: What if token spans buffer boundary?
2. Double Buffer (Better!)
Buffer 1: |----token----|
Buffer 2: |--continues--|
3. Sentinels
Use special EOF character to avoid boundary checks
Trick: Place EOF at end of each buffer half
Memory Trick: "Double Sentinel Buffers Work Better"
6. Context-Free Grammar (CFG)
Definition:
A CFG is a 4-tuple: G = (V, T, P, S)
V: Variables (Non-terminals)
T: Terminals
P: Productions
S: Start symbol
Example:
E → E + T | T
T → T * F | F
F → (E) | id
This generates expressions like: id + id * id
Memory Trick: "Very Talented Programmers Start here"
7. Top-Down vs Bottom-Up Parsing
Top-Down Parsing
Strategy: Start from start symbol, derive input
Types: Recursive Descent, LL(1)
Think: Root to leaves (like reading a book)
Bottom-Up Parsing
Strategy: Start from input, reduce to start symbol
Types: LR(0), SLR(1), LALR(1), LR(1)
Think: Leaves to root (like building a pyramid)
Memory Trick:
Top-Down = Tree from Top
Bottom-Up = Build from Base
8. Syntax-Directed Translation
S-Attributed Definitions
Rule: Use only Synthesized attributes
Direction: Information flows UP the parse tree
Example: Expression evaluation
E → E1 + T { E.val = E1.val + T.val }
T → F { T.val = F.val }
F → num { F.val = num.lexval }
L-Attributed Definitions
Rule: Use synthesized + Limited inherited attributes
Direction: Information flows UP and Left-to-right
Restriction: Inherited attributes can only depend on left siblings
Memory Trick:
S = Synthesized = Simple (only up)
L = Limited = Left-to-right allowed
9. Types of Grammar (COMPLETE KNOWLEDGE)
Chomsky Hierarchy (Remember: "Uncles Can Count Regularly")
Type 0: Unrestricted Grammar
Production: α → β (any string to any string)
Automaton: Turing Machine
Example: αAβ → αγβ
Type 1: Context-Sensitive Grammar
Production: αAβ → αγβ (|γ| ≥ 1)
Automaton: Linear Bounded Automaton
Rule: Left side ≤ Right side length
Type 2: Context-Free Grammar
Production: A → α (single non-terminal on left)
Automaton: Pushdown Automaton
Most important for compilers!
Type 3: Regular Grammar
Production: A → aB or A → a
Automaton: Finite Automaton
Used in lexical analysis
Grammar Properties:
Ambiguous Grammar
Multiple parse trees for same string
Example: E → E + E | E * E | id
String "id + id * id" has 2 parse trees
Left Recursion
Direct: A → Aα
Indirect: A → Bα, B → Aβ
Problem: Infinite loop in top-down parsing
Left Factoring
Problem: A → αβ | αγ
Solution: A → αA', A' → β | γ
10. Type Checking
Static Type Checking
When: Compile time
Advantage: Catches errors early, faster execution
Languages: C, C++, Java
Example: int x = "hello"; ← Error caught at compile time
Dynamic Type Checking
When: Runtime
Advantage: More flexible
Languages: Python, JavaScript
Example: Variable can change type during execution
Memory Trick:
Static = Strictly checked at Start
Dynamic = Determined During execution
11. Polymorphic Functions
Definition:
Functions that work with multiple types
Types:
1. Parametric Polymorphism (Generics)
java
<T> void swap(T a, T b) {
// Works with any type T
}
2. Ad-hoc Polymorphism (Overloading)
java
int add(int a, int b) { return a + b; }
double add(double a, double b) { return a + b; }
3. Subtype Polymorphism (Inheritance)
java
Animal a = new Dog(); // Dog inherits from Animal
Memory Trick: "Polymorphic = People Adding Similar functions"
12. Storage Allocation Strategies
1. Static Allocation
When: Compile time
Where: Global variables, static variables
Lifetime: Entire program execution
2. Stack Allocation
When: Function calls
Where: Local variables, parameters
Lifetime: Function scope
Structure: LIFO (Last In, First Out)
3. Heap Allocation
When: Runtime (dynamic)
Where: malloc(), new operator
Lifetime: Until explicitly freed
Management: Garbage collection or manual
Memory Trick: "Static Stays, Stack Shrinks, Heap Hangs around"
13. Goals of Code Generation
Primary Goals:
1. Correctness: Generated code must be semantically equivalent
2. Efficiency: Optimize for speed and space
3. Target Independence: Easy to retarget
Specific Goals:
Register Allocation: Minimize memory access
Instruction Selection: Choose best instructions
Instruction Scheduling: Avoid pipeline stalls
Memory Trick: "Correct Efficient Targeted code"
14. DAG (Directed Acyclic Graph) Representation
Purpose:
Represent basic blocks efficiently
Identify common subexpressions
Enable optimizations
Example:
a = b + c
d = b + c
e = d + a
DAG shows b + c computed once, used twice!
Benefits:
Dead Code Elimination: Unused computations
Common Subexpression: Avoid recomputation
Constant Folding: Compute constants at compile time
15. Back Patching in Code Generation
Problem:
Jump addresses unknown during code generation
Forward references need to be "patched"
Solution - Back Patching:
1. Generate jump with placeholder address
2. Keep list of locations to patch
3. When target known, patch all locations
Example:
if (condition) goto L1
stmt1
goto L2
L1: stmt2
L2: next_stmt
Memory Trick: "BackPatch = Blank first, Patch later"
16. Code Optimization
Types:
Machine Independent Optimizations:
Constant Folding: 3 + 4 → 7
Constant Propagation: Replace variables with constants
Dead Code Elimination: Remove unreachable code
Common Subexpression Elimination
Machine Dependent Optimizations:
Register Allocation
Instruction Scheduling
Peephole Optimization
Levels:
Local: Within basic block
Global: Within function
Interprocedural: Across functions
17. Peephole Optimization
Definition:
Optimize small "window" of instructions (usually 3-5)
Types:
1. Redundant Instruction Elimination
Before: MOV R1, R2
MOV R2, R1
After: MOV R1, R2
2. Constant Folding
Before: MOV R1, #3
ADD R1, #4
After: MOV R1, #7
3. Strength Reduction
Before: MUL R1, #2
After: SHL R1, #1 (shift left = multiply by 2)
4. Algebraic Simplification
Before: ADD R1, #0
After: (remove instruction)
Memory Trick: "Peep through small hole, optimize locally, efficiently"
18. Data Flow Analysis
Purpose:
Determine how data flows through program
Enable optimizations
Safety analysis
Key Concepts:
Reaching Definitions
Which definitions reach which uses?
Use: Dead code elimination
Live Variable Analysis
Which variables are "live" (will be used later)?
Use: Register allocation
Available Expressions
Which expressions are available (already computed)?
Use: Common subexpression elimination
Direction:
Forward: Information flows with program execution
Backward: Information flows against program execution
19. Cross Compiler
Definition:
Compiler that runs on one machine but generates code for another
Example:
Host: x86 PC
Target: ARM processor (mobile phones)
Use Case: Embedded systems development
Why Needed?
Target machine too small for compiler
Development convenience
Different architectures
Memory Trick: "Cross Compiler Compiles Crosswise"
20. Properties of Optimizing Compilers
Essential Properties:
1. Correctness Preservation
Optimized code must behave identically to original
Most Important Property!
2. Efficiency Improvement
Time: Faster execution
Space: Smaller code size
Energy: Lower power consumption
3. Compile-Time Efficiency
Optimization shouldn't take too long
Trade-off between compile time and runtime benefit
4. Debugging Support
Maintain correlation between source and optimized code
Support for debugging optimized code
5. Predictability
Similar code patterns should be optimized similarly
Developers can reason about performance
Memory Trick: "Correct Efficient Compilation Debugs Predictably"
Quick Review Mnemonics 🎯
1. Compiler Phases: "Lexy Sarah Sees Incredible Code Optimization"
2. Grammar Types: "Uncles Can Count Regularly"
3. Storage Types: "Static Stays, Stack Shrinks, Heap Hangs around"
4. Parsing Types: "Top-Down from Top, Bottom-Up from Base"
5. Attributes: "S = Simple UP, L = Limited left-right"
Exam Tips 💡
1. Draw diagrams for phases, parse trees, DAGs
2. Give examples for every concept
3. Compare and contrast (compiler vs interpreter, static vs dynamic)
4. Remember the "why" behind each technique
5. Practice code examples for LEX, grammar, optimizations
Good Luck! You've got this! 🚀