CD Notesgpt s1
CD Notesgpt s1
Compiler
A compiler is a software tool that translates an entire program written in a high-level programming language
into machine language, generating an executable file.
Characteristics
Performs the translation once and produces a separate executable file.
Compilation involves multiple phases: Lexical Analysis, Syntax Analysis, Semantic Analysis, Intermediate Code
Generation, Optimization, and Code Generation.
Errors (if any) are reported after scanning the entire program.
Advantages
Fast execution of the final executable.
Good optimization results in efficient machine code.
Disadvantages
Slow to compile large programs initially.
Errors are only visible after compilation, making debugging harder for beginners.
Translator
A translator is a general term for any software that converts code from one programming language to
another. It may include:
Compilers
Interpreters
Assemblers
Types of Translators
Type Description
Compiler Translates high-level language to machine code in one go
Interpreter Translates and executes line-by-line
Assembler Converts assembly language to machine code
Ensures code written in one language can be understood and executed by the system.
Provides abstraction between human logic and hardware execution.
Interpreter
An interpreter is a translator that executes a high-level program line by line, converting it into machine code
at runtime.
Characteristics
Translates and executes simultaneously.
No separate executable file is created.
Stops execution when the first error is encountered.
Advantages
Easier to debug: errors are shown immediately.
Useful for educational purposes and scripting languages.
Disadvantages
Slower execution since it processes code line by line.
Code must be re-interpreted every time it runs.
Comparative Summary
Feature Compiler Interpreter
Translation Entire program at once Line-by-line during execution
Execution Time Faster (after compilation) Slower
Error Handling After full scan As soon as error is encountered
Output Standalone executable No separate output
Examples C, C++, Java (JVM for bytecode) Python, MATLAB, JavaScript
Phases of a Compiler
The compilation process is divided into multiple phases, each responsible for a specific part of the translation
process. These phases transform source code into target code (machine code), systematically analyzing and
optimizing the program.
3. Semantic Analysis
Input: Syntax tree.
Output: Annotated syntax tree or intermediate representation.
Functions:
Checks semantic correctness (e.g., type checking, undeclared variables).
Uses symbol table and type expressions.
Detects semantic errors not caught by the parser.
4. Intermediate Code Generation
Input: Annotated syntax tree.
Output: Intermediate code (e.g., Three-Address Code, Polish Notation).
Purpose:
Platform-independent representation of code.
Easier for further optimization and target code generation.
Example:
a = b + c * d
=>
t1 = c * d
t2 = b + t1
a = t2
5. Code Optimization
Input: Intermediate code.
Output: Optimized intermediate code.
Purpose: Improve performance without changing semantics.
Types:
Peephole optimization
Loop optimization
Constant folding
Dead code elimination
6. Code Generation
Input: Optimized intermediate code.
Output: Target code (Assembly or Machine Code).
Functions:
Map variables to registers or memory locations.
Convert intermediate representation to instructions of target machine.
Focuses on:
Instruction selection
Register allocation
Efficient addressing
8. Error Handling
Detects, reports, and recovers from errors during compilation.
Types of errors:
Lexical errors: Invalid characters
Syntax errors: Grammar violations
Semantic errors: Type mismatches
Goal: Continue compilation where possible (error recovery).
Overall Compilation Workflow
Source Code
↓
Lexical Analysis → Tokens
↓
Syntax Analysis → Parse Tree
↓
Semantic Analysis → Annotated Tree
↓
Intermediate Code Generation → IR Code
↓
Code Optimization → Optimized IR
↓
Code Generation → Machine Code
6. Bootstrapping
Bootstrapping is the process of writing a compiler (or assembler) in the source programming language that it
is intended to compile. In simple terms, bootstrapping refers to the method of developing a compiler using
the language that the compiler is supposed to translate.
Bootstrapping Process (Step-by-step)
Step 1:
Write Compiler C1 for L in Language A (e.g., Assembly)
↓
Compile C2 (written in L) using C1 → Produces Executable of C2
Step 2:
Use C2 to compile future versions of compilers or L programs
Importance of Bootstrapping
Enables self-improvement of a compiler.
Facilitates writing a compiler in the same high-level language, making it:
More portable, Easier to maintain, Essential for language evolution and standardization (e.g., C, Java).
Real-World Examples
The first C compiler was written in Assembly. Later versions of the C compiler were written in C and compiled
using the earlier version — a classic example of bootstrapping.
Lisp, Pascal, and Go compilers have also used bootstrapping techniques.
Self-hosting compiler: A compiler that is able to compile its own source code.
Cross-compiler: A compiler that runs on one platform but generates code for another.
Bootstrapping is the technique of writing a compiler in the same programming language that it is
intended to compile. It involves using simpler or external tools to compile the initial version and then
progressively using improved versions to compile themselves.
Review of Finite Automata in Lexical Analysis
Finite Automata (FA) play a central role in the implementation of lexical analyzers. They are used to recognize
the regular languages that define tokens in programming languages (e.g., identifiers, keywords, numbers,
operators).
DFA Structure
A DFA is a 5-tuple (Q, Σ, δ, q₀, F), where:
Q: Set of states
Σ: Input alphabet
δ: Transition function (Q × Σ → Q)
q₀: Initial state
F: Set of accepting (final) states
The lexical analyzer simplifies the task of the parser by tokenizing the input, filtering out whitespaces and
comments, and reporting lexical errors.
Finite Automata is the underlying theory behind lexical analyzers. Lexical rules expressed using regular
expressions are converted into finite automata, particularly DFA, which scan the source program to recognize
tokens efficiently.
The input to the lexical analyzer is a stream of characters from the source code, and it outputs a sequence of
tokens to the parser.
Recognition of Tokens
What is a Token?
A token is a sequence of characters that represents a logical unit in the source program.
Each token has:
Token Name (e.g., identifier, number, keyword, operator)
Attribute Value (optional): e.g., the actual name or value like x, 42
Token Structure
Example:
int x = 42;
Token stream:
[int, keyword] [x, identifier] [=, operator] [42, number] [;, separator]
Components Involved
Input Buffer: Holds the source code being scanned.
Lexeme: Actual string in the source code matching the token pattern.
Lookahead Pointer: Marks the end of the current token being scanned.
Matching Example
For input:
[while, keyword], [(, separator], [count, identifier], [<, operator], [10, number], [), separator]
Error Handling in Lexical Analysis
Purpose
To detect invalid sequences of characters in the source program and report lexical errors.
Types of Lexical Errors
Invalid characters
Example: # in C (unless used in preprocessor directive).
Unterminated strings or comments
Example: "Hello (missing closing quote).
Unrecognized lexemes
Example: @count (if @ is not valid).
Lexical analyzer detects: 2num is invalid as variable names cannot start with digits.
Generates error:
The recognition of tokens is performed by the lexical analyzer using finite automata, guided by regular
expressions. It follows the maximal munch rule and converts input into tokens.
Lexical error handling involves detecting invalid sequences and applying recovery strategies such as panic
mode, insertion/deletion, and error tokens to continue scanning while minimizing disruptions in parsing.
Review of Context-Free Grammar (CFG)
What is a Grammar?
A grammar is a formal way of specifying the syntactic structure of a programming language.
A Context-Free Grammar (CFG) is a set of production rules used to generate all possible strings in a given
context-free language.
CFG Definition
A CFG is a 4-tuple (V, T, P, S), where:
V: Set of variables (non-terminals)
T: Set of terminals (tokens)
P: Set of production rules (A → α, where A ∈ V and α ∈ (V ∪ T)*)
S: Start symbol (S ∈ V)
Example of CFG
For arithmetic expressions:
E → E + E | E * E | (E) | id
E is a non-terminal
+, *, (, ) and id are terminals
This grammar can generate strings like: id + id * id
Derivation
There are two types:
Leftmost Derivation: Expand the leftmost non-terminal first.
Rightmost Derivation: Expand the rightmost non-terminal first.
Ambiguity in Grammars
What is Ambiguity?
A grammar is ambiguous if a string can have more than one parse tree (or more than one leftmost/rightmost
derivation).
Removing Ambiguity
To remove ambiguity, modify grammar to reflect correct operator precedence and associativity.
Example (unambiguous):
E→E+T|T
T→T*F|F
F → (E) | id
Introduction to Parsing
What is Parsing?
Parsing is the process of analyzing a sequence of tokens to determine its grammatical structure according to a
CFG.
Purpose of Parsing
Validate syntax
Construct parse trees or syntax trees
Bridge between lexical analysis and semantic analysis
Types of Parsers
Parser Type Description
Top-Down Parser Builds parse tree from the root (start symbol) to leaves
Bottom-Up Parser Builds parse tree from leaves (input tokens) to root
Parsing Techniques
Technique Example
Top-Down LL(1) Parser
Bottom-Up LR(0), SLR(1), LALR(1), Canonical LR(1)
A Context-Free Grammar (CFG) defines syntactic rules of a language using production rules.
A grammar is ambiguous if it allows multiple parse trees for the same string, which must be avoided in
compiler design. Parsing is the process of analyzing token sequences using CFG to produce a parse tree,
validating syntax and structure.
Top-Down Parsing
Top-Down Parsing is a parsing technique where the parser starts at the start symbol and tries to derive the
input string by applying the production rules in a top-to-bottom fashion.
It builds the parse tree from the root (start symbol) and moves toward the leaves (input tokens).
Key Characteristics of Top-Down Parsing
Recursive: Each non-terminal in the grammar might lead to multiple recursive calls.
Predictive: For each non-terminal, the parser predicts which production rule to apply.
LL Grammars
An LL Grammar is a context-free grammar (CFG) that can be parsed using an LL parser (Top-Down Parser) with
one token of lookahead.
L refers to Left-to-right scanning of the input string.
L refers to Leftmost derivation (expanding the leftmost non-terminal first).
1 indicates that the parser uses one token of lookahead to decide which production rule to apply.
Properties of LL Grammar
Unambiguous: LL grammars are unambiguous by nature, making them suitable for efficient parsing.
No Left Recursion: LL grammars cannot have left recursion. Left recursion causes infinite recursion in top-
down parsing.
Example of Left Recursion (Invalid for LL Grammar):
A → Aα | β
A → βA'
A' → αA' | ε
For input b a, the lookahead token is a when S → bB was expected. The parser detects this mismatch.
Error Handling (Panic Mode):
The parser discards the erroneous a and resumes parsing from the next valid token.
Top-Down Parsing begins from the start symbol and tries to derive the input by applying production rules. An
LL(1) Parser is a top-down parser that uses one token of lookahead to predict the appropriate rule.
LL Grammars are unambiguous and require the grammar to be free from left recursion.
Error Handling in LL parsers involves detecting predictive parsing errors and using strategies like panic mode
or phrase-level recovery to continue parsing the input.
Recursive Descent Parsing
Recursive Descent Parsing is a top-down parsing technique where each non-terminal in the grammar has a
corresponding recursive function in the parser. These functions attempt to match the input string based on
the production rules.
The parser tries to expand the input by applying the appropriate production rules, and for each non-terminal
in the grammar, there is a recursive procedure to handle it.
E→T+E|T
T → id
E→T+E
↓
T → id // Match 'id'
+ // Match '+'
E → T // Match 'id'
E→T+E|T
T → id
First sets:
First(E) = {id}
First(T) = {id}
A → Aα | β
This causes infinite recursion in a recursive descent parser, and it also leads to failure in predictive parsers
because the parser can keep expanding A without ever consuming input.
For example:
A → Aα | β
A → βA'
A' → αA' | ε
Recursive Descent Parsing is a simple, top-down technique where each non-terminal corresponds to a
recursive function. It may require backtracking and cannot handle left recursion.
Predictive Parsers are efficient top-down parsers that use a lookahead token and a parsing table to predict the
correct production rule. They work on LL(1) grammars and do not require backtracking, making them faster
than recursive descent parsers.
Left recursion must be eliminated from the grammar to make it compatible with predictive parsing, ensuring
the parser operates correctly.
Bottom-Up Parsing
Bottom-Up Parsing is a type of syntax analysis where the parser starts from the input tokens (the leaves of the
parse tree) and works towards the start symbol (the root of the parse tree). This approach is also called shift-
reduce parsing.
In bottom-up parsing, the parser tries to reduce the input string to the start symbol by applying production
rules in reverse. It starts with the input string and tries to reduce portions of the string to non-terminals.
Bottom-Up Parsing Process
Initial State: The stack starts empty, and the entire input string is placed into the input buffer.
Shift Operation: The parser shifts tokens from the input buffer onto the stack.
Reduce Operation: After shifting a token, the parser checks if the top portion of the stack matches the right-
hand side of any production. If so, it reduces that portion to the left-hand side of the rule.
Repeat: The parser alternates between shift and reduce until the input string is reduced to the start symbol.
Accept: If the entire input string is reduced to the start symbol and no more tokens remain in the input buffer,
the parser accepts the input.
Error: If the parser cannot reduce the stack or match the input token, it reports an error.
Bottom-Up Parsing starts from the input tokens and reduces them to the start symbol. It works by applying
shift and reduce operations, with LR parsing being the most powerful type of bottom-up parsing. It uses an LR
parsing table with action and goto entries to decide what action to take based on the current state and input
token
LR, SLR, LALR, and Canonical LR are different types of bottom-up parsers that vary in complexity, efficiency,
and power.
Bottom-up parsing is efficient for large grammars and can handle complex constructs, but it requires more
memory and is harder to implement than top-down parsing.
Shift-Reduce Parsing
Shift-Reduce Parsing is a bottom-up parsing technique used to parse context-free grammars (CFGs). It is a type
of bottom-up parsing that works by shifting input tokens onto a stack and reducing the stack's topmost
symbols to non-terminals according to the production rules.
Shift: Move the next input token onto the stack.
Reduce: Replace a portion of the stack that matches the right-hand side of a production rule with the left-
hand side non-terminal.
The parser continues alternating between shifting and reducing until it has processed all input and reduced
the stack to the start symbol of the grammar, at which point the input is accepted.
Key Operations in Shift-Reduce Parsing
Shift: The parser moves the next token from the input buffer to the top of the stack.
The stack grows as more tokens are shifted onto it.
Reduce: The parser checks the top portion of the stack to see if it matches the right-hand side of any
production rule.
If it matches, the parser reduces that portion of the stack to the left-hand side of the production rule.
Accept: If the entire input string is reduced to the start symbol and the input buffer is empty, the parser
accepts the input as syntactically correct.
Error: If no reduction rule can be applied and no more tokens remain in the input, the parser reports a syntax
error.
Example of Shift-Reduce Parsing
E→E+T|T
T → id
Shift-Reduce Conflicts
A key challenge in shift-reduce parsing is resolving conflicts that occur when the parser cannot decide whether
to shift or reduce.
Shift-Reduce Conflict:
Occurs when the parser cannot decide whether to shift a token onto the stack or reduce a portion of the
stack. This conflict typically arises when the grammar is ambiguous or when it does not follow the required
restrictions for shift-reduce parsing.
Example:
E→E+E
E → id
Reduce-Reduce Conflict:
Occurs when there are multiple applicable reduce actions.
This conflict arises when the parser cannot determine which of several production rules to apply for reducing
the stack's top portion.
Example:
A→B
A→C
If both B and C are present at the top of the stack, the parser cannot decide which production to apply.
Resolving Conflicts
To resolve conflicts in shift-reduce parsing, there are a few strategies:
Disambiguating Grammar: Modify the grammar to make it unambiguous, ensuring that there is a clear
decision between shifting and reducing at each step.
Using Lookahead: The parser can use a lookahead token to decide which operation to apply, particularly in LR
parsers.
Precedence and Associativity Rules: For operators (e.g., + and *), define precedence (which operator has
higher priority) and associativity (whether operators are left- or right-associative) to resolve shift-reduce
conflicts.
Example: * has higher precedence than +, so when encountering both + and *, the parser prefers to shift *
before reducing the +.
LR Parsers
LR Parsers are a type of bottom-up parsers that read the input from left to right and generate a rightmost
derivation in reverse. LR parsers are one of the most powerful types of parsers because they can handle a
wide range of context-free grammars (CFGs), including many complex and ambiguous grammars.
Efficient: LR parsers use lookahead and parse tables to guide the parsing process, making them efficient in
handling complex grammars.
Comprehensive: LR parsers can handle a large set of grammars that cannot be handled by recursive descent
parsers.
Types of LR Parsers
Canonical LR Parser:
The most general form of LR parsing.
It uses a large parsing table with many states.
Suitable for parsing a wide range of grammars but memory intensive.
Reduce: The parser reduces the top of the stack according to the production rule.
Accept: The parser successfully finishes if the start symbol is at the top of the stack, and the input is
exhausted.
Error: The parser encounters an error when no valid shift or reduce actions are possible.
Construction of SLR Parser
An SLR (Simple LR) Parser is a type of LR parser that uses a simplified approach to constructing the parsing
table compared to the canonical LR parser. SLR parsers are less powerful than canonical LR parsers but are
more space-efficient and easier to construct.
SLR parsers are based on the concept of LALR(1) parsing but use a simpler construction method for the goto
and action tables.
A → •B
A → B•
The dot indicates the position in the rule, and the parser will process the grammar in the order of the dot.
Create the States of the Parser:
S → AA
A→a|b
S' → • S
S → • AA
A→•a
A→•b
This table is used when a reduction occurs and the parser needs to know where to go next.
Steps to Construct Canonical LR Parsing Table:
Construct the augmented grammar: Add a new start symbol (e.g., S' → S).
Compute the LR(0) items: Create the initial item and apply the closure operation to form the set of all items.
Generate the states: For each item set, apply the goto operation with all terminals and non-terminals.
Define the Action Table: For each state and terminal, determine whether the action is a shift, reduce, or
accept.
Define the Goto Table: For each state and non-terminal, compute the next state after a reduction.
S→AB
A→a
B→b
Augmented Grammar:
S' → S
S→AB
A→a
B→b
LR(0) Items:
S' → • S
S→•AB
A→•a
B→•b
State 1 (Start state) contains the item:
S' → • S
S→•AB
The parser begins by shifting symbols and reducing based on the above items, using the goto operation to
transition between states.
Action Table:
State 1: If the lookahead is a, shift.
State 2: If the lookahead is b, shift.
State 3: If the lookahead is the end symbol ($), reduce using S → A B.
Goto Table:
State 1, A: Go to state 2.
State 2, B: Go to state 3.
S→AB
A→a
B→b
The canonical LR parsing table would generate states based on individual item sets, while the LALR parser
would merge states that have identical cores and may differ only in their lookahead symbols.
26. Canonical LR vs. LALR Parsers
Feature Canonical LR Parser LALR Parser
Number of States Larger (more states) Smaller (state merging reduces table
size)
Parsing Power Can handle all LR(1) grammars Can handle most LR(1) grammars but
not all
Memory Usage Higher (more states and transitions) Lower (compact tables due to state
merging)
Implementation More complex to implement due to Easier to implement than Canonical
Complexity larger tables LR
Conflict Handling No conflicts for LR(1) grammars May have some conflicts due to state
merging
E→E+E
E→E*E
E→(E)
E → id
For the string id + id * id, this grammar allows two different derivations:
This means the grammar is ambiguous, as it generates two different parse trees for the same input.
Inefficiency: Handling ambiguity can lead to an inefficient parser that may need to explore multiple parsing
paths or backtrack, resulting in increased computation time and memory usage.
Semantics Confusion: Ambiguity can create problems not just in parsing, but in semantic analysis. Different
interpretations of the same syntax could lead to conflicting semantic results, making it difficult to generate
correct code or analyze the program.
How to Handle Ambiguous Grammars in Parsing
There are several strategies used to handle or resolve ambiguity during parsing:
Precedence and Associativity Rules:
Precedence determines which operators should be applied first. For example, multiplication ( *) has higher
precedence than addition (+).
Associativity defines the direction in which operators are applied when they have the same precedence (left-
to-right or right-to-left).
For example, in the ambiguous grammar mentioned above, you can introduce precedence rules to resolve
whether + or * should take precedence.
E → E + E (low precedence)
E → E * E (high precedence)
E → ( E ) (parentheses)
E → id (identifier)
By introducing precedence and associativity, you can guide the parser to choose the correct derivation.
Transform the Grammar: One approach is to rewrite the grammar to remove the ambiguity. This may involve
breaking down production rules or introducing new non-terminals to eliminate multiple derivations for a given
string.
For example, the ambiguous expression grammar can be rewritten as follows:
E→E+T
E→T
T→T*F
T→F
F→(E)
F → id
In this transformed grammar, + and * are handled in separate non-terminal rules, making the grammar
unambiguous.
Use a Parser with Backtracking: Some parsing strategies (e.g., recursive descent parsers) can use backtracking
to explore all possible derivations when ambiguity arises. However, this approach is inefficient for large
grammars and can lead to performance issues.
Earley Parser and GLR Parser: These parsers are capable of handling ambiguous grammars by parsing all
possible parse trees. However, they are more complex to implement and may still require additional
mechanisms for resolving ambiguity.
E→E+E
E→E*E
E→(E)
E → id
For the expression id + id * id, this grammar is ambiguous because we can derive two different parse
trees:
(id + id) * id
id + (id * id)
E→E+T
T→T*F
T→F
F→(E)
F → id
Now, the parse tree for id + id * id is unique, and the grammar is unambiguous.
Parse tree:
E
/ \
E +
/ \ \
id T T
/ \
id *
\
id
This tree represents the correct interpretation where id * id is evaluated first, then added to id.
The idea behind operator precedence parsing is to use a stack-based parser that applies a set of precedence
rules to determine when to shift (move a symbol onto the stack) and when to reduce (apply a production
rule).
Initialize the Stack: The stack starts empty, and the input expression is processed one token at a time.
Shift Operation: If the current token is an operand (e.g., an identifier or number) or if no precedence conflict
exists between the operator and the top of the stack, shift the token onto the stack.
Reduce Operation: When an operator has lower precedence than the operator on the top of the stack, or
when the stack contains a valid production for reduction, apply reduce to collapse a valid portion of the
expression into a non-terminal.
End Condition: The parsing continues until the entire input expression is processed, and the final reduced
form is produced.
id + id * id
We will parse this expression using the following operator precedence rules:
* has higher precedence than +.
+ and * are left-associative (i.e., operators of the same precedence are evaluated from left to right).
Precedence Table:
Operator id + *
id = < <
+ > = <
* > > =
Now, we will parse the expression using the operator precedence table.
Initial Stack: Start with an empty stack and the input id + id * id:
Step 2: Read +, which is an operator. Check the precedence of + with the top of the stack (id). Since id has no
precedence relation to +, shift + onto the stack.
Stack: id +
Step 4: Read *, which is an operator. Check the precedence of * with the top of the stack (id). Since * has
higher precedence than +, shift * onto the stack.
Stack: id + id *
Step 6: Now, reduce using the operator precedence table. The top of the stack is *, and the next token is id,
which is an operand. Since * has higher precedence than +, apply reduce to the id * id portion of the stack.
Stack: id + (id * id)
Step 7: Continue parsing the next part of the expression, applying the remaining rules until the entire input is
processed.
E→E+E
E→E*E
E→(E)
E → id
For operator precedence parsing, this grammar is modified to handle operator precedence explicitly.
We can define precedence rules for + and * as:
* has higher precedence than +.
Parentheses () are used to explicitly group operations and have the highest precedence.
Grammar Modifications:
For addition and multiplication:
E→E+T
E→T
T→T*F
T→F
F → id
F→(E)
Precedence rules:
+ is left-associative and has lower precedence than *.
* is left-associative and has higher precedence than +.
Advantages:
Simple to implement: Operator precedence parsing is relatively easy to implement for simple arithmetic
expressions and provides efficient handling of operator precedence.
No need for backtracking: The algorithm is deterministic, meaning it does not require backtracking like
recursive descent parsers.
Efficient: Operator precedence parsers work in linear time (O(n)), making them fast for parsing arithmetic
expressions.
Disadvantages:
Limited grammar coverage: Operator precedence parsing can only handle grammars that can be expressed
with precedence relations. It cannot handle more complex grammars, such as those requiring context-
sensitive rules.
Complexity with parentheses: Handling parentheses correctly requires special rules, and the grammar must be
modified to account for them explicitly.
Error handling: It is not as robust as other parsers (e.g., LL or LR parsers) in handling errors or unexpected
input.