0% found this document useful (0 votes)
9 views30 pages

CD Notesgpt s1

The document provides an overview of compiler design, detailing the roles of compilers, interpreters, and translators, along with their characteristics, advantages, and disadvantages. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, and code generation, while also discussing error handling and the importance of finite automata in lexical analysis. Additionally, it covers concepts such as bootstrapping, context-free grammar, and parsing techniques, emphasizing the need for unambiguous grammar in compiler design.

Uploaded by

quartzalquatl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

CD Notesgpt s1

The document provides an overview of compiler design, detailing the roles of compilers, interpreters, and translators, along with their characteristics, advantages, and disadvantages. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, and code generation, while also discussing error handling and the importance of finite automata in lexical analysis. Additionally, it covers concepts such as bootstrapping, context-free grammar, and parsing techniques, emphasizing the need for unambiguous grammar in compiler design.

Uploaded by

quartzalquatl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Compiler Design Notes

Compiler
A compiler is a software tool that translates an entire program written in a high-level programming language
into machine language, generating an executable file.
Characteristics
Performs the translation once and produces a separate executable file.
Compilation involves multiple phases: Lexical Analysis, Syntax Analysis, Semantic Analysis, Intermediate Code
Generation, Optimization, and Code Generation.
Errors (if any) are reported after scanning the entire program.
Advantages
Fast execution of the final executable.
Good optimization results in efficient machine code.
Disadvantages
Slow to compile large programs initially.
Errors are only visible after compilation, making debugging harder for beginners.

Translator
A translator is a general term for any software that converts code from one programming language to
another. It may include:
Compilers
Interpreters
Assemblers
Types of Translators
Type Description
Compiler Translates high-level language to machine code in one go
Interpreter Translates and executes line-by-line
Assembler Converts assembly language to machine code
Ensures code written in one language can be understood and executed by the system.
Provides abstraction between human logic and hardware execution.

Interpreter
An interpreter is a translator that executes a high-level program line by line, converting it into machine code
at runtime.
Characteristics
Translates and executes simultaneously.
No separate executable file is created.
Stops execution when the first error is encountered.
Advantages
Easier to debug: errors are shown immediately.
Useful for educational purposes and scripting languages.
Disadvantages
Slower execution since it processes code line by line.
Code must be re-interpreted every time it runs.
Comparative Summary
Feature Compiler Interpreter
Translation Entire program at once Line-by-line during execution
Execution Time Faster (after compilation) Slower
Error Handling After full scan As soon as error is encountered
Output Standalone executable No separate output
Examples C, C++, Java (JVM for bytecode) Python, MATLAB, JavaScript

Phases of a Compiler
The compilation process is divided into multiple phases, each responsible for a specific part of the translation
process. These phases transform source code into target code (machine code), systematically analyzing and
optimizing the program.

Major Phases of a Compiler


The phases are typically grouped into two parts:
Analysis Phase: Breaks the source code into constituent parts.
Synthesis Phase: Assembles the intermediate representation into target code.

1. Lexical Analysis (Scanner)


Input: High-level source program (as a sequence of characters).
Output: Sequence of tokens (identifier, keywords, operators, etc.).
Functions:
Removes whitespace and comments.
Groups characters into lexemes.
Detects lexical errors (e.g., invalid characters).
Tool used: Lex

2. Syntax Analysis (Parser)


Input: Tokens from the lexical analyzer.
Output: Parse Tree or Syntax Tree.
Functions:
Checks for grammatical structure of the source code.
Detects syntax errors.
Based on Context-Free Grammar (CFG).
Parser types: LL Parser, LR Parser, SLR, LALR.

3. Semantic Analysis
Input: Syntax tree.
Output: Annotated syntax tree or intermediate representation.
Functions:
Checks semantic correctness (e.g., type checking, undeclared variables).
Uses symbol table and type expressions.
Detects semantic errors not caught by the parser.
4. Intermediate Code Generation
Input: Annotated syntax tree.
Output: Intermediate code (e.g., Three-Address Code, Polish Notation).
Purpose:
Platform-independent representation of code.
Easier for further optimization and target code generation.
Example:
a = b + c * d
=>
t1 = c * d
t2 = b + t1
a = t2

5. Code Optimization
Input: Intermediate code.
Output: Optimized intermediate code.
Purpose: Improve performance without changing semantics.
Types:
Peephole optimization
Loop optimization
Constant folding
Dead code elimination

6. Code Generation
Input: Optimized intermediate code.
Output: Target code (Assembly or Machine Code).
Functions:
Map variables to registers or memory locations.
Convert intermediate representation to instructions of target machine.
Focuses on:
Instruction selection
Register allocation
Efficient addressing

7. Symbol Table Management


A data structure that stores information about identifiers (variables, functions, classes).
Used in multiple phases for:
Type checking
Scope resolution
Memory management

8. Error Handling
Detects, reports, and recovers from errors during compilation.
Types of errors:
Lexical errors: Invalid characters
Syntax errors: Grammar violations
Semantic errors: Type mismatches
Goal: Continue compilation where possible (error recovery).
Overall Compilation Workflow
Source Code

Lexical Analysis → Tokens

Syntax Analysis → Parse Tree

Semantic Analysis → Annotated Tree

Intermediate Code Generation → IR Code

Code Optimization → Optimized IR

Code Generation → Machine Code

6. Bootstrapping
Bootstrapping is the process of writing a compiler (or assembler) in the source programming language that it
is intended to compile. In simple terms, bootstrapping refers to the method of developing a compiler using
the language that the compiler is supposed to translate.
Bootstrapping Process (Step-by-step)
Step 1:
Write Compiler C1 for L in Language A (e.g., Assembly)

Compile C2 (written in L) using C1 → Produces Executable of C2

Step 2:
Use C2 to compile future versions of compilers or L programs

Importance of Bootstrapping
Enables self-improvement of a compiler.
Facilitates writing a compiler in the same high-level language, making it:
More portable, Easier to maintain, Essential for language evolution and standardization (e.g., C, Java).

Real-World Examples
The first C compiler was written in Assembly. Later versions of the C compiler were written in C and compiled
using the earlier version — a classic example of bootstrapping.
Lisp, Pascal, and Go compilers have also used bootstrapping techniques.
Self-hosting compiler: A compiler that is able to compile its own source code.
Cross-compiler: A compiler that runs on one platform but generates code for another.

Bootstrapping is the technique of writing a compiler in the same programming language that it is
intended to compile. It involves using simpler or external tools to compile the initial version and then
progressively using improved versions to compile themselves.
Review of Finite Automata in Lexical Analysis
Finite Automata (FA) play a central role in the implementation of lexical analyzers. They are used to recognize
the regular languages that define tokens in programming languages (e.g., identifiers, keywords, numbers,
operators).

Finite Automata: Definition


A Finite Automaton is a mathematical model used to recognize patterns within input strings. It is a state
machine that accepts or rejects strings based on its structure.
There are two types:
Type Description
DFA (Deterministic Finite Automaton) Exactly one transition for each symbol from a state
NFA (Nondeterministic Finite May have multiple transitions (including ε-transitions) from a
Automaton) state

Role in Lexical Analysis


Lexical rules (like token patterns) are specified using regular expressions.
These regular expressions are converted into Finite Automata.
The automaton is then used to scan and match lexemes in the source code.

Working of Lexical Analyzer using FA


Token patterns (e.g., [a-zA-Z][a-zA-Z0-9]* for identifiers) are described as regular expressions.
Regular expressions are converted into NFA.
NFA is converted into DFA (using subset construction).
DFA is implemented in code to scan the input and identify the tokens.

DFA Structure
A DFA is a 5-tuple (Q, Σ, δ, q₀, F), where:

Q: Set of states
Σ: Input alphabet
δ: Transition function (Q × Σ → Q)
q₀: Initial state
F: Set of accepting (final) states

Advantages of DFA in Lexical Analysis


Fast: Requires only one pass over the input.
Deterministic: No backtracking or ambiguity in transitions.
Efficient in hardware or software implementation.
8. Input to Lexical Analyzer
Primary Input
Source program written in a high-level programming language (e.g., C, Java).
Supplied as a stream of characters from the source code file.

Secondary Inputs (Support Structures)


Symbol Table
Stores information about identifiers (variables, function names).
Used to track scope and declaration.
Keyword Table
Stores reserved words of the programming language (e.g., if, while, return).
Helps distinguish between identifiers and keywords.
Finite Automata / Token Descriptions
Lexical analyzer refers to these definitions to identify valid tokens.

Output of Lexical Analyzer


Sequence of tokens (with type and value), e.g.

[id, x], [assign, =], [num, 5], [semi, ;]

Interaction with Parser


+-------------------+
| Source Program |
+-------------------+

+-------------------+
| Lexical Analyzer |
+-------------------+

+----------------------+
| Tokens to Parser |
+----------------------+

The lexical analyzer simplifies the task of the parser by tokenizing the input, filtering out whitespaces and
comments, and reporting lexical errors.

Finite Automata is the underlying theory behind lexical analyzers. Lexical rules expressed using regular
expressions are converted into finite automata, particularly DFA, which scan the source program to recognize
tokens efficiently.

The input to the lexical analyzer is a stream of characters from the source code, and it outputs a sequence of
tokens to the parser.
Recognition of Tokens
What is a Token?
A token is a sequence of characters that represents a logical unit in the source program.
Each token has:
Token Name (e.g., identifier, number, keyword, operator)
Attribute Value (optional): e.g., the actual name or value like x, 42

Token Structure

Example:

int x = 42;

Token stream:

[int, keyword] [x, identifier] [=, operator] [42, number] [;, separator]

Token Recognition Process


Lexical patterns are defined using regular expressions.
These patterns are converted to finite automata (DFA).
The DFA scans input one character at a time:
Follows transitions until a final state is reached.
Accepts the longest valid match (called maximal munch rule).
Upon reaching a final state, the corresponding token is emitted.

Components Involved
Input Buffer: Holds the source code being scanned.
Lexeme: Actual string in the source code matching the token pattern.
Lookahead Pointer: Marks the end of the current token being scanned.

Matching Example
For input:

while (count < 10)

The lexical analyzer identifies tokens as:

[while, keyword], [(, separator], [count, identifier], [<, operator], [10, number], [), separator]
Error Handling in Lexical Analysis
Purpose
To detect invalid sequences of characters in the source program and report lexical errors.
Types of Lexical Errors
Invalid characters
Example: # in C (unless used in preprocessor directive).
Unterminated strings or comments
Example: "Hello (missing closing quote).
Unrecognized lexemes
Example: @count (if @ is not valid).

Error Recovery Strategies


Strategy Description
Panic Mode Skip characters until a delimiter or whitespace is found. Simple but may
skip valid code.
Delete/Insert/Replace Try correcting common typos by guessing the intended token.
Characters
Use Default Token Return a generic token or error token to keep parsing.
Flag and Continue Report the error, skip the bad lexeme, and move on.

Lexical Analyzer's Role in Error Handling


It should:
Report the line and position of the error.
Provide meaningful messages (e.g., "Invalid character '@' at line 3").
Continue scanning if possible to identify more errors.

Example of Error Handling


Source code:

int 2num = 10;

Lexical analyzer detects: 2num is invalid as variable names cannot start with digits.

Generates error:

Lexical Error: Invalid identifier '2num' at line 1

The recognition of tokens is performed by the lexical analyzer using finite automata, guided by regular
expressions. It follows the maximal munch rule and converts input into tokens.
Lexical error handling involves detecting invalid sequences and applying recovery strategies such as panic
mode, insertion/deletion, and error tokens to continue scanning while minimizing disruptions in parsing.
Review of Context-Free Grammar (CFG)
What is a Grammar?
A grammar is a formal way of specifying the syntactic structure of a programming language.
A Context-Free Grammar (CFG) is a set of production rules used to generate all possible strings in a given
context-free language.

CFG Definition
A CFG is a 4-tuple (V, T, P, S), where:
V: Set of variables (non-terminals)
T: Set of terminals (tokens)
P: Set of production rules (A → α, where A ∈ V and α ∈ (V ∪ T)*)
S: Start symbol (S ∈ V)

Example of CFG
For arithmetic expressions:

E → E + E | E * E | (E) | id

E is a non-terminal
+, *, (, ) and id are terminals
This grammar can generate strings like: id + id * id

Derivation
There are two types:
Leftmost Derivation: Expand the leftmost non-terminal first.
Rightmost Derivation: Expand the rightmost non-terminal first.

Parse Tree (Derivation Tree)


A tree representation showing how a string is derived from the start symbol using production rules.

Ambiguity in Grammars
What is Ambiguity?
A grammar is ambiguous if a string can have more than one parse tree (or more than one leftmost/rightmost
derivation).

Example of Ambiguous Grammar


E → E + E | E * E | (E) | id

For input id + id * id, we have:


Parse Tree 1:
id + (id * id) → multiplication first (correct precedence)
Parse Tree 2:
(id + id) * id → addition first (incorrect precedence)
→ Hence, grammar is ambiguous.
Why Ambiguity is a Problem?
Compilers need unique interpretation of syntax.
Ambiguity causes confusion in:
Parsing
Code generation
Semantic analysis

Removing Ambiguity
To remove ambiguity, modify grammar to reflect correct operator precedence and associativity.
Example (unambiguous):

E→E+T|T
T→T*F|F
F → (E) | id

Here, * has higher precedence than +


Left-associative operations are modeled explicitly

Introduction to Parsing
What is Parsing?
Parsing is the process of analyzing a sequence of tokens to determine its grammatical structure according to a
CFG.
Purpose of Parsing
Validate syntax
Construct parse trees or syntax trees
Bridge between lexical analysis and semantic analysis

Types of Parsers
Parser Type Description
Top-Down Parser Builds parse tree from the root (start symbol) to leaves
Bottom-Up Parser Builds parse tree from leaves (input tokens) to root

Parsing Techniques
Technique Example
Top-Down LL(1) Parser
Bottom-Up LR(0), SLR(1), LALR(1), Canonical LR(1)

Parse Tree vs Syntax Tree


Feature Parse Tree Syntax Tree
Shows all rules Yes No (only essential structure)
More detailed Yes No
Used in parsing Yes Yes, especially in semantic analysis

A Context-Free Grammar (CFG) defines syntactic rules of a language using production rules.
A grammar is ambiguous if it allows multiple parse trees for the same string, which must be avoided in
compiler design. Parsing is the process of analyzing token sequences using CFG to produce a parse tree,
validating syntax and structure.
Top-Down Parsing
Top-Down Parsing is a parsing technique where the parser starts at the start symbol and tries to derive the
input string by applying the production rules in a top-to-bottom fashion.
It builds the parse tree from the root (start symbol) and moves toward the leaves (input tokens).
Key Characteristics of Top-Down Parsing
Recursive: Each non-terminal in the grammar might lead to multiple recursive calls.
Predictive: For each non-terminal, the parser predicts which production rule to apply.

Types of Top-Down Parsers


Recursive Descent Parser:
A hand-written parser for a grammar.
Each non-terminal has a corresponding recursive procedure.
Backtracking is used if a wrong path is taken.
Not efficient for grammars that require backtracking.
LL(1) Parser:
A non-recursive parser using a lookahead of one token.
A predictive parser that does not require backtracking.
Works with LL grammars.

LL Grammars
An LL Grammar is a context-free grammar (CFG) that can be parsed using an LL parser (Top-Down Parser) with
one token of lookahead.
L refers to Left-to-right scanning of the input string.
L refers to Leftmost derivation (expanding the leftmost non-terminal first).
1 indicates that the parser uses one token of lookahead to decide which production rule to apply.

Properties of LL Grammar
Unambiguous: LL grammars are unambiguous by nature, making them suitable for efficient parsing.
No Left Recursion: LL grammars cannot have left recursion. Left recursion causes infinite recursion in top-
down parsing.
Example of Left Recursion (Invalid for LL Grammar):

A → Aα | β

The left recursion is removed by re-writing it:

A → βA'
A' → αA' | ε

LL(1) Parsing Table


For an LL(1) parser, we construct a parsing table that guides the selection of production rules based on the
current non-terminal and the lookahead token.
Rows represent non-terminals.
Columns represent terminals (including the end-of-input symbol "$").
Cell Entries represent the production rule to apply based on the current non-terminal and lookahead terminal.
LL(1) Parsing Table Example:
Non-Terminal a b $
S S → aA - -
A - A→b A→ε
Error Handling in LL Parser
Purpose of Error Handling in LL Parser
Error handling in a parser is critical to detect syntax errors early and to provide meaningful feedback to the
user. In an LL(1) parser, errors may arise when the lookahead token does not match any valid production for
the current non-terminal.
Types of Errors in LL Parsing
Predictive Parsing Error:
The lookahead token does not match any production rule for the current non-terminal.
Occurs when the parsing table does not have an entry for the current combination of non-terminal and
lookahead token.
Mismatch Error:
The input token does not match the expected terminal.
This typically occurs in cases of incorrect grammar or ill-formed source code.

Error Recovery in LL Parser


To ensure the parser can recover from errors and continue parsing the rest of the input, LL parsers implement
various error recovery strategies:
Panic Mode:
Skip tokens until a synchronization token (like ; or }) is found.
Resumes parsing from a valid position after skipping over incorrect input.
Phrase-Level Recovery:
Try to insert or delete tokens to make the input match an expected pattern.
Backtracking (limited to recursive descent):
If a wrong path is taken, backtrack to a previous point and try another production rule.
However, LL(1) parsers avoid backtracking, so predictive errors are caught early.

Example of Error Handling in LL Parser


S → aA | bB
A→a|ε
B→b

For input b a, the lookahead token is a when S → bB was expected. The parser detects this mismatch.
Error Handling (Panic Mode):
The parser discards the erroneous a and resumes parsing from the next valid token.

LL Parsing Error Detection Process


Identify Mismatch: Compare the lookahead token with the expected tokens in the parsing table.
Handle Mismatch: If no valid entry exists in the table, apply error recovery strategies:
Panic Mode (skip tokens)
Phrase-Level Recovery (insert/delete tokens)
Continue Parsing: After error recovery, continue parsing the remaining input.

Top-Down Parsing begins from the start symbol and tries to derive the input by applying production rules. An
LL(1) Parser is a top-down parser that uses one token of lookahead to predict the appropriate rule.
LL Grammars are unambiguous and require the grammar to be free from left recursion.
Error Handling in LL parsers involves detecting predictive parsing errors and using strategies like panic mode
or phrase-level recovery to continue parsing the input.
Recursive Descent Parsing
Recursive Descent Parsing is a top-down parsing technique where each non-terminal in the grammar has a
corresponding recursive function in the parser. These functions attempt to match the input string based on
the production rules.
The parser tries to expand the input by applying the appropriate production rules, and for each non-terminal
in the grammar, there is a recursive procedure to handle it.

Characteristics of Recursive Descent Parsing


It works by applying leftmost derivation of the grammar.
It is a recursive process, where each non-terminal invokes its own corresponding function.
It uses backtracking to try different productions if a mismatch occurs.
Typically simple to implement and suitable for small grammars.

Example of Recursive Descent Parsing


Consider a grammar:

E→T+E|T
T → id

For input id + id, the recursive descent parser works as follows:


Start with the start symbol E.
For E, check if the input matches the production E → T + E or E → T.
If E → T + E is chosen, recursively call the function for T and E.

E→T+E

T → id // Match 'id'
+ // Match '+'
E → T // Match 'id'

Advantages of Recursive Descent Parsing


Simple to implement: Each non-terminal corresponds to a function.
Readable and maintainable: The structure mirrors the grammar.

Disadvantages of Recursive Descent Parsing


Backtracking: Can be inefficient for ambiguous grammars, requiring backtracking when a rule doesn't match.
Left recursion: Recursive descent parsers cannot handle left-recursive grammars directly, as this causes
infinite recursion.
Predictive Parsers
A Predictive Parser is a top-down parser that does not require backtracking. It uses a single lookahead token
and a parsing table to predict which production to apply based on the current non-terminal and the next input
token.

Key Features of Predictive Parsers


No backtracking: Predictive parsers do not need to backtrack, making them efficient.
LL(1): Most predictive parsers are LL(1), meaning they predict using one token of lookahead.
Parse Table: The parser relies on a parsing table to decide which production to use.

Working of Predictive Parsing


First Set: For a non-terminal A, the First set contains all the terminals that appear as the first symbol in strings
derived from A.
Follow Set: For a non-terminal A, the Follow set contains all the terminals that can appear immediately after A
in the derivations of the grammar.
LL(1) Parsing Table: The table has rows corresponding to non-terminals and columns corresponding to
terminals. The entries indicate which production rule to apply.

Predictive Parsing Table Example


Consider the following grammar:

E→T+E|T
T → id

First sets:

First(E) = {id}
First(T) = {id}

LL(1) Parsing Table:


Non-Terminal id +
E E→T E→T+E
T T → id -

Predictive Parsing Process


Start with the start symbol and the input string.
Use the lookahead token to decide which rule to apply.
Apply the rule and continue parsing recursively.
If a match is found, consume the token and move to the next one.

Advantages of Predictive Parsers


Efficient: No backtracking and uses a single token of lookahead.
Non-recursive: This is a major improvement over recursive descent parsing for complex grammars.
Handles LL(1) grammars efficiently.

Disadvantages of Predictive Parsers


Requires LL(1) grammar: Predictive parsers work only on LL(1) grammars, which limits their applicability.
Complex table construction: Building the parsing table can be complex for large grammars.
Handling Left Recursion in Predictive Parsers
Left Recursion and Its Problem
Left recursion in a grammar occurs when a non-terminal can directly or indirectly derive itself at the beginning
of its production.
Example of Left Recursion:

A → Aα | β

This causes infinite recursion in a recursive descent parser, and it also leads to failure in predictive parsers
because the parser can keep expanding A without ever consuming input.

Eliminating Left Recursion


To make the grammar suitable for predictive parsing, left recursion must be eliminated.
Left recursion is eliminated by introducing a new non-terminal and refactoring the productions.

For example:

A → Aα | β

Can be rewritten as:

A → βA'
A' → αA' | ε

Now, the grammar is right-recursive, which is suitable for predictive parsing.

Recursive Descent Parsing is a simple, top-down technique where each non-terminal corresponds to a
recursive function. It may require backtracking and cannot handle left recursion.
Predictive Parsers are efficient top-down parsers that use a lookahead token and a parsing table to predict the
correct production rule. They work on LL(1) grammars and do not require backtracking, making them faster
than recursive descent parsers.
Left recursion must be eliminated from the grammar to make it compatible with predictive parsing, ensuring
the parser operates correctly.
Bottom-Up Parsing
Bottom-Up Parsing is a type of syntax analysis where the parser starts from the input tokens (the leaves of the
parse tree) and works towards the start symbol (the root of the parse tree). This approach is also called shift-
reduce parsing.
In bottom-up parsing, the parser tries to reduce the input string to the start symbol by applying production
rules in reverse. It starts with the input string and tries to reduce portions of the string to non-terminals.
Bottom-Up Parsing Process
Initial State: The stack starts empty, and the entire input string is placed into the input buffer.
Shift Operation: The parser shifts tokens from the input buffer onto the stack.
Reduce Operation: After shifting a token, the parser checks if the top portion of the stack matches the right-
hand side of any production. If so, it reduces that portion to the left-hand side of the rule.
Repeat: The parser alternates between shift and reduce until the input string is reduced to the start symbol.
Accept: If the entire input string is reduced to the start symbol and no more tokens remain in the input buffer,
the parser accepts the input.
Error: If the parser cannot reduce the stack or match the input token, it reports an error.

Types of Bottom-Up Parsers


LR Parsing (Left to Right, Rightmost Derivation):
LR parsers are the most powerful class of parsers for deterministic context-free grammars.
They parse the input from left to right and generate a rightmost derivation.
LR parsers work in bottom-up fashion by shifting tokens onto a stack and reducing them based on the
production rules.
SLR (Simple LR) Parsing:
An optimized version of LR parsing.
Uses a simpler LR parsing table with fewer states, making it more efficient, but can be less powerful for certain
grammars.
LALR (Look-Ahead LR) Parsing:
Similar to SLR, but allows a larger lookahead to improve parsing capabilities.
LALR parsers are commonly used in practical applications (e.g., Yacc).
Canonical LR Parsing:
The most general form of LR parsing with a more comprehensive state set, allowing for more complex
grammars. It uses a large parsing table, which makes it more powerful but less efficient compared to SLR and
LALR parsers.
LR Parsing Table
The LR parsing table consists of:
Action Table: Determines what action the parser should take (shift, reduce, accept, or error) based on the
current state and lookahead token.
Goto Table: Determines the next state based on the current state and the non-terminal being processed.
Action Table Example:
State id + * $ ...
0 S1
1 S2 S3 Accept
2 R1 R1
3 R2 R2
S1: Shift the token id onto the stack and go to state 1.
S2: Shift the token + and go to state 2.
R1: Reduce using the first rule (e.g., E → id).
Accept: The parser successfully reduces the input string to the start symbol.
Example of Bottom-Up Parsing
E→E+T|T
T → id

For input id + id, the parsing proceeds as follows:

Shift: Place id on the stack.


Stack: id
Action: Apply T → id.

Reduce: Replace id with T.


Stack: T
Action: Apply E → T.

Shift: Place + on the stack.


Stack: T +
Action: Shift the next token (+).

Shift: Place id on the stack.


Stack: T + id
Action: Apply T → id.

Reduce: Replace id with T.


Stack: T + T
Action: Apply E → T + T.

Accept: The parser successfully reduces the input string to E.

Advantages of Bottom-Up Parsing


Powerful: Bottom-up parsing can handle a wide variety of context-free grammars, including many complex
constructs like ambiguous or left-recursive grammars.
Efficient for large grammars: Can handle grammars with a larger set of production rules efficiently.

Disadvantages of Bottom-Up Parsing


Complex to implement: Compared to top-down parsers, bottom-up parsers require constructing LR parsing
tables, which can be difficult.
Large memory usage: Canonical LR parsing, in particular, can require a large number of states and tables,
making it less space-efficient.

Bottom-Up Parsing starts from the input tokens and reduces them to the start symbol. It works by applying
shift and reduce operations, with LR parsing being the most powerful type of bottom-up parsing. It uses an LR
parsing table with action and goto entries to decide what action to take based on the current state and input
token

LR, SLR, LALR, and Canonical LR are different types of bottom-up parsers that vary in complexity, efficiency,
and power.
Bottom-up parsing is efficient for large grammars and can handle complex constructs, but it requires more
memory and is harder to implement than top-down parsing.
Shift-Reduce Parsing
Shift-Reduce Parsing is a bottom-up parsing technique used to parse context-free grammars (CFGs). It is a type
of bottom-up parsing that works by shifting input tokens onto a stack and reducing the stack's topmost
symbols to non-terminals according to the production rules.
Shift: Move the next input token onto the stack.
Reduce: Replace a portion of the stack that matches the right-hand side of a production rule with the left-
hand side non-terminal.
The parser continues alternating between shifting and reducing until it has processed all input and reduced
the stack to the start symbol of the grammar, at which point the input is accepted.
Key Operations in Shift-Reduce Parsing
Shift: The parser moves the next token from the input buffer to the top of the stack.
The stack grows as more tokens are shifted onto it.
Reduce: The parser checks the top portion of the stack to see if it matches the right-hand side of any
production rule.
If it matches, the parser reduces that portion of the stack to the left-hand side of the production rule.
Accept: If the entire input string is reduced to the start symbol and the input buffer is empty, the parser
accepts the input as syntactically correct.
Error: If no reduction rule can be applied and no more tokens remain in the input, the parser reports a syntax
error.
Example of Shift-Reduce Parsing
E→E+T|T
T → id

For input id + id, the shift-reduce parser proceeds as follows:


Shift the first id onto the stack.
Stack: id
Input: + id

Reduce using the production T → id.


Stack: T
Input: + id

Shift the + token onto the stack.


Stack: T +
Input: id

Shift the second id onto the stack.


Stack: T + id
Input: (empty)

Reduce using the production T → id.


Stack: T + T
Input: (empty)

Reduce using the production E → T + T.


Stack: E
Input: (empty)
Accept: The input has been reduced to the start symbol E, and the input is successfully parsed.
Types of Shift-Reduce Parsers
Simple Shift-Reduce Parser:
A basic parser that uses a single stack and applies shift and reduce operations based on the input tokens and
stack contents.
It might require backtracking for some grammars, especially when ambiguous rules are present.
LR Parser:
An advanced type of shift-reduce parser.
It uses shift-reduce operations combined with a lookahead mechanism to make decisions.
It relies on an LR parsing table to guide the parser on when to shift and when to reduce.
There are variations of LR parsers like SLR, LALR, and Canonical LR.

Shift-Reduce Conflicts
A key challenge in shift-reduce parsing is resolving conflicts that occur when the parser cannot decide whether
to shift or reduce.

Shift-Reduce Conflict:
Occurs when the parser cannot decide whether to shift a token onto the stack or reduce a portion of the
stack. This conflict typically arises when the grammar is ambiguous or when it does not follow the required
restrictions for shift-reduce parsing.
Example:

E→E+E
E → id

For input id + id, there is ambiguity:


The parser can either shift the + or reduce the id to E.

Reduce-Reduce Conflict:
Occurs when there are multiple applicable reduce actions.
This conflict arises when the parser cannot determine which of several production rules to apply for reducing
the stack's top portion.
Example:

A→B
A→C

If both B and C are present at the top of the stack, the parser cannot decide which production to apply.

Resolving Conflicts
To resolve conflicts in shift-reduce parsing, there are a few strategies:
Disambiguating Grammar: Modify the grammar to make it unambiguous, ensuring that there is a clear
decision between shifting and reducing at each step.
Using Lookahead: The parser can use a lookahead token to decide which operation to apply, particularly in LR
parsers.
Precedence and Associativity Rules: For operators (e.g., + and *), define precedence (which operator has
higher priority) and associativity (whether operators are left- or right-associative) to resolve shift-reduce
conflicts.
Example: * has higher precedence than +, so when encountering both + and *, the parser prefers to shift *
before reducing the +.
LR Parsers
LR Parsers are a type of bottom-up parsers that read the input from left to right and generate a rightmost
derivation in reverse. LR parsers are one of the most powerful types of parsers because they can handle a
wide range of context-free grammars (CFGs), including many complex and ambiguous grammars.

L stands for Left to Right (the direction of reading the input).

R stands for Rightmost derivation (the direction of applying productions).

Key Features of LR Parsers


Deterministic: LR parsers can make decisions based on the current input symbol and the current state of the
parser, using one token of lookahead.

Efficient: LR parsers use lookahead and parse tables to guide the parsing process, making them efficient in
handling complex grammars.

Comprehensive: LR parsers can handle a large set of grammars that cannot be handled by recursive descent
parsers.

Types of LR Parsers

Canonical LR Parser:
The most general form of LR parsing.
It uses a large parsing table with many states.
Suitable for parsing a wide range of grammars but memory intensive.

SLR (Simple LR) Parser:


A simplified version of the canonical LR parser.
It uses a smaller parsing table than the canonical LR parser.
It is more efficient than the canonical LR parser but less powerful.

LALR (Look-Ahead LR) Parser:


A compromise between canonical LR and SLR parsers.
It reduces the number of states compared to canonical LR while maintaining more power than SLR parsers.
Commonly used in practical applications like Yacc.

Basic Operations of LR Parsing


Shift: The parser shifts the next token from the input to the stack.

Reduce: The parser reduces the top of the stack according to the production rule.

Accept: The parser successfully finishes if the start symbol is at the top of the stack, and the input is
exhausted.

Error: The parser encounters an error when no valid shift or reduce actions are possible.
Construction of SLR Parser
An SLR (Simple LR) Parser is a type of LR parser that uses a simplified approach to constructing the parsing
table compared to the canonical LR parser. SLR parsers are less powerful than canonical LR parsers but are
more space-efficient and easier to construct.
SLR parsers are based on the concept of LALR(1) parsing but use a simpler construction method for the goto
and action tables.

SLR Parsing Table


An SLR parser uses two primary components:
Action Table: This table contains the actions that the parser should take based on the current state and the
lookahead symbol.
Shift (S): The parser shifts the lookahead symbol onto the stack.
Reduce (R): The parser reduces the top of the stack based on a production rule.
Accept: The input has been successfully parsed.
Error: There is no valid action for the current state and symbol.
Goto Table: This table is used to transition to a new state based on the current state and a non-terminal that
has been reduced.
This is used when a reduction is made, and the parser needs to know where to go next.

Steps to Construct an SLR Parser


Construct the Canonical Collection of LR(0) Items:
For a given grammar, we start by constructing a set of LR(0) items for each production rule.
An LR(0) item represents a production rule with a dot (•) indicating how much of the production has been
parsed.
Example of LR(0) items:

A → •B
A → B•

The dot indicates the position in the rule, and the parser will process the grammar in the order of the dot.
Create the States of the Parser:

Each set of LR(0) items forms a state of the parser.


Starting from the initial state (which contains the initial item), the parser applies closure and goto operations
to build the entire set of states.

Define the Action Table:


For each state, check if the lookahead symbol can shift or if it must reduce based on the items in the state.
For example:
If the item is A → α • B β, and the lookahead symbol is B, the parser will shift.
If the item is A → α •, and the lookahead symbol is the end-of-input symbol $, the parser will accept.
The reduce action is determined based on the follow set of the non-terminal being reduced.

Define the Goto Table:


For each state and non-terminal, compute the goto based on the LR(0) items and the closure operation.
Conflict Resolution:
The SLR parser uses follow sets to handle reduce actions. If multiple reductions are possible, the parser
chooses the reduction based on the follow set of the non-terminal.
If conflicts arise, the grammar is not SLR(1), and an error occurs.
Example of Constructing an SLR Parser
Consider the following simple grammar:

S → AA
A→a|b

Step 1: Construct LR(0) Items: Start with the augmented grammar:

S' → • S
S → • AA
A→•a
A→•b

Apply the closure operation to find the initial set of items.

Step 2: Create the States


From the initial set of items, generate the states of the parser. For each state, compute the goto for each
terminal and non-terminal symbol.

Step 3: Define the Action Table


Based on the states and lookahead symbols, define the shift, reduce, and accept actions.

Step 4: Define the Goto Table


For each state and non-terminal, define the goto table to transition to new states when a reduction occurs.

Advantages of SLR Parsers


Simple to Construct: The construction of SLR parsing tables is simpler than other LR parsers, like Canonical LR.
Space Efficient: The SLR parser uses smaller parsing tables compared to the Canonical LR parser, making it
more space-efficient.
Efficient for Simple Grammars: SLR parsers work well for grammars that are not too complex and do not have
many shift-reduce conflicts.

Disadvantages of SLR Parsers


Limited Grammar Class: SLR parsers can only handle a subset of grammars that Canonical LR parsers can
handle. Some grammars may cause shift-reduce conflicts that cannot be resolved with an SLR parser.
Less Powerful: Compared to Canonical LR parsers and LALR parsers, SLR parsers are less powerful and may not
handle all types of context-free grammars.
24. Canonical LR Parsing
Canonical LR Parsing is the most general type of LR parser. It uses a complete set of parsing states, providing a
more powerful and flexible method for parsing context-free grammars (CFGs).
Canonical means that the parser considers every possible state of the grammar to make decisions.
LR(1) refers to Left-to-right scanning of input and Rightmost derivation in reverse, with 1 token of lookahead.

Construction of Canonical LR Parsing Tables


A Canonical LR parser uses two primary tables to guide parsing:
Action Table: Specifies the actions to take based on the current state and the lookahead token.

Shift (S): Move the next token onto the stack.


Reduce (R): Apply a production rule to reduce the top part of the stack.
Accept: Successfully parsed the input.
Error: The parser cannot proceed.
Goto Table: Specifies the next state to transition to after reducing a non-terminal.

This table is used when a reduction occurs and the parser needs to know where to go next.
Steps to Construct Canonical LR Parsing Table:

Construct the augmented grammar: Add a new start symbol (e.g., S' → S).
Compute the LR(0) items: Create the initial item and apply the closure operation to form the set of all items.

Generate the states: For each item set, apply the goto operation with all terminals and non-terminals.

Define the Action Table: For each state and terminal, determine whether the action is a shift, reduce, or
accept.
Define the Goto Table: For each state and non-terminal, compute the next state after a reduction.

Example of Canonical LR Parsing Table:

Consider the following grammar:

S→AB
A→a
B→b

Augmented Grammar:

S' → S
S→AB
A→a
B→b

LR(0) Items:

S' → • S
S→•AB
A→•a
B→•b
State 1 (Start state) contains the item:

S' → • S
S→•AB

The parser begins by shifting symbols and reducing based on the above items, using the goto operation to
transition between states.
Action Table:
State 1: If the lookahead is a, shift.
State 2: If the lookahead is b, shift.
State 3: If the lookahead is the end symbol ($), reduce using S → A B.
Goto Table:
State 1, A: Go to state 2.
State 2, B: Go to state 3.

Advantages of Canonical LR Parsers


Handles all LR(1) grammars: Canonical LR parsers can handle all grammars that are LR(1), which is the largest
class of grammars for LR parsing.
Unambiguous: Since every decision is based on the current state and the lookahead symbol, there are no
ambiguities.
Disadvantages of Canonical LR Parsers
Large parsing tables: The number of states and transitions can grow quickly, making the parser memory-
intensive and slower to construct.
Difficult to implement: The construction of canonical LR parsing tables is complex due to the large number of
states and transitions.

25. LALR Parsing


LALR (Look-Ahead LR) Parsing is a variation of Canonical LR Parsing that attempts to reduce the size of the
parsing tables while maintaining the power of the LR(1) parser.
LALR is almost identical to Canonical LR but uses state merging to combine similar states in the parsing table.
LALR parsers can handle a larger set of grammars than SLR parsers but are more memory-efficient than
Canonical LR parsers.
LALR Parsing Table Construction
The LALR parsing table is constructed by merging states from the canonical LR table that have the same core
but possibly different lookahead symbols. These states are merged to create a more compact table.
Steps to Construct an LALR Parsing Table:
Construct the Canonical LR Parsing Table: First, construct the canonical LR parsing table as usual.
Merge States: Identify states that have the same core (i.e., the same set of LR(0) items without the lookahead
symbols). Merge these states into a single state.
Adjust the Action and Goto Tables: After merging the states, update the action and goto tables accordingly.
This may require resolving conflicts between the merged states.
Example of LALR Parsing Table
Consider the same grammar:

S→AB
A→a
B→b

The canonical LR parsing table would generate states based on individual item sets, while the LALR parser
would merge states that have identical cores and may differ only in their lookahead symbols.
26. Canonical LR vs. LALR Parsers
Feature Canonical LR Parser LALR Parser
Number of States Larger (more states) Smaller (state merging reduces table
size)
Parsing Power Can handle all LR(1) grammars Can handle most LR(1) grammars but
not all
Memory Usage Higher (more states and transitions) Lower (compact tables due to state
merging)
Implementation More complex to implement due to Easier to implement than Canonical
Complexity larger tables LR
Conflict Handling No conflicts for LR(1) grammars May have some conflicts due to state
merging

Parsing with Ambiguous Grammar


What is Ambiguous Grammar?
A grammar is said to be ambiguous if there exists at least one string in the language generated by the
grammar that can be derived in more than one way. In other words, an ambiguous grammar allows multiple
leftmost derivations, rightmost derivations, or parse trees for the same input string.
Ambiguity can cause issues during parsing because it makes it unclear how the parser should proceed when
multiple interpretations are possible.

Example of Ambiguous Grammar:


Consider the following grammar for arithmetic expressions:

E→E+E
E→E*E
E→(E)
E → id

For the string id + id * id, this grammar allows two different derivations:

(id + id) * id (first +, then *)


id + (id * id) (first *, then +)

This means the grammar is ambiguous, as it generates two different parse trees for the same input.

Impact of Ambiguous Grammar on Parsing


Multiple Parse Trees: The parser may encounter situations where it has to choose between different parse
trees. This can cause confusion in compiler construction since the meaning of the program (syntax and
semantics) depends on the structure of the parse tree.

Inefficiency: Handling ambiguity can lead to an inefficient parser that may need to explore multiple parsing
paths or backtrack, resulting in increased computation time and memory usage.

Semantics Confusion: Ambiguity can create problems not just in parsing, but in semantic analysis. Different
interpretations of the same syntax could lead to conflicting semantic results, making it difficult to generate
correct code or analyze the program.
How to Handle Ambiguous Grammars in Parsing
There are several strategies used to handle or resolve ambiguity during parsing:
Precedence and Associativity Rules:
Precedence determines which operators should be applied first. For example, multiplication ( *) has higher
precedence than addition (+).
Associativity defines the direction in which operators are applied when they have the same precedence (left-
to-right or right-to-left).

For example, in the ambiguous grammar mentioned above, you can introduce precedence rules to resolve
whether + or * should take precedence.

Example of resolving ambiguity using precedence:

E → E + E (low precedence)
E → E * E (high precedence)
E → ( E ) (parentheses)
E → id (identifier)

By introducing precedence and associativity, you can guide the parser to choose the correct derivation.
Transform the Grammar: One approach is to rewrite the grammar to remove the ambiguity. This may involve
breaking down production rules or introducing new non-terminals to eliminate multiple derivations for a given
string.
For example, the ambiguous expression grammar can be rewritten as follows:

E→E+T
E→T
T→T*F
T→F
F→(E)
F → id

In this transformed grammar, + and * are handled in separate non-terminal rules, making the grammar
unambiguous.

Use a Parser with Backtracking: Some parsing strategies (e.g., recursive descent parsers) can use backtracking
to explore all possible derivations when ambiguity arises. However, this approach is inefficient for large
grammars and can lead to performance issues.

Earley Parser and GLR Parser: These parsers are capable of handling ambiguous grammars by parsing all
possible parse trees. However, they are more complex to implement and may still require additional
mechanisms for resolving ambiguity.

Example of Resolving Ambiguity in Arithmetic Expressions


Consider the following original grammar:

E→E+E
E→E*E
E→(E)
E → id
For the expression id + id * id, this grammar is ambiguous because we can derive two different parse
trees:

(id + id) * id
id + (id * id)

To resolve this ambiguity, we can introduce precedence rules:


Multiplication has higher precedence than addition:

E→E+T
T→T*F
T→F
F→(E)
F → id

Now, the parse tree for id + id * id is unique, and the grammar is unambiguous.
Parse tree:
E
/ \
E +
/ \ \
id T T
/ \
id *
\
id
This tree represents the correct interpretation where id * id is evaluated first, then added to id.

Ambiguity in Context-Free Grammars (CFGs)


It is important to note that ambiguity in grammars is a property of the grammar itself, not the parser. A
context-free grammar (CFG) is ambiguous if there exists at least one string in the language it generates that
has multiple valid parse trees.
Ambiguity is a serious issue in parsers because it can cause:
Multiple interpretations: Different derivations can lead to different meanings of the same input.
Difficulty in optimizing code: Ambiguity makes it harder to generate efficient machine code or intermediate
representations because different interpretations might affect code generation.

Ambiguity in LL and LR Parsers


LL Parsers: An LL parser (Left-to-right, Leftmost derivation) cannot handle ambiguous grammars because it
needs to make decisions based on a single token of lookahead. Ambiguity would violate the one-token
lookahead rule.
LR Parsers: An LR parser (Left-to-right, Rightmost derivation) can handle more complex grammars but still
struggles with ambiguity. It may have to rely on techniques like precedence rules or backtracking to handle
ambiguous cases.
Operator Precedence Parsing
Operator Precedence Parsing is a type of bottom-up parsing used to handle expressions involving operators
like +, -, *, /, etc. It uses a precedence table to decide the order in which operators should be evaluated in the
input expression. This approach is especially useful for expressions with mixed operators, where operator
precedence and associativity determine the correct parse tree.

The idea behind operator precedence parsing is to use a stack-based parser that applies a set of precedence
rules to determine when to shift (move a symbol onto the stack) and when to reduce (apply a production
rule).

How Operator Precedence Parsing Works


Operator precedence parsing relies on the concept of precedence relations between operators in the
grammar. The parser builds a stack of operands and operators, gradually applying reductions according to the
precedence rules until it reaches the final result.
Precedence Relation Table:
<: The operator on the left has lower precedence than the operator on the right.
>: The operator on the left has higher precedence than the operator on the right.
=: The operators on both sides have equal precedence.

Steps in Operator Precedence Parsing


Build the Precedence Table: The precedence table defines the precedence and associativity of operators. This
table helps in deciding whether to shift or reduce based on the current token and the top of the stack.

Initialize the Stack: The stack starts empty, and the input expression is processed one token at a time.

Shift Operation: If the current token is an operand (e.g., an identifier or number) or if no precedence conflict
exists between the operator and the top of the stack, shift the token onto the stack.

Reduce Operation: When an operator has lower precedence than the operator on the top of the stack, or
when the stack contains a valid production for reduction, apply reduce to collapse a valid portion of the
expression into a non-terminal.

End Condition: The parsing continues until the entire input expression is processed, and the final reduced
form is produced.

Example of Operator Precedence Parsing


Consider the following expression:

id + id * id

We will parse this expression using the following operator precedence rules:
* has higher precedence than +.
+ and * are left-associative (i.e., operators of the same precedence are evaluated from left to right).
Precedence Table:
Operator id + *
id = < <
+ > = <
* > > =
Now, we will parse the expression using the operator precedence table.
Initial Stack: Start with an empty stack and the input id + id * id:

Step 1: Read id, which is an operand. Shift id onto the stack.


Stack: id

Step 2: Read +, which is an operator. Check the precedence of + with the top of the stack (id). Since id has no
precedence relation to +, shift + onto the stack.
Stack: id +

Step 3: Read id. Shift id onto the stack.


Stack: id + id

Step 4: Read *, which is an operator. Check the precedence of * with the top of the stack (id). Since * has
higher precedence than +, shift * onto the stack.
Stack: id + id *

Step 5: Read id. Shift id onto the stack.


Stack: id + id * id

Step 6: Now, reduce using the operator precedence table. The top of the stack is *, and the next token is id,
which is an operand. Since * has higher precedence than +, apply reduce to the id * id portion of the stack.
Stack: id + (id * id)

Step 7: Continue parsing the next part of the expression, applying the remaining rules until the entire input is
processed.

Precedence Parsing Algorithm


The precedence parser uses the following algorithm:
Initialize:
Start with an empty stack and the input string.
Repeat:
Shift the next token onto the stack if it is an operand or operator that can be shifted.
Reduce the top of the stack using the precedence rules if applicable.
If the stack contains a complete valid expression, accept the input as valid.
Termination:
The process terminates when all tokens are shifted and reduced, and the final result is obtained.

Example Grammar for Precedence Parsing

Consider the following arithmetic expression grammar:

E→E+E
E→E*E
E→(E)
E → id

For operator precedence parsing, this grammar is modified to handle operator precedence explicitly.
We can define precedence rules for + and * as:
* has higher precedence than +.
Parentheses () are used to explicitly group operations and have the highest precedence.
Grammar Modifications:
For addition and multiplication:

E→E+T
E→T
T→T*F
T→F
F → id
F→(E)

Precedence rules:
+ is left-associative and has lower precedence than *.
* is left-associative and has higher precedence than +.

Advantages and Disadvantages of Operator Precedence Parsing

Advantages:
Simple to implement: Operator precedence parsing is relatively easy to implement for simple arithmetic
expressions and provides efficient handling of operator precedence.
No need for backtracking: The algorithm is deterministic, meaning it does not require backtracking like
recursive descent parsers.
Efficient: Operator precedence parsers work in linear time (O(n)), making them fast for parsing arithmetic
expressions.

Disadvantages:
Limited grammar coverage: Operator precedence parsing can only handle grammars that can be expressed
with precedence relations. It cannot handle more complex grammars, such as those requiring context-
sensitive rules.
Complexity with parentheses: Handling parentheses correctly requires special rules, and the grammar must be
modified to account for them explicitly.
Error handling: It is not as robust as other parsers (e.g., LL or LR parsers) in handling errors or unexpected
input.

You might also like