CS3501 – COMPILER DESIGN S.
P Audline Beena, AP / SMIT
UNIT II SYNTAX ANALYSIS
Role of Parser – Grammars – Context-free grammars – Writing a grammar Top Down Parsing - General Strategies
- Recursive Descent Parser Predictive Parser-LL(1) - Parser-Shift Reduce Parser-LR Parser- LR(0) Item
Construction of SLR Parsing Table - Introduction to LALR Parser - Error Handling and Recovery in Syntax
Analyzer-YACC tool - Design of a syntax Analyzer for a Sample Language
ROLE OF PARSER
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that the
string can be generated by the grammar for the source language.
It reports any syntax errors in the program. It also recovers from commonly occurring errors so that it
can continue processing its input.
Role of the Parser:
Parser builds the parse tree.
Parser verifies the structure generated by the tokens based on the grammar (performs context free syntax
analysis).
Parser helps to construct intermediate code.
Parser produces appropriate error messages.
Parser performs error recovery.
Issues:
Parser cannot detect errors such as:
Variable re-declaration
Variable initialization before use
Data type mismatch for an operation
GRAMMARS
Constructs that begin with keywords like while or int, are relatively easy to parse, because the keyword
guides the choice of the grammar production that must be applied to match the input. Consider the
grammar for expressions, which present more of challenge, because of the associativity and precedence
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
of operators.
In the following expression grammar E represents expressions consisting of terms separated by + signs,
T represents terms consisting of factors separated by * signs, and F represents factors that can be either
parenthesized expressions or identifiers.
Error Handling:
Programs can contain errors at many different levels.
For example:
Lexical, such as misspelling an identifier, keyword or operators
Syntactic, such as misplaced semicolons or extra or missing braces, a case statement without an
enclosing switch statement.
Semantic, such as type mismatches between operators and operands.
Logical, such as an assignment operator = instead of the comparison operator ==, infinitely recursive
call.
Functions of error handler:
It should report the presence of errors clearly and accurately.
It should recover from each error quickly enough to be able to detect subsequent errors.
It should not significantly slow down the processing of correct programs.
Error-Recovery Strategies:
Once an error is detected, the parser should recover from the error. The following recovery strategies are
Panic-mode: On discovering an error, the parser discards input symbols one at a time until one of a
designated set of synchronizing tokens is found. The synchronizing tokens are usually delimiters, such
as semicolon or}, whose role in the source program is clear and unambiguous
Phrase-level: When an error is identified, a parser may perform local correction on the remaining input;
that is, it may replace a prefix of the remaining input by some string that allows the parser to continue.
A typical local correction is to replace a comma by a semicolon, delete an extraneous semicolon, or
insert a missing semicolon. Phrase-level replacement has been used in several error-repairing compilers,
as it can correct any input string. Its major drawback is the difficulty it has in coping with situations in
which the actual error has occurred before the point of detection
Error-productions: The common errors that might be encountered are anticipated and augment the
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
grammar for the language at hand with productions that generate the erroneous constructs.
Global-correction: A compiler need to make few changes as possible in processing an incorrect input
string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost
correction. Given an incorrect input string x and grammar G, these algorithms will find a parse tree for
a related string y, such that the number of insertions, deletions, and changes of tokens required to
transform x into y is as small as possible. These methods are in general too costly to implement in terms
of time and space.
CONTEXT-FREE GRAMMARS
Grammars are used to systematically describe the syntax of programming language constructs like
expressions and statements. Using a syntactic variable stmt to denote statements and variable expr to denote
expressions, the production
It specifies the structure of this form of conditional statement.
The Formal Definition of a Context-Free Grammar
A context-free grammar G is defined by the 4-tuple: G= (V, T, P S) where
1. V is a finite set of non-terminals (variable).
2. T is a finite set of terminals.
3. S is the start symbol (variable S∈V)
4. The productions of a grammar specify the manner in which the terminals and nonterminals can be combined
to form strings.
Notational Conventions
These symbols are terminals:
Lowercase letters early in the alphabet, such as a, b, c.
Operator symbols such as +, *,- , / and so on.
Punctuation symbols such as parentheses, comma, and so on.
The digits 0 , 1 , . . . ,9.
Boldface strings such as id or if, each of which represents a single terminal symbol
These symbols are non-terminals:
Uppercase letters early in the alphabet, such as A, B, C.
The letter S, which, when it appears, is usually the start symbol.
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
Lowercase, italic names such as expr or stmt.
Example:
Start symbol: E
Terminal: +, - , * , / , ( , ) , id
Non Terminal: E , T , F
Derivations:
`Derivation is a process that generates a valid string with the help of grammar by replacing the non-terminals
on the left with the string on the right side of the production.
Example: Consider the following grammar for arithmetic expressions:
E → E+E | E*E | ( E ) | - E | id
To generate a valid string - (id+id ) from the grammar the steps are
Types of derivations:
The two types of derivation are:
Left most derivation: In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen
first for replacement.
Right most derivation: In rightmost derivations, the rightmost non-terminal in each sentinel is always
chosen first for replacement. Rightmost derivations are sometimes called canonical derivations.
Example: Given grammar G : E → E+E | E*E | ( E ) | - E | id
Sentence to be derived : – (id+id)
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
Parse Trees and Derivations:
A parse tree is a graphical representation of a derivation that filters out the order in which productions
are applied to replace nonterminals.
Each interior node of a parse tree represents the application of a production. The interior node is labeled
with the nonterminal A in the head of the production; the children of the node are labeled, from left to
right, by the symbols in the body of the production by which this A was replaced during the derivation.
Ambiguity:
A grammar produce more than one parse tree for some sentence is said to be ambiguous. A grammar G
is said to be ambiguous if it has more than one parse tree either in LMD or in RMD for at least one
string.
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
WRITING A GRAMMAR
Grammars are capable of describing most, but not all, of the syntax of programming languages. For
instance, the requirement that identifiers be declared before they are used, cannot be described by a
context-free grammar.
Therefore, the sequences of tokens accepted by a parser form a superset of the programming language;
subsequent phases of the compiler must analyze the output of the parser to ensure compliance with rules
that are not checked by the parser.
Eliminating Ambiguity
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost derivation
can be eliminated by re-writing the grammar. Consider the following "dangling else" grammar.
stmt → if expr then stmt | if expr then stmt else stmt | other
Here "other" stands for any other statement. According to this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
But the grammar is ambiguous since the string
if E1 then if E2 then S1 else S2Has the
following two parse trees for leftmost
derivation:
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
Elimination of Left Recursion:
A grammar is left recursive if it has a non-terminal A such that there is a derivation
A=> Aα
Top down parsing methods can’t handle left-recursive grammars
A simple rule for immediate left recursion elimination for: A => A α|β
We may replace it with
A -> β A’
A’ -> α A’ | ɛ
Example to eliminate Immediate left recursion: Consider the following grammar for arithmetic expressions:
E → E+T | T
T → T*F | F
F → (E) | id
Left factoring
Left factoring is a process of factoring out the common prefixes of two or more production alternates
for the same nonterminal.
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
or top-down parsing.
Consider following grammar:
stmt -> if expr then stmt else stmt | if expr then stmt
On seeing input if it is not clear for the parser which production to use. So, can perform left factoring, where
the general form is:
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
TOP DOWN PARSING
Top-down parsing can be viewed as the problem of constructing a parse tree for the input string, starting
from the root and creating the nodes of the parse tree in preorder (depth-first).
Top down parsing can be viewed as finding a leftmost derivation for an input string.
Parsers are generally distinguished by whether they work top-down (start with the grammar's start
symbol and construct the parse tree from the top) or bottom-up (start with the terminal symbols that
form the leaves of the parse tree and build the tree from the bottom).
Top down parsers include recursive-descent and LL parsers, while the most common forms of bottom
up parsers are LR parsers
RECURSIVE DESCENT PARSER
These parsers use a procedure for each nonterminal. The procedure looks at its input and decides which
production to apply for its nonterminal.
Terminals in the body of the production are matched to the input at the appropriate time, while
non terminals in the body result in calls to their procedure.
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
General recursive-descent may require backtracking; that is, it may require repeated scans over the input.
However, backtracking is rarely needed to parse programming language constructs.
Step 1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first symbol of w.
Expand the tree with the production of S.
An input pointer points to ‘c’, the first symbol of w. Expand the tree with the production of S
Step 2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second symbol
of w ‘a’ and consider the next leaf ‘A’.
Step 3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer to third
symbol of w ‘d’. But the third leaf of tree is b which does not match with the input symbol d.
Hence discard the chosen production and reset the pointer to second position. This is called backtracking.
FIRST( ) & FOLLOW( ):
First(α) is defined as set of terminals that begins strings derived from α.
If , then ε is also in First(α)
In predictive parsing when we have A-> α|β, if First(α) and First(β) are disjoint sets then we can select
appropriate A-production by looking at the next input.
Follow(A), for any nonterminal A, is set of terminals a that can appear immediately after A in some
sentential form.
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
FIRST(F) = FIRST(T) = FIRST(E) = {(, id}. To see why, note that the two productions for F have bodies
that start with these two terminal symbols, id and the left parenthesis. T has only one production, and its
body starts with F. Since F does not derive ε, FIRST(T) must be the same as FIRST(F). The same
argument covers FIRST(E).
FIRST(E') = {+, ε }. The reason is that one of the two productions for E' has a body that begins with
terminal +, and the other's body is ε. Whenever a nonterminal derives ε, we place ε in FIRST for that
nonterminal.
FIRST(T') = {* , t}. The reasoning is analogous to that for FIRST(E').
FOLLOW(E) = FOLLOW(E') = { ), $}. Since E is the start symbol, FOLLOW(E) must contain $. The
production body (E) explains why the right parenthesis is in FOLLOW(E) . For E' , note that this
nonterminal appears only at the ends of bodies of E-productions. Thus, FOLLOW(E') must be the same
as FOLLOW(E).
FOLLOW(T) = FOLLOW(T') = {+, ), $}. Notice that T appears in bodies only followed by E' . Thus,
everything except ε that is in FIRST(E') must be in FOLLOW(T) ; that explains the symbol +.
PREDICTIVE PARSER – LL(1) PARSER
Predictive parsing is a special case of recursive descent parsing where no backtracking is required.
The key problem of predictive parsing is to determine the production to be applied for a non-terminal in
case of alternatives.
This parser looks up the production in parsing table.
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
The table-driven predictive parser has an input buffer, stack, a parsing table and an output stream.
Transition Diagrams for Predictive Parsers:
Transition diagrams are useful for visualizing predictive parsers.
For example, the transition diagrams for nonterminals E and E' of expression grammar
To construct the transition diagram from a grammar, first eliminate left recursion and then left factor the
grammar.
Transition diagrams for predictive parsers differ from those for lexical analyzers.
Parsers have one diagram for each nonterminal.
The labels of edges can be tokens or nonterminals.
A transition on a token (terminal) means that we take that transition if that token is the next input symbol.
Predictive Parsing Program:
The predictive parser has an input, a stack, a parsing Table and an output.
Input: Contains the string to be parsed, followed by right end marker $.
Stack: Contains a sequence of grammar symbols, preceded by $, the bottom of stack marker. Initially the stack
contains the start symbol of the grammar preceded by $.
Parsing Table: It is a two dimensional array M[A,a], where A is a non-terminal and a is a terminal or $.
Output: Gives the output whether the string is valid or not.
INTRODUCTION TO LALR PARSER
The last parser method LALR (LookAhead-LR) technique is often used in practice because the tables
obtained by it are considerably smaller than the canonical LR tables.
The SLR and LALR tables for a grammar always have the same number of states whereas CLR would
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
typically have several number of states.
For example, for a language like Pascal SLR and LALR would have hundreds of states but CLR would
have several thousands of states.
Algorithm: LALR table construction.
INPUT: An augmented grammar G'.
OUTPUT: The LALR parsing-table functions ACTION and GOT0 for G'.
METHOD:
1. Construct C = (I0 I1, . . , In ), the collection of sets of LR(1) items.
2. For each core present among the set of LR(1) items, find all sets having that core, and replace these sets by
their union.
3. Let C' = {J0J1, . . . ,Jm) be the resulting sets of LR(1) items. If there is a parsing action conflict, the algorithm
fails to produce a parser, and the grammar is said not to be LALR(1).
4. The GOTO table is constructed as follows. If J is the union of one or more sets of LR(1) items, that is, J = I1
U I2 U . . .U Ik , then the cores of GOTO(I1, X) , GOTO(I2, X) , . . . , GOTO(Ik, X) are the same, since I1,I2,
. . ,Ik all have the same core. Let K be the union of all sets of items having the same core as GOTO(I1, X). Then
GOTO(J,X) = K.
Let us consider grammar
S->CC
C->cC
C->d whose sets of LR(1) items are
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
Error recovery in LR parsing:
An LR parser will detect an error when it consults the parsing action table and finds an error entry. An
LR parser will announce an error as soon as there is no valid continuation for the portion of the input
thus far scanned.
A canonical LR parser will not make even a single reduction before announcing an error. SLR and
LALR parsers may make several reductions before announcing an error, but they will never shift an
erroneous input symbol onto the stack.
Panic-mode error recovery:
We scan down the stack until a state s with a goto on a particular nonterminal A is found. Zero or more
input symbols are then discarded until a symbol a is found that can legitimately follow A.
The parser then stacks the state GOTO (s, A) and resumes normal parsing.
Phrase-level recovery:
It is implemented by examining each error entry in the LR parsing table and deciding on the basis of
language usage the most likely programmer error that would give rise to that error.
An appropriate recovery procedure can then be constructed; presumably the top of the stack and/or first
input symbols would be modified in a way deemed appropriate for each error entry.
ERROR HANDLING AND RECOVERY IN SYNTAX ANALYZER
The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
Panic mode recovery:
On discovering an error, the parser discards input symbols one at a time until a synchronizing token is
found. The synchronizing tokens are usually delimiters, such as semicolon or end.
It has the advantage of simplicity and does not go into an infinite loop. When multiple errors in the same
statement are rare, this method is quite useful.
Phrase level recovery:
On discovering an error, the parser performs local correction on the remaining input that allows it to
continue.
Example: Insert a missing semicolon or delete an extraneous semicolon etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error production is
used by the parser, appropriate error diagnostics can be generated to indicate the erroneous constructs
recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a parse tree for
a string y, such that the number of insertions, deletions and changes of tokens is as small as possible.
However, these methods are in general too costly in terms of time and space.
YACC
YACC is a Yet Another Compiler Compiler.
Yacc is a computer program for the Unix operating system.
It is a LALR parser generator, generating a parser, the part of a compiler that tries to make syntactic
sense of the source code, specifically a LALR parser, based on an analytic grammar written in a notation
similar to BNF.
Yacc itself used to be available as the default parser generator on most Unix systems.
The input to Yacc is a grammar with snippets of C code (called "actions") attached to its rules.Its output
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
is a shift-reduce parser in C that executes the C snippets associated with each rule as soon as the rule is
recognized. Typical actions involve the construction of parse trees.
Design of a syntax analyzer for a sample language:
YACC (Yet Another Compiler Compiler).
Automatically generate a parser for a context free grammar (LALR parser)
Allows syntax direct translation by writing grammar productions and semantic action
LALR(1) is more powerful than LL(1).
Work with lex. YACC calls yylex to get the next token.
YACC and lex must agree on the values for each token.
Like lex, YACC pre-dated c++, need workaround for some constructs when using c++ (will give an
example).
yyparse() returns 0 if the program is grammatically correct, non-zero otherwise.YACC automatically
2126 – Sri Muthukumaran Institute of Technology
CS3501 – COMPILER DESIGN S.P Audline Beena, AP / SMIT
builds a parser for the grammar (LALR parser).
2126 – Sri Muthukumaran Institute of Technology