Compiler course
Chapter 4
Syntax Analysis
Outline
Role of parser
Context free grammars
Top down parsing
Bottom up parsing
Parser generators
The role of parser
Lexical token Rest of
Source Parse tree Intermediate
program Analyze Parser Front representation
r getNext End
Token
Checks whether the
tokens are syntactically
Symbol
true.
table Checks code is according
to rules or not.
The Parser
A parsing is a process of deriving string from the given
grammar. (CFG)
It check the proper grammar of the token generated by
lexical analyzer (Scanner).
Generate the parse tree (Derivation etc.)
A parser implements a C-F grammar as a recognizer of strings
The role of the parser in a compiler is two-fold:
1. To check syntax (= string recognizer)
And to report syntax errors accurately
2. To invoke semantic actions
For static semantics checking, e.g. type checking of
expressions, functions, etc.
For syntax-directed translation of the source code to an
intermediate representation
4
Uses of grammars
E -> E + T | T s->T+S|T2
T -> T * F | F T->id*T|id|(s)
F -> (E) | id w= id*id*id
Error handling
Common programming errors
◦ Lexical errors
◦ Syntactic errors
◦ Semantic errors
Error handler goals
◦ Report the presence of errors clearly and
accurately
◦ Recover from each error quickly enough to detect
subsequent errors
◦ Add minimal overhead to the processing of
correct programs
Error-recover strategies
Error recovery strategies in syntax analysis aim to
enable a parser to continue processing input even after
encountering a syntax error, rather than halting
immediately. This allows for the detection of multiple
errors and provides more comprehensive feedback to
the user.
Panic mode recovery
◦ Discard input symbol one at a time until one of
designated set of synchronization tokens (semicolons,
braces, keywords like end) is found
◦ Resume parsing from that point.
◦ Easy to implement and prevents infinite loops.
Phrase level recovery
◦ When an error is detected, the parser tries to replace,
insert, or delete a small number of tokens to transform
the erroneous phrase into a valid one, allowing parsing
to continue.
◦ More effective than panic mode.
Error-recover strategies
Error productions
◦ Augmenting the grammar with special "error
productions" that explicitly describe common syntactic
errors. When the parser encounters an input sequence
matching an error production, it can recognize the error
and potentially provide specific, helpful error messages
or even attempt a predefined correction.
Global correction
◦ Choosing minimal sequence of changes to obtain a
globally least-cost correction by using complex
algorithms.
Statement Mode Recovery:
◦ Similar to phrase-level recovery, but focused on
recovering at the statement level. When an error occurs
within a statement, the parser attempts to make local
corrections (e.g., inserting missing semicolons,
correcting misplaced commas) to allow the parsing of
the rest of the statement to proceed.
Context free grammars
(Components)
Terminals
◦ They are the concrete "words" that the lexical analyzer
(scanner) feeds to the parser.
◦ The parser cannot break down terminals any further.
◦ They appear as leaf nodes in a parse tree.
◦ They are the actual tokens in the input stream, such as
keywords, operators, identifiers, and literals.
Non terminals
◦ They represent the abstract syntactic categories
like expression, statement, or variable.
◦ They are often represented by uppercase letters in grammar
examples, such as E for expression or S for statement.
Context free grammars
(Components)
Start symbol
Productions
◦ These are formal rules of the grammar that specify
how terminals and non-terminals can be combined
to form strings.
◦ Each production consists of a non-terminal on the
left side and an arrow (->) pointing to a sequence of
terminals and/or non-terminals on the right side.
Context free grammars
expression -> expression + term
expression -> expression – term
expression -> term
term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id
In this example, Expression, Term, and Factor are non-terminals,
while +, *, (, ), and Number are terminals.
Each line represents a production rule, defining how an
Expression can be formed from Terms, how Terms can
be formed from Factors, and so on.
Derivations
Productions are treated as rewriting rules to
generate a string
Rightmost and leftmost derivations
◦ E -> E + E | E * E | -E | (E) | id
◦ Derivations for –(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Parse trees
A parse tree is a graphical
representation of a derivation that
filters out the order in which
productions are applied to replace
non-terminals.
Each interior node of a parse tree
represents the application of a
production.
-(id+id)
E => -E => -(E) => -(E+E) => -
(id+E)=>-(id+id)
Ambiguity
For some strings
there exist more
than one parse tree
Or more than one
leftmost derivation
Or more than one
rightmost derivation
Example: id+id*id
Elimination of ambiguity
Here “other” stands for any other statement. According to
this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3
if E1 then if E2 then S1 else S2
Elimination of ambiguity
(cont.)
Idea:
◦ A statement appearing between a then and an
else must be matched
Top Down Parsing
Introduction
A Top-down parser tries to create a parse tree from
the root towards the leafs scanning input from left to
right
It can be also viewed as finding a leftmost derivation
for an input string
Example: id+id*id
E -> TE’ E E E E E E
lm lm lm lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’
F -> (E) | id id id Ɛ id Ɛ
Recursive descent parsing
Consists of a set of procedures, one for each
nonterminal
Execution begins with the procedure for start
symbol
A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
Recursive descent parsing
(cont)
General recursive descent may require
backtracking
The previous code needs to be modified to
allow backtracking
In general form it cant choose an A-production
easily.
So we need to try all alternatives
If one failed the input pointer needs to be reset
and another alternative should be tried
Recursive descent parsers cant be used for left-
recursive grammars
ExamplePointer moves toward next leaf
and next terminal in input
S->cAd Step:2 Buffer c a d
A->ab | a S
Input: cad c A d
Step:1 Buffer c a d a b Pointer will
backtrack in
S
parse tree and
Step:3 Buffer c a d input and take
c A d
S next alternate
for production
At start pointer will be at and not
c A d
left leaf node of parse tree. consider
a b current
Example
Step:4 Buffer c a d Step:5 Buffer c a d
S
S Pointer in
parse tree
c A d
c A d and input
matches
a
a terminal so
move next.
First and Follow
Provide aid to both top down and bottom-up approaches.
During top-down parsing, FIRST and FOLLOW allow us to
choose which production to apply, based on the next
input symbol.
First(A): contains all terminals present in first place
of every string derived by A.
If α=>ɛ then ɛ is also in First(α)
A=>cy
Computing First
To compute First(X) for all grammar symbols
X, apply following rules until no more
terminals
* or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a non terminal and X->Y1Y2…Yk is a
production for some k>=1, then place a in First(X)
if for some i a is in First(Yi) and ɛ is in all of
First(Y1),…,First(Yi-1) that is Y1…Yi-1 => ɛ. if ɛ is
in First(Yj)* for j=1,…,k then add ɛ to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
Example!
Computing follow
To compute First(A) for all nonterminals A,
apply following rules until nothing can be
added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything
in First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ, then
everything in Follow(A) is in Follow(B)
Example!
LL(1)
*
Home Work
Bottom-up Parsing
in order to reach the start symbol.
The image given below depicts the bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing.
These steps are known as shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input
pointer to the next input symbol, which is called the shifted
symbol. This symbol is pushed onto the stack. The shifted symbol
is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule
(RHS) and replaces it to (LHS), it is known as reduce-step. This
occurs when the top of the stack contains a handle. To reduce, a
POP function is performed on the stack which pops off the handle
and replaces it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser.
It uses a wide class of context-free grammar which makes it the
most efficient syntax analysis technique. LR parsers are also
known as LR(k) parsers, where L stands for left-to-right scanning
of the input stream; R stands for the construction of right-most
derivation in reverse, and k denotes the number of lookahead
symbols to make decisions.
LALR(1) – Look-Ahead LR Parser:
1. LALR Parser is lookahead LR parser.
2. It can handle large classes of grammar.
3. Size of CLR parsing table is quite large as compared to
other parsing table.
4. LALR reduces the size of this table produces by CLR.
5. LALR works similar to CLR. The only difference is , it
combines the similar states of CLR parsing table into one
single state.
Error Handling in Compiler Design
The tasks of the Error Handling process
are to detect each error, report it to the
user, and then make some recover strategy
and implement them to handle error. During
this whole process processing time of
program should not be slow. An Error is the
blank entries in the symbol table.
Types or Sources of Error –
There are two types of error: run-time and compile-
time error:
A run-time error: is an error which takes place
during the execution of a program, and usually
happens because of adverse system parameters
or invalid input data. The lack of sufficient
memory to run an application or a memory
conflict with another program and logical error
are example of this. Logic errors, occur when
executed code does not produce the expected
result. Logic errors are best handled by
meticulous program debugging.
Types or Sources of Error –
Compile-time errors: rises at compile
time, before execution of the program.
Syntax error or missing file reference that
prevents the program from successfully
compiling is the example of this.
Classification of Compile-time
error –
Lexical : This includes misspellings of
identifiers, keywords or operators
Syntactical : missing semicolon or
unbalanced parenthesis
Semantical : incompatible value
assignment or type mismatches between
operator and operand
Logical : code not reachable, infinite loop.
Finding error or reporting an error
–
Viable-prefix is the property of a parser which
allows early detection of syntax errors.
Goal: detection of an error as soon as possible
without further consuming unnecessary input
How: detect an error as soon as the prefix of the
input does not match a prefix of any string in the
language.
Example: for(;), this will report an error as for
have two semicolons inside braces.
Error Recovery
basic requirement for the compiler is to simply stop
and issue a message, and cease compilation. There
are some common recovery methods that are
follows.
Error Recovery
Panic mode recovery: This is the easiest way of
error-recovery and also, it prevents the parser from
developing infinite loops while recovering error. The
parser discards the input symbol one at a time until
one of the designated (like end, semicolon) set of
synchronizing tokens (are typically the statement or
expression terminators) is found. This is adequate
when the presence of multiple errors in same
statement is rare. Example: Consider the erroneous
expression- (1 + + 2) + 3. Panic-mode recovery: Skip
ahead to next integer and then continue. Bison: use
the special terminal error to describe how much input
to skip.
E->int|E+E|(E)|error int|(error)
Error Recovery
Phase level recovery: Perform local
correction on the input to repair the error.
But error correction is difficult in this
strategy.
Error productions: Some common errors
are known to the compiler designers that
may occur in the code. Augmented
grammars can also be used, as productions
that generate erroneous constructs when
these errors are encountered. Example:
write 5x instead of 5*x
Error Recovery
Global correction: Its aim is to make as
few changes as possible while converting an
incorrect input string to a valid string. This
strategy is costly to implement.
Error detection and Recovery in Compiler
In this phase of compilation, all possible
errors made by the user are detected and
reported to the user in form of error
messages. This process of locating errors
and reporting it to user is called Error
Handling process.
Functions of Error handler
Detection
Reporting
Recovery
Classification of Errors
Lexical phase errors
These errors are detected during the lexical
analysis phase. Typical lexical errors are
Exceeding length of identifier or numeric
constants.
Appearance of illegal characters
Unmatched string
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $
appears at the end of statement.
Example 2 : This is a comment */
This is an lexical error since end of comment is
present but beginning is not present.
Errorrecovery:
Panic Mode Recovery
In this method, successive characters from
the input are removed one at a time until a
designated set of synchronizing tokens is
found. Synchronizing tokens are delimiters
such as; or }
Advantage is that it is easy to implement
and guarantees not to go to infinite loop
Disadvantage is that a considerable amount
of input is skipped without checking it for
additional errors
Syntactic phase errors
These errors are detected during syntax analysis phase.
Typical syntax errors are
Errors in structure
Missing operator
Misspelled keywords
Unbalanced parenthesis
Example : swicth(ch)
{
.......
.......
}
The keyword switch is incorrectly written as
swicth. Hence, “Unidentified
keyword/identifier” error occurs.
Errorrecovery:
Panic Mode Recovery
◦ In this method, successive characters from
input are removed one at a time until a
designated set of synchronizing tokens is found.
Synchronizing tokens are deli-meters such as ;
or }
◦ Advantage is that its easy to implement and
guarantees not to go to infinite loop
◦ Disadvantage is that a considerable amount of
input is skipped without checking it for
additional errors
Errorrecovery:
Statement Mode recovery
◦ In this method, when a parser encounters an
error, it performs necessary correction on
remaining input so that the rest of input
statement allow the parser to parse ahead.
◦ The correction can be deletion of extra
semicolons, replacing comma by semicolon or
inserting missing semicolon.
◦ While performing correction, atmost care should
be taken for not going in infinite loop.
◦ Disadvantage is that it finds difficult to handle
situations where actual error occured before
point of detection.
Errorrecovery:
Error production
◦ If user has knowledge of common errors that
can be encountered then, these errors can be
incorporated by augmenting the grammar with
error productions that generate erroneous
constructs.
◦ If this is used then, during parsing appropriate
error messages can be generated and parsing
can be continued.
◦ Disadvantage is that its difficult to maintain.
Errorrecovery:
Global Correction
◦ The parser examines the whole program and
tries to find out the closest match for it which is
error free.
◦ The closest match program has less number of
insertions, deletions and changes of tokens to
recover from erroneous input.
◦ Due to high time and space complexity, this
method is not implemented practically.
Semantic errors
These errors are detected during semantic analysis phase.
Typical semantic errors are
Incompatible type of operands
Undeclared variables
Not matching of actual arguments with formal one
Example : int a[10], b;
.......
.......
a = b;
It generates a semantic error because of an
incompatible type of a and b.
Error recovery
If error “Undeclared Identifier” is encountered then, to
recover from this a symbol table entry for corresponding
identifier is made.
If data types of two operands are incompatible then,
automatic type conversion is done by the compiler.