Chapter 4 Syntax Analysis
Outline
y Role of parser y Context free grammars y Top down parsing y Bottom up parsing y Parser generators
The role of parser
Source Lexical program Analyzer token
Parser
getNext Token
Parse tree
Rest of Intermediate Front End representation
Symbol table
Uses of grammars
E -> E + T | T T -> T * F | F F -> (E) | id
E -> TE E -> +TE | T -> FT T -> *FT | F -> (E) | id
Error handling
y Common programming errors y Lexical errors y Syntactic errors y Semantic errors y Lexical errors y Error handler goals y Report the presence of errors clearly and accurately y Recover from each error quickly enough to detect subsequent errors y Add minimal overhead to the processing of correct progrms
Error-recover strategies
y Panic mode recovery y Discard input symbol one at a time until one of designated set of synchronization tokens is found y Phrase level recovery y Replacing a prefix of remaining input by some string that allows the parser to continue y Error productions y Augment the grammar with productions that generate the erroneous constructs y Global correction y Choosing minimal sequence of changes to obtain a globally least-cost correction
Context free grammars
y Terminals y Nonterminals y Start symbol y productions
expression -> expression + term expression -> expression term expression -> term term -> term * factor term -> term / factor term -> factor factor -> (expression) factor -> id
Derivations
y Productions are treated as rewriting rules to generate a
string y Rightmost and leftmost derivations
y E -> E + E | E * E | -E | (E) | id y Derivations for (id+id)
y
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Parse trees
y -(id+id)
y E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Ambiguity
y For some strings there exist more than one parse tree y Or more than one leftmost derivation y Or more than one rightmost derivation y Example: id+id*id
Elimination of ambiguity
Elimination of ambiguity (cont.)
y Idea: y A statement appearing between a then and an else must be matched
Elimination of left recursion
y y y
y
y
A grammar is left recursive if it has a non-terminal A + such that there is a derivation A=> A Top down parsing methods cant handle left-recursive grammars A simple rule for direct left recursion elimination:
For a rule like:
A -> A A -> A -> | A A |
y
y y
We may replace it with
Left recursion elimination (cont.)
y There are cases like following y S -> Aa | b y A -> Ac | Sd | y Left recursion elimination algorithm: y Arrange the nonterminals in some order A1,A2, ,An. y For (each i from 1 to n) {
y
For (each j from 1 to i-1) { y Replace each production of the form Ai-> Aj by the production Ai -> 1 | 2 | | k where Aj-> 1 | 2 | | k are all current Aj productions y } y Eliminate left recursion among the Ai-productions }
Left factoring
y Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive or top-down parsing. y Consider following grammar: y Stmt -> if expr then stmt else stmt y | if expr then stmt y On seeing input if it is not clear for the parser which production to use y We can easily perform left factoring: y If we have A-> 1 | 2 then we replace it with
y y
A -> A ->
A 1 |
Left factoring (cont.)
y Algorithm y For each non-terminal A, find the longest prefix common to two or more of its alternatives. If <> , then replace all of A-productions A-> 1 | 2 | | n| by
y y
A -> A ->
A| 1 | 2 |
y Example: y S -> I E t S | i E t S e S | a y E -> b
Introduction
y A Top-down parser tries to create a parse tree from the
root towards the leafs scanning input from left to right y It can be also viewed as finding a leftmost derivation for an input string y Example: id+id*id
E -> TE E -> +TE | T -> FT T -> *FT | F -> (E) | id
E
lm
E T E
lm
E T E T
lm
E T E T
lm
E T E T
lm
E T E T + T E
F id
F id
F id
Recursive descent parsing
y Consists of a set of procedures, one for each
nonterminal y Execution begins with the procedure for start symbol y A typical procedure for a non-terminal
void A() { choose an A-production, A->X1X2..Xk for (i=1 to k) { if (Xi is a nonterminal call procedure Xi(); else if (Xi equals the current input symbol a) advance the input to the next symbol; else /* an error has occurred */ } }
Recursive descent parsing (cont)
y General recursive descent may require backtracking y The previous code needs to be modified to allow
backtracking y In general form it cant choose an A-production easily. y So we need to try all alternatives y If one failed the input pointer needs to be reset and another alternative should be tried y Recursive descent parsers cant be used for leftrecursive grammars
Example
S->cAd A->ab | a Input: cad
S c A d c a
S A b d c
S A a d
First and Follow
y First() is set of terminals that begins strings derived from * y If => then is also in First( ) y In predictive parsing when we have A-> | , if First( ) and First( ) are disjoint sets then we can select appropriate A-production by looking at the next input y Follow(A), for any nonterminal A, is set of terminals a that can appear immediately after A in some sentential form * y If we have S => Aa for some and then a is in Follow(A) y If A can be the rightmost symbol in some sentential form, then $ is in Follow(A)
Computing First
y To compute First(X) for all grammar symbols X, apply
* following rules until no more terminals or added to any First set:
can be
If X is a terminal then First(X) = {X}. 2. If X is a nonterminal and X->Y1Y2 Yk is a production for some k>=1, then place a in First(X) if for some i a is in First(Yi) and is in all of First(Y1), ,First(Yi-1) that * is Y1 Yi-1 => . if is in First(Yj) for j=1, ,k then add to First(X). 3. If X-> is a production then add to First(X)
1.
y Example!
Computing follow
y To compute First(A) for all nonterminals A, apply
following rules until nothing can be added to any follow set:
Place $ in Follow(S) where S is the start symbol 2. If there is a production A-> B then everything in First( ) except is in Follow(B). 3. If there is a production A->B or a production A-> B where First( ) contains , then everything in Follow(A) is in Follow(B)
1.
y Example!
LL(1) Grammars
y Predictive parsers are those recursive descent parsers needing no
backtracking y Grammars for which we can create predictive parsers are called LL(1)
y The first L means scanning input from left to right y The second L means leftmost derivation y And 1 stands for using one input symbol for lookahead
y A grammar G is LL(1) if and only if whenever A->
y For no terminal a do
| are two distinct productions of G, the following conditions hold:
and both derive strings beginning with a y At most one of or can derive empty string * y If => then does not derive any string beginning with a terminal in Follow(A).
Construction of predictive parsing table
y For each production A->
in grammar do the
following:
For each terminal a in First( ) add A-> in M[A,a] 2. If is in First( ), then for each terminal b in Follow(A) add A-> to M[A,b]. If is in First( ) and $ is in Follow(A), add A-> to M[A,$] as well
1.
y If after performing the above, there is no production
in M[A,a] then set M[A,a] to error
Example
E -> TE E -> +TE | T -> FT T -> *FT | F -> (E) | id
Non terminal
First F T E E T id +
{(,id} {(,id} {(,id} {+, } {*, }
Follow
{+, *, ), $} {+, ), $} {), $} {), $} {+, ), $}
Input Symbol ( *
E -> TE
)
E ->
$
E ->
E E T T F
E -> TE E -> +TE T -> FT T -> F -> id T -> *FT
T -> FT T -> F -> (E) T ->
Another example
S -> iEtSS | a S -> eS | E -> b
Non terminal
a
S -> a
Input Symbol i e
S -> iEtSS S -> S -> eS
$
S ->
S S E
E -> b
Non-recursive predicting parsing
a + b $
stack
X Y Z $
Predictive parsing program
output
Parsing Table M
Predictive parsing algorithm
Set ip point to the first symbol of w; Set X to the top stack symbol; While (X<>$) { /* stack is not empty */ if (X is a) pop the stack and advance ip; else if (X is a terminal) error(); else if (M[X,a] is an error entry) error(); else if (M[X,a] = X->Y1Y2..Yk) { output the production X->Y1Y2..Yk; pop the stack; push Yk, ,Y2,Y1 on to the stack with Y1 on top; } set X to the top stack symbol; }
Example
y id+id*id$
Matched Stack E$ Input id+id*id$ Action
Error recovery in predictive parsing
y Panic mode y Place all symbols in Follow(A) into synchronization set for nonterminal A: skip tokens until an element of Follow(A) is seen and pop A from stack. y Add to the synchronization set of lower level construct the symbols that begin higher level constructs y Add symbols in First(A) to the synchronization set of nonterminal A y If a nonterminal can generate the empty string then the production deriving can be used as a default y If a terminal on top of the stack cannot be matched, pop the terminal, issue a message saying that the terminal was insterted
Example
Non terminal E E T T F
id E -> TE
+ E -> +TE
Input Symbol ( ) * E -> TE synch E -> T -> FT synch T -> *FT synch T ->
$ synch E -> synch T -> synch
T -> FT
synch T ->
F -> id
synch
F -> (E) synch
Stack
E$ E$ TE$ FTE$ idTE$ TE$ *FTE$ FTE$ TE$
Input
)id*+id$ id*+id$ id*+id$ id*+id$ id*+id$ *+id$ *+id$ +id$ +id$
Action
Error, Skip ) id is in First(E)
Error, M[F,+]=synch F has been poped
Introduction
y Constructs parse tree for an input string beginning at
the leaves (the bottom) and working towards the root (the top) y Example: id*id
E -> E + T | T T -> T * F | F F -> (E) | id
id*id F * id id T * id F id T*F F id id F T*F F id id E F T*F F id id
Shift-reduce parser
y The general idea is to shift some symbols of input to
the stack until a reduction can be applied y At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of the production y The key decisions during bottom-up parsing are about when to reduce and about what production to apply y A reduction is a reverse of a step in a derivation y The goal of a bottom-up parser is to construct a derivation in reverse:
y E=>T=>T*F=>T*id=>F*id=>id*id
Handle pruning
y A Handle is a substring that matches the body of a
production and whose reduction represents one step along the reverse of a rightmost derivation
Right sentential form id*id F*id T*id T*F Handle id F id T*F Reducing production F->id T->F F->id E->T*F
Shift reduce parsing
y A stack is used to hold grammar symbols y Handle always appear on top of the stack y Initial configuration: Stack Input $ w$ y Acceptance configuration Stack Input $S $
Shift reduce parsing (cont.)
y Basic operations: y Shift y Reduce y Accept y Error y Example: id*id
Stack $ $id $F $T $T* $T*id $T*F $T $E Input id*id$ *id$ *id$ *id$ id$ $ $ $ $ Action shift reduce by F->id reduce by T->F shift shift reduce by F->id reduce by T->T*F reduce by E->T accept
Handle will appear on top of the stack
S A B y Stack $ $ $ B By Input yz$ yz$ z$ z Stack $ $ Bxy S B x A y Input xyz$ z$ z
Conflicts during shit reduce parsing
y Two kind of conflicts y Shift/reduce conflict y Reduce/reduce conflict y Example:
Stack if expr then stmt
Input else $
Reduce/reduce conflict
stmt -> id(parameter_list) stmt -> expr:=expr parameter_list->parameter_list, parameter parameter_list->parameter parameter->id expr->id(expr_list) expr->id expr_list->expr_list, expr Stack expr_list->expr id(id
Input ,id) $
LR Parsing
y The most prevalent type of bottom-up parsers y LR(k), mostly interested on parsers with k<=1 y Why LR parsers? y Table driven y Can be constructed to recognize all programming language constructs y Most general non-backtracking shift-reduce parsing method y Can detect a syntactic error as soon as it is possible to do so y Class of grammars for which we can construct LR parsers are superset of those which we can construct LL parsers
States of an LR parser
y States represent set of items y An LR(0) item of G is a production of G with the dot at
some position of the body:
y For A->XYZ we have following items
y y y y
A->.XYZ A->X.YZ A->XY.Z A->XYZ.
y In a state having A->.XYZ we hope to see a string
derivable from XYZ next on the input. y What about A->X.YZ?
Constructing canonical LR(0) item sets
y Augmented grammar:
y G with addition of a production: S ->S
y Closure of item sets:
y If I is a set of items, closure(I) is a set of items constructed from I by
the following rules: y Add every item in I to closure(I) y If A-> .B is in closure(I) and B-> is a production then add the item B->. to clsoure(I). I0=closure({[E ->.E]} y Example:
E->E E -> E + T | T T -> T * F | F F -> (E) | id
E ->.E E->.E+T E->.T T->.T*F T->.F F->.(E) F->.id
Constructing canonical LR(0) item sets (cont.)
y Goto (I,X) where I is an item set and X is a grammar symbol is closure of set of all items [A-> X. ] where [A-> .X ] is in I I1 y Example E ->E. E E->E.+T
I0=closure({[E ->.E]} E ->.E E->.E+T E->.T T->.T*F T->.F F->.(E) F->.id
T (
I2 E ->T. T->T.*F
I4 F->(.E) E->.E+T E->.T T->.T*F T->.F F->.(E) F->.id
Closure algorithm
SetOfItems CLOSURE(I) { J=I; repeat for (each item A-> .B in J)
for (each prodcution B-> of G) if (B->. is not in J) add B->. to J;
until no more items are added to J on one round; return J;
GOTO algorithm
SetOfItems GOTO(I,X) { J=empty; if (A-> .X
return J; }
is in I) add CLOSURE(A-> X. ) to J;
Canonical LR(0) items
Void items(G ) { C= CLOSURE({[S ->.S]}); repeat for (each set of items I in C) for (each grammar symbol X) if (GOTO(I,X) is not empty and not in C) add GOTO(I,X) to C; until no new set of items are added to C on a round; }
Example
I0=closure({[E ->.E]} E ->.E E->.E+T E->.T T->.T*F T->.F F->.(E) F->.id
acc $ E T id (
I1 E ->E. E->E.+T I2 E ->T. T->T.*F I5 F->id.
I4 F->(.E) E->.E+T E->.T T->.T*F T->.F F->.(E) F->.id
E->E E -> E + T | T T -> T * F | F F -> (E) | id
+ * id
I6 E->E+.T T->.T*F T->.F F->.(E) F->.id I7 T->T*.F F->.(E) F->.id
I9
E->E+T. T->T.*F I10 T->T*F.
+ E
I8 E->E.+T F->(E.)
I11 F->(E).
I3 T>F.
Use of LR(0) automaton
y Example: id*id
Line Stack (1) (2) (3) (4) (5) (6) (7) (8) (9) 0 05 03 02 027 0275 02710 02 01 Symbols $ $id $F $T $T* $T*id $T*F $T $E Input id*id$ *id$ *id$ *id$ id$ $ $ $ $ Action Shift to 5 Reduce by F->id Reduce by T->F Shift to 7 Shift to 5 Reduce by F->id Reduce by T->T*F Reduce by E->T accept
LR-Parsing model
INPUT
a1 ai an $
Sm Sm-1 $
LR Parsing Program
Output
ACTION
GOTO
LR parsing algorithm
let a be the first symbol of w$; while(1) { /*repeat forever */ let s be the state on top of the stack; if (ACTION[s,a] = shift t) { push t onto the stack; let a be the next input symbol; } else if (ACTION[s,a] = reduce A-> ) { pop | | symbols of the stack; let state t now be on top of the stack; push GOTO[t,A] onto the stack; output the production A-> ; } else if (ACTION[s,a]=accept) break; /* parsing is done */ else call error-recovery routine; }
Example
STATE id 0 1 2 3 4 5 6 7 8 9 10 11 S5 S5 S6 R1 R3 R5 S7 R3 R5 S5 R 6 R 6 S4 S4 S11 R1 R3 R5 R1 R3 R5 S5 S6 R2 R 4 S7 R7 S4 R6 R6 9 3 10 R2 R4 + * ACTON ( S4 Acc R2 R4 8 2 3 ) $ E 1 GOTO T 2 F 3
(0) E->E (1) E -> E + T (2) E-> T (3) T -> T * F (4) T-> F (5) F -> (E) (6) F->id
Line (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) Stac k 0 05 03 02 027 0275 02710 02 01 016 0165 0163 0169 01 id F T T* T*id T*F T E E+ E+id E+F E+T` E Symbol s
id*id+id?
Input id*id+id$ *id+id$ *id+id$ *id+id$ id+id$ +id$ +id$ +id$ +id$ id$ $ $ $ $ Action Shift to 5 Reduce by F->id Reduce by T->F Shift to 7 Shift to 5 Reduce by F->id Reduce by T>T*F Reduce by E->T Shift Shift Reduce by F->id Reduce by T->F Reduce by E>E+T accept
Constructing SLR parsing table
y Method y Construct C={I0,I1, , In}, the collection of LR(0) items for G y State i is constructed from state Ii:
y y y
If [A-> .a ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to shift j If [A-> .] is in Ii, then set ACTION[i,a] to reduce A-> for all a in follow(A) If {S ->.S] is in Ii, then set ACTION[I,$] to Accept
y If any conflicts appears then we say that the grammar is not
SLR(1). y If GOTO(Ii,A) = Ij then GOTO[i,A]=j y All entries not defined by above rules are made error y The initial state of the parser is the one constructed from the set of items containing [S ->.S]
Example grammar which is not S -> L=R | R SLR(1)
L -> *R | id R -> L
I0 S->.S S -> .L=R S->.R L -> .*R | L->.id R ->. L I1 S->S. I2 S ->L.=R R ->L. I3 S ->R. I4 L->*.R R->.L L->.*R L->.id Action = 2
Shift 6 Reduce R->L
I5 L -> id. I6 S->L=.R R->.L L->.*R L->.id
I7 L -> *R. I8 R -> L. I9 S -> L=R.
More powerful LR parsers
y Canonical-LR or just LR method y Use lookahead symbols for items: LR(1) items y Results in a large collection of items y LALR: lookaheads are introduced in LR(0) items
Canonical LR(1) items
y In LR(1) items each item is in the form: [A-> . ,a] y An LR(1) item [A-> . ,a] is valid for a viable prefix if * there is a derivation S=> Aw=> w, where
rm
= y Either a is the first symbol of w, or w is and a is $
y
y Example: y S->BB y B->aB|b
* S=>aaBab=>aaaBab rm Item [B->a.B,a] is valid for =aaa and w=ab
Constructing LR(1) sets of items
SetOfItems Closure(I) { repeat for (each item [A-> .B ,a] in I) for (each production B-> in G ) for (each terminal b in First( a)) add [B->. , b] to set I; until no more items are added to I; return I; } SetOfItems Goto(I,X) { initialize J to be the empty set; for (each item [A-> .X ,a] in I) add item [A-> X. ,a] to set J; return closure(J); } void items(G ){ initialize C to Closure({[S ->.S,$]}); repeat for (each set of items I in C) for (each grammar symbol X) if (Goto(I,X) is not empty and not in C) add Goto(I,X) to C; until no new sets of items are added to C; }
Example
S ->S S->CC C->cC C->d
Canonical LR(1) parsing table
y Method y Construct C={I0,I1, , In}, the collection of LR(1) items for G y State i is constructed from state Ii:
y y y
If [A-> .a , b] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to shift j If [A-> ., a] is in Ii, then set ACTION[i,a] to reduce A-> If {S ->.S,$] is in Ii, then set ACTION[I,$] to Accept
y If any conflicts appears then we say that the grammar is not
LR(1). y If GOTO(Ii,A) = Ij then GOTO[i,A]=j y All entries not defined by above rules are made error y The initial state of the parser is the one constructed from the set of items containing [S ->.S,$]
Example
S ->S S->CC C->cC C->d
LALR Parsing Table
y For the previous example we had:
I4 C->d. , c/d
I47 C->d. , c/d/$
I7 C->d. , $
y State merges cant produce Shift-Reduce conflicts.
Why? y But it may produce reduce-reduce conflict
Example of RR conflict in state merging
S ->S S -> aAd | bBd | aBe | bAe A -> c B -> c
An easy but space-consuming LALR table construction
y Method: 1. Construct C={I0,I1, ,In} the collection of LR(1) items. 2. For each core among the set of LR(1) items, find all sets having that core, and replace these sets by their union. 3. Let C ={J0,J1, ,Jm} be the resulting sets. The parsing actions for state i, is constructed from Ji as before. If there is a conflict grammar is not LALR(1). 4. If J is the union of one or more sets of LR(1) items, that is J = I1 UI2 IIk then the cores of Goto(I1,X), , Goto(Ik,X) are the same and is a state like K, then we set Goto(J,X) =k. y This method is not efficient, a more efficient one is discussed in the book
Compaction of LR parsing table
y Many rows of action tables are identical y Store those rows separately and have pointers to them from different states y Make lists of (terminal-symbol, action) for each state y Implement Goto table by having a link list for each nonterinal in the form (current state, next state)
Using ambiguous grammars
STATE ACTON id 0 1 2 3 4 5 S3 S3 S4 R1 R2 R3 S5 S5 R2 R3 R1 R2 R3 R1 R2 R3 S3 R4 S3 S4 S5 S2 R4 S2 S2 R4 R4 7 8 + * ( S2 Acc 6 ) $
E->E+E E->E*E E->(E) E->id
I0: E->.E E->.E+E E->.E*E E->.(E) E->.id I3: E->.id I1: E->E. E->E.+E E->E.*E I2: E->(.E) E->.E+E E->.E*E E->.(E) E->.id I5: E->E*.E E->(.E) E->.E+E E->.E*E E->.(E) E->.id
GO TO E 1
6 7 8 9
I4: E->E+.E E->.E+E E->.E*E E->.(E) E->.id
I6: E->(E.) E->E.+E E->E.*E I8: E->E*E. E->E.+E E->E.*E
I7: E->E+E. E->E.+E E->E.*E I9: E->(E).
Readings
y Chapter 4 of the book