Syntax Analysis Part I
EECS 483 Lecture 4
University of Michigan
Monday, September 17, 2006
Announcements/Reading
Reading - Ch 4
4.1 4.3 for today
Simon does not have office hours on Fri
Just Tue/Thu, 1:30-3:30
My mistake on the website
-2-
Where is Syntax Analysis Performed?
if (b == 0) a = b;
Lexical Analysis or Scanner
if
==
Syntax Analysis or Parsing
if
==
b
abstract syntax tree
or parse tree
=
0
a
-3-
Parsing Analogy
Syntax analysis for natural languages
Recognize whether a sentence is grammatically correct
Identify the function of each word
sentence
subject
verb
indirect object
gave
him
I gave him the book
-4-
object
noun phrase
article
noun
the
book
Syntax Analysis Overview
Goal Determine if the input token stream
satisfies the syntax of the program
What do we need to do this?
An expressive way to describe the syntax
A mechanism that determines if the input
token stream satisfies the syntax description
For lexical analysis
Regular expressions describe tokens
Finite automata = mechanisms to generate
tokens from input stream
-5-
Just Use Regular Expressions?
REs can expressively describe tokens
Easy to implement via DFAs
So just use them to describe the syntax of a
programming language??
NO! They dont have enough power to express any
non-trivial syntax
Example Nested constructs (blocks, expressions,
statements) Detect balanced braces:
{ { { { {
{{} {} {{} { }}}
- We need unbounded counting!
- FSAs cannot count except in a strictly
modulo fashion
-6-
...
Context-Free Grammars
Consist of 4 components:
Terminal symbols = token or
Non-terminal symbols = syntactic variables
Start symbol S = special non-terminal
Productions of the form LHSRHS
SaSa
ST
TbTb
T
LHS = single non-terminal
RHS = string of terminals and non-terminals
Specify how non-terminals may be expanded
Language generated by a grammar is the set of
strings of terminals derived from the start symbol
by repeatedly applying the productions
L(G) = language generated by grammar G
-7-
CFG - Example
Grammar for balanced-parentheses language
S(S)S
S
? Why is the final S required?
1 non-terminal: S
2 terminals: (, )
Start symbol: S
2 productions
If grammar accepts a string, there is a derivation of
that string using the productions
(())
S = (S) = ((S) S) = (() ) = (())
-8-
More on CFGs
Shorthand notation vertical bar for
multiple productions
SaSa|T
TbTb|
CFGs powerful enough to expression the syntax
in most programming languages
Derivation = successive application of
productions starting from S
Acceptance? = Determine if there is a derivation
for an input token stream
-9-
A Parser
Context free
grammar, G
Parser
Token stream, s
(from lexer)
Yes, if s in L(G)
No, otherwise
Error messages
Syntax analyzers (parsers) = CFG acceptors which also
output the corresponding derivation when the token stream
is accepted
Various kinds: LL(k), LR(k), SLR, LALR
- 10 -
RE is a Subset of CFG
Can inductively build a grammar for each RE
a
R1 R2
R1 | R2
R1*
S
Sa
S S1 S2
S S1 | S2
S S1 S |
Where
G1 = grammar for R1, with start symbol S1
G2 = grammar for R2, with start symbol S2
- 11 -
Grammar for Sum Expression
Grammar
SE+S|E
E number | (S)
Expanded
SE+S
SE
E number
E (S)
4 productions
2 non-terminals (S,E)
4 terminals: (, ), +, number
start symbol: S
- 12 -
Constructing a Derivation
Start from S (the start symbol)
Use productions to derive a sequence of
tokens
For arbitrary strings , , and for a
production: A
A single step of the derivation is
A
(substitute for A)
Example
SE+S
(S + E) + E (E + S + E) + E
- 13 -
Class Problem
SE+S|E
E number | (S)
Derive: (1 + 2 + (3 + 4)) + 5
- 14 -
Parse Tree
S
E
( S )
E + S
1 E + S
2
Parse tree = tree representation of the
derivation
Leaves of the tree are terminals
Internal nodes are non-terminals
No information about the order of
the derivation steps
( S )
E + S
3
E
4
- 15 -
Parse Tree vs Abstract Syntax Tree
S
E
Parse tree also called concrete syntax
S
( S )
E + S
+
+
1 E + S
2
+
2
+
3
( S )
E + S
3
E
AST discards (abstracts) unneeded
information more compact format
- 16 -
Derivation Order
Can choose to apply productions in any order,
select non-terminal and substitute RHS of
production
Two standard orders: left and right-most
Leftmost derivation
In the string, find the leftmost non-terminal and apply
a production to it
E+S1+S
Rightmost derivation
Same, but find rightmost non-terminal
E+SE+E+S
- 17 -
Leftmost/Rightmost Derivation Examples
SE+S|E
E number | (S)
Leftmost
derive: (1 + 2 + (3 + 4)) + 5
S E + S (S)+S (E+S) + S (1+S)+S (1+E+S)+S
(1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S
(1+2+(3+S))+S (1+2+(3+E))+S (1+2+(3+4))+S
(1+2+(3+4))+E (1+2+(3+4))+5
Now, rightmost derive the same input string
S E+S E+E E+5 (S)+5 (E+S)+5 (E+E+S)+5
(E+E+E)+5 (E+E+(S))+5 (E+E+(E+S))+5
(E+E+(E+E))+5 (E+E+(E+4))+5 (E+E+(3+4))+5
(E+2+(3+4))+5 (1+2+(3+4))+5
Result: Same parse tree: same productions chosen, but in diff order
- 18 -
Class Problem
SE+S|E
E number | (S) | -S
Do the rightmost derivation of : 1 + (2 + -(3 + 4)) + 5
- 19 -
Ambiguous Grammars
In the sum expression grammar, leftmost
and rightmost derivations produced
identical parse trees
+ operator associates to the right in parse
tree regardless of derivation order
+
+
(1+2+(3+4))+5
+
2
+
3
- 20 -
An Ambiguous Grammar
+ associates to the right because of the
right-recursive production: S E + S
Consider another grammar
S S + S | S * S | number
Ambiguous grammar = different
derivations produce different parse trees
More specifically, G is ambiguous if there are
2 distinct leftmost (rightmost) derivations for
some sentence
- 21 -
Ambiguous Grammar - Example
S S + S | S * S | number
Consider the expression: 1 + 2 * 3
Derivation 2: S S*S
S+S*S 1+S*S 1+2*S
1+2*3
Derivation 1: S S+S
1+S 1+S*S 1+2*S
1+2*3
2 leftmost derivations
*
+
1
*
2
1
But, obviously not equal!
- 22 -
3
2
Impact of Ambiguity
Different parse trees correspond to
different evaluations!
Thus, program meaning is not defined!!
*
+
1
*
2
3
2
=9
=7
- 23 -
Can We Get Rid of Ambiguity?
Ambiguity is a function of the grammar, not the
language!
A context-free language L is inherently ambiguous
if all grammars for L are ambiguous
Every deterministic CFL has an unambiguous
grammar
So, no deterministic CFL is inherently ambiguous
No inherently ambiguous programming languages have
been invented
To construct a useful parser, must devise an
unambiguous grammar
- 24 -
Eliminating Ambiguity
Often can eliminate ambiguity by adding
nonterminals and allowing recursion only
S
on right or left
SS+T|T
T T * num | num
T non-terminal enforces precedence
Left-recursion; left associativity
- 25 -
S + T
T
T * 3
A Closer Look at Eliminating Ambiguity
Precedence enforced by
Introduce distinct non-terminals for each
precedence level
Operators for a given precedence level are
specified as RHS for the production
Higher precedence operators are accessed by
referencing the next-higher precedence nonterminal
- 26 -
Associativity
An operator is either left, right or non
associative
Left:
Right:
Non:
a + b + c = (a + b) + c
a ^ b ^ c = a ^ (b ^ c)
a < b < c is illegal (thus undefined)
Position of the recursion relative to the
operator dictates the associativity
Left (right) recursion left (right) associativity
Non: Dont be recursive, simply reference next
higher precedence non-terminal on both sides of
operator
- 27 -
Class Problem
S S + S | S S | S * S | S / S | (S) | -S | S ^ S | num
Enforce the standard arithmetic precedence rules and remove
all ambiguity from the above grammar
Precedence (high to low)
(), unary
^
*, /
+, Associativity
^ = right
rest are left
- 28 -