Intro to Compiler
Introduction to Parsing
Chapter 4
1
Languages and Automata
• Formal languages are very important in CS
– Especially in programming languages
• Regular languages
– The weakest formal languages widely used
– Many applications
• We will also study context-free languages
2
Beyond Regular Languages
• Many languages are not regular
• Strings of balanced parentheses are not regular:
) | i 0
( i i
3
What Can Regular Languages Express?
A B
0 0
4
What Can Regular Languages Express?
• Languages requiring counting modulo a fixed
integer
• Intuition: A finite automaton that runs long
enough must repeat states
• Finite automaton can’t remember # of times it
has visited a particular state
5
The Functionality of the Parser
• Input: sequence of tokens from lexer
• Output: parse tree of the program
(But some parsers never produce a parse tree . . .)
6
Example
• Cool
if x =y then 1 else 2 fi
• Parser input
IF ID =ID THEN INT ELSE INT FI
• Parser output
IF-THEN-ELSE
= INT INT
ID ID
7
Comparison with Lexical Analysis
Phase Input Output
Lexer String of String of tokens
characters
Parser String of tokens Parse tree
(may be implicit)
8
The Role of the Parser
• Not all strings of tokens are programs . . .
• . . . parser must distinguish between valid and
invalid strings of tokens
• We need
– A language for describing valid strings of tokens
– A method for distinguishing valid from invalid
strings of tokens
9
Chomsky Hierarchy
1 Unrestricted A
2 Context-Sensitive | LHS | | RHS |
3 Context-Free |LHS | = 1
4 Regular |RHS| = 1 or 2 ,
A a | aB, or
A a | Ba
10
Context-Free Grammars
• Programming language constructs have
recursive structure
• An STMT is
if EXPR then STMT else STMT
while EXPR do STMT end
…
• Context-free grammars are a natural notation
for this recursive structure
11
CFGs (Cont.)
• A CFG consists of
– A set of terminals T
– A set of non-terminals N
– A start symbol S (a non-terminal)
– A set of productions
12
Notational Conventions
• In these lecture notes
– Non-terminals are written upper-case
– Terminals are written lower-case
– The start symbol is the left-hand side of the first
production
13
Example of CFGs
Simple arithmetic expressions:
E E E
| E+E
| E
| id
14
The Language of a CFG
Read productions as replacement rules:
X Y1…Yn
Means X can be replaced by Y1…Yn
15
Key Idea
1. Begin with a string consisting of the start
symbol “S”
2. Replace any non-terminal X in the string by a
the right-hand side of some production
X Y1…Yn
3. Repeat (2) until there are no non-terminals in
the string
16
The Language of a CFG (Cont.)
More formally, write
if there is a production
17
The Language of a CFG (Cont.)
Write
if
in 0 or more steps
18
The Language of a CFG
Let G be a context-free grammar with start symbol S.
Then the language L(G) of G is:
The sentential forms SF(G) of G is:
X1 … Xn S X1 … Xn and every Xi is a terminal or non-terminal
Therefore: L(G) SF(G)
19
Terminals
• Terminals are so-called because there are no
rules for replacing them
• Once generated, terminals are permanent
• Terminals ought to be tokens of the language
20
Examples
L(G) is the language of CFG G
Strings of balanced parentheses ( )
i i
| i 0
Two grammars:
S (S) S (S)
OR
S |
21
Arithmetic Example
Simple arithmetic expressions:
E E+E | E E | (E) | id
Some elements of the language:
id id + id
(id) id id
(id) id id (id)
22
Notes
The idea of a CFG is a big step. But:
• Membership in a language is “yes” or “no”; also
need parse tree of the input
• Must handle errors gracefully
• Need an implementation of CFG’s (e.g., bison)
23
Derivations and Parse Trees
A derivation is a sequence of productions
S …… ………
A derivation can be drawn as a tree
– Start symbol is the tree’s root
– For a production X Y1…Yn add children Y1…Yn to
node X
24
Derivation Example
• Grammar
E E+E | E E | (E) | id
• String
id id + id
25
Derivation Example (Cont.)
E
E
E+E
E + E
E E+E
id E + E E * E id
id id + E
id id
id id + id
26
Derivation in Detail (1)
27
Derivation in Detail (2)
E + E
E
E+E
28
Derivation in Detail (3)
E E + E
E+E
E * E
E E+E
29
Derivation in Detail (4)
E
E + E
E+E
E E+E E * E
id E + E
id
30
Derivation in Detail (5)
E
E
E+E E + E
E E+E
E * E
id E + E
id id + E id id
31
Derivation in Detail (6)
E
E
E+E
E + E
E E+E
id E + E E * E id
id id + E
id id
id id + id
32
Notes on Derivations
• A parse tree has
– Terminals at the leaves
– Non-terminals at the interior nodes
• An in-order traversal of the leaves is the
original input
• The parse tree shows the association of
operations, the input string does not
33
Left-most and Right-most Derivations
• The example was a left-
most derivation E
– At each step, replaced the
left-most non-terminal E+E
• There is an equivalent
E+id
notion of a right-most E E + id
derivation
E id + id
id id + id
34
Right-most Derivation in Detail (1)
35
Right-most Derivation in Detail (2)
E + E
E
E+E
36
Right-most Derivation in Detail (3)
E E + E
E+E
id
E+id
37
Right-most Derivation in Detail (4)
E
E + E
E+E
E+id E * E id
E E + id
38
Right-most Derivation in Detail (5)
E
E
E+E E + E
E+id
E * E id
E E + id
E id + id id
39
Right-most Derivation in Detail (6)
E
E
E+E
E + E
E+id
E E + id E * E id
E id + id
id id
id id + id
40
Derivations and Parse Trees
• Note that right-most and left-most
derivations have the same parse tree
• The difference is the order in which branches
are added
41
Summary of Derivations
• We are not just interested in whether
s L(G)
– We need a parse tree for s
• A derivation defines a parse tree
– But one parse tree may have many derivations
• Left-most and right-most derivations are
important in parser implementation
42
Ambiguity
• Grammar E E + E | E * E | ( E ) | id
• String id * id + id
43
Ambiguity (Cont.)
This string has two parse trees
E E
E + E E * E
E E id id E + E
*
id id id id
44
Ambiguity (Cont.)
• A grammar is ambiguous if it has more than
one parse tree for some string
– Equivalently, there is more than one right-most or
left-most derivation for some string
• Ambiguity is BAD
– Leaves meaning of some programs ill-defined
45
Dealing with Ambiguity
• There are several ways to handle ambiguity
• Most direct method is to rewrite grammar
unambiguously
EE+T|T
TT*F|F
F ( E ) | id
• Enforces precedence of * over +
46
Ambiguity in Arithmetic Expressions
• Recall the grammar
E E + E | E * E | ( E ) | id
• The string id * id + id has two parse trees:
E E
E + E E * E
E * E id id E + E
id id id id
47
Ambiguity in Arithmetic Expressions
• Recall the grammar
E E + E | E * E | ( E ) | id
• The string id * id + id has two parse trees:
E E
E + E E * E
E * E id id E + E
id id id id
48
Ambiguity: The Dangling Else
• Consider the grammar
S if E then S
| if E then S else S
| OTHER
• This grammar is also ambiguous
49
The Dangling Else: Example
• The expression
if E1 then if E2 then S else S
has two parse trees
S S
if E1 then S else S if E1 then S
if E2 then S if E2 then S else S
• Typically we want the second form
50
The Dangling Else: A Fix
• else matches the closest unmatched then
• We can describe this in the grammar
S MIF /* all then are matched */
| UIF /* some then is unmatched */
MIF if E then MIF else MIF
| OTHER
UIF if E then S
| if E then MIF else UIF
• Describes the same set of strings
51
The Dangling Else: Example Revisited
• The expression if E1 then if E2 then S else S
S S
UIF UIF
if E1 then MIF elseUIF if E1 then S
MIF
if E2 then S
if E2 then MIF else MIF
• Not valid because the • A valid parse tree
then expression is not (for a UIF)
a MIF
Ambiguity
• No general techniques for handling ambiguity
• Impossible to convert automatically an
ambiguous grammar to an unambiguous one
• Used with care, ambiguity can simplify the
grammar
– Sometimes allows more natural definitions
– We need disambiguation mechanisms
53