G52Cmp Compilers: Syntax Analysis
G52Cmp Compilers: Syntax Analysis
Syntax Analysis
Lecture 5
1
Learning objectives
2
Source Program
Syntax Analysis
Lexical Analyser
Syntax Analyser
Symbol Table Semantic Analyser
Manager Error
Handler
Intermediate Code
Generator
Code Optimiser
Code Generator
Target Program
3
Grouping of Tokens
4
Where is the Syntax Analysis?
if (b == 0) a = b;
if ( b == 0 ) a = b ;
if
abstract syntax tree
== = or parse tree
b 0 a b without token names
// inorder traversal
5
The Role of the Parser
token
source Lexical Semantic intermediate
program Analyzer Analyzer
Parser syntax representation
get next tree
token
symbol
table
6
Parsing Analogy (for English language)
• Syntax analysis for natural languages
• Recognize whether a sentence is grammatically correct
• Identify the function of each word
sentence
article noun
“I gave him the book”
the book
7
Parsing Overview
• Goal – determine if the input token stream satisfies
the syntax of the program
• What do we need to do this?
– An expressive way to describe the syntax
– A mechanism that determines if the input token
stream satisfies the syntax description
• For lexical analysis
– Regular expressions describe tokens
– Finite automata have the mechanisms to generate
tokens from input stream
8
Use Regular Expressions
• REs can expressively describe tokens
– Easy to implement via DFAs
• So just use them to describe the syntax of a
programming language ???
– NO! – They don’t have enough power to express any
non-trivial syntax. Example: Nested constructs (blocks,
expressions, statements) such as to detect balanced
braces:
{ { { { {
{{} {} {{} { }}}
...
} } } } }
9
Context-Free Grammars
// production rules or
// grammar
• Consist of 4 components/properties:
SaSa
– Terminal symbols = token or
ST
– Nonterminal symbols = syntactic variables
TbTb
– Start symbol S = special non-terminal
– Productions of the form LHSRHS T
• LHS = single non-terminal
• RHS = string of ‘terminals and non-terminals’
• Specify how non-terminals may be expanded
• Language generated by a grammar is the set of strings of
terminals derived from the start symbol by repeatedly
applying the productions
– L(G) = language generated by grammar G
10
Context-Free Grammars continued
11
Notational Conventions
To avoid always having to state that "these are the
terminals”, "these are the nonterminals”, and so on, we
shall employ the following notational conventions with
regard to grammars throughout the remainder of this
subject
Terminals: id, +, -, *, /, (, )
Nonterminals: expr, op
Productions:
expr expr op expr
expr ( expr )
expr - expr
expr id
op + | - | * | /
The start symbol: expr
12
Notational Conventions continued
l. These Symbols are Terminals:
i) Lower-case letters early in the alphabet such as a, b, c.
ii) Operator symbols such as +, -, etc.
iii) Punctuation symbols such as parentheses, comma, etc.
iv) The digits 0, 1, . . . , 9.
v) Boldface strings such as id or if.
13
Notational Conventions continued
14
Notational Conventions continued
15
Derivations
E - *( id ) E - ( id+)
16
example
• Grammar for balanced-parentheses language
– S(S)S
– S // Why is the final S required ?
• 1 non-terminal: S
• 2 terminals: “(”, “)”
• Start symbol: S
• 2 productions
• If grammar accepts a string, there is a derivation of that
string using the productions. For examples:
– S = (S) S = (S) = ((S) S) = (() ) = (())
– S = (S) S = ((S)S) S = ((S)S) (S)S = (()) () = (()) ()
17
More on CFGs
• Shorthand notation – vertical bar for multiple
productions
SaSa|T
TbTb|
• CFGs powerful enough to expression the syntax in
most programming languages
• Derivation = successive application of productions
starting from S
• Acceptance? = Determine if there is a derivation for
an input token stream
18
RE is a Subset of CFG
Can inductively build a grammar for each RE
S
a Sa
R1 R2 S S1 S2
R1 | R2 S S1 | S2
R1* S S1 S |
Note:
G1 = grammar for RE1, with start symbol S1
G2 = grammar for RE2, with start symbol S2
19
Context-Free Languages
20
A Parser
Context free
grammar, G Yes, if s in L(G)
Parser
No, otherwise
Token stream, s
(from lexer)
Error messages
21
Left- & Right-most Derivations
Each derivation step needs to choose
a nonterminal to rewrite
a production to apply
Consider the following grammar:
E E A E | ( E ) | - E | id
A +|-|*|/
A leftmost derivation always chooses the
leftmost nonterminal to rewrite
E lm - E lm - ( E ) lm - ( E A E ) lm - ( E + E )
lm - ( id + E ) lm - ( id + id )
E // rightmost derivation
// leftmost derivation
E lm - E - E Erm - E
lm - ( E ) rm - ( E )
lm - ( E + E ) ( E ) rm - ( E + E )
lm - ( id + E ) rm - (E + id )
lm - ( id + id ) E + E rm - ( id + id )
id id 23
Parse Tree vs Abstract Syntax Tree
S SE+S | E +
E + S E number | (S)
+ 5
Derive: (1 + 2 + (3 + 4)) + 5
( S ) E 1 +
• Parse tree is a tree
2 +
E + S 5 representation of the derivation
• Leaves of the tree are terminals
3 4
1 E + S • Internal nodes are non-
AST (abstract
terminals
syntax tree) discards
2 E • NO information about the order
unneeded
of the derivation steps
( S ) information, more
compact format,
E + S inorder traversal
3 E
4 24
Derivation Order
• Can choose to apply productions in any order,
select non-terminal and
substitute RHS of production
• Two standard orders: left and right-most
• Consider the following grammar:
SE+S | E
E number | (S)
• Leftmost derivation
– In the string, find the leftmost non-terminal and apply a
production to it
– E+S1+S
• Rightmost derivation
– Same, but find rightmost non-terminal
– E+SE+E+S
25
Leftmost & Rightmost Derivation
Leftmost derive: SE+S|E Rightmost derive:
(1 + 2 + (3 + 4)) + 5 E number | (S) (1 + 2 + (3 + 4)) + 5
SE+S S E+S
(S)+S E+E
(E+S) + S E+5
(1+S)+S (S)+5
(1+E+S)+S (E+S)+5
(1+2+S)+S (E+E+S)+5
(1+2+E)+S (E+E+E)+5
(1+2+(S))+S (E+E+(S))+5
(1+2+(E+S))+S (E+E+(E+S))+5
(1+2+(3+S))+S (E+E+(E+E))+5
(1+2+(3+E))+S (E+E+(E+4))+5
(1+2+(3+4))+S (E+E+(3+4))+5
(1+2+(3+4))+E (E+2+(3+4))+5
(1+2+(3+4))+5 (1+2+(3+4))+5
Result: Same parse tree, same productions chosen, but in different order
26
Ambiguous Grammar
* A grammar is ambiguous if it produces more than
one parse tree for some sentence
EE+E EE*E
id + E EE*E E+E*E
id + E * E |E+E id + E * E
id + id * E | id id + id * E
id + id * id Derive: id + id * id id + id * id
E E
E + E E * E
id E * E E + E id
id id id id 27
Resolving Ambiguity
28
Example
The dangling-else grammar
stmt if expr then stmt
| if expr then stmt else stmt
| other
Which parse tree for if E1 then if E2 then S1 else S2 ?
S
if E then S else S
if E then S
S
if E then S
if E then S else S
29
Incorporating Disambiguating Rules
into Grammars
* Rule: match each “else” with the closest previous unmatched
“then”
* Remove undesired state transitions in the pushdown automaton
Which parse tree for if E1 then if E2 then S1 else S2 ?
S
if E then S else S
if E then S
S
if E then S
if E then S else S
30
Grammar Rewriting
Consider the ambiguous grammar below:
stmt if expr then stmt
| if expr then stmt else stmt
| other
The cat sees a rat. The rat sees me. I like a cat.
The cat sees the rat. The rat like me. I see the rat.
I sees a rat.
I like the rat.
// Shhh… I don’t like rat. 32
Bottom up parsing
The parse tree “grows” from the bottom (leafs) up to the top (root).
Sentence
Subject Object
Sentence
Noun Noun
36