0% found this document useful (0 votes)
46 views36 pages

G52Cmp Compilers: Syntax Analysis

The document discusses syntax analysis in compilers. It describes the process of checking the syntax of a program using context free grammars. Context free grammars can expressively describe the syntax of programming languages using terminals, nonterminals, productions and a start symbol. Parsing is the process of determining if an input token stream satisfies the grammar using techniques like top-down and bottom-up parsing. Ambiguities in grammars need to be resolved for accurate parsing.

Uploaded by

Kim Mey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
46 views36 pages

G52Cmp Compilers: Syntax Analysis

The document discusses syntax analysis in compilers. It describes the process of checking the syntax of a program using context free grammars. Context free grammars can expressively describe the syntax of programming languages using terminals, nonterminals, productions and a start symbol. Parsing is the process of determining if an input token stream satisfies the grammar using techniques like top-down and bottom-up parsing. Ambiguities in grammars need to be resolved for accurate parsing.

Uploaded by

Kim Mey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 36

G52CMP COMPILERS

Syntax Analysis
Lecture 5

1
Learning objectives

• To describe the process of checking the syntax of a


program
• To describe the representation of a syntax for a
programming language using context free grammar
• To understand the acceptance of a sentence for a
grammar using derivation
• To understand the problem of ambiguity in grammars
and how to resolve it
• To describe the top-down and bottom-up parsing
processes in accepting sentences for a given grammar

2
Source Program
Syntax Analysis
Lexical Analyser

Syntax Analyser

Symbol Table Semantic Analyser
Manager Error
Handler
Intermediate Code
Generator

Code Optimiser

Code Generator

Target Program
3
Grouping of Tokens

4
Where is the Syntax Analysis?
if (b == 0) a = b;

Lexical Analysis or Scanner

if ( b == 0 ) a = b ;

Syntax Analysis or Parsing

if
abstract syntax tree
== = or parse tree
b 0 a b without token names
// inorder traversal
5
The Role of the Parser

token
source Lexical Semantic intermediate
program Analyzer Analyzer
Parser syntax representation
get next tree
token

symbol
table

6
Parsing Analogy (for English language)
• Syntax analysis for natural languages
• Recognize whether a sentence is grammatically correct
• Identify the function of each word

sentence

subject verb indirect object object


I gave him noun phrase

article noun
“I gave him the book”
the book

7
Parsing Overview
• Goal – determine if the input token stream satisfies
the syntax of the program
• What do we need to do this?
– An expressive way to describe the syntax
– A mechanism that determines if the input token
stream satisfies the syntax description
• For lexical analysis
– Regular expressions describe tokens
– Finite automata have the mechanisms to generate
tokens from input stream

8
Use Regular Expressions
• REs can expressively describe tokens
– Easy to implement via DFAs
• So just use them to describe the syntax of a
programming language ???
– NO! – They don’t have enough power to express any
non-trivial syntax. Example: Nested constructs (blocks,
expressions, statements) such as to detect balanced
braces:
{ { { { {
{{} {} {{} { }}}
...
} } } } }

9
Context-Free Grammars
// production rules or
// grammar
• Consist of 4 components/properties:
SaSa
– Terminal symbols = token or 
ST
– Nonterminal symbols = syntactic variables
TbTb
– Start symbol S = special non-terminal
– Productions of the form LHSRHS T
• LHS = single non-terminal
• RHS = string of ‘terminals and non-terminals’
• Specify how non-terminals may be expanded
• Language generated by a grammar is the set of strings of
terminals derived from the start symbol by repeatedly
applying the productions
– L(G) = language generated by grammar G

10
Context-Free Grammars continued

 A set of terminals: basic symbols from which


sentences are formed
 A set of nonterminals: syntactic categories
denoting sets of sentences
 A set of productions: rules specifying how the
terminals and nonterminals can be combined to
form sentences
 The start symbol: a distinguished nonterminal
denoting the language

11
Notational Conventions
To avoid always having to state that "these are the
terminals”, "these are the nonterminals”, and so on, we
shall employ the following notational conventions with
regard to grammars throughout the remainder of this
subject
Terminals: id, +, -, *, /, (, )
Nonterminals: expr, op
Productions:
expr  expr op expr

expr  ( expr )
expr  - expr
expr  id
op  + | - | * | /
The start symbol: expr
12
Notational Conventions continued
l. These Symbols are Terminals:
i) Lower-case letters early in the alphabet such as a, b, c.
ii) Operator symbols such as +, -, etc.
iii) Punctuation symbols such as parentheses, comma, etc.
iv) The digits 0, 1, . . . , 9.
v) Boldface strings such as id or if.

2. These Symbols are Nonterminals:


i) Upper-case letters early in the alphabet such as A, B, C.
ii) The letter S, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.

13
Notational Conventions continued

3. Upper-case letters late in the alphabet, such as X, Y, Z,


represent grammar symbols, that is, either nonterminals
(normally) or terminals.
4. Lower-case letters late in the alphabet, chiefly u, v, . . . , z,
represent strings of terminals.
5. Lower-case Greek letters, , , , … , represent strings of
grammar symbols. Thus, a generic production could be written
as A   , indicating that there is a single nonterminal A on
the left of the arrow (the left side of the production) and a
string of grammar symbols  to the right of the arrow (the
right side of the production).

14
Notational Conventions continued

6. If A  1, A  2, . . . , A  k are all productions with A on


the left (we call them A-productions),
we may write A  1 | 2 | … | k. We call 1, 2, . . . , k the
alternatives for A.
7. The left side of the first production is the start symbol.
E  E A E | ( E ) | - E | id
A + |- |*|/

15
Derivations

A derivation step is an application of a production as a


rewriting rule
E-E
A sequence of derivation steps
E  - E  - ( E )  - ( id )
is called a derivation of “- ( id )” from E
The symbol  * denotes: derives in zero or more steps
+
The symbol  denotes: derives in one or more steps

E  - *( id ) E  - ( id+)

1.   *  for any string  // reflexive relation


2. If  *  and  , then  * 
  // transitive relation

16
example
• Grammar for balanced-parentheses language
– S(S)S
– S // Why is the final S required ?
• 1 non-terminal: S
• 2 terminals: “(”, “)”
• Start symbol: S
• 2 productions
• If grammar accepts a string, there is a derivation of that
string using the productions. For examples:
– S = (S) S = (S)  = ((S) S)  = (()  )  = (())
– S = (S) S = ((S)S) S = ((S)S) (S)S = (()) () = (()) ()

17
More on CFGs
• Shorthand notation – vertical bar for multiple
productions
SaSa|T
TbTb|
• CFGs powerful enough to expression the syntax in
most programming languages
• Derivation = successive application of productions
starting from S
• Acceptance? = Determine if there is a derivation for
an input token stream

18
RE is a Subset of CFG
Can inductively build a grammar for each RE
 S
a Sa
R1 R2 S  S1 S2
R1 | R2 S  S1 | S2
R1* S  S1 S | 

Note:
G1 = grammar for RE1, with start symbol S1
G2 = grammar for RE2, with start symbol S2

19
Context-Free Languages

A context-free language L(G) is the language defined by


a context-free grammar G
A string of terminals  is in L(G) if and only if S + ,
 is called a sentence of G
If S * , where  may contain nonterminals, then we
call  a sentential form of G
E  - E  - ( E )  - ( id )
G1 is equivalent to G2 if L(G1) = L(G2)

20
A Parser
Context free
grammar, G Yes, if s in L(G)
Parser
No, otherwise
Token stream, s
(from lexer)
Error messages

Syntax analyzers (parsers) = CFG acceptors which also


output the corresponding derivation when the token stream
is accepted

Various kinds of parsing: LL(k), LR(k), SLR, LALR

21
Left- & Right-most Derivations
Each derivation step needs to choose
a nonterminal to rewrite
a production to apply
Consider the following grammar:
E  E A E | ( E ) | - E | id
A +|-|*|/
A leftmost derivation always chooses the
leftmost nonterminal to rewrite
E lm - E lm - ( E ) lm - ( E A E ) lm - ( E + E )
lm - ( id + E ) lm - ( id + id )

A rightmost derivation always chooses the


rightmost nonterminal to rewrite
E rm - E rm - ( E ) rm - ( E A E ) rm - ( E + E )
rm - (E + id ) rm - ( id + id ) 22
Parse Trees
* A parse tree is a graphical representation for a derivation that
filters out the order of choosing nonterminals for rewriting
* Many derivations may correspond to the same parse tree, but
every parse tree has associated with it a unique leftmost and a
unique rightmost derivation

E // rightmost derivation
// leftmost derivation
E lm - E - E Erm - E
lm - ( E ) rm - ( E )
lm - ( E + E ) ( E ) rm - ( E + E )
lm - ( id + E ) rm - (E + id )
lm - ( id + id ) E + E rm - ( id + id )

id id 23
Parse Tree vs Abstract Syntax Tree
S SE+S | E +
E + S E  number | (S)
+ 5
Derive: (1 + 2 + (3 + 4)) + 5
( S ) E 1 +
• Parse tree is a tree
2 +
E + S 5 representation of the derivation
• Leaves of the tree are terminals
3 4
1 E + S • Internal nodes are non-
AST (abstract
terminals
syntax tree) discards
2 E • NO information about the order
unneeded
of the derivation steps
( S ) information, more
compact format,
E + S inorder traversal

3 E

4 24
Derivation Order
• Can choose to apply productions in any order,
select non-terminal and
substitute RHS of production
• Two standard orders: left and right-most
• Consider the following grammar:
SE+S | E
E  number | (S)
• Leftmost derivation
– In the string, find the leftmost non-terminal and apply a
production to it
– E+S1+S
• Rightmost derivation
– Same, but find rightmost non-terminal
– E+SE+E+S

25
Leftmost & Rightmost Derivation
Leftmost derive: SE+S|E Rightmost derive:
(1 + 2 + (3 + 4)) + 5 E  number | (S) (1 + 2 + (3 + 4)) + 5
SE+S S  E+S
 (S)+S  E+E
 (E+S) + S  E+5
 (1+S)+S  (S)+5
 (1+E+S)+S  (E+S)+5
 (1+2+S)+S  (E+E+S)+5
 (1+2+E)+S  (E+E+E)+5
 (1+2+(S))+S  (E+E+(S))+5
 (1+2+(E+S))+S  (E+E+(E+S))+5
 (1+2+(3+S))+S  (E+E+(E+E))+5
 (1+2+(3+E))+S  (E+E+(E+4))+5
 (1+2+(3+4))+S  (E+E+(3+4))+5
 (1+2+(3+4))+E  (E+2+(3+4))+5
 (1+2+(3+4))+5  (1+2+(3+4))+5
Result: Same parse tree, same productions chosen, but in different order
26
Ambiguous Grammar
* A grammar is ambiguous if it produces more than
one parse tree for some sentence
EE+E EE*E
 id + E EE*E E+E*E
 id + E * E |E+E  id + E * E
 id + id * E | id  id + id * E
 id + id * id Derive: id + id * id  id + id * id
E E

E + E E * E

id E * E E + E id

id id id id 27
Resolving Ambiguity

* Use disambiguating rules to throw away undesirable


parse trees

* Rewrite grammars by incorporating disambiguating


rules into grammars

28
Example
The dangling-else grammar
stmt  if expr then stmt
| if expr then stmt else stmt
| other
Which parse tree for if E1 then if E2 then S1 else S2 ?
S

if E then S else S
if E then S
S
if E then S
if E then S else S
29
Incorporating Disambiguating Rules
into Grammars
* Rule: match each “else” with the closest previous unmatched
“then”
* Remove undesired state transitions in the pushdown automaton
Which parse tree for if E1 then if E2 then S1 else S2 ?
S

if E then S else S
if E then S
S
if E then S
if E then S else S
30
Grammar Rewriting
Consider the ambiguous grammar below:
stmt  if expr then stmt
| if expr then stmt else stmt
| other

Incorporating Disambiguating Rules into a Grammar:


stmt  m_stmt
Try to derive by yourself:
| unm_stmt if E then if E then S else S

m_stmt  if expr then m_stmt else m_stmt


| other

unm_stmt  if expr then stmt


| if expr then m_stmt else unm_stmt 31
Different kinds of Parsing Algorithms
• Two big groups of algorithms can be distinguished:
– bottom up strategies
– top down strategies
• Analogy: parsing of “Micro-English” grammar below:
Sentence
Sentence ::=
::= Subject
Subject Verb
Verb Object
Object ..
Subject
Subject ::=
::= II || AA Noun
Noun || The
The Noun
Noun
Object
Object ::=
::= me
me || aa Noun
Noun || the
the Noun
Noun
Noun
Noun ::=
::= cat
cat || bat
bat || rat
rat
Verb
Verb ::=
::= like
like || isis || see
see || sees
sees

The cat sees a rat. The rat sees me. I like a cat.
The cat sees the rat. The rat like me. I see the rat.
I sees a rat.
I like the rat.
// Shhh… I don’t like rat. 32
Bottom up parsing
The parse tree “grows” from the bottom (leafs) up to the top (root).

Sentence

Subject Object

Noun Verb Noun

The cat sees a rat .


33
Top-down parsing
The parse tree is constructed starting at the top (root) down to the
bottom (leafs).

Sentence

Subject Verb Object

Noun Noun

The cat sees a rat .


34
Quick Review
• Syntactic analysis
– Lexical analysis
• Group letters into words (or group characters into
tokens)
• Use regular expressions and deterministic FSM’s
– Grammar transformations before parsing, coming soon
• Left-recursion removal
• Left-factoring
• Substitution
– Parsing = structural analysis of program
• Group words into sentences, paragraphs, and
documents (or tokens into expressions, commands, and
programs)
• Top-down Parsing
• Bottom-up Parsing
35
END

36

You might also like