03 Compiler Design Lecture - Syntax Analysis
03 Compiler Design Lecture - Syntax Analysis
Syntax Analysis
Introduction
First task is to break up the text into meaningful words newval=oldval+12
called tokens.
id = id + num Source
Code
Token Stream Lexical Analysis (High
Level)
identifiers
The order of the tokens is not
important at this stage. Example:
Symbol Table Token Lexeme
12 + old val = newval
Will also be accepted. Id newval
Id oldval
Lexical Analyzer’s purpose is
Num 12
simply to extract the token.
There should not be any combination which can not pass as token. e.g., 12oldval
Syntax
After verifying that there is no lexical error, it is time to check for the order
of the tokens
id = id + num
Observe that the actual lexemes are not used here. Syntax Analysis phase
is not interested to know if it is
Oldval = newval + 12 or newval = oldval + 12
Just like Lexical Analysis was not interested in the order of token
Syntax
But the compiler process should not forget the lexemes. They will be used
later.
id = id + num
Symbol Table
Token Lexeme
Id oldval
Id newval
Tokens will carry the pointer to the symbol Num 12
table entry with them.
Syntax
Okay, now, how to check if the syntax is correct or not
id = id + num
Syntax Analysis Token Stream
A syntactic category can be made of other syntactic categories and finally, tokens.
Syntactic categories are designated as Non Terminals.
Recall that a non terminal can be derived into any combination of terminals and non
terminals, but eventually, it should be all tokens.
Syntax
The entire source program listing can be considered as a syntactic category , i.e., non
terminal, say P
A statement (whatever type it may be) can also be considered as another syntactic
category , i.e., non terminal, say S
P S;
S S;S
S id := id + id * number ;
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num
id = id * num
Syntax Analysis Token Stream
S id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num
id = id * num
Syntax Analysis Token Stream
S id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax
id = id + num ; id = id * num ;
newval=oldval+12;
Source myval=newval*10;
Token Lexical Code
Syntax Analysis
Stream Analysis (High
Level)
S id = id + num
S id = id * num Symbol Table Token Lexeme
Id Newval
Id oldval
So, the stream will be converted to S;S; Num 12
We can also check later if S;S; is valid or not.
Id Myval
It will be valid, if there is a production P S;S; Num 10
But combinations like S+S or S*S will not be valid
Syntax
So, any combination of tokens that can be reduced, meaning, that exists on
the right hand side of a production is valid.
Id = id – id
Id = id * id
Id = id + id – id
Id = id + id – num
Id = id * id – id
We have to have a limited set of rules using which we can generate all
combinations.
This is the malt that laid in the house that Jack built
This is the rat that ate the malt that laid in the house that Jack built
This is the cat that killed the rat that ate the malt that laid in the house
that Jack built
This is the dog that chased the cat that killed the rat that ate the malt that
laid in the house that Jack built
Syntax
There are limited types of tokens but the combination is infinite
E is a non terminal. It has to stay on LHS of at least one production. It can also
stay on the RHS of some productions.
E E + E E + id id + id
E E + E E + E * E E + E * id E + id * id id + id * id
E E + E E + E - E E + E - id E + id - id id + id – id
E E * E E * E - E E * E - id E * id - id id * id – id
E E * E E * E - E E * E – E / E id * E – E / E id * id – E / E
id * id – id / E id * id – id / id
(the non terminal being derived in each step has been highlighted)
Defining a grammar:
Start symbol is a non terminal from which the chain of derivations will start.
There can be only one.
N: Non terminal
α, β, γ : strings of terminals and non terminals
Definition: Given a context-free grammar G with start symbol S, terminal symbols T and
productions P, the language L(G) that G generates is defined to be the set of strings of
terminal symbols that can be obtained by derivation from S using the productions P, i.e.,
the set
1. T aTc
2. T aTc Rightmost
3. T R
4. R RbR
5. R ε
6. R RbR Leftmost
7. R RbR
8. R ε
9. R ε
10. R ε
In this derivation, we have applied derivation steps sometimes to the leftmost non
terminal, sometimes to the rightmost and sometimes to a non terminal that was neither.
Derivation- Parsing
The Syntax Analysis phase checks the structure of the source code statements. This is
.called Parsing
1. Trying to generate the statement from the start symbol and applying production
rules. This is called top down parsing.
T aabbbcc
2. Taking the string and applying productions in reverse to arrive at the start symbol.
This is called bottom up parsing
aabbbcc T
Derivation
However, since derivation steps are local, the order does not matter. So,
we might as well decide to always rewrite the leftmost non terminal.
a T c
Read the leaves from left to right
The leaves of the tree are terminals which, when read from left to right, form the
derived string. ε is ignored.
.
Derivation - Trees
Order of derivation does not matter: only choice of rule
E –> E + E
EE*E
E num
(i) Associativity
Note: Each of + and * can be both right associative and left associative, but for
convenience, they are made left associative. (parser has to follow any one rule)
(i) Precedence
a+ b * c will be treated as a + (b * c)
Ambiguity Detection
Ambiguity exists in the grammar is there exists a string which can result in two
distinct parse trees.
N NαN
E E op E
E num
(num is a numeric literal)
Rewriting ambiguous grammar
Expression Grammar
Rewrite as follows:
E E op E’
E E’
E’ num
Op + | - | * | /
Isolate the rightmost non terminal first, push it to a sub tree
E E’ op E E num op E
E E’ E num
E’ num
e.g., a<b
This is easily extendible where the cases where several operators with the same
precedence and associativity interact
EE+E‘
EE–E‘
E E ‘
Enum
“+” and “-” are both left associative hence left recursive grammar is required.
Rewriting ambiguous grammar
But if we mix left recursive with right recursive, it will be ambiguous again
EE+E‘
EE‘^E‘
E E ‘
Enum
EE+E‘
EE‘^E‘
E E ‘
Enum
E E + E2
EE – E2
E E2
E2E2*E3
E2E2/E3
E2E3
E3num
Other sources of ambiguity
Example:
It might mean
Or
if P then (if Q then S1) else S2
The grammar is
stmt <matched>
stmt <unmatched>
matched if <exp> then <matched> else <matched>
matched <id> :=<exp>
unmatched if <exp> then <matched> > else <unmatched>
unmatched if <exp> then <matched>
Other sources of ambiguity
For statements with unmatched if and else