Compiler Design
Compiler Design
1. Introduction to compilers:-
A compiler is a program that reads a program written in one language (source
language (or) high level language) and translates it into an equivalent program in another
language.(target language (or) low level language)
Compiler:- It converts the high level language into an equivalent low level language program.
PHASES OF COMPILER
Source Program
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Symbol Table Manager Error Handler
Code Optimizer
Code Generator
Target Program
Analysis Phase:-
The analysis phase breaks up the source program into constituent pieces. The analysis phase
of a compiler performs,
1. Lexical analysis
2. Syntax Analysis
3. Semantic Analysis
The lexical analysis phase reads the characters in the program and groups them into
tokens that are sequence of characters having a collective meaning.
Such as an Identifier, a Keyboard, a Punctuation, character or a multi character operator like ++.
“ The character sequence forming a token is called lexeme”
For Eg. Pos = init + rate * 60
Lexeme Token Attribute value
rate ID Pointer to symbol table
+ ADD
60 num 60
init ID Pointer to symbol table
Syntax analysis processes the string of descriptors (tokens), synthesized by the lexical
analyzer, to determine the syntactic structure of an input statement. This process is known as
parsing.
ie, Output of the parsing step is a representation of the syntactic structure of a statement.
Example:-
pos = init + rate * 60
=
pos +
init *
rate 60
3. Semantic Analysis:-
The semantic analysis phase checks the source program for semantic errors.
Processing performed by the semantic analysis step can classified into
a. Processing of declarative statements
b. Processing of executable statements
During semantic processing of declarative statements items of information are added to the
lexical tables.
Example:- (symbol table or lexical table)
real a, b;
id a real length ……
id b real length …..
Synthesis Phase:-
2 Principles of Compiler Design
1. Intermediate code generation
2. Code optimization
3. Code Generator
Example:-
pos = init + rate * 60
pos = init + rate * int to real (60)
id1 +
id2 *
id3 60
2. Code Optimization:-
The code optimization phase attempts to improve the intermediate code, so that faster
running machine code will result.
3. Code Generation:-
The final phase of the compiler is the generation of the target code or machine code or
assembly code.
Memory locations are selected for each of the variables used by the program. Then intermediate
instructions are translated into a sequence of machine instructions that perform the same task.
Example:-
MOV F id3, R2
MUL F #60.0, R2
MOV F id2, R1
ADD F R2, R1
MOV F R1,id1
Lexical Analyzer
Syntax Analyzer
id1 +
id2 *
id3 60
Semantic Analyzer
id1 +
id2 *
Code Optimizer
Code Generator
MOV F id3, R2
MUL F #60.0, R2
MOV F id2, R1
ADD F R2, R1
MOV F R1,id1
Error Handler:-
Each phase can encounter errors.
• The lexical phase can detect errors where the characters remaining in the input do not
form any token of the language.
• The syntax analysis phase can detect errors where the token stream violates the structure
rules of the language.
• During semantic analysis, the compiler tries to construct a right syntactic structure, but no
meaning to the operation involved.
• The intermediate code generator may detect an operator whose operands have in
compatible types.
• The code optimizer, doing control flow analysis may detect that certain statements can
never be reached.
• The code generator may find a compiler created constant that is too large to fit in a word
of the target machines.
Source Token
Program Lexical Analyzer Parser
Get next Token
Symbol Table
After receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify a next token.
Token:-
Token is a sequence of characters that can be treated as a single logical entity. Typical
tokens are,
(a) Identifiers
(b) Keywords
(c) Operators
(d) Special symbols
(e) Constants
Pattern:-
A set of strings in the input for which the same token is produced as output, this set of
strings is called pattern.
Lexeme:-
A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Finite Automata
Definition:-
A recognizer for a language is a program that takes as input a string x and answers
“yes” if x is a sentence of the language and “no” otherwise.
A better way to covert a regular expression to a recognizer is to construct a
generalized transition diagram from the expression. This diagram is called finite automation.
Example:-
The transition graph for an NFA that recognizes the language (a/b)* a
start a
0 1
Input Symbol
State
a b
0 0,1 0
1 - -
PROBLEM:-
r5
r4 *
( r3 )
r1 / r2
a b
a
2 3
ε ε
NFA for r3 = r1/r2 start 6
1
ε ε
b
4 5
NFA for r4, that is (r3) is the same as that for r3.
r9 r10
r7 r8 b
r5 r6 b
r4 * a
( r3 )
r1 / r2
a b
a
2 3
ε ε
NFA for r3 = r1/r2 start 6
1
ε ε
b
4 5
NFA for r4, that is (r3) is the same as that for r3.
a
2 3 ε
start ε ε ε ε a 8
0 1 6
ε ε 7
b
4 5
NFA for r8 = b
start b
8
9
a
3 ε
2
start ε ε ε 7 a 8
b 9 b 1
0
0 1 6
ε ε
b
4 5
ε
Since A is the start state and state C is the only accepting state then, the transition table is,
Input symbol
State
a b
A B C
B B C
C B C
a a b
Start a b
A B C
a
3 ε
2
start ε ε ε 7 a 8
b 9 b 1
0
0 1 6
ε ε
b
4 5
ε
Since A is the start state and state E is the only accepting state then, the transition table is,
Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
C
b
b a
a
start A a c b
B D E
a
a
MINIMIZATION OF STATES
Problem 1: Construct a minimum state DFA for a regular expression (a/b)* abb
Solution:-
1. The NFA of (a/b)*abb is
a
3 ε
2
start ε ε ε 7 a 8
b 9 b 1
0
0 1 6
ε ε
b
4 5
ε
2. Construct a DFA:
Since A is the start state and state E is the only accepting state then, the transition table is,
Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
Let Π = ABCDE
The initial partition Π consists of two groups.
Π1 = ABCD ( that is the non – accepting states)
Π2 = E ( that is the accepting state)
AB
a a
A B B B
b b
A C B D
AC
a a
A B C B
b b
A C C C
b b
A C D E
On input “a” each of these states has a transition to B, so they could all remain in one group as
far as input a is concerned.
On input “b” A,B,C go to members of the group Π1 (ABCD) while D goes to Π2 (E) . Thus Π1
group is split into two new groups.
Π1 = ABC Π2 = D , Π3 = E
So, (ABC) (D) (E)
AB
a a
A B B B
b b
A C B D
Here B goes to Π2. Thus Π1 group is again split into two new groups. The new groups are,
Π1 = AC Π2 = B , Π3 = D, Π4 = E
So, (AC) (B) (D) (E)
Here we cannot split any of the groups consisting of the single state. The only possibility is try to
split only (AC)
For AC
a a
A B C B
b b
A C C C
But A and C go the same state B on input a, and they go to the same state C on input b.
Hence after this,
(AC) (B) (D) (E)
Here we choose A as the representative for the group AC.
Thus A is the start state and state E is the only accepting state.
b
b
a
start A a B b D b E
a
a
____________________________________________________________________________
__
Notational Conventions:-
PARSER:
A parser for grammar G is a program that takes a string W as input and produces either a
parse tree for W, if W is a sentence of G or an error message indicating that W is not a sentence
of G as output.
1. Bottom up Parser:-
The bottom up parser build parse trees from bottom (leaves) to the top (root). The input
to the parser is being scanned from left to right, one symbol at a time. This is also called as “Shift
Reduce Parsing” because it consisting of shifting input symbols onto a stack until the right side
of a production appears on top of the stack.
Here let us “reduce” a string w to the start symbol of a grammar. At each step a string
matching the right side of a production is replaced by the symbol on the left.
S aAcBe
A Ab/b
Bd
and the string abbcde.
We want to reduce the string to S. We scan abbcde, looking for substrings that match the right
side of some production. The substrings b and d qualify.
So, S abbcde
aAbcde (A b)
We now that Ab,b and d each match the right side of some production.
Suppose this time we choose to replace the substring Ab by A, in the left side of the production.
A Ab
We now obtain,
aAcde (A Ab)
Then replacing d by B
aAcBe (B d)
Now we can replace the entire string by S.
W = abbcde position production
abbcde 2 AAb/b (that is, Ab)
aAbcde 2 AAb
aAcde 4 Bd
aAcBe SaAcBe
Thus we will be reached the starting symbol S.
Each replacement of the right side of a production by the left side in the process above is
called a reduction.
In the above example abbcde is a right sentential form whose handle is,
Ab at position 2
AAb at position 2
Bd at position 4.
id + * $
id - > > >
+ < > < >
* < > > >
$ < < < -
+ - * / id ( ) $
+ > > < < < < < > >
- > > < < < < < > >
* > > > > < < < > >
/ > > > > < < < > >
> > > > < < < > >
id > > > > > - - > >
( < < < < < < < = -
) > > > > > - - > >
$ < < < < < < < - -
Derivations:-
The central idea is that a production is treated as a rewriting rule in which the non-
terminal in the left side is replaced by the string on the right side of the production.
For Ex, consider the following grammar for arithmetic expression,
EE+E | E*E | (E) | -E |id
That is we can replace a single E by –E. we describe this action by writing
E => -E , which is read “E derives –E”
E(E) tells us that we could also replace by (E).
So, E*E => (E) * E or E*E => E* (E)
We can take a single E and repeatedly apply production in any order to obtain sequence of
replacements.
E => -E
E => -(E)
E => -(id)
We call such sequence of replacements is called derivation.
Suppose α A β => α γ β then
A γ is a production and α and β are arbitrary strings of grammar symbols.
If α1=> α2 ……. => αn we say α1 derives αn
Example:- EE+E | E*E | (E) | -E |id. The string –(id+id) is a sentence of above grammar.
E => -E
=> -(E)
=> -(E+E)
=> - (id+E)
=> -(id+id)
The above derivation is called left most derivation and it can be re written as,
E => -E
=> -(E)
=> -(E+E)
=> - (id+E)
=> -(id+id)
E => -E
=> -(E)
=> -(E+E)
=> - (id+E)
=> -(id+id)
- E
E => -(E) E
- E
( E )
E => -(E+E) E
- E
( E )
E + E
E => -(id+E) E
- E
( E )
E + E
id
E => -(id+E) E
- E
( E )
E + E
id id
c A d
a b
We now have a match for the second input symbol. Advance the input pointer to d,
w = cad
We now consider the third input symbol, and the next leaf labeled b. Since b does not match d,
we report failure and go back to A to see whether there is another alternative for A.
In going back to A we must reset the input pointer to position 2.
W = cad
c A d
a
The leaf a matches the second symbol of w and the leaf d matches the third symbol. Now we
produced a parse tree for w = cad using the grammar ScAd and Aab/a.
This is successful completion.
25 Principles of Compiler Design
Difficulties of Top – Down Parsing (or) Disadvantages of Top - Down Parsing
1. Left Recursion:-
A grammar G is said to be left recursion if it has a non terminal A such that there is a
derivation , A + Aα for some α
This grammar can cause a top-down parser go into an infinite loop.
Elimination of left Recursion:-
Consider the left recursive pair of production
A Aα/ β ,
Where β does not begin with A.
This left recursion can be eliminated by replacing this pair of production with,
A βA´
A´ α A´ / ε
Parse tree of original Grammar:-
AA α/ β
A
A α
A α
β
Parse tree for new grammar to eliminate left recursion:-
β A´
α A´
α A´
a+b $
STACK
X
Y OUTPUT
Parsing
Program
Z Table
$
$E´ $ T´ε
$ $ E´ ε
Ex.No:2:- Give the Predictive parsing table for the following Grammar,
SiEtSS´/a
S´eS/ε
Eb
Solution:-
Elimination of Left Recursion
The above grammar has no left Recursion. So we move to First and Follow.
First(S) = {i, a}
First(S´) = {e, ε}
First(E) = { b}
30 Principles of Compiler Design
Follow(S) = First(S´) = {e, ε} + Follow(S´)
= { e, $}
Ex.No:3:- Give the Predictive parsing table for the following Grammar,
SCC
CcC/d
Solution:-
First(S) = First(C) = {c,d}
Follow(S) = { $ }
Follow( C ) = First ( C ) = { c, d, $}
So the predictive parsing table is
T
c d $
NT
SC SC
S
C C
C CcC Cd
T
a b e i t $
NT
SiCtSS
S Sa
´
S
S´ ´eS S´ε
S´ε
C Cb
LR Parsing Algorithm:
LR Parsers consists of an input, an output, a stack, a driver program and a parsing table
that has two functions.
1. ACTION 2. GOTO
The driver program is same for all LR Parsers. Only the parsing table changes from one
parser to another parser. The parsing program reads character from an input buffer one at a time.
The program uses a STACK to store a string of the form S0X1S1X2S2……XmSm, where Sm is on
top. Each Si is a symbol called STATE and each Xi is a grammar symbol.
a1 … ai … an $ Input
STACK
Sm
Xm
Xm-1
….
S0
ACTION GOTO
E´.E
E.E+T
E.T
T .T*F I0
T .F
F .(E)
F .id
GO TO(I0, E )
E´E. I1
EE.+T
GO TO (I0, T )
ET. I2
T T.*F
GO TO(I0, F )
F(.E)
E.E+T
E.T
T.T*F I4
T.F
F.(E)
F.id
GO TO(I0, id )
Fid . I5
GO TO(I1, + )
EE+.T
T.T*F
T.F I6
F.(E)
F.id
GO TO(I2, * )
TT*.F
F.(E) I7
F.id
GO TO(I4, E )
F(E.) I8
E.E+T
GO TO(I4, T )
ET. I2
TT.*F
GO TO(I4, F )
TF. I3
GO TO(I4, ( )
F(.E)
E.E+T
E.T
T.T*F I4
T.F
F.(E)
F.id
GO TO(I6, T )
EE+T. I9
TT.*F
GO TO(I6, F )
TF. I3
GO TO(I6, ( )
F(.E)
E.E+T
E.T
T.T*F I4
T.F
F.(E)
F.id
GO TO(I6, id )
Fid . I5
GO TO(I7, F )
TT*F. I10
GO TO(I7, ( )
F(.E)
E.E+T
E.T
T.T*F I4
T.F
F.(E)
F.id
GO TO(I7, id )
Fid . I5
GO TO(I8, ) )
F(E) . I11
GO TO(I8, + )
EE+.T
T.T*F
T.F I6
F.(E)
F.id
TF. (I3 )
ACTION(3,FOLLOW(T)) = (3,+),(3,) ), (3,$) r4
Fid. (I5 )
ACTION(5,FOLLOW(F)) = (5,*), (5,+ ), (5,) ), (5,$) r6
EE+T. (I9 )
ACTION(9,FOLLOW(E)) = (9,* ), (9,$) r1
TT*F. (I10 )
ACTION(10,FOLLOW(T)) = (10,+), (10,) ), (10,$) r3
F(E). (I11 )
ACTION(11,FOLLOW(F)) = (11,*), (11,+ ), (11,) ), (11,$) r5
ACTION GOTO
State
+ * ( ) id $ E T F
0 S4 S5 1 2 3
1 S6 acc
2 S7 r2 r2
3 r4 r4 r4
4 S4 S5 8 2 3
5 r6 r6 r6 r6
6 S4 S5 9 3
7 S4 S5 10
8 S6 S11
9 S7 r1 r1
10 r3 r3 r3
11 r5 r5 r5 r5
Solution:-
(i) Elimination left Recursion
SCC
CcC / d
(ii) Finding FIRST and FOLLOW:-
FIRST(S)= FIRST(C) = { c , d }
FOLLOW(S) = { $ }
FOLLOW(C) = FIRST(C) = { c,d , $ }
(iii) Numbering the Grammar:-
1. SCC
2. CcC
3. Cd
Augmented Grammar
S´S
SCC I´
CcC
Cd
Closure ( I´)
S´ .S,$
S .CC,$ I0
.
C cC,c/d
.
C d,c/d
GOTO (I0,S)
S´ S .,$ I1
GOTO (I0,C)
SC.C, $
C.cC, $ I2
C.d, $
GOTO (I0,c)
Cc.C, c/d
C. cC, c/d
GOTO (I2,C)
S CC .,$ I5
GOTO (I2,c)
Cc.C, $
C. cC, $ I6
C.d, $
GOTO (I2,d)
Cd ., $ I7
GOTO (I3,C)
C cC .,c/d I8
GOTO (I3,c)
Cc.C, c/d
C. cC, c/d I3
C.d, c/d
GOTO (I3,d)
Cd ., c/d I4
GOTO (I6,C)
C cC .,$ I9
GOTO (I6,c)
Cc.C, $
C. cC, $ I6
C.d, $
GOTO (I6,d)
Cd ., $ I7
Reduce:-
Cd., c/d (I4 )
ACTION(4,c/d) = (4,c ), (4,d) r3
SCC. , $ (I5 )
ACTION(5,$) = (5,$) r1
CcC., $ (I8 )
ACTION(9,$) = (9,$) r2
A compiler while translating a source program into a functionally equivalent object code
representation may first generate an intermediate representation.
Advantages of generating intermediate representation
1. Ease of conversion from the source program to the intermediate code
2. Ease with which subsequent processing can be performed from the intermediate code
INTERMEDIATE LANGUAGES:
There are three kinds of Intermediate representation. They are,
1. Syntax Trees
2. Postfix Notation
3. Three address code
1. Syntax Tree:-
A syntax tree depicts the natural hierarchical structure of a source program. A DAG
(Direct Acyclic Graph) gives the same information but in a more compact way because common
sub expressions are identified.
A syntax tree and dag for the assignment statement a:= b* -c + b* -c
assign
a +
Syntax Tree
* *
b uminus b uminus
c c
assign
a +
DAG
b uminus
op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) assign a (4)
Indirect Triples:-
Listing pointers to triples rather than listing the triples themselves are called indirect
triples.
Eg. Indirect Triple Representation of a := b * -c + b * -c
Basic Blocks:
A block of code means a block of intermediate code with no jumps in except at the
beginning and no jumps out except at the end.
A basic block is a sequence of consecutive statements in which flow of control enters at
the beginning and leaves at the end without halt or possibility of branching except at the end.
A list of three address statements performing the computation of above program is, (for a
machine with four bytes per word)
Block 1.
1.PROD:=0
2. I:=1
Block 2.
3 t1:=4*I
4 t2:=A[t1]
5 t3:=4*I
6 t4:=B[t3]
7 t5:=t2*t4
8 t6:=PROD+t5
9 PROD:=t6
10 t7:=I+1
11 I:=t7
12. If I<=20 GOTO (3)