Compiler Design and Construction Note
Compiler Design and Construction Note
(CSC 352)
By
Bhupendra Singh Saud
for
B. Sc. Computer Science & Information Technology
Unit 1:
1.1 Introduction to compiling: Compilers, Analysis of source program, the phases of
compiler, compiler-construction tools. 4 hrs
1.2 A Simple One-Pass Compiler: Syntax Definition, Syntax directed translation,
Parsing, Translator for simple expression, Symbol Table, Abstract Stack Machines.
5 hrs
Unit 2:
2.1 Lexical Analysis: The role of the lexical analyzer, Input buffering, Specification of
tokens, Recognition of tokens, Finite Automata, Conversion Regular Expression to an
NFA and then to DFA, NFA to DFA, State minimization in DFA, Flex/lex introduction.
8 Hrs
2.2 Syntax Analysis: The role of parser, Context frees grammars, Writing a grammars,
Top-down parsing, Bottom-up parsing, error recovery mechanism, LL grammar,
Bottom up parsing-Handles, shift reduced parsing, LR parsers-SLR,LALR,LR,LR/LALR
Grammars, parser generators. 10 Hrs
Unit 3:
3.1 Syntax Directed Translation: Syntax-directed definition, Syntax tree and its
construction, Evaluation of S-attributed definitions, L-attributed, Top-down translation,
Recursive evaluators. 5 Hrs
3.2 Type Checking: Type systems, Specification of a simple type checker, Type
conversions equivalence of type expression, Type checking Yacc/Bison. 3 Hrs
4.2 Code Generation and optimization: Issues in design of a code generator, the target
machine, Run –time storage management, Basic blocks and flow graphs, next use
information‘s, a simple code generator, Peephole organization, generating code from
dags. 6 Hrs
Prerequisites
* Introduction to Automata and Formal Languages
* Introduction to Analysis of Algorithms and Data Structures
* Working knowledge of C/C++
* Introduction to the Principles of Programming Languages, Operating System &
Computer Architecture is plus
Resources
Text Book: Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers: Principles,
Techniques, and Tools, Addison-Wesley, 1986
What is Compiler?
A compiler is a translator software program that takes its input in the form of program
written in one particular programming language (source language) and produce the
output in the form of program in another language (object or target language).
Phases of a Compiler
In during compilation process program passes through various steps or phases. It also
involves the symbol table and error handler. There are two major parts of a compiler
Analysis and Synthesis.
Analysis part
In analysis part, an intermediate representation is created from the given source
program. This part is also called front end of the compiler. This part consists of mainly
following four phases:
Lexical Analysis
Syntax Analysis
Semantic Analysis and
Intermediate code generation
Synthesis part
In synthesis part, the equivalent target program is created from intermediate
representation of the program created by analysis part. This part is also called back end
of the compiler. This part consists of mainly following two phases:
Code Optimization and
Final Code Generation
Lexical analyzer
Syntax analyzer
Intermediate code
Generator
Code optimizer
Code generator
Example:
While(i>0)
i=i-2;
Tokens description
while while keyword
( left parenthesis
i identifier
> greater than symbol
0 integers constant
) right parenthesis
i identifier
= Equals
i identifier
- Minus
2 integers constant
; Semicolon
3. Semantic Analyzer
The next phase of the semantic analyzer is the semantic analyzer and it performs a very
important role to check the semantics rules of the expression according to the source
language. The previous phase output i.e. syntactically correct expression is the input of
Example: A = b + c * d / f
Solution
Code Optimization
Optimization is the process of transforming a piece of code to make more efficient
(either in terms of time or space) without changing its output or side effects. The
process of removing unnecessary part of a code is known as code optimization. Due to
code optimization process it decreases the time and space complexity of the program.
i.e
Detection of redundant function calls
Detection of loop invariants
Common sub-expression elimination
Dead code detection and elimination
MOVE c, R1
MULT d, R1
DIV f, R1
ADD b, R1
MOVE R1, A
Symbol Tables
Symbol tables are data structures that are used by compilers to hold information about
source-program constructs. The information is collected incrementally by the analysis
phase of a compiler and used by the synthesis phases to generate the target code.
Entries in the symbol table contain information about an identifier such as its type, its
position in storage, and any other relevant information. Symbol tables typically need to
support multiple declarations of the same identifier within a program.
The lexical analyzer can create a symbol table entry and can return token to the parser,
say id, along with a pointer to the lexeme. Then the parser can decide whether to use a
previously created symbol table or create new one for the identifier.
* In lexical analysis phase, errors can occur due to misspelled tokens, unrecognized
characters, etc. These errors are mostly the typing errors.
* In syntax analysis phase, errors can occur due to the syntactic violation of the
language.
* In intermediate code generation phase, errors can occur due to incompatibility of
operands type for an operator.
* In code optimization phase, errors can occur during the control flow analysis due to
some unreachable statements.
* In code generation phase, errors can occurs due to the incompatibility with the
computer architecture during the generation of machine code. For example, a
constant created by compiler may be too large to fit in the word of the target
machine.
* In symbol table, errors can occur during the bookkeeping routine, due to the
multiple declaration of an identifier with ambiguous attributes.
Lexical Analysis
The lexical analysis is the first phase of a compiler where a lexical analyzer acts as an
interface between the source program and the rest of the phases of compiler. It reads the
input characters of the source program, groups them into lexemes, and produces a
sequence of tokens for each lexeme. The tokens are then sent to the parser for syntax
analysis. Normally a lexical analyzer doesn‘t return a list of tokens; it returns a token
Example:
newval := oldval + 12
tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
Put information about identifiers into the symbol table.
Regular expressions are used to describe tokens (lexical constructs).
A (Deterministic) Finite State Automaton (DFA) can be used in the implementation of a
lexical analyzer.
Patterns are the rules for describing whether a given lexeme belonging to a token or not.
Regular expressions are widely used to specify patterns.
Input Buffering
* Reading character by character from secondary storage is slow process and time consuming
as well. It is necessary to look ahead several characters beyond the lexeme for a pattern
before a match can be announced.
* One technique is to read characters from the source program and if pattern is not matched
then push look ahead character back to the source program.
* This technique is time consuming.
* Use buffer technique to eliminate this problem and increase efficiency.
Many times, a scanner has to look ahead several characters from the current character in
order to recognize the token.
For example int is keyword in C, while the term inp may be a variable name. When the
character ‘i’ is encountered, the scanner cannot decide whether it is a keyword or a
variable name until it reads two more characters.
In order to efficiently move back and forth in the input stream, input buffering is used.
Recognition of tokens
To recognize tokens lexical analyzer performs following steps:
a. Lexical analyzers store the input in input buffer.
b. The token is read from input buffer and regular expressions are built for
corresponding token
c. From these regular expressions finite automata is built. Usually NFA is built.
d. For each state of NFA, a function is designed and each input along the transitional
edges corresponds to input parameters of these functions.
e. The set of such functions ultimately create lexical analyzer program.
Regular Expressions
Regular expressions are the algebraic expressions that are used to describe tokens of a
programming language.
Examples
Given the alphabet A = {0, 1}
1. 1(1+0)*0 denotes the language of all string that begins with a ‗1‘ and ends with a ‗0‘.
2. (1+0)*00 denotes the language of all strings that ends with 00 (binary number
multiple of 4)
3. (01)*+ (10)* denotes the set of all stings that describe alternating 1s and 0s
4. (0* 1 0* 1 0* 1 0*) denotes the string having exactly three 1‘s.
Regular Definitions
To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions.
The regular definition is a sequence of definitions of the form,
d1 → r1
d2 → r2
…………….
dn → rn
Where di is a distinct name and ri is a regular expression over symbols in Σ∪ {d1, d2...
di-1}
Where, Σ = Basic symbol and
{d1, d2... di-1} = previously defined names.
Write regular definitions for specifying an integer array declaration in language like C
Soln: letter → A | B | C |………| Z | a | b | c |………| z
underscore →‘_‘
digit → 1 | 2 |…………….| 9
array → (letter | underscore).( letter | underscore | digit)* ([digit+.0*])+
ε- NFA
In NFA if a transition made without any input symbol is called ε-NFA.
Here we need ε-NFA because the regular expressions are easily convertible to ε-NFA.
Using rule 1 and 2 we construct NFA‘s for each basic symbol in the expression, we combine
these basic NFA using rule 3 to obtain the NFA for entire expression.
Exercise
Convert the following regular expression first into NFA and then into DFA
1. 0+ (1+0)*00
2. zero 0; one 1; bit zero + one; bits bit*
3. aa*+bb*
4. (a+b)*abb
Conversion steps:
1. Augment the given regular expression by concatenating it with special symbol #
I.e. r (r) #
2. Create the syntax tree for this augmented regular expression
In this syntax tree, all alphabet symbols (plus # and the empty string) in the
augmented regular expression will be on the leaves, and all inner nodes will be
the operators in that augmented regular expression.
3. Then each alphabet symbol (plus #) will be numbered (position numbers)
4. Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos
5. Finally construct the DFA from the followpos
After we calculate follow positions, we are ready to create DFA for the regular
expression.
Now
S1=firstpos(root)={1,2}
mark S1
Start state: S1
Accepting states: {S3}
a, c
c a, b
D
Fig: - DFA for above RE
Example 2:
Flex: An introduction
Flex is a tool for generating scanners. A scanner is a program which recognizes lexical
patterns in text. The flex program reads the given input files, or its standard input if no
file names are given, for a description of a scanner to generate. The description is in the
form of pairs of regular expressions and C code, called rules. flex generates as output a
C source file, ‗lex.yy.c‘ by default, which defines a routine yylex(). This file can be
compiled and linked with the flex runtime library to produce an executable. When the
executable is run, it analyzes its input for occurrences of the regular expressions.
Whenever it finds one, it executes the corresponding C code.
Flex specification:
A flex specification consists of three parts:
Regular definitions, C declarations in %{ %}
%%
Translation rules
%%
User-defined auxiliary procedures
Practice
• Get familiar with FLEX
1. Try sample*.lex
2. Command Sequence:
flex sample*.lex
gcc lex.yy.c -lfl
./a.out
Flex Example1
Example2
/*
* Description: Count the number of characters and the number of lines
* from standard input
* Usage:
(1) $ flex sample2.lex
* (2) $ gcc lex.yy.c -lfl
* (3) $ ./a.out
* stdin> whatever you like
* stdin> Ctrl-D
* Questions: Is it ok if we do not indent the first line?
* What will happen if we remove the second rule?
*/
int num_lines = 0, num_chars = 0;
=====================================================================
Syntax Analysis
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given source
program.
Syntax analyzer is also called the parser. Its job is to analyze the source program based on the
definition of its syntax. It works in lock-step with the lexical analyzer and is responsible for
creating a parse-tree of the source code.
Ex: newval: = oldval + 12
Where,
Programming languages usually have recursive structures that can be defined by a context-free
grammar (CFG).
Right-most derivation:
If we always choose the right-most non-terminal in each derivation step, this derivation is called
as right-most derivation.
Eg:
Parse Treese
A parse tree is a graphic representation of a CFG with the following properties:
Ambiguity of a grammar:
A grammar G is said to be ambiguous if there is a string w∈ L(G) for which we can
construct more than one parse tree rooted at start symbol of the production.
Eg: let us consider a CFG:
𝐸 → 𝐸 + 𝐸 𝐸 ∗ 𝐸 𝐸 −𝐸 𝑖𝑑
The parse trees for the string id + id*id is as follows:
𝐸 → 𝐸 + 𝐸 → 𝐸 + 𝐸*E→ 𝑖𝑑 + 𝐸*E→ 𝑖𝑑 + 𝑖𝑑*E→ 𝑖𝑑 + 𝑖𝑑*id
Parsing
Given a stream of input tokens, parsing involves the process of reducing them to a non-
terminal. Parsing can be either top-down or bottom-up.
Top-down parsing involves generating the string starting from the first non-terminal and
repeatedly applying production rules.
Bottom-up parsing involves repeatedly rewriting the input string until it ends up in the first non-
terminal of the grammar.
Top-Down Parsing
The parse tree is created top to bottom.
Top-down parser
– Recursive-Descent Parsing
– Predictive Parsing
Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we backtrack to
try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
It tries to find the left-most derivation
Example: Consider the grammar,
𝑆 → 𝑎𝐵𝑐
Step 3: Which is false and now backtrack and use production 𝐵 → 𝑏 to parse for B
Method: let input w = abc, initially create the tree of single node S. The left most node a match
the first symbol of w, so advance the pointer to b and consider the next leaf B. Then expand B
using first choice bc. There is match for b and c, and advanced to the leaf symbol c of S, but
there is no match in input, report failure and go back to B to find another alternative b that
produce match.
Left Recursion
A grammar is left recursive if it has a non-terminal A such that there is a derivation.
𝐴 → 𝐴𝛼 For some string α
A→ βA‘
A‘→αA‘ | ∈
In general,
Non-Immediate Left-Recursion
By just eliminating the immediate left-recursion, we may not get a grammar which is not left-
recursive.
Predictive Parsing
A predictive parser tries to predict which production produces the least chances of a
backtracking and infinite looping.
When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a
production rule by just looking the current symbol in the input string.
Two variants:
– Recursive (recursive-descent parsing)
– Non-recursive (table-driven parsing)
Input: abba
To compute LL (1) parsing table, at first we need to compute FIRST and FOLLW functions.
Compute FIRST
FIRST(α) is a set of the terminal symbols which occur as first symbols in strings derived from α
where α is any string of grammar symbols.
If α derives to ∈ , then ∈ is also in FIRST (α).
Compute FIRST algorithm:
Compute FOLLOW:
FOLLOW (A) is the set of the terminals which occur immediately after (follow) the non-
terminal A in the strings derived from the starting symbol.
Algorithm
+ * ( ) id $
E‘
T‘
Exercise:
Q. For the grammar,
S [C]S|∈
C {A}C| ∈
A A( )| ∈ Construct the predictive top down parsing table (LL (1) parsing table)
Conflict in LL (1):
When a single symbol allows several choices of production for non-terminal N then we
say that there is a conflict on that symbol for that non-terminal.
Example: Show that given grammar is not LL(1).
S→aA│ bAc
A →c │
Solution:
FIRST(S)={a, b} FIRST(A)={C, }
FIRST(aA)={a} FIRST(bAc)={b} FIRST(c)={c} FIRST()={}
FOLLOW(S)={ $ }
FOLLOW(A)={$, c }
a b c $
S S→aA S→bAc
A A →c A →
A →
Bottom-Up Parsing
Bottom-up parsing attempts to construct a parse tree for an input string starting from leaves (the
bottom) and working up towards the root (the top). It is also called LR(k) parsing.
Reduction:
The process of replacing a substring by a non-terminal in bottom-up parsing is called reduction.
It is a reverse process of production.
Eg: S aA
Here, if replacing aA by S then such a grammar is called reduction.
Shift-Reduce Parsing
The process of reducing the given input string into the starting symbol is called shift-reduce
parsing.
A string the starting symbol
Reduced to
Example:
Handle
A substring that can be replaced by a non-terminal when it matches its right sentential form is
called a handle.
If the grammar is unambiguous, then every right-sentential form of the grammar has exactly one
handle.
Example 1: Let‘s take a grammar
Parser Actions:
1. Shift: The next input symbol is shifted onto the top of the stack.
2. Reduce: Replace the handle on the top of the stack by the non-terminal.
3. Accept: Successful completion of parsing.
4. Error: Parser discovers a syntax error, and calls an error recovery routine
Example 1: Use the following grammar
shift/reduce conflict:
Here, the parser is not able to decide whether to shift or to reduce.
Example:
A ab | abcd
the stack contains $ab, and
the input buffer contains cd$, the parser cannot decide whether to reduce $ab to $A or to shift
two more symbols before reducing.
reduce/reduce conflict:
Here, the parser cannot decide which sentential form to use for reduction.
For example
A bc
B abc and the stack contains $abc, the parser cannot decide whether to reduce it to $aA or to
$B.
LR Parsers
– LR parsing is most general non-backtracking, efficient and most powerful shift-reduce parsing.
– LL(1)-Grammars LR(1)-Grammars
– An LR-parser can detect a syntactic error so fast.
Augmented grammar:
If G is a grammar with start symbol S, then the augmented grammar G‘ of G is a grammar with a
new start symbol S‘ and production S‘ S
Eg: the grammar,
E→E+T|T
T→T*F|F
F → ( E ) | id
Its augmented grammar is;
E‘ E
E→E+T|T
T→T*F|F
F → ( E ) | id
Definition:
If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items constructed
from I by the two rules:
1. Initially, every LR(0) item in I is added to closure(I).
Homework:
1. Construct the LR(0) set of items for the following grammar (which produces simple
postfix expressions).
X→SS+|SS*|a
Don't forget to augment the grammar.
2. Draw the DFA for this item set.
Step 1:- construct the canonical LR(0) collection for the grammar as,
State I0: State I1 : State I2 :
closure(C‘→.C) closure (goto(I0, C)) closure (goto(I0, A))
C‘→.C closure(C‘→C.) closure(C→A.B)
C→.AB C‘→C. C→A.B
A→.a B→.a
Step 2 : Construct SLR parsing table that contains both action and goto table as follows:
1. E→E+T
2. E→T
3. T→T*F
4. T→F
5. F→(E)
6. F → id
Solution:
State ACTION GOTO
id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
[2]. Construct the SLR parsing table for the following grammar
S‘ → S
S → aABe
A → Abc
A→b
B→d
LR(1) Grammars
SLR is so simple and can only represent the small group of grammar
LR(1) parsing uses look-ahead to avoid unnecessary conflicts in parsing table
LR(1) item = LR(0) item + look-ahead
Algorithm:
1. Construct the canonical collection of sets of LR(1) items for G‘.
C = {I0... In}
2. Create the parsing action table as follows
• If [A→α.aβ,b] in Ii and goto(Ii,a) = Ij then action[i,a] = shift j.
• If A→α., a is in Ii, then action[i,a] = reduce A→α where A≠S‘.
• If S‘→S.,$ is in Ii , then action[i,$] = accept.
• If any conflicting actions generated by these rules, the grammar is not LR(1).
I2:S→C.C , $
C→.cC, $
C→.d, $
…………………………………….up to I9
Example 2:
Construct LR(1) parsing table for the augmented grammar,
1. S‘ → S
2. S → L = R
3. S → R
4. L → * R
5. L → id
6. R → L
Step 1: At first find the canonical collection of LR(1) items of the given augmented grammar as,
State I0: State I1 : State I2 :
closure(S‘→.S, $) closure (goto(I0, S)) closure (goto(I0, L))
S‘→.S, $ closure(S‘→S., $) closure((S → L. = R, $),( R →L. ,$))
S → .L = R, $ S‘→S. , $ S → L. = R, $
S → .R, $ R →L. ,$
L → .* R, $
L →. Id, =
R →.L,$
State I3 : State I4 :
closure (goto(I0, R)) closure(goto(I0, *))
closure(S → R. , $) closure(L → * .R, =)
S → R. , $ {(L → * .R, =),( R →.L,=),( L → .* R, =),( L →. Id, =)}
Example:
I1: L → id. , = I12: L → id. , =
Non-Kernel item:
The productions of a grammar which have their dots at the left end are non-kernel items.
Example: Do itself……………….
Take any one grammar then find canonical collection of LR(0) items and finally list kernel and
non-kernel items.
Bison Specification
• A bison specification consists of four parts:
%{
C declarations
%}
Bison declarations
%%
Grammar rules
%%
Additional C codes
Bison Declaration
Tokens that are single characters can be used directly within productions, e.g. „+‟, „-‟, „*‟
Named tokens must be declared first in the declaration part using
%token Token Name (Upper Case Letter)
e.g %token INTEGER IDENTIFIER
%token NUM 100
– %left, %right or %nonassoc can be used instead for %token to specify the precedence &
associativity (precedence declaration). All the tokens declared in a single precedence declaration
have equal precedence and nest together according to their associativity.
– %union declares the collection data types
– %type <non-terminal> declares the type of semantic values of non-terminal
– %start <non-terminal> specifies the grammar start symbol (by default the start symbol of
grammar)
Grammar Rules
In order for Bison to parse a grammar, it must be described by a Context-Free Grammar
that is LALR (1).
A non-terminal in the formal grammar is represented in Bison input as an identifier, like
an identifier in C. By convention, it is in lower case, such as expr, declaration.
A Bison grammar rule has the following general form:
o RESULT: COMPONENTS...;
where, RESULT is the non-terminal symbol that this rule describes and COMPONENTS
are various terminal and non-terminal symbols that are put together by this rule.
For example, exp: exp ’+’ exp; says that two groupings of type ‗exp‘, with a ‗+‘ token in
between, can be combined into a larger grouping of type ‗exp‘.
Multiple rules for the same RESULT can be written separately or can be joined with the
vertical-bar character ‗|‘ as follows:
RESULT: RULE1-COMPONENTS...
| RULE2-COMPONENTS...
............................................;
If COMPONENTS in a rule is empty, it means that RESULT can match the empty string.
For example, here is how to define a comma-separated sequence of zero or more ‗exp‘
groupings:
expseq: /* empty */
| expseq1
;
expseq1: exp
| expseq1 ‘,‘ exp
;
It is customary to write a comment ‗/* empty */‘ in each rule with no components.
In bison, the default data type for all semantics is int. i.e. the parser stack is implemented as an
integer array. It can be overridden by redefining the macro YYSTYPE. A line of the form
#defines YYSTYPE double in the C declarations section of the bison grammar file.
To use multiple types in the parser stack, a ―union‖ has to be declared that enumerates all
possible types used in the grammar.
Example:
%union{
double val;
char *str;
}
This says that there are two alternative types: double and char*.
Tokens are given specific types by a declaration of the form:
%token <val> exp
If yylex() is written in flex, then bison should first be called with the -d option: bison -d
grammar.y
This creates a file grammar.tab.h that containes #defines of all the %token declarations in the
grammar. The sequence of invocation is hence:
bison -d grammar.y
flex grammar.flex
gcc -o grammar grammar.tab.c lex.yy.c –lfl
Practice
• Get familiar with Bison: Write a desk calculator which performs '+' and '*' on unsigned integers
1. Create a Directory: "mkdir calc"
2. Save the five files (calc.lex, calc.y, Makefile, main.cc, and heading.h) to directory
"calc"
3. Command Sequence: "make"; "./calc"
Programming Example
/* Mini Calculator */
/* calc.lex */
%{
#include "heading.h"
#include "tok.h"
int yyerror(char *s);
int yylineno = 1;
%}
digit [0-9]
int_const {digit}+
%%
{int_const}{ yylval.int_val = atoi(yytext); return INTEGER_LITERAL; }
"+" { yylval.op_val = new std::string(yytext); return PLUS; }
"*" { yylval.op_val = new std::string(yytext); return MULT; }
[\t]* {}
[\n] { yylineno++; }
%%
---------------------------------------------------------------------------------------------------------------------
/* Mini Calculator */
/* calc.y */
%{
#include "heading.h"
int yyerror(char *s);
int yylex(void);
%}
%union{
int int_val;
string* op_val;
}
%start input
%token <int_val> INTEGER_LITERAL
%type <int_val> exp
%left PLUS
%left MULT
%%
input: /* empty */
%%
int yyerror(string s)
{
extern int yylineno; // defined and maintained in lex.c
extern char *yytext; // defined and maintained in lex.c
cerr << "ERROR: " << s << " at symbol \"" << yytext;
cerr << "\" on line " << yylineno << endl;
exit(1);
}
Syntax-directed definition:
Syntax-Directed Definitions are high level specifications for translations. They hide many
implementation details and free the user from having to explicitly specify the order in which
translation takes place.
A syntax-directed definition is a generalization of a context-free grammar in which each
grammar symbol is associated with a set of attributes. This set of attributes for a grammar
symbol is partitioned into two subsets synthesized and inherited attributes of that grammar.
In brief,
A syntax-directed definition is a grammar together with semantic rules associated with the
productions. These rules are used to compute attribute values.
Mathematically,
The attributes of a node that are derived from its children nodes are called synthesized attributes.
Terminals do not have inherited attributes. A non-terminal ‗A‘ can have both inherited and
synthesized attributes. The difference is how they are computed by rules associated with a
production at a node N of the parse tree.
Example:
T → F T‘ Production Semantic Rules Type
T→T*F T → F T' T'.lval = F.val Inherited
T.val = T'.tval Synthesized
T→F
T' → * F T1' T'1.lval = T'.lval * F.val Inherited
F → num
T'.tval = T'1.tval Synthesized
T' → ε T'.tval = T'.lval Synthesized
F → num F.val = num.lexval Synthesized
Example 2:
S-Attributed Definitions
A syntax-directed definition that uses synthesized attributes exclusively is called an S-
attributed definition (or S attributed grammar).
A parse tree of an S-attributed definition is annotated by evaluating the semantic rules for
the attribute at each node in bottom-up manner.
Yacc/Bison only support S-attributed definitions.
L-Attributed Definitions
An inherited attribute which can be evaluated in a left-to right fashion is called an L-
attributed definition.
L-attributed definitions can be evaluated using a depth-first evaluation order.
Mathematically,
A syntax-directed definition is L-attributed if each inherited attribute of Xj, 1<=j<=n on
right side of A → X1 X2 … Xn and it depends on,
1. The attributes of the symbols X1, X2, …, Xj-1 to the left of Xj and
2. The inherited attributes of A.
Every S-attributed definition is L-attributed; the restrictions only apply to the inherited attributes
(not to synthesized attributes).
Example:
------------------------------------------------------------------------------------------------------------------
Q:- Define the syntax directed definitions with an example. How definitions are different from
translation schemas?
-------------------------------------------------------------------------------------------------------------------
In syntax directed definition each grammar symbols associated with the semantic rules
while in translation schema we use semantic actions instead of semantic rules.
Example: A simple translation schema that converts infix expression into corresponding postfix
expressions.
Now taking the new semantic actions for each symbols as follows,
Evaluating attributes
Evaluation of string XYY
E.syn=16
Id R1
2 * E R.in=16
( E ) R.in=8
Id R
+ E R.in=8
Id R.in=5
Compiler must check that the source program follows both the syntactic and semantic
conventions of the source language. Type checking is the process of checking the data type of
different variables.
The design of a type checker for a language is based on information about the syntactic
construct in the language, the notation of type, and the rules for assigning types to the language
constructs.
Type expressions:
The type of a language construct is denoted by a type expression.
A type expression can be:
* A basic type
• a primitive data type such as integer, real, char, boolean, …
• type-error signal an error during type checking
• void : no type
* A type name
• a name can be used to denote a type expression.
* A type constructor applies to other type expressions.
• arrays: If T is a type expression, then array(I,T) is a type expression where I
denotes index range. Ex: array(0..99, int)
• products: If T1 and T2 are type expressions, then their Cartesian product
T1 X T2 is a type expression. Ex: int x int
• pointers: If T is a type expression, then pointer(T) is a type expression. Ex:
pointer(int)
• functions: We may treat functions in a programming language as
mapping from a domain type D to a range type R. So, the type of a
function can be denoted by the type expression D→R where D is R type
expressions. Ex: int → int represents the type of a function which takes an
int value as parameter, and its return type is also int.
* Informal type system rules, for example ―if both operands of addition are of type integer,
then the result is of type integer‖
* A type checker implements type system
E → id { E.type = lookup(id.entry) }
………………………………………………………………………………………………………………..
Unit -4
The front end translates the source program into an intermediate representation from
which the backend generates target code. Intermediate codes are machine independent
codes, but they are close to machine instructions.
Intermediate Representations
There are three kinds of intermediate representations:
1. Graphical representations (e.g. Syntax tree or Dag)
2. Postfix notation: operations on values stored on operand stack (similar to JVM byte code)
3. Three-address code: (e.g. triples and quads) Sequence of statement of the form x = y op z
Syntax tree:
Syntax tree is a graphic representation of given source program and it is also called
variant of parse tree. A tree in which each leaf represents an operand and each interior
node represents an operator is called syntax tree.
* d
a +
b c
A DAG for an expression identifies the common sub expressions in the expression. It is similar
to syntax tree, only difference is that a node in a DAG representing a common sub expression
has more than one parent, but in syntax tree the common sub expression would be represented
as a duplicate sub tree.
+ *
* d
a -
b c
Postfix notation
The address code that uses three addresses, two for operands and one for result is called three
code. Each instruction in three address code can be described as a 4-tuple: (operator, operand1,
operand2, result).
A quadruple (Three address code) is of the form:
x = y op z
Where x, y and z are names, constants or compiler-generated temporaries and op is any
operator.
We use the term ―three-address code‖ because each statement usually contains three addresses
(two for operands, one for the result). Thus the source language like x + y * z might be
translated into a sequence
t1 = y * z
t1 = B + A
t2 = Y - t1
t3 = t1 * t2
Example 2: Three address code for expression:
i=2*n+k
While i do
i=i-k
t1 = 2
t2 = t1 * n
t3 = t2 + k
i = t3
L1: if i = 0 goto L2
t4 = i - k
i = t4
goto L1
L2: ………………..
2. Boolean Expressions
Boolean expressions are used to compute logical values. They are logically used as
conditional expressions in statements that alter the flow of control, such as if—then, if—
the—else or while---do statements.
Control-Flow Translation of Boolean Expressions
Production Semantic Rules
B1.true = B.true
B1.false = newlabel()
B → B1 || B2 B2.true = B.true
B2.false = B.false
B.code = B1.code || label(B1.false) || B2.code
B1.true = newlabel()
B1.false = B.false
B → B1 && B2 B2.true = B.true
B2.false = B.false
B.code = B1.code || label(B1.true) || B2.code
B1.true = B.false
B →! B1 B1.false = B.true
B.code = B1.code
B.true = newlabel()
B.false = S.next
S → if ( B ) S1
S1.next = S.next
S.code = B.code || label(B.true) || S1.code
B.true = newlabel()
B.false = newlabel()
S1.next = S.next
S → if ( B ) S1 else S2
S2.next = S.next
S.code = B.code || label(B.true) || S1.code
|| gen(goto S.next) || label(B.false) || S2.code
begin = newlabel()
B.true = newlabel()
S → while ( B ) S1
B.false = S.next
S1.next = begin
S.code = label(begin) || B.code || label(B.true) || S1.code || gen(goto begin)
Fig: If--then
Example: Convert the following switch statement into three address code:
Switch (i + j)
{
Case 1: x=y + z
i.e A[i]= i* w + C
Example: Let A be a 10 X 20 array, there are 4 bytes per word, assume low1=low2=1.
Solution: Let X=A[Y, Z]
Now using formula for two dimensional array as,
((i1 * n2) + i2) * w + baseA - ((low1 * n2) + low2) * w
= ((Y * 20) + Z) * 4 + baseA - ((1 * 20) + 1) * 4
= ((Y * 20) + Z) * 4 + baseA - ((1 * 20) + 1) * 4
= ((Y * 20) + Z) * 4 + baseA – 84
We can convert the above expression in three address codes as below:
T1= Y * 20
T1= T1+Z
T2=T1*4
5. Declarative statements
We can explain the declarative statement by using the following example,
S→D {offset=0}
D→ id : T {enter-to-symbol-table(id.name, T.type, offset);
(offset= offset+T.width)}
T→ integer {T.type=Integer; T.width=4}
T→ real {T.type=real; T.width=8}
T→array [num] of T1 {T.type=array (num.val, T1.type)
T.width=num.val * T1.width}
T→ ↑T1 { T.type=pointer(T1.type); T.width=4}
* Initially offset is set to zero. As each new name is seen, that name is entered in the symbols
table with offset equal to the current value of offset and offset is incremented by the width of
the data object denoted by that name.
* The procedure enter-to-symbol-table (name, type, offset) creates a symbol table entry for
name, gives it type and relative address offset in its data area.
* Integers have width 4 and reals have width 8. The width of an array is obtained by
multiplying the width of each element by the number of elements in the array.
* The width of each pointer is assumed to be 4.
6. Procedure Calls
The procedure is such an important and frequently used programming construct that is
imperative for a compiler to generate good code for procedure calls and returns. Consider a
grammar for a simple procedure call statement:
S → call id (Elist )
Elist → Elist
Elist → E
When procedure call occurs, space must be allocated for the activation record of the called
procedure. The argument of the called procedure must be evaluated and made available to the
called procedure in a known place. The return address is usually location of the instruction that
follows the call in the calling procedure. Finally a jump to the beginning of the code for the
called procedure must be generated.
………………………………………………………………………………………………………………..
Unit -4 [Chapter-2]
Code Generation and optimization
The process of transform intermediate code + tables into final machine (or assembly)
code is known as code generation. The process of eliminating unnecessary and
inefficient code such as dead code, code duplication etc from the intermediate
representation of source code is known as code optimization. Code generation +
Optimization are the back end of the compiler.
Source
Intermediate
Program
code Code Code Target
Front
Optimizatio Generator Program
end
n
The advantage of producing target code in absolute machine form is that it can be
placed directly at the fixed memory location and then can be executed immediately. The
benefit of such target code is that small programs can be quickly compiled.
4. Instruction Selection
Example
t:=a*b MOV a, R1
t:=t+a MUL b, R1
t:=t/d ADD a, R1
DIV d, R1
MOV R1, t
MOV a, R0
MOV R0, R1
MUL b, R1
ADD R0, R1
DIV d, R1
MOV R1, t
6. Evaluation Order
When instructions are independent, their evaluation order can be changed. Sometimes
better code results if the three address codes are reordered.
MOV a, R0
t1:=a+b ADD b, R0
a+b-(c+d)*e
t2:=c+d MOV R0, t1
t3:=e*t2
MOV c, R1
t4:=t1-t3
ADD d, R1
MOV c, R0
Reorder MOV e, R0
ADD d, R0
MUL R1, R0
MOV e, R1
t2:=c+d MOV t1, R1
MUL R0, R1
MOV a, R0 t3:=e*t2 SUB R0, R1
ADD b, R0 t1:=a+b MOV R1, t4
SUB R1, R0 t4:=t1-t3
MOV R0, t4
By Bhupendra Singh Saud Page 88
L1: MUL 2, R0
SUB 1, R1 L1: MUL 2, R0
SUB 1, R1
L2: MUL 3, R1
JMPNZ R1, L1
L2: MUL 3, R1
JMPNZ R1, L1
Flow Graphs
A flow graph is a graphical depiction of a sequence of instructions with control flow
edges. A flow graph can be defined at the intermediate code level or target code level.
The nodes of flow graphs are the basic blocks and flow-of-control to immediately follow
node connected by directed arrow.
Simply, if flow of control occurs in basic blocks of given sequence of instructions then
such group of blocks is known as flow graphs.
Example: MOV 1, R0
MOV 1, R0 ADD n, R0
ADD n, R0 MOV 2, R1
MOV 2, R1 MUL R0, R1
MUL R0, R1 JMP L2
JMP L2
L1: MUL 2, R0 L1: MUL 2, R0
SUB 1, R1 SUB 1, R1
L2: MUL 3, R1
JMPNZ R1, L1
L2: MUL 3, R1
JMPNZ R1, L1
b=0 a= c*a
t1 = a + b
t2 = c * t1 b=0
a = t2
a= c*a a= c*a
b=0 b=0
b
Transformations on Basic Blocks =
The process of code optimization to improve speed or reduce0 code size of given
sequence of codes and convert to basic blocks by shafting and preserving the meaning
of the code is known as transformation on basic blocks. There are mainly two types of
code transformations:
Global transformations and
Local transformations
Global transformations are performed across basic blocks whereas Local transformations
are only performed on single basic blocks. A local transformation is safe if the
transformed basic block is guaranteed to be equivalent to its original form. There are
two classes of local transformations which are:
1. Structure preserving transformations and
Common sub-expression elimination
Dead code elimination
Renaming of temporary variables
Interchange of two independent statements
2. Algebraic transformations
If a==0 If a==0
Goto L1 Goto L1
b=x + y
L1:
L1: P= q + r
P= q + r
Interchange of statements
Independent statements can be reordered without affecting the value of block to make
its optimal use.
t1 = b + c t1 = b + c
t2 = a - t1 t3 = t1 * d
t3 = t1 * d t2 = a - t1
d = t2 + t3 d = t2 + t3
Algebraic Transformations
Change arithmetic operations to transform blocks to algebraic equivalent forms. Here
we replace expansive expressions by cheaper expressions.
t1 = a - a
t1 = 0
t2 = b + t1
t2 = b
t3 = t2 ** 2
t3 = t2 * t2
Code generator
The code generator converts the optimized intermediate representation of a code to final code
which is normally machined dependent.
d = (a-b) + (a – c)
To three address code
t1=a –b MOV a, R0
Final code SUB b, R0
t2= a – c
MOV a, R1
t3=t1 + t2
SUB c, R1
d= t3 ADD R1, R0
MOV d, R0
Register Descriptors
A register descriptor keeps track of what is currently stored in a register at a particular point in
the code, e.g. a local variable, argument, global variable, etc.
MOV a, R0 ―R0 contains a‖
Example 1: At first convert the following expression into three address code sequence
then show the register as well as address descriptor contents.
Statement: d = (a - b) + (a – c) + (a – c)
Solution: The three address code sequence of above statement is:
t1=a –b
t2= a – c
t3=t1 + t2
d= t3+t2
Statements Code generated Resister descriptor Address Descriptor
t1=a –b MOV a, R0 R0 contains t1 t1 in R0
SUB b, R0
t2= a – c
MOV a, R1 R0 contains t1 t1 in R0
SUB c, R1 R1 contains t2 t2 in R1
Example 2: At first convert the following expression into three address code sequence
then show the register as well as address descriptor contents.
Statement: X = (a / b + c) – d * e
Solution: The three address code sequence of above statement is:
t1=a /b
t2= t1 + c
t3=d * e
X= t2 – t3
X= t2 – t3 R0 contains X X is in R0 and in
SUB R1, R0 R1 contains t3 memory
MOV X, R0
Code optimization
[Q]. Explain various optimization approaches (Peephole optimizations) with
suitable example for each of the approach.
If a==0 If a==0
Goto L1 Goto L1
b=x + y
L2:
L1: P= q + r
Go to L2 ………
L2:
P= q + r
………
……………… ………………
a= x^2 a= x * x
b= y/8 b= y>>3
………….. ………………
…
Loop optimization
If a complex expression is used as condition to the loop then such a expression can be reduced
into smaller sub expressions outside the loop and only their final result used to the looping as a
condition check. Such an approach is called loop optimization.
t1=limit*2
While (i<=limit* 2-10) t2=t1-10
While (i<=t2)
Loop invariant
Example 1: On the following piece of code,
max = 4099;
x=max*10;
while (i!=x*5)
{
b[i]=i * find(x);
c=max;
while (x<0)
max - -;
d=x * I;
}
Identify different kinds of optimizations possible and describe them. Rewrite the code after
making optimizations.
Solution:
max = 4099;
x=max*10;
while (i!=x*5) Loop optimization
{
b[i]=i * find(x); Redundant expression
c=max;
while (x<0)
max - -; dead code
d=x * i;
}