UNIT I
Overview of compilation
Syllabus:
▪ Introduction
▪ Phases of compilation
▪ Lexical Analysis
▪ Regular grammar and regular expression for common programming
language features
▪ Pass and phases of translation
▪ Interpretation
▪ LEX tool Introduction
Introduction
Translator: A program that converts source code into object code
Source language Target language
TRANSLATOR
Types of Translators:
1. Compilers
2. Interpreters
3. Assemblers
INTRODUCTION
Compiler:
COMPILER
Interpreter:
Assembler:
Why Compilers?
Properties of Compiler
▪ Bug-free
▪ Correct machine code
▪ Run fast
▪ Portable
▪ Produce error messages
▪ Work well with debuggers
▪ Consistent optimization
Overview of Compilation
THE PHASES OF COMPILER / STRUCTURE OF
COMPILER
Lexical Analyzer:
• The first phase of a compiler is called lexical analysis or scanning or linear
analysis.
• It reads the characters in the source program and groups them into a stream of
tokens.
• Each token represents a logically cohesive sequence of characters(identifier,
keyword, a punctuation character or a multi-character operator).
• The character sequence forming a token is called the lexeme for the token.
Ex: position:=initial +rate*60
After lexical analysis
Id1:=id2+id3*60
Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing or hierarchical
analysis.
• The hierarchical tree structure generated in this phase is called parse tree or
syntax tree.
Syntax Analysis
Parse tree: It is a rooted tree that represents the syntactic structure of a
string according to some Context Free Grammar
Syntax tree: It is compressed representation of the parse tree in which
the operators appear as the interior nodes, and the operands of an
operator are the children of the node for that operator
Semantic Analysis
• Checks the source program for semantic errors.
• It gathers type information for sub sequent code generation phase.
• It uses the hierarchy structure determined by the syntax analysis phase
• An important component of semantic analysis is type checking
Intermediate Code Generation
Translate from abstract syntax trees to intermediate codes.
• It should have two important properties:
1. It should be easy to produce
2. It should be easy to translate into the target program
• It can have variety of forms: One form is three-address code:
temp1:=inttoreal (60)
temp2:=id3*temp1
temp3:=id2+temp2
id1:=temp3
Code Optimization
• Attempts to improve the intermediate code so that better target code will result
• It is in the form of shorter sequence
temp1:=id3*60.0
id1:=id2+temp1
Code Generation
• Produces the target language in a specific architecture
• The target program is normally a relocatable object file containing the machine
code
MOVF id3,R2
MULF #60.0,R2
MOVF id2,R1
ADDF R2,R1
MOVF R1,id1
Symbol Table and Error Handler
Symbol Table Management:
A symbol table is a data structure containing a record for each
identifier, with fields for the attributes of the identifier
Error Detection and Handling:
Errors can be detected during almost every phase of compilation
The static errors must be reported by a compiler
Each phase of a compiler will need a different kind of error handling
An error handler must contain different operations, each appropriate
for a specific phase and situation
LEXICAL ANALYSIS
▪ The process of compilation starts with the first phase called lexical analysis
▪ Its task is to read the input characters and produce a sequence of tokens as
output
“Token”, “Pattern”, “Lexeme”
Token: A sequence of characters having a collective meaning
Ex: keywords, operators, identifiers, constants, literal strings and
punctuation symbols-parenthesis, commas and semicolon
Pattern: A rule for describing a token
Ex: An integer is a pattern for token number
Lexeme: A sequence of characters in the source program that is matched
by the pattern for a token
LEXICAL ANALYSIS
Examples of tokens:
Tokens
Specification of Tokens
LEXICAL ANALYSIS
Role of Lexical Analyzer:
LEXICAL ANALYSIS
Role of Lexical Analyzer:
The scanner or a lexical analyzer is responsible for:
1. Removing white space and comments
2. Identifying
- constants
- identifiers
- keywords
3. Generating tokens <q1, q2>
4. Reporting about errors
LEXICAL ANALYSIS
Reasons for Separating Lexical Analysis from Syntax Analysis:
❖ Improves efficiency
❖ Compiler design made easy
❖ Reduces compilation time
❖ Improves portability
LEXICAL ANALYSIS
Need for Lexical Analysis:
▪ Simplify the over all design of the compiler
▪ Easier to specify the structure of tokens
▪ More specialized and more efficient recognizer for tokens
▪ Simplify the design of the syntax analyzer
▪ Keeps track of line numbers, producing an output listing if necessary
▪ Removing white spaces and deleting comments
LEXICAL ANALYSIS
Attributes for tokens:
Lexical analyser provides additional information to distinguish between
the similar type of patterns that match a lexeme
▪ The lexical analyzer collects the attributes of tokens as the additional
information
▪ A token has a single attribute i.e. a pointer to the symbol table entry
in which the information about the token is kept
Ex: E=M*C**2
LEXICAL ANALYSIS
Lexical Errors and Their Handlers:
Lexical error is a sequence of characters that does not match the pattern of any
token.
Lexical errors:
1. A misspelled identifier
2. An identifier with over than defined length
Reasons:
1. An extra character than specified length
2. An invalid character
3. A missing character
4. Swapped characters or misplaced characters
LEXICAL ANALYSIS
Lexical Errors and Their Handlers:
Error Handlers:
▪ The lexical analyzer should read a numeric constant and if it exceeds the length
then the extra numbers can be truncated and rest retained
▪ In case of comments, If an extra character appears outside the comment section,
the character will be deleted
▪ If an illegal or invalid character appears, the same can be deleted
▪ If it is not able to repair an error and is unable to proceed further, it should not get
stuck to the error rather it can delete the characters till it gets a valid token (panic
mode recovery)
Lexical Errors and Their Handlers:
LEXICAL ANALYSIS
Input Buffering:
▪ Decides whether a pattern forms a valid token or not
▪ A block of data is first read into a buffer and then scanned by lexical analyzer
▪ It uses two pointers to read tokens:
begin_ptr (lexeme-beginning)
forward_ptr (search pointer)
▪ The forward_ptr moves ahead to search for end of lexeme
Initial positions of pointers bp and fp
LEXICAL ANALYSIS
Updation of pointers for the next lexeme
LEXICAL ANALYSIS
One Buffer Scheme:
Sentinel: EOF character used to
identify the end of buffer
Two Buffer Scheme:
LEXICAL ANALYSIS
Specification of tokens:
• In theory of compilation regular expressions are used to formalize the
specification of tokens
• Regular expression is an important notation for specifying patterns.
• Each pattern matches a set of strings, so regular expressions serve as
names for a set of strings.
• Programming language tokens can be described by regular
languages.
Recognition of Tokens
Regular Grammar and Regular Expression for
common Programming Language Features
Regular expressions: a sequence of characters that define a search
pattern
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)
Regular Grammar and Regular Expression for
common Programming Language Features
Regular definitions:
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_(letter_| digit)*
Regular Grammar and Regular Expression for
common Programming Language Features
Notational shorthands:
• Zero or more instances:(r)*
• One or more instances: (r)+
• Zero or one instance: r? L(r) U{epsilon} ex:0?
• Character classes: [A-Z|a-z][A-Z|a-z|0-9]*
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter_|digit)*
Regular Expressions
1. Write a RE for the language containing the strings of length
two over ∑={a,b}
2. Write a RE for a language containing strings which end with
“pqq” over ∑={p,q}
3. Write a RE for the language accepting all combinations of p’s
except the null string over ∑={p}
4. Write a RE for the language containing all the strings with any
number of p’s and q’s
5. Write a RE for the language containing all strings having any
number of p’s and q’s except the null string
Regular Expressions
6. Write a RE for the language accepting all the strings which are
ending with aa over the set ∑={a,b}
7. Write a RE for the language accepting the strings which are
starting with b and ending with a over the set ∑={a,b}
8. Write a RE denote the language L over ∑*, where ∑={p,q,r} in
which any string will be any number of p’s followed by any
number of q’s followed by any number of r’s
9. Write a RE denote a language L over ∑* where ∑={p,q} such
that 3rd character from right end of the string is always p
10. Write a RE for the language which consists of exactly two q’s
over the set ∑={p,q}
Regular Grammar and Regular Expression for common
Programming Language Features
Recognition of tokens:
digit -> [0-9]
Digits -> digit+
Real -> digit.(digit)+
letter -> [A-Za-z_]
id -> letter_ (letter_|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
delimiter -> blank|newline|tab
ws -> delimiter+ ws =(blank+newline+tab)+
Regular Grammar and Regular Expression for
common Programming Language Features
Transition Diagrams:
Transition diagram for relop:
Regular Grammar and Regular Expression for
common Programming Language Features
Transition diagram for reserved words and identifiers
Regular Grammar and Regular Expression for
common Programming Language Features
Transition diagram for whitespace
NFA vs. DFA
• NFAs and DFAs recognize the same set of languages (regular
languages)
• DFAs are easier to implement
• There are no choices to consider
• For a given language the NFA can be simpler than the DFA
DFA can be exponentially larger than NFA
Regular Expressions to Finite Automata
Regular Expressions to Finite Automata
For each kind of regular expression, define an NFA
Notation: NFA for regular expression A
Regular Expressions to Finite Automata
Example of RegExp -> NFA conversion
NFA -> DFA Example
Table Implementation of a DFA
Pass and Phases of Translation
Pass: A pass is one complete scan of a program, which includes reading
an input source program and converting the source program into
machine understandable form
Logical Structure of Pass
Pass and Phases of Translation
• A one-pass compiler is a compiler that
passes through the parts of each
compilation unit only once,
immediately translating each part into
its final machine code
Pass and Phases of Translation
• A language processor that goes
through the program to be
translated twice
• Front end: maps legal code into
Intermediate Representation
(IR).
• Back end: maps IR onto the
target machine
Pass and Phases of Translation
▪ Multi pass compiler is used to process the source
code of a program several times.
Wide Compiler ▪ In the first pass, compiler can read the source
program, scan it, extract the tokens and store the
result in an output file.
• In the second pass, build the syntactic tree and
perform the syntactical analysis.
• In the third pass, check that the tree follows the rules
of language or not.
• This pass is going on, until the target output is
produced.
Pass and Phases of Translation
Single pass compiler Multi pass compiler
1. Scans the entire source code only once 1. Scans the source code several times
2. Less execution time 2. More execution time
3. Debugging of the translated code is 3. Debugging of the translated code is
difficult easy
4. Memory requirement is more 4. Memory requirement is less
5. Generated code is less efficient 5. Generated code is more efficient
6. Backtracking is not allowed 6. Backtracking is allowed
7. Narrow compiler 7. Wide compiler
8. Pascal and C 8. Java
Pass and Phases of Translation
Advantages of pass:
1. In a single pass compiler, the program is read only once
2. The execution time for a two pass compiler is very short
3. Pascal and C uses a single pass compiler
Disadvantages:
1. For a single pass compiler the memory required is more
2. Code generated by a single pass compiler is not efficient
3. Complexity in designing a pass based compiler is more
Pass and Phases of Translation
Phase: A compiler is divided into number of parts or segments and each
segment is known as a phase.
A general compiler has 6 phases
Advantages:
1. The compilation process is divided into a number of steps
2. The efficiency of the code generated is more
Disadvantages:
1. Memory required is relatively high than a pass based compiler
2. If any phase of a compiler enters into an infinite loop then the compiler
gets stuck which may result into a system crash
Pass and Phases of Translation
Pass Phase
1. Various phases are logically grouped 1. Compilation process is done in six
together to form a pass phases
2. Single pass or multiple pass 2. Analysis and synthesis phase
3. The compilation model is viewed as 3. Lexical, syntax and semantic
front end and back end model analysis phases are machine
independent and intermediate
code,code optimization and code
generation are machine dependent
Interpretation
▪ The data and the source program are input to the interpreter
▪ Produces the results by performing the operations of the source program on its
data
▪ BASIC, SNOBOL, LISP, Java are translated using interpreters
▪ The process of interpretation
Lexical Analysis
Syntax Analysis
Semantic Analysis
Direct execution
Interpretation
Advantages:
1. Fast response of changes in source program
2. Easy to write and do not require large memory in computer
Disadvantages:
1. Time consuming method
2. Slower than a compiler
Comparison Between Interpreter &
Compiler
Interpreter Compiler
1. Every time the source program is 1. The program is analysed only once
analysed. 2. More efficient
2. Less efficient 3. Object code produced
3. No object code produced 4. Complex and requires large amount
4. Simpler and gives improved of memory
debugging environment 5. Translator which take only source
5. Translator which produces results program as input and converts it
directly when the source language and into object code
data is give to it as input
Cross Compiler
▪ The compiler which runs on one machine and produces target code
for another machine
▪ Platform independency is achieved
▪ T diagram is used to represent it.
Cross Compiler
LM N
LS N
SM M
Bootstrapping
• A process in which simple language is used to translate more
complicated program which in turn may handle far more complicated
program, this complicated program can further handle even more
complicated program and so on.
• Bootstrapping is the process of writing a compiler in the source
program language that it intends to compile the programs in the
source language
Bootstrapping
XY Z XM Z
YM M
Bootstrapping
LL N LM N
LM M
LL N LN N
LM N
Bootstrapping
Bootstrapping a compiler
Compiler Construction Tools
1. Scanner Generator:
– Generate lexical analyzers
– Specifications given to these generators are in the form of RE’s
2. Parser Generators:
– Produce syntax analyser
– Specifications given to these generators are in the form of CFG’s
3. Syntax-directed Translation Engines:
– parse tree is scanned completely to generate an intermediate code
4. Automatic code Generator:
– converts each rule of intermediate language into equivalent machine
language
5. Data Flow Engines: perform good code optimization
6. Compiler construction Toolkits: Provides an integrated set of routines for
constructing various phases of compiler
Lexical Analyzer Generator - Lex
Creating a Lexical Analyzer with LEX
Lexical Analyzer Generator - Lex
Lexical Analyzer Generator - Lex
Structure of Lex programs:
Lexical Analyzer Generator - Lex
Declaration section
1. variables
2. manifest constants
– an identifier that represents a constant
Ex: #define MAX=10
3. regular definitions
– A sequence of regular expressions
n1🡪 RE1
n2🡪RE2
.
nm🡪REm
Lexical Analyzer Generator - Lex
Translation rules section
▪ set of statements, defines the action to be performed for each RE given.
RE1 {action1}
RE2 {action2}
…………………..
Rem {actionm}
Auxiliary Procedures
▪ Procedures are used by the actions in translation rules section.
▪ Auxiliary procedures can be compiled separately and loaded with the
scanner
Design of Lexical Analyzer Generator
Pattern Matching Based on NFA’s
Example
NFA for a,abb,a*b+
Example
Combined NFA
Example
Sequence of states entered in processing input aaba
DFA for Lexical Analyzers
Transition Table for DFA
Lex Program
Lex Program
Lex Program for the tokens
lower case to upper case and upper case to lower
case.
• %{
• %}
• %%
• [a-z] { printf(“%c”,(yytext[0]-32);}
• %%
•
• %{
• %}
• %%
• [a-z] { printf(“%c”,(yytext[0]+32);}
• %%
count no of words in a given sentence
• %{
• int c;
• %}
• %%
• [\t \n] {c++;}
• %%
• main()
•{
• yylex();
• printf(“no of words in a given sentence =”,c);
•}
count no. of vowels & consonants
• %{
• int c=0,v=0;
• %}
• %%
• [a|e|i|o|u|A|E|O|U|I] {v++;}
• [b-d|f-h|j-n|p-t|v-z|B-D|F-H|J-N|P-T|V-Z] {c++;}
• %%
• main()
• {
• yylex();
• printf(“no of consonants in a given sentence is=%d \n no of vowels in a given sentence is =%d”,c,v);
• }
identify keywords, numbers
• %{
• //This is first lex program
• %}
• letter [a-z][A-Z0-9a-z0-9]*
• digit [0-9]+
• %%
• int|float|do|char|else|while|for|if {printf("%s is a keyword",yytext);}
• {letter} {printf("%s is an identifier",yytext);}
• {digit} {printf("%s is a number",yytext);}
• %%
• main(int argc,char **argv)
• {
• if(argc>1)
• yyin=fopen(argv[1],"r");
• else
• yyin=stdin;
• yylex();
• printf("\n");
• }
• int yywrap()
• {
• return 0;
• }