0% found this document useful (0 votes)
44 views89 pages

Unit I

This document provides an overview of compilation, detailing the phases involved such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of different types of translators, including compilers, interpreters, and assemblers, while emphasizing the importance of properties like bug-free code and efficient execution. Additionally, it covers the use of regular expressions and grammars for token specification, as well as the differences between single and multi-pass compilers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views89 pages

Unit I

This document provides an overview of compilation, detailing the phases involved such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of different types of translators, including compilers, interpreters, and assemblers, while emphasizing the importance of properties like bug-free code and efficient execution. Additionally, it covers the use of regular expressions and grammars for token specification, as well as the differences between single and multi-pass compilers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

UNIT I

Overview of compilation
Syllabus:
▪ Introduction
▪ Phases of compilation
▪ Lexical Analysis
▪ Regular grammar and regular expression for common programming
language features
▪ Pass and phases of translation
▪ Interpretation
▪ LEX tool Introduction
Introduction
Translator: A program that converts source code into object code

Source language Target language


TRANSLATOR

Types of Translators:
1. Compilers
2. Interpreters
3. Assemblers
INTRODUCTION

Compiler:
COMPILER
Interpreter:

Assembler:
Why Compilers?
Properties of Compiler
▪ Bug-free
▪ Correct machine code
▪ Run fast
▪ Portable
▪ Produce error messages
▪ Work well with debuggers
▪ Consistent optimization
Overview of Compilation
THE PHASES OF COMPILER / STRUCTURE OF
COMPILER
Lexical Analyzer:
• The first phase of a compiler is called lexical analysis or scanning or linear
analysis.
• It reads the characters in the source program and groups them into a stream of
tokens.
• Each token represents a logically cohesive sequence of characters(identifier,
keyword, a punctuation character or a multi-character operator).
• The character sequence forming a token is called the lexeme for the token.
Ex: position:=initial +rate*60
After lexical analysis
Id1:=id2+id3*60
Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing or hierarchical
analysis.
• The hierarchical tree structure generated in this phase is called parse tree or
syntax tree.
Syntax Analysis
Parse tree: It is a rooted tree that represents the syntactic structure of a
string according to some Context Free Grammar
Syntax tree: It is compressed representation of the parse tree in which
the operators appear as the interior nodes, and the operands of an
operator are the children of the node for that operator
Semantic Analysis
• Checks the source program for semantic errors.
• It gathers type information for sub sequent code generation phase.
• It uses the hierarchy structure determined by the syntax analysis phase
• An important component of semantic analysis is type checking
Intermediate Code Generation
Translate from abstract syntax trees to intermediate codes.
• It should have two important properties:
1. It should be easy to produce
2. It should be easy to translate into the target program
• It can have variety of forms: One form is three-address code:

temp1:=inttoreal (60)
temp2:=id3*temp1
temp3:=id2+temp2
id1:=temp3
Code Optimization
• Attempts to improve the intermediate code so that better target code will result
• It is in the form of shorter sequence

temp1:=id3*60.0
id1:=id2+temp1
Code Generation
• Produces the target language in a specific architecture
• The target program is normally a relocatable object file containing the machine
code

MOVF id3,R2
MULF #60.0,R2
MOVF id2,R1
ADDF R2,R1
MOVF R1,id1
Symbol Table and Error Handler
Symbol Table Management:
A symbol table is a data structure containing a record for each
identifier, with fields for the attributes of the identifier
Error Detection and Handling:
Errors can be detected during almost every phase of compilation
The static errors must be reported by a compiler
Each phase of a compiler will need a different kind of error handling
An error handler must contain different operations, each appropriate
for a specific phase and situation
LEXICAL ANALYSIS
▪ The process of compilation starts with the first phase called lexical analysis
▪ Its task is to read the input characters and produce a sequence of tokens as
output
“Token”, “Pattern”, “Lexeme”
Token: A sequence of characters having a collective meaning
Ex: keywords, operators, identifiers, constants, literal strings and
punctuation symbols-parenthesis, commas and semicolon
Pattern: A rule for describing a token
Ex: An integer is a pattern for token number
Lexeme: A sequence of characters in the source program that is matched
by the pattern for a token
LEXICAL ANALYSIS
Examples of tokens:
Tokens
Specification of Tokens
LEXICAL ANALYSIS
Role of Lexical Analyzer:
LEXICAL ANALYSIS
Role of Lexical Analyzer:
The scanner or a lexical analyzer is responsible for:
1. Removing white space and comments
2. Identifying
- constants
- identifiers
- keywords
3. Generating tokens <q1, q2>
4. Reporting about errors
LEXICAL ANALYSIS
Reasons for Separating Lexical Analysis from Syntax Analysis:

❖ Improves efficiency
❖ Compiler design made easy
❖ Reduces compilation time
❖ Improves portability
LEXICAL ANALYSIS
Need for Lexical Analysis:

▪ Simplify the over all design of the compiler


▪ Easier to specify the structure of tokens
▪ More specialized and more efficient recognizer for tokens
▪ Simplify the design of the syntax analyzer
▪ Keeps track of line numbers, producing an output listing if necessary
▪ Removing white spaces and deleting comments
LEXICAL ANALYSIS
Attributes for tokens:
Lexical analyser provides additional information to distinguish between
the similar type of patterns that match a lexeme
▪ The lexical analyzer collects the attributes of tokens as the additional
information
▪ A token has a single attribute i.e. a pointer to the symbol table entry
in which the information about the token is kept
Ex: E=M*C**2
LEXICAL ANALYSIS
Lexical Errors and Their Handlers:
Lexical error is a sequence of characters that does not match the pattern of any
token.

Lexical errors:
1. A misspelled identifier
2. An identifier with over than defined length
Reasons:
1. An extra character than specified length
2. An invalid character
3. A missing character
4. Swapped characters or misplaced characters
LEXICAL ANALYSIS
Lexical Errors and Their Handlers:
Error Handlers:
▪ The lexical analyzer should read a numeric constant and if it exceeds the length
then the extra numbers can be truncated and rest retained
▪ In case of comments, If an extra character appears outside the comment section,
the character will be deleted
▪ If an illegal or invalid character appears, the same can be deleted
▪ If it is not able to repair an error and is unable to proceed further, it should not get
stuck to the error rather it can delete the characters till it gets a valid token (panic
mode recovery)
Lexical Errors and Their Handlers:
LEXICAL ANALYSIS
Input Buffering:
▪ Decides whether a pattern forms a valid token or not
▪ A block of data is first read into a buffer and then scanned by lexical analyzer
▪ It uses two pointers to read tokens:
begin_ptr (lexeme-beginning)
forward_ptr (search pointer)
▪ The forward_ptr moves ahead to search for end of lexeme

Initial positions of pointers bp and fp


LEXICAL ANALYSIS

Updation of pointers for the next lexeme


LEXICAL ANALYSIS
One Buffer Scheme:

Sentinel: EOF character used to


identify the end of buffer

Two Buffer Scheme:


LEXICAL ANALYSIS
Specification of tokens:
• In theory of compilation regular expressions are used to formalize the
specification of tokens
• Regular expression is an important notation for specifying patterns.
• Each pattern matches a set of strings, so regular expressions serve as
names for a set of strings.
• Programming language tokens can be described by regular
languages.
Recognition of Tokens
Regular Grammar and Regular Expression for
common Programming Language Features
Regular expressions: a sequence of characters that define a search
pattern
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)
Regular Grammar and Regular Expression for
common Programming Language Features
Regular definitions:
d1 -> r1
d2 -> r2

dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_(letter_| digit)*
Regular Grammar and Regular Expression for
common Programming Language Features
Notational shorthands:
• Zero or more instances:(r)*
• One or more instances: (r)+
• Zero or one instance: r? L(r) U{epsilon} ex:0?
• Character classes: [A-Z|a-z][A-Z|a-z|0-9]*

• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter_|digit)*
Regular Expressions
1. Write a RE for the language containing the strings of length
two over ∑={a,b}
2. Write a RE for a language containing strings which end with
“pqq” over ∑={p,q}
3. Write a RE for the language accepting all combinations of p’s
except the null string over ∑={p}
4. Write a RE for the language containing all the strings with any
number of p’s and q’s
5. Write a RE for the language containing all strings having any
number of p’s and q’s except the null string
Regular Expressions
6. Write a RE for the language accepting all the strings which are
ending with aa over the set ∑={a,b}
7. Write a RE for the language accepting the strings which are
starting with b and ending with a over the set ∑={a,b}
8. Write a RE denote the language L over ∑*, where ∑={p,q,r} in
which any string will be any number of p’s followed by any
number of q’s followed by any number of r’s
9. Write a RE denote a language L over ∑* where ∑={p,q} such
that 3rd character from right end of the string is always p
10. Write a RE for the language which consists of exactly two q’s
over the set ∑={p,q}
Regular Grammar and Regular Expression for common
Programming Language Features
Recognition of tokens:
digit -> [0-9]
Digits -> digit+
Real -> digit.(digit)+
letter -> [A-Za-z_]
id -> letter_ (letter_|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
delimiter -> blank|newline|tab
ws -> delimiter+ ws =(blank+newline+tab)+
Regular Grammar and Regular Expression for
common Programming Language Features
Transition Diagrams:
Transition diagram for relop:
Regular Grammar and Regular Expression for
common Programming Language Features
Transition diagram for reserved words and identifiers
Regular Grammar and Regular Expression for
common Programming Language Features
Transition diagram for whitespace
NFA vs. DFA
• NFAs and DFAs recognize the same set of languages (regular
languages)
• DFAs are easier to implement
• There are no choices to consider
• For a given language the NFA can be simpler than the DFA

DFA can be exponentially larger than NFA


Regular Expressions to Finite Automata
Regular Expressions to Finite Automata
For each kind of regular expression, define an NFA
Notation: NFA for regular expression A
Regular Expressions to Finite Automata
Example of RegExp -> NFA conversion
NFA -> DFA Example
Table Implementation of a DFA
Pass and Phases of Translation
Pass: A pass is one complete scan of a program, which includes reading
an input source program and converting the source program into
machine understandable form

Logical Structure of Pass


Pass and Phases of Translation

• A one-pass compiler is a compiler that


passes through the parts of each
compilation unit only once,
immediately translating each part into
its final machine code
Pass and Phases of Translation

• A language processor that goes


through the program to be
translated twice
• Front end: maps legal code into
Intermediate Representation
(IR).
• Back end: maps IR onto the
target machine
Pass and Phases of Translation

▪ Multi pass compiler is used to process the source


code of a program several times.
Wide Compiler ▪ In the first pass, compiler can read the source
program, scan it, extract the tokens and store the
result in an output file.
• In the second pass, build the syntactic tree and
perform the syntactical analysis.
• In the third pass, check that the tree follows the rules
of language or not.
• This pass is going on, until the target output is
produced.
Pass and Phases of Translation

Single pass compiler Multi pass compiler


1. Scans the entire source code only once 1. Scans the source code several times
2. Less execution time 2. More execution time
3. Debugging of the translated code is 3. Debugging of the translated code is
difficult easy
4. Memory requirement is more 4. Memory requirement is less
5. Generated code is less efficient 5. Generated code is more efficient
6. Backtracking is not allowed 6. Backtracking is allowed
7. Narrow compiler 7. Wide compiler
8. Pascal and C 8. Java
Pass and Phases of Translation
Advantages of pass:
1. In a single pass compiler, the program is read only once
2. The execution time for a two pass compiler is very short
3. Pascal and C uses a single pass compiler
Disadvantages:
1. For a single pass compiler the memory required is more
2. Code generated by a single pass compiler is not efficient
3. Complexity in designing a pass based compiler is more
Pass and Phases of Translation
Phase: A compiler is divided into number of parts or segments and each
segment is known as a phase.
A general compiler has 6 phases
Advantages:
1. The compilation process is divided into a number of steps
2. The efficiency of the code generated is more
Disadvantages:
1. Memory required is relatively high than a pass based compiler
2. If any phase of a compiler enters into an infinite loop then the compiler
gets stuck which may result into a system crash
Pass and Phases of Translation

Pass Phase
1. Various phases are logically grouped 1. Compilation process is done in six
together to form a pass phases
2. Single pass or multiple pass 2. Analysis and synthesis phase
3. The compilation model is viewed as 3. Lexical, syntax and semantic
front end and back end model analysis phases are machine
independent and intermediate
code,code optimization and code
generation are machine dependent
Interpretation
▪ The data and the source program are input to the interpreter
▪ Produces the results by performing the operations of the source program on its
data

▪ BASIC, SNOBOL, LISP, Java are translated using interpreters


▪ The process of interpretation
Lexical Analysis
Syntax Analysis
Semantic Analysis
Direct execution
Interpretation
Advantages:
1. Fast response of changes in source program
2. Easy to write and do not require large memory in computer
Disadvantages:
1. Time consuming method
2. Slower than a compiler
Comparison Between Interpreter &
Compiler
Interpreter Compiler
1. Every time the source program is 1. The program is analysed only once
analysed. 2. More efficient
2. Less efficient 3. Object code produced
3. No object code produced 4. Complex and requires large amount
4. Simpler and gives improved of memory
debugging environment 5. Translator which take only source
5. Translator which produces results program as input and converts it
directly when the source language and into object code
data is give to it as input
Cross Compiler
▪ The compiler which runs on one machine and produces target code
for another machine
▪ Platform independency is achieved
▪ T diagram is used to represent it.
Cross Compiler

LM N
LS N

SM M
Bootstrapping
• A process in which simple language is used to translate more
complicated program which in turn may handle far more complicated
program, this complicated program can further handle even more
complicated program and so on.
• Bootstrapping is the process of writing a compiler in the source
program language that it intends to compile the programs in the
source language
Bootstrapping

XY Z XM Z

YM M
Bootstrapping

LL N LM N

LM M

LL N LN N

LM N
Bootstrapping

Bootstrapping a compiler
Compiler Construction Tools
1. Scanner Generator:
– Generate lexical analyzers
– Specifications given to these generators are in the form of RE’s
2. Parser Generators:
– Produce syntax analyser
– Specifications given to these generators are in the form of CFG’s
3. Syntax-directed Translation Engines:
– parse tree is scanned completely to generate an intermediate code
4. Automatic code Generator:
– converts each rule of intermediate language into equivalent machine
language
5. Data Flow Engines: perform good code optimization
6. Compiler construction Toolkits: Provides an integrated set of routines for
constructing various phases of compiler
Lexical Analyzer Generator - Lex

Creating a Lexical Analyzer with LEX


Lexical Analyzer Generator - Lex
Lexical Analyzer Generator - Lex
Structure of Lex programs:
Lexical Analyzer Generator - Lex

Declaration section
1. variables
2. manifest constants
– an identifier that represents a constant
Ex: #define MAX=10
3. regular definitions
– A sequence of regular expressions
n1🡪 RE1
n2🡪RE2
.
nm🡪REm
Lexical Analyzer Generator - Lex
Translation rules section
▪ set of statements, defines the action to be performed for each RE given.
RE1 {action1}
RE2 {action2}
…………………..
Rem {actionm}
Auxiliary Procedures
▪ Procedures are used by the actions in translation rules section.
▪ Auxiliary procedures can be compiled separately and loaded with the
scanner
Design of Lexical Analyzer Generator
Pattern Matching Based on NFA’s
Example
NFA for a,abb,a*b+
Example
Combined NFA
Example

Sequence of states entered in processing input aaba


DFA for Lexical Analyzers

Transition Table for DFA


Lex Program
Lex Program

Lex Program for the tokens


lower case to upper case and upper case to lower
case.
• %{
• %}
• %%
• [a-z] { printf(“%c”,(yytext[0]-32);}
• %%

• %{
• %}
• %%
• [a-z] { printf(“%c”,(yytext[0]+32);}
• %%
count no of words in a given sentence
• %{
• int c;
• %}
• %%
• [\t \n] {c++;}
• %%
• main()
•{
• yylex();
• printf(“no of words in a given sentence =”,c);
•}
count no. of vowels & consonants
• %{
• int c=0,v=0;
• %}
• %%
• [a|e|i|o|u|A|E|O|U|I] {v++;}
• [b-d|f-h|j-n|p-t|v-z|B-D|F-H|J-N|P-T|V-Z] {c++;}
• %%
• main()
• {
• yylex();
• printf(“no of consonants in a given sentence is=%d \n no of vowels in a given sentence is =%d”,c,v);
• }
identify keywords, numbers
• %{

• //This is first lex program

• %}

• letter [a-z][A-Z0-9a-z0-9]*

• digit [0-9]+

• %%

• int|float|do|char|else|while|for|if {printf("%s is a keyword",yytext);}

• {letter} {printf("%s is an identifier",yytext);}

• {digit} {printf("%s is a number",yytext);}

• %%

• main(int argc,char **argv)

• {

• if(argc>1)

• yyin=fopen(argv[1],"r");

• else

• yyin=stdin;

• yylex();

• printf("\n");

• }

• int yywrap()

• {

• return 0;

• }

You might also like