0% found this document useful (0 votes)
20 views

Unit 1 Slides

Compiler Design Notes

Uploaded by

Ayush Jindal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 1 Slides

Compiler Design Notes

Uploaded by

Ayush Jindal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Compiler Design

CSE 353

Unit 1 syllabus
Course Outcome
CO Course Outcome(CO)
Code

CO1 CO 1:Explain the concepts and different phases of compilation with


compile time error handling

CO2 :Represent language tokens using regular expressions, context free


grammar and finite automata and design lexical analyzer for a language

CO3 Compare top down with bottom up parsers, and develop appropriate
parser to produce parse tree representation of the input
CO4 Design syntax directed translation schemes for a given context free
grammar.
CO5 Generate intermediate code for statements in high level language,
Benefits and limitations of automatic memory management.

CO6 Apply optimization techniques to intermediate code and generate


machine code for high level language program
Introduction to Language processing
System

HLL

Pure HLL
• Pre-Processor – The pre-processor removes all the #include directives by including
the files called file inclusion and all the #define directives using macro expansion. It
performs file inclusion, augmentation, macro-processing etc.

• Assembly Language – Its neither in binary form nor high level. It is an intermediate
state that is a combination of machine instructions and some other useful data
needed for execution.

• Assembler – For every platform (Hardware + OS) we will have an assembler. They
are not universal since for each platform we have one. The output of assembler is
called object file. It translates assembly language to machine code.

• Interpreter – An interpreter converts high level language into low level machine
language, just like a compiler. But they are different in the way they read the input.
The Compiler in one go reads the inputs, does the processing and executes the
source code whereas the interpreter does the same line by line. Compiler scans the
entire program and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time. Interpreted programs
are usually slower with respect to compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run. The
address within the program will be in such a way that it will cooperate for the
program movement.

• Loader/Linker – It converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or sometimes
both can happen). Linker loads a variety of object files into a single file to make it
executable. Then loader loads it in memory and executes it.
Compiler
Compiler is a translator program that translates a program written in (HLL)
the source program and translate it into an equivalent program in (MLL) the
target program. As an important part of a compiler is error showing to the
programmer.

Structure of Compiler
Executing a program written in HLL programming language is basically of two parts.
the source program must first be compiled translated into an object program. Then
the result object program is loaded into a memory is executed.

Execution process of source program in Compiler


LIST OF COMPILERS
1. Ada compilers
2. ALGOL compilers
3. BASIC compilers
4. C# compilers
5. C compilers
6. C++ compilers
7. COBOL compilers
8. Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11. Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and produces
output in another representation.

There are two phases of compilation.


a. Analysis (Machine Independent/Language Dependent)
b. Synthesis (Machine Dependent/Language Independent)

Compilation process is partitioned into no. of sub-processes called ‘phases’.


Structure of Compiler

Stream of Tokens

Parse Tree

refined Parse
Tree

3- Address Code

Optimized Code
Lexical Analyzer-

• It is the first phase of the compiler.


• It gets input from the source program and produces tokens as output.
• It reads the characters one by one, starting from left to right and forms the tokens.
Token : It represents a logically cohesive sequence of characters a + b = 20 Here,
a,b,+,=,20 are all separate tokens. Group of characters forming a token is called the
Lexeme.
•The lexical analyser not only generates a token but also characters such as keywords,
operators, identifiers, special symbols etc. Example: enter the lexeme into the symbol
table if it is not already there.

Syntax Analyzer –

•It is the second phase of the compiler. It is also known as parser.


•It gets the token stream as input from the lexical analyser of the compiler and
generates syntax tree as the output.
•Syntax tree: It is a tree in which interior nodes are operators and exterior nodes are
operands.
•Example: For a=b+c*2, syntax tree is
Semantic Analyzer –
•It is the third phase of the compiler.
•It gets input from the syntax analysis as parse tree and checks whether the given
syntax is correct or not.
•It performs type conversion of all the data types into real data types.

Intermediate Code Generator –


•It is the fourth phase of the compiler.
•It gets input from the semantic analysis and converts the input into output as
intermediate code such as three-address code.
•The three-address code consists of a sequence of instructions, each of which has
atmost three operands.
Example: t1=t2+t3
T1=t3
CODE OPTIMIZATION:
• It is the fifth phase of the compiler.
• It gets the intermediate code as input and produces optimized intermediate code as
output.
•This phase reduces the redundant code and attempts to improve the intermediate
code so that faster-running machine code will result.
•During the code optimization, the result of the program is not affected.
•To improve the code generation, the optimization involves
-deduction and removal of dead code (unreachable code).
- - calculation of constants in expressions and terms.
-- collapsing of repeated expression into temporary string.
-- loop unrolling. - moving code outside the loop.
-- removal of unwanted temporary variables.

CODE GENERATION:
•It is the final phase of the compiler.
•It gets input from code optimization phase and produces the target code or object
code as result.
•Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
•The code generation involves - allocation of register and memory - generation of
correct references - generation of correct data types - generation of missing code.
SYMBOL TABLE MANAGEMENT:
•Symbol table is used to store all the information about identifiers used in the program.
•It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
•It allows to find the record for each identifier quickly and to store or retrieve data from
that record.
•Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

ERROR HANDLING:
•Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
•In lexical analysis, errors occur in separation of tokens.
•In syntax analysis, errors occur during construction of syntax tree.
•In semantic analysis, errors occur when the compiler detects constructs with right
syntactic structure but no meaning during type conversion.
•In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Example
Lexical Analyzer
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
•Lexical Analysis can be implemented with the Deterministic finite Automata.
•The output is a sequence of tokens that is sent to the parser for syntax analysis.
•Its task is to read the input characters and produce as output a sequence of tokens
that the parser uses for syntax analysis

•Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA).
The process of forming tokens from an input stream of characters is called
tokenization.
Consider this expression in the C programming language: sum=3+2;
Lexeme Token Type
Sum Identifier
= Assignment Operator
3 Number
+ Addition Operator
2 Number
; End of statement

LEXEME: Collection or group of characters forming tokens is called Lexeme.


PATTERN: A pattern is a description of the form that the lexemes of a token may
take. In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many strings.
How Lexical Analyzer functions

1. Tokenization i.e. Dividing the program into valid tokens.


2. Remove white space characters.
3. Remove comments.

The function of lexical analysis is to tokenize and separate them out from the
program or statement.

How to separate them from program?


We have designed strong regular expression to represent the tokens and
design NFA/DFA to act as token recognizer.

Letter/Digit
Start
0 1 2
Letter delimiter

Transition Diagram for identifier


Consider the Program and find the number of token

int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
Solution

'int' 'main‘ '(' ')‘ '{' 'int‘ 'a‘

',‘ 'b‘ ';‘ 'a‘ '=‘ '10‘

';‘ 'return‘ '0‘ ';’ '}‘

Question 2:
int max(x,y)
int x, y;
/* find max of x and y*/
{
return(x>y ? X:y);
}
Solution:
int max ( x , y )

int x , y ; { return

( x > y ? x :

y ) ; }

Question3 :

Printf (“%d, hi”, &x);


Passes

A pass refers to the number of times the compiler goes through the source
code.

Two types of Passes in compiler


Single-pass compiler goes through the program only once. In other words,
the single pass compiler allows the source code to pass through each
compilation unit only once. It immediately translates each code section into
its final machine code.

Multi-pass compiler goes through the source code several times. In other
words, it allows the source code to pass through each compilation unit several
times. Each pass takes the result of the previous pass as input and creates
intermediate outputs. Therefore, the code improves in each pass. The final
code is generated after the final pass

The main difference between phases and passes of compiler is that phases are the
steps in the compilation process while passes are the number of times
the compiler traverses through the source code.
Bootstrapping and Cross Compiler
Bootstrapping is a process in which simple language is used to translate
more complicated program which in turn may handle far more complicated
program. This complicated program can further handle even more complicated
program and so on.

A cross compiler is a compiler capable of creating executable code for a


platform other than the one on which the compiler is running. For example, a
compiler that runs on a Windows 7 PC but generates code that runs on Android
smartphone is a cross compiler.
Continued..

▪ Suppose we want to write a cross


compiler for new language X.
▪ The implementation language of this
compiler is say Y and the target
code being generated is in language
Z. That is, we create XYZ.
▪ Now if existing compiler Y runs on
machine M and generates code for
M then it is denoted as YMM.
▪ Now if we run XYZ using YMM
then we get a compiler XMZ.
▪ That means a compiler for source
language X that generates a target
code in language Z.
Continued..
Difference between Native Compiler and
Cross Compiler
Native Compiler Cross Compiler
Translates program for same Translates program for different
hardware/platform/machine on it is hardware/platform/machine other
running. than the platform which it is running.

It is used to build programs for same It is used to build programs for other
system/machine & OS it is installed. system/machine like AVR/ARM.

It is dependent on System/machine and It is independent of System/machine


OS and OS

It can generate executable file like .exe It can generate raw code .hex

TurboC or GCC is native Compiler. Keil is a cross compiler.


Deterministic Finite Automata
• DFA refers to deterministic finite automata.
Deterministic refers to the uniqueness of the
computation.
• The finite automata are called deterministic finite
automata if the machine is read an input string
one symbol at a time.
• In DFA, there is only one path for specific input
from the current state to the next state.
• DFA does not accept the null move, i.e., the DFA
cannot change state without any input character.
• DFA can contain multiple final states. It is used in
Lexical Analysis in Compiler.
• A DFA is a collection of 5-tuples:-
• Q: finite set of states
• ∑: finite set of the input symbol
• q0: initial state
• F: final state
• δ: Transition function

Transition function can be defined as: DFA


• δ: Q x ∑→Q

• Design a FA with ∑ = {0, 1} accepts those
string which starts with 1 and ends with 0.
• 10100
• 111
• 00110
• Design FA with ∑ = {0, 1} accepts the set of all
strings with three consecutive 0's.
• 100001
• 10001
Deterministic and Nondeterministic
Automata
• Deterministic Finite Automata (DFA)
– One transition per input per state
– No ε-moves
• Nondeterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
given state
– Can have ε-moves
• Finite automata have finite memory
– Need only to encode the current state

33
Execution of Finite Automata
• A DFA can take only one path through the
state graph
– Completely determined by input

• NFAs can choose


– Whether to make ε-moves
– Which of multiple transitions for a single input to
take

34
Acceptance of NFAs
• An NFA can get into multiple states
1

0 1

• Input: 1 0 1

• Rule: NFA accepts if it can get in a final state

35
NFA vs. DFA (1)
• NFAs and DFAs recognize the same set of
languages (regular languages)

• DFAs are easier to implement


– There are no choices to consider

36
NFA vs. DFA (2)
• For a given language the NFA can be simpler than
the DFA
1
0 0
NFA
0

1 0
0 0
DFA
1
1

• DFA can be exponentially larger than NFA

37
Regular Expressions to Finite Automata
• High-level sketch
NFA

Regular
expressions DFA

Lexical
Specification

38
Regular Expressions to NFA (1)
• For each kind of rexp, define an NFA
– Notation: NFA for rexp A

• For ε
ε

• For input a
a

39
Regular Expressions to NFA (2)
• For AB
A ε
B

• For A | B
B ε
ε
ε
ε A

40
Example of RegExp -> NFA conversion
• Consider the regular expression
(1 | 0)*1
• The NFA is
ε

ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε

41
NFA to DFA. The Trick
• Simulate the NFA
• Each state of resulting DFA
= a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through ε-moves from
NFA start state
• Add a transition S →a S’ to DFA iff
– S’ is the set of NFA states reachable from the states in
S after seeing the input a
• considering ε-moves as well

42
NFA -> DFA Example
ε

ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε
0
0 FGABCDHI
ABCDHI 0 1
1
1 EJGABCDHI

43
Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program.
It has the following issues:

•Lookahead

•Ambiguities
Lookahead
•Lookahead is required to decide when one token will end
and the next token will begin. The simple example which
has lookahead issues are i vs if,
•= vs. ==. Therefore, a way to describe the lexemes of
each token is required.
•A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
•Hence, the number of lookahead to be considered and a
way to describe the lexemes of each token is also needed.
Ambiguities

The lexical analysis programs written with lex accept ambiguous


specifications and choose the longest match possible at each input point.
Lex can handle ambiguous specifications. When more than one expression
can match the current input, lex chooses as follows:

•The longest match is preferred.

• Among rules which matched the same number of characters, the rule given
first is preferred.
Lexical Errors
A character sequence which is not possible to scan into any valid
token is a lexical error. Important facts about the lexical error:
▪ Lexical errors are not very common, but it should be managed by a
scanner

▪ Misspelling of identifiers, operators, keyword are considered as


lexical errors

▪ Generally, a lexical error is caused by the appearance of some


illegal character, mostly at the beginning of a token.
Error Recovery in Lexical Analyzer

Most common error recovery techniques:

▶ Removes one character from the remaining input


▶ In the panic mode, the successive characters are always ignored
until we reach a well-formed token

▶ By inserting the missing character into the remaining input

▶ Replace a character with another character

▶ Transpose two serial characters


Lexical Analyzer vs. Parser
Lexical Analyser Parser

Scan Input program Perform syntax analysis

Identify Tokens Create an abstract


representation of the code

Insert tokens into Symbol Update symbol table entries


Table
It generates lexical errors It generates a parse tree of the
source code

You might also like