Phases of Compiler
Phases of Compiler
Compiler
Efficiency: A lexer may do the simple parts of the work faster than the more general parser can.
Furthermore, the size of a system that is split in two may be smaller than a combined system. It
is usually not terribly difficult to write a lexer by hand Furthermore, a handwritten lexer may be
complex and difficult to maintain. Hence, lexers are normally constructed by lexer generators,
which transform human-readable specifications of tokens and white-space into efficient
programs. For lexical analysis, specifications are traditionally written using regular expressions:
The generated lexers are in a class of extremely simple programs called finite automata.
always pick the right rewrites, so we get deterministic parsing. Such methods are called
bottom-up parsing methods.
The advantage of using an intermediate language is most obvious if many languages are to be
compiled to many machines. If translation is done directly, the number of compilers is equal to
the product of the number of languages and the number of machines. If a common intermediate
language is used, one front-end (i.e., compiler to intermediate code) is needed for every language
and one backend is needed for each machine, making the total equal to the sum of the number of
languages and the number of machines.
If an interpreter for the intermediate language is written in a language for which there already
exist compilers on the target machines, the interpreter can be compiled on each of these. This
way, there is no need to write a separate back-end for each machine.
The advantages of this approach are:
No actual back-end needs to be written for each new machine.
A compiled program can be distributed in a single intermediate form for all machines, as
opposed to shipping separate binaries for each machine.
The intermediate form may be more compact than machine code. This saves space both in
distribution and on the machine that executes the programs (though the latter is somewhat offset
by requiring the interpreter to be kept in memory during execution).
String:
Position=initial+rate*60
Lexical analysis:
The lexical analyzer reads the input stream and groups them into meaningful sequence called
lexemes. For each lexeme the lexer produces tokens.
Position
<id,1>
<=>
Initial
< id,2>
<+>
Rate
<id,3>
<* >
60
<60>
Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation. An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example, many programming
language definitions require an array index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
The language specification may permit some type conversions called coercions. For example, a
binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point
numbers. If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number. Suppose that position, initial, and rate
have been declared to be floating-point numbers, and that the lexeme 60 by itself forms an
integer. The type checker in the semantic analyzer discovers that the operator * is applied to a
floating-point number rate and an integer 60. In this case, the integer may be converted into a
floating-point number. In the below Fig. notice that the output of the semantic analyzer has an
extra node for the operator inttofloat , which explicitly converts its integer argument into a
floating-point number
In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. One form is the Syntax
trees are a form of intermediate representation; they are commonly used during syntax and
semantic analysis. After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation, which we can think of
as a program for an abstract machine. This intermediate representation should have two
important properties: it should be easy to produce and it should be easy to translate into the target
machine.
The second type is the three address code which consists of a sequence of assembly-like
instructions with three operands per instruction. The output of the intermediate code generator
consists of the following three-address code sequence
t l= inttofloat (60)
t2= id3 * t l
t3= id2 + t2
id1 = t3
Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so
that better target code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power. For example, a
straightforward algorithm generates the intermediate code as in previous phase, using an
instruction for each operator in the tree representation that comes from the semantic analyzer.
A simple intermediate code generation algorithm followed by code optimization is a reasonable
way to generate good target code. The optimizer can deduce that the conversion of 60 from
integer to floating point can be done once and for all at compile time, so the inttofloat operation
can be eliminated by replacing the integer 60 by the floating-point number 60.0. Moreover, t3 is
used only once to transmit its value to id1 so the optimizer can transform the intermediate code
as produced by the previous phase, into the shorter sequence as below.
t 1 = id3 * 60 . 0
id1 = id2 +t 1
Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to hold variables.
For example, using registers R1 and R2, the intermediate code produced by previous phase might
get translated into the following machine code
LDF
R2 , id3
MULF
R2 , #60 . 0
LDF
R1 , id2
ADDF
R 1 , R2
STF
id1 , R1
The first operand of each instruction specifies a destination. The F in each instruction tells us that
it deals with floating-point numbers. The instruction in above code loads the contents of address
id3 into register R2, then multiplies it with floating-point constant 60.0. The # signifies that 60.0
is to be treated as an immediate constant. The third instruction moves id2 into register Rl and the
fourth adds to it the value previously computed in register R2. Finally, the value in register Rl is
stored into the address of id l , so the code correctly implements the assignment statement that
we started with.
The compiler writer, like any software developer, can profitably use modern software
development environments containing tools such as language editors, debuggers, version
managers, profilers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.
These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler.
Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description
of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of the
tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
6. Compiler- construction toolkits that provide an integrated set of routines for constructing
various phases of a compiler