Unit 1 Notes
Unit 1 Notes
A compiler is a program that can read a program in one language — the source language
— and translate it into an equivalent program in another language — the target language.
An important role of the compiler is to report any errors in the source program that it detects
during the translation process. If the target program is an executable machine-language program,
it can then be called by the user to process inputs and produce outputs.
2
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
Java language processors combine compilation and interpretation. A Java source program
may first be compiled into an intermediate form called bytecodes. The bytecodes are then
interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one
machine can be interpreted on another machine, perhaps across a network. In order to achieve
faster processing of inputs to outputs, some Java compilers, called just-in-time compilers,
translate the bytecodes into machine language immediately before they run the intermediate
program to process the input.
A source program may be divided into modules stored in separate files. The task of
collecting the source program is sometimes entrusted to a separate program, called a
3
preprocessor. The preprocessor may also expand short hands, called macros, into source
language statements.
The modified source program is then fed to a compiler. The compiler may produce an
assembly-language program as its output, because assembly language is easier to produce as
output and is easier to debug. The assembly language is then processed by a program called an
assembler that produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.
A compiler maps a source program into a semantically equivalent target program. There
are two parts to this mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program. If the analysis part detects that the source program is either syntactically
ill formed or semantically unsound, then it must provide informative messages, so the user can
take corrective action. The analysis part also collects information about the source program and
stores it in a data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.
The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table. The analysis part is often called the front
end of the compiler; the synthesis part is the back end.
5
The symbol table, stores information about the entire source program, is used by all
phases of the compiler. Some compilers have a machine-independent optimization phase
between the front end and the back end. The purpose of this optimization phase is to perform
transformations on the intermediate representation, so that the back end can produce a better
target program than it would have otherwise produced from an unoptimized intermediate
representation.
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer
reads the stream of characters making up the source program and groups the characters into
meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a
token of the form,
In the token, the first component token-name is an abstract symbol that is used during syntax
analysis, and the second component attribute-value points to an entry in the symbol table for this
token. Information from the symbol-table entry is needed for semantic analysis and code
generation.
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. p o s i t i o n is a lexeme that would be mapped into a token <id, 1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for p o s i t i o n . The
symbol-table entry for an identifier holds information about the identifier, such as its name and
type.
2. The assignment symbol = is a lexeme that is mapped into the token <=>
3. i n i t i a l is a lexeme that is mapped into the token <id, 2>, where 2 points to the symbol-table
entry for i n i t i a l .
5. r a t e is a lexeme that is mapped into the token <id, 3>, where 3 points to the symbol-table
entry for r a t e .
Blanks separating the lexemes would be discarded by the lexical analyzer. Representation of the
assignment statement after lexical analysis,
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation.
The tree has an interior node labeled * with (id, 3) as its left child and the integer 60 as its right
child. The node <id, 3> represents the identifier rate. The node labeled * makes it explicit that
we must first multiply the value of r a t e by 60. The node labeled + indicates that we must add
the result of this multiplication to the value of i n i t i a l . The root of the tree, labeled =,
indicates that we must store the result of this addition into the location for the identifier p o s i t i
on.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition. It also gathers
type information and saves it in either the syntax tree or the symbol table, for subsequent use
during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that
each operator has matching operands. For example, many programming language definitions
require an array index to be an integer; the compiler must report an error if a floating-point
number is used to index an array.
The language specification may permit some type conversions called coercions. For
example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of
7
floating-point numbers. If the operator is applied to a floating-point number and an integer, the
compiler may convert or coerce the integer into a floating-point number.
In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during syntax and semantic analysis. After
syntax and semantic analysis of the source program, many compilers generate an explicit
low-level or machine-like intermediate representation, which we can think of as a program for an
abstract machine.
● Each three-address assignment instruction has at most one operator on the right
side.
● The compiler must generate a temporary name to hold the value computed by a
three-address instruction.
● Some "three-address instructions" have fewer than three operands.
8
Code Optimization
Code Generation
The code generator takes as input an intermediate representation of the source program
and maps it into the target language. If the target language is machine code, registers or memory
locations are selected for each of the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine instructions that perform the same task.
Symbol-Table Management
An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name. These attributes may
provide information about the storage allocated for a name, its type, its scope (where in the
program its value may be used), and in the case of procedure names, such things as the number
and types of its arguments, the method of passing each argument (for example, by value or by
reference), and the type returned.
9
The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name. The data structure should be designed to allow the compiler
to find the record for each name quickly and to store or retrieve data from that record quickly.
10
Activities from several phases may be grouped together into a pass that reads an input file and
writes an output file. A Compiler pass refers to the traversal of a compiler through the entire
program.
● To design a compiler for a different programming language for the same machine
● To design a compiler for the same programming language for different machines/systems
12
COMPILER-CONSTRUCTION TOOLS
The compiler writer, like any software developer, can profitably use modern software
development environments containing tools such as language editors, debuggers, version
managers, profilers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.
These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
13
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
The main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for each lexeme
in the source program. The stream of tokens is sent to the parser for syntax analysis. It is
common for the lexical analyzer to interact with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the
symbol table. In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it must pass to
the parser.
Commonly, the interaction is implemented by having the parser call the lexical analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next token, which it
returns to the parser.
Since the lexical analyzer is the part of the compiler that reads the source text, it may
perform certain other tasks besides identification of lexemes. One such task is stripping out
comments and whitespace (blank, newline, tab, and perhaps other characters that are used to
separate tokens in the input). Another task is correlating error messages generated by the
compiler with the source program. For instance, the lexical analyzer may keep track of the
number of newline characters seen, so it can associate a line number with each error message. In
some compilers, the lexical analyzer makes a copy of the source program with the error
14
messages inserted at the appropriate positions. If the source program uses a macro-preprocessor,
the expansion of macros may also be performed by the lexical analyzer.
a) Scanning consists of the simple processes that do not require tokenization of the input,
such as deletion of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.
There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design
● A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword,
or a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes.
● A pattern is a description of the form that the lexemes of a token may take.In the case of
a keyword as a token, the pattern is just the sequence of characters that form the keyword.
For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
● A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an nstance of that token.
15
Eg 2
16
Number of tokens = 12
Eg 3.
Number of tokens = 26
Eg 4.
Number of tokens = 58
17
When more than one lexeme can match a pattern, the lexical analyzer must provide the
subsequent compiler phases additional information about the particular lexeme that matched. For
example, the pattern for token number matches both 0 and 1, but it is extremely important for the
code generator to know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute value that describes
the lexeme represented by the token; the token name influences parsing decisions, while the
attribute value influences translation of tokens after the parse.
We shall assume that tokens have at most one associated attribute, although this attribute
may have a structure that combines several pieces of information. The most important example is
the token id, where we need to associate with the token a great deal of information. Normally,
information about an identifier — e.g., its lexeme, its type, and the location at which it is first
found (in case an error message about that identifier must be issued) — is kept in the symbol
table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry
for that identifier.
LEXICAL ERRORS
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error. For instance, if the string f i is encountered for the first time in a C program in
the context:
f i ( a == f ( x ) ) . ..
must return the token id to the parser and let some other phase of the compiler — probably the
parser in this case — handle an error due to transposition of the letters.
However, suppose a situation arises in which the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input. The simplest
recovery strategy is "panic mode" recovery. We delete successive characters from the remaining
input, until the lexical analyzer can find a well-formed token at the beginning of what input is
left. This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.
Transformations like these may be tried in an attempt to repair the input. The simplest
such strategy is to see whether a prefix of the remaining input can be transformed into a valid
lexeme by a single transformation. This strategy makes sense, since in practice most lexical
errors involve a single character. A more general correction strategy is to find the smallest
number of transformations needed to convert the source program into one that consists only of
valid lexemes, but this approach is considered too expensive in practice to be worth the effort.
SPECIFICATION OF TOKENS
An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits,
and punctuation. The set {0,1} is the binary alphabet. ASCII is an important example of an
alphabet; it is used in many software systems. Unicode, which includes approximately 100,000
characters from alphabets around the world, is another important example of an alphabet. A
string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language
theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a
string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is
a string of length six. The empty string, denoted e, is the string of length zero. A language is any
countable set of strings over some fixed alphabet. Abstract languages like 0, the empty set, or
{e}, the set containing only the empty string,
19
Operations on Languages:
● union,
● concatenation
● closure
20
Regular Expressions:
Regular expressions are a way for describing all the languages that can be built from the
operators of languages applied to the symbols of some alphabet.
Identifiers:
21
The regular expressions are built recursively out of smaller regular expressions, using the rules
described below. Each regular expression r denotes a language L(r), which is also defined
recursively from the languages denoted by r ' s subexpressions.
22
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s. For
instance, (a|b) = (b|a).
Regular Definitions:
23
Lex is a tool that allows one to specify a lexical analyzer by specifying regular
expressions to describe patterns for tokens. Its recent implementation is Flex.. The input notation
for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. The Lex
25
compiler transforms the input patterns into a transition diagram and generates code, in a file
called lex.yy.c that simulates this transition diagram.
Use of Lex:
An input file, which we call lex.l, is written in the Lex language and describes the lexical
analyzer to be generated. The Lex compiler transforms lex.1 to a C program, in a file that is
always named lex.yy.c. The latter file is compiled by the C compiler into a file called a.out, as
always. The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.
Each pattern is a regular expression, which may use the regular definitions of the
declaration section. The actions are fragments of code, typically written in C, although many
variants of Lex using other languages have been created. The third section holds whatever
additional functions are used in the actions. Alternatively, these functions can be compiled
separately and loaded with the lexical analyzer.
The lexical analyzer created by Lex behaves in concert with the parser as follows. When called
by the parser, the lexical analyzer begins reading its remaining input, one character at a time,
until it finds the longest prefix of the input that matches one of the patterns Pi. It then executes
the associated action Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi
describes whitespace or comments), then the lexical analyzer proceeds to find additional
lexemes, until one of the corresponding actions causes a return to the parser. The lexical analyzer
returns a single value, the token name, to the parser, but uses the shared, integer variable yylval
to pass additional information about the lexeme found, if needed.
Example:
%%
[0-9]+ {pos++;printf("%s is a positive number",yytext);printf("%d%d",pos,neg);}
^[-][0-9]+ {neg++;printf("%s is a negative number",yytext);printf("%d%d",pos,neg);}
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}
//Recognize Keyword
%{
#include<stdio.h>
%}
%%
auto|double|int|struct|break|else|long|switch|case|enum|register|typedef|char|extern|return|union|co
ntinue|for|signed|void|do|if|static|while|default|goto|sizeof|volatile|const|float|short {printf("This
is Keyword");}
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}
//Recognize Relational Operators
%{
#include<stdio.h>
%}
%%
<|>|<=|>=|<>|={printf("This is an operator");}
%%
int yywrap()
{
}
int main()
{
yylex();
28
return 0;
}
//Recognize Identifiers
%{
#include<stdio.h>
%}
%%
^[a-z A-Z _]([a-z A-Z 0-9 _])* {printf("Valid Identifier");}
^[^a - z A - Z _] printf("Invalid Identifier");
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}