0% found this document useful (0 votes)
10 views28 pages

Unit 1 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

Unit 1 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

1

UNIT I INTRODUCTION TO COMPILERS & LEXICAL


ANALYSIS ​ ​ ​ ​ ​ ​ ​ ​ ​ ​
Introduction – Language Processors– The Structure of a Compiler– Compiler–Construction
Tools– Lexical Analysis– The Role of the Lexical Analyzer– Tokens, Patterns, and Lexemes–
Specification of Tokens – The Lexical–Analyzer Generator Lex – Finite Automata –
Construction of an NFA from a Regular Expression – Conversion of an NFA to a DFA –
Minimizing the Number of States of a DFA.
INTRODUCTION:
Programming languages are notations for describing computations to people and to
machines. The world as we know it depends on programming languages, because all the software
running on all the computers was written in some programming language. But, before a program
can be run, it first must be translated into a form in which it can be executed by a computer. The
software systems that do this translation are called compilers.
LANGUAGE PROCESSORS:

A compiler is a program that can read a program in one language — the source language
— and translate it into an equivalent program in another language — the target language.

An important role of the compiler is to report any errors in the source program that it detects
during the translation process. If the target program is an executable machine-language program,
it can then be called by the user to process inputs and produce outputs.
2

An interpreter is another common kind of language processor. Instead of producing a target


program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user.

The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

​ Java language processors combine compilation and interpretation. A Java source program
may first be compiled into an intermediate form called bytecodes. The bytecodes are then
interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one
machine can be interpreted on another machine, perhaps across a network. In order to achieve
faster processing of inputs to outputs, some Java compilers, called just-in-time compilers,
translate the bytecodes into machine language immediately before they run the intermediate
program to process the input.

A SIMPLE LANGUAGE PROCESSING SYSTEM:

A source program may be divided into modules stored in separate files. The task of
collecting the source program is sometimes entrusted to a separate program, called a
3

preprocessor. The preprocessor may also expand short hands, called macros, into source
language statements.

The modified source program is then fed to a compiler. The compiler may produce an
assembly-language program as its output, because assembly language is easier to produce as
output and is easier to debug. The assembly language is then processed by a program called an
assembler that produces relocatable machine code as its output.

Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.

THE PHASES OF COMPILER:


4

​ A compiler maps a source program into a semantically equivalent target program. There
are two parts to this mapping: analysis and synthesis.

The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program. If the analysis part detects that the source program is either syntactically
ill formed or semantically unsound, then it must provide informative messages, so the user can
take corrective action. The analysis part also collects information about the source program and
stores it in a data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.

The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table. The analysis part is often called the front
end of the compiler; the synthesis part is the back end.
5

The symbol table, stores information about the entire source program, is used by all
phases of the compiler. Some compilers have a machine-independent optimization phase
between the front end and the back end. The purpose of this optimization phase is to perform
transformations on the intermediate representation, so that the back end can produce a better
target program than it would have otherwise produced from an unoptimized intermediate
representation.

Lexical Analysis

The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer
reads the stream of characters making up the source program and groups the characters into
meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a
token of the form,

In the token, the first component token-name is an abstract symbol that is used during syntax
analysis, and the second component attribute-value points to an entry in the symbol table for this
token. Information from the symbol-table entry is needed for semantic analysis and code
generation.

For example, suppose a source program contains the assignment statement

The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:

1. p o s i t i o n is a lexeme that would be mapped into a token <id, 1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for p o s i t i o n . The
symbol-table entry for an identifier holds information about the identifier, such as its name and
type.

2. The assignment symbol = is a lexeme that is mapped into the token <=>

3. i n i t i a l is a lexeme that is mapped into the token <id, 2>, where 2 points to the symbol-table
entry for i n i t i a l .

4. + is a lexeme that is mapped into the token <+>.

5. r a t e is a lexeme that is mapped into the token <id, 3>, where 3 points to the symbol-table
entry for r a t e .

6. * is a lexeme that is mapped into the token <*>.


6

7. 60 is a lexeme that is mapped into the token <60>.

Blanks separating the lexemes would be discarded by the lexical analyzer. Representation of the
assignment statement after lexical analysis,

Syntax Analysis

The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation.

After Syntax Analysis,

The tree has an interior node labeled * with (id, 3) as its left child and the integer 60 as its right
child. The node <id, 3> represents the identifier rate. The node labeled * makes it explicit that
we must first multiply the value of r a t e by 60. The node labeled + indicates that we must add
the result of this multiplication to the value of i n i t i a l . The root of the tree, labeled =,
indicates that we must store the result of this addition into the location for the identifier p o s i t i
on.

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition. It also gathers
type information and saves it in either the syntax tree or the symbol table, for subsequent use
during intermediate-code generation.

An important part of semantic analysis is type checking, where the compiler checks that
each operator has matching operands. For example, many programming language definitions
require an array index to be an integer; the compiler must report an error if a floating-point
number is used to index an array.

The language specification may permit some type conversions called coercions. For
example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of
7

floating-point numbers. If the operator is applied to a floating-point number and an integer, the
compiler may convert or coerce the integer into a floating-point number.

After Semantic Analysis,

​ ​

Intermediate Code Generation

In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during syntax and semantic analysis. After
syntax and semantic analysis of the source program, many compilers generate an explicit
low-level or machine-like intermediate representation, which we can think of as a program for an
abstract machine.

This intermediate representation should have two important properties:

●​ it should be easy to produce


●​ it should be easy to translate into the target machine.

One such intermediate form is three-address code, which consists of a sequence of


assembly-like instructions with three operands per instruction. Each operand can act like a
register. The output of the intermediate code generator consists of the three-address code
sequence,

Properties of Three Address code:

●​ Each three-address assignment instruction has at most one operator on the right
side.
●​ The compiler must generate a temporary name to hold the value computed by a
three-address instruction.
●​ Some "three-address instructions" have fewer than three operands.
8

Code Optimization

The machine-independent code-optimization phase attempts to improve the intermediate


code so that better target code will result. Usually better means faster, but other objectives may
be desired, such as shorter code, or target code that consumes less power.

A simple intermediate code generation algorithm followed by code optimization is a reasonable


way to generate good target code. The optimizer can deduce that the conversion of 60 from
integer to floating point can be done once and for all at compile time, so the inttofloat operation
can be eliminated by replacing the integer 60 by the floating-point number 60.0. Moreover, t3 is
used only once to transmit its value to idl so the optimizer can transform code into the shorter
sequence

Code Generation

The code generator takes as input an intermediate representation of the source program
and maps it into the target language. If the target language is machine code, registers or memory
locations are selected for each of the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine instructions that perform the same task.

Symbol-Table Management

An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name. These attributes may
provide information about the storage allocated for a name, its type, its scope (where in the
program its value may be used), and in the case of procedure names, such things as the number
and types of its arguments, the method of passing each argument (for example, by value or by
reference), and the type returned.
9

The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name. The data structure should be designed to allow the compiler
to find the record for each name quickly and to store or retrieve data from that record quickly.
10

THE GROUPING OF PHASES INTO PASSES

Activities from several phases may be grouped together into a pass that reads an input file and
writes an output file. A Compiler pass refers to the traversal of a compiler through the entire
program.

Single Pass Compiler:


11

Two Pass Compiler:

Multi Pass Compiler:

●​ To design a compiler for a different programming language for the same machine
●​ To design a compiler for the same programming language for different machines/systems
12

COMPILER-CONSTRUCTION TOOLS

The compiler writer, like any software developer, can profitably use modern software
development environments containing tools such as language editors, debuggers, version
managers, profilers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.

These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include

1. Parser generators that automatically produce syntax analyzers from a grammatical


description of a programming language.

2. Scanner generators that produce lexical analyzers from a regular-expression description of


the tokens of a language.

3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.

4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
13

5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.

6. Compiler-construction toolkits that provide an integrated set of routines for constructing


various phases of a compiler.

THE ROLE OF THE LEXICAL ANALYZER

The main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for each lexeme
in the source program. The stream of tokens is sent to the parser for syntax analysis. It is
common for the lexical analyzer to interact with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the
symbol table. In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it must pass to
the parser.

Commonly, the interaction is implemented by having the parser call the lexical analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next token, which it
returns to the parser.

Since the lexical analyzer is the part of the compiler that reads the source text, it may
perform certain other tasks besides identification of lexemes. One such task is stripping out
comments and whitespace (blank, newline, tab, and perhaps other characters that are used to
separate tokens in the input). Another task is correlating error messages generated by the
compiler with the source program. For instance, the lexical analyzer may keep track of the
number of newline characters seen, so it can associate a line number with each error message. In
some compilers, the lexical analyzer makes a copy of the source program with the error
14

messages inserted at the appropriate positions. If the source program uses a macro-preprocessor,
the expansion of macros may also be performed by the lexical analyzer.

Sometimes, lexical analyzers are divided into a cascade of two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input,
such as deletion of comments and compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.

Lexical Analysis Versus Parsing

There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.

1. Simplicity of design

2. Compiler efficiency is improved.

3. Compiler portability is enhanced.

Tokens, Patterns, and Lexemes:

●​ A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword,
or a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes.
●​ A pattern is a description of the form that the lexemes of a token may take.In the case of
a keyword as a token, the pattern is just the sequence of characters that form the keyword.
For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
●​ A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an nstance of that token.
15

Eg 1. Find the number of tokens in the following C statement.

Eg 2
16

Number of tokens = 12

Eg 3.

Number of tokens = 26

Eg 4.

Number of tokens = 58
17

ATTRIBUTES FOR TOKENS

When more than one lexeme can match a pattern, the lexical analyzer must provide the
subsequent compiler phases additional information about the particular lexeme that matched. For
example, the pattern for token number matches both 0 and 1, but it is extremely important for the
code generator to know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute value that describes
the lexeme represented by the token; the token name influences parsing decisions, while the
attribute value influences translation of tokens after the parse.

We shall assume that tokens have at most one associated attribute, although this attribute
may have a structure that combines several pieces of information. The most important example is
the token id, where we need to associate with the token a great deal of information. Normally,
information about an identifier — e.g., its lexeme, its type, and the location at which it is first
found (in case an error message about that identifier must be issued) — is kept in the symbol
table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry
for that identifier.

LEXICAL ERRORS

It is hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error. For instance, if the string f i is encountered for the first time in a C program in
the context:

f i ( a == f ( x ) ) . ..

a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an


undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer
18

must return the token id to the parser and let some other phase of the compiler — probably the
parser in this case — handle an error due to transposition of the letters.

However, suppose a situation arises in which the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input. The simplest
recovery strategy is "panic mode" recovery. We delete successive characters from the remaining
input, until the lexical analyzer can find a well-formed token at the beginning of what input is
left. This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Transformations like these may be tried in an attempt to repair the input. The simplest
such strategy is to see whether a prefix of the remaining input can be transformed into a valid
lexeme by a single transformation. This strategy makes sense, since in practice most lexical
errors involve a single character. A more general correction strategy is to find the smallest
number of transformations needed to convert the source program into one that consists only of
valid lexemes, but this approach is considered too expensive in practice to be worth the effort.

SPECIFICATION OF TOKENS

Regular expressions are an important notation for specifying lexeme patterns.

Strings and Languages

An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits,
and punctuation. The set {0,1} is the binary alphabet. ASCII is an important example of an
alphabet; it is used in many software systems. Unicode, which includes approximately 100,000
characters from alphabets around the world, is another important example of an alphabet. A
string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language
theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a
string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is
a string of length six. The empty string, denoted e, is the string of length zero. A language is any
countable set of strings over some fixed alphabet. Abstract languages like 0, the empty set, or
{e}, the set containing only the empty string,
19

Operations on Languages:

In lexical analysis, the most important operations on languages are

●​ union,
●​ concatenation
●​ closure
20

Regular Expressions:

​ Regular expressions are a way for describing all the languages that can be built from the
operators of languages applied to the symbols of some alphabet.

Identifiers:
21

The regular expressions are built recursively out of smaller regular expressions, using the rules
described below. Each regular expression r denotes a language L(r), which is also defined
recursively from the languages denoted by r ' s subexpressions.
22

A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s. For
instance, (a|b) = (b|a).

Regular Definitions:
23

Extensions of Regular Expressions:


24

THE LEXICAL-ANALYZER GENERATOR LEX:

Lex is a tool that allows one to specify a lexical analyzer by specifying regular
expressions to describe patterns for tokens. Its recent implementation is Flex.. The input notation
for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. The Lex
25

compiler transforms the input patterns into a transition diagram and generates code, in a file
called lex.yy.c that simulates this transition diagram.

Use of Lex:

An input file, which we call lex.l, is written in the Lex language and describes the lexical
analyzer to be generated. The Lex compiler transforms lex.1 to a C program, in a file that is
always named lex.yy.c. The latter file is compiled by the C compiler into a file called a.out, as
always. The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.

a. out is as a subroutine of the parser. It is a C function that returns an integer, which is a


code for one of the possible token names. The attribute value, whether it be another numeric
code, a pointer to the symbol table, or nothing, is placed in a global variable yylval, which is
shared between the lexical analyzer and parser, thereby making it simple to return both the name
and an attribute value of a token.

Structure of Lex Programs:

A Lex program has the following form:

The declarations section includes declarations of variables, manifest constants (identifiers


declared to stand for a constant, e.g., the name of a token), and regular definitions. The
translation rules each have the form,
26

Each pattern is a regular expression, which may use the regular definitions of the
declaration section. The actions are fragments of code, typically written in C, although many
variants of Lex using other languages have been created. The third section holds whatever
additional functions are used in the actions. Alternatively, these functions can be compiled
separately and loaded with the lexical analyzer.

The lexical analyzer created by Lex behaves in concert with the parser as follows. When called
by the parser, the lexical analyzer begins reading its remaining input, one character at a time,
until it finds the longest prefix of the input that matches one of the patterns Pi. It then executes
the associated action Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi
describes whitespace or comments), then the lexical analyzer proceeds to find additional
lexemes, until one of the corresponding actions causes a return to the parser. The lexical analyzer
returns a single value, the token name, to the parser, but uses the shared, integer variable yylval
to pass additional information about the lexeme found, if needed.

Example:

/*lex program to count number of words*/


%{
#include<stdio.h>
int i=0;
%}
%%
([a-zA-Z0-9])* {i++;}
"\n" {printf("%d\n", i); i = 0;}
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}
//count positive and negative numbers
%{
#include<stdio.h>
int pos=0,neg=0;
%}
27

%%
[0-9]+ {pos++;printf("%s is a positive number",yytext);printf("%d%d",pos,neg);}
^[-][0-9]+ {neg++;printf("%s is a negative number",yytext);printf("%d%d",pos,neg);}
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}
//Recognize Keyword
%{
#include<stdio.h>
%}
%%
auto|double|int|struct|break|else|long|switch|case|enum|register|typedef|char|extern|return|union|co
ntinue|for|signed|void|do|if|static|while|default|goto|sizeof|volatile|const|float|short {printf("This
is Keyword");}
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}
//Recognize Relational Operators
%{
#include<stdio.h>
%}
%%
<|>|<=|>=|<>|={printf("This is an operator");}
%%
int yywrap()
{
}
int main()
{
yylex();
28

return 0;
}
//Recognize Identifiers
%{
#include<stdio.h>
%}
%%
^[a-z A-Z _]([a-z A-Z 0-9 _])* {printf("Valid Identifier");}
^[^a - z A - Z _] printf("Invalid Identifier");
%%
int yywrap()
{
}
int main()
{
yylex();
return 0;
}

You might also like