Language processing | PDF | Compiler | Computer Program
0% found this document useful (0 votes)
2 views

Language processing

The document outlines the fundamentals of language processing, which involves analyzing a source program and synthesizing a target program through a two-phase model. It details the analysis phase, which includes lexical, syntax, and semantic analysis, and the synthesis phase, which focuses on memory allocation and code generation. Additionally, it discusses the multi-pass organization of language processors, the use of intermediate representations, and the operations performed on symbol tables.

Uploaded by

ravindramhaske38
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Language processing

The document outlines the fundamentals of language processing, which involves analyzing a source program and synthesizing a target program through a two-phase model. It details the analysis phase, which includes lexical, syntax, and semantic analysis, and the synthesis phase, which focuses on memory allocation and code generation. Additionally, it discusses the multi-pass organization of language processors, the use of intermediate representations, and the operations performed on symbol tables.

Uploaded by

ravindramhaske38
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Language processing

fundamentals
Language processing
fundamentals
• Language processing =
– Analysis of the source program +
Synthesis of the target program

– This definition motivates a generic two-phase


model of language processing activities – the
source program analyzed during the analysis
phase and target program is synthesized
during the synthesis phase.
1. Analysis Phase
• This model partially applies to interpreters
in that an interpreter does not have a
synthesis phase.

• Figure below shows a schematic of a


language processor that uses this model.
The specification consist of three components

1. Lexical Rules: govern formation of valid lexical


units in the source language, such as operators,
identifiers and constants.

2. Syntax rules: Govern formation of valid statements


in the source language.

3. Semantic rules: Associate meanings with valid


statements of the source language.
• Example: percent_profit:=(profit*100)/cost_price;

• Lexical analysis identifies :=,*and / as operators,100 as


constant and remaining string as identifiers.

• Syntax analysis identifies the statement as an


assignment statement with percent_profit as the varible
on the left size and other as the expression on the right
side.

• Semantic analysis determine meaning of statement to be


assignment of
profit x 100
cost_price
to percent_profit
2. Synthesis phase
Is concerned with construction of target language
statements which would implement the meaning of a
source statement.

it consists of two main tasks as follows:

a. Creation of Data structure in the target


program(memory allocation)

b. Generation of target code( code generation)


example
• A language processor generates the following target
program for the source statement.

• This program is in the assembly language of a computer,


which we call the target machine of the language
processor.

• The DW statements in the program reserve words in


memory for holding data.

• DC statement reserves a word in memory and stores the


constant 100 in it.

• The statement MOVER and MOVEM move a value from


a memory location to a CPU register and vice versa.
Multi pass organization of LP
• We have impression that language processing
can be performed on a statement by statement
basis

• That is , analysis of a statement in the source


program can be immediately followed by
synthesis of target statements.

• But statement –by- statement may not feasible


due to two reasons.
1) A source program may contain forward reference
– A forward reference of a program entity is reference to the
entity in some statement of the program that occurs
before the statement containing the definition or
declaration of the entity.

2) may require more memory than is available for its


operation
• Forward reference causes problem in statement-by-
statement processing because

– The language processor would not know the attributes of the


entity when it encounters the forward reference
– Consequently, it would not be able to generate the correct code
for it
– Percent_profit = (profit* 100) / cost_price;
– ….
– Long profit;
– Reference to profit in the assignment statement constitutes a
forward reference because the statement occurs before
statement long profit;
• If program is processed on a statement-by-
statement basis, the language processor would not
know the type of variable profit while processing the
assignment statement, hence it would not be able to
generate the correct code.

These problems are addressed by using a multi-


pass organization of language processors
Multi-pass
• The language processor splits the task of
language processing into sequence of stages

• Problem of forward reference is solved by


analysis of program in one stage and generating
target program in a later stage.

• So that now in our example language processor


would have deduced type of profit before
generating code.
Language processor pass
is the processing of every statement in a source
program, or in its equivalent representation, to
perform a language processing function

• Pass I : perform analysis of the source program and


note the relevant information

• Pass II : Perform synthesis of the target program


Ex: Information regarding type of profit noted in Pass-1.
Information used in Pass-2 to perform code generation.
Intermediate Representation
• In pass I – it would analyze the source program
to find the type of each variable.

• In pass II – it would again analyze the source


program to generate target code by using type
information noted in pass I.

• This duplication can be avoided by using an


intermediate representation of the source
program
Intermediate Representation (IR)
• An IR reflects the effect of some but not all,
analysis and synthesis tasks performed during
language processing.

Source Front End Back End Target


Program Program

Intermediate
Representation
(IR)
• Second pass now would read the intermediate
representation generated by previous pass, instead of
reading source program

• ‘but not all’ simply differentiate target program and


intermediate representation.

• To be effective, an intermediate representation should


have following properties:
1. Ease of use
2. Processing Efficiency
3. Memory Efficiency
• Ease of use : should be easy to construct and
analyze the intermediate representation

• Processing Efficiency : Efficient algorithm should be


available for accessing the data structures used in
intermediate representation

• Memory Efficiency: The intermediate


representation should be compact so that it does
not occupy much memory
Front end and back end
• The first pass is concerned exclusively with
source language issues, it is called front end of
language processor.

• The second pass is concerned with synthesis of


the target program, it is called back end of a
language processor.
• Front end and back end need not coexist in
memory. So it will reduce the memory
requirements
• The code for the front end is first loaded in
memory, LP analyzes the source program,
generates the intermediate
representation , and writes it to disk
• Now the code of front end is removed from
the memory and code of the back end is
loaded.
• LP reads the intermediate representation
and synthesizes the target program
:= :=

+ a, real +
a, real

b, real i, int b, real i*, real

(a) (b)
:=

a, real temp, real

(c)

Steps in semantic analysis of an assignment statement


Toy Compiler
• Front End
Performs lexical, syntax and semantic
analysis of SP.
Analysis involves
1. Determine validity of source stmt.
2. Determine the content of source stmt.
3. Construct a suitable representation.
Output of Front End
1. Tables of information
2. Intermediate code (IC)

IC is a sequence of IC units, represents meaning


of one action in SP

Example: i: integer;
a,b: real;
a := b+i;
Symbol Table:
No. Symbol Type Length Address
1 i int
2 a real
3 b real
4 i* real
5 temp real

Intermediate code
1. Convert (id, #1) to real, giving (id, #4)
2. Add (id, #4) to (id, #3) giving (id, #5)
3. Store (id, #5) in (Id, #2)
Lexical Analysis (Scanning)
• Identifies lexical units in a source statement.
• Classifies units into different classes e.g id’s,
constants, reserved id’s etc and enters them into
different tables
• Token contains
– Class code and number in class

Code #no e.g. Id #10


Example: Statement a := b + i;
a := b + i ;

Id #2 Op #5 Id #3 Op #3 Id #1 Op #10
Syntax Analysis (Parsing)
• Processes tokens for a statement to determine its
grammatical structure and build intermediate code that
represents structure.
• A tree is chosen for the intermediate code because it can
represent the hierarchical structure.
• Determines the statement class such as assignment
statement, if stmt etc.
e.g.:- a , b : real; and a = b +i ;

real :=

a b a +

b i
Semantic Analysis
• Determines the meaning of the SP

• Results in addition of info such as type, length


etc.

• Determines the meaning of the subtree in IC and


adds info to the IC tree.
Stmt a := b +i; proceeds as

1. Type is added to the IC tree

2. Rules of assignment indicate expression on


RHS should be evaluated first.

3. Rules of addition indicate i should be converted


before addition
i. Convert i to real giving i*;
ii. Add i* to b giving temp.
iii. Store temp in a.
Intermediate code after analysis phase
No. Symbol Type Length Address
1 i int
2 a real
3 b real
4 i* real
5 temp real

1. Convert (Id,#1) to real, giving (Id,#4)


2. Add (Id, #4) to (Id, #3), giving (Id, #5)
3. Store (Id, #5) in (Id, #2)
Source Program

Lexical
errors Scanning

tokens
Symbol table
Syntax Constant table
errors Parsing
Other table
Parse tree

Semantic Semantic analysis


errors

IC

IR
Figure: Front end of a Toy Compiler
Back End
1. Memory Allocation
Calculated from its type, length.
No. Symbol Type Length Address
1 i int 2000
2 a real 2001
3 b real 2002
• Back end computes the memory requirement of
a variable from its type, length found in the
symbol table, and allocates memory to it.

• The address of allocated memory area is


entered in the symbol table.

• Temporary results i* and temp are both


computed and used in the assignment statement
a = b + i;
Symbol table after memory
allocation
Code generation

• Two key decisions involved in generating


good quality target code.

• 1. What instructions should be used for


each of the actions in the intermediate
code?

• 2. What CPU registers should be used for


evaluating expressions?
1. Convert (Id,#1) to real, giving (Id,#4)
2. Add (Id, #4) to (Id, #3), giving (Id, #5)
3. Store (Id, #5) in (Id, #2)
(IC Code)

If back end decides to hold the values of i* and temp in the


CPU register AREG, it would generate assembly code

CONV_R AREG, I a=b+i


ADD_R AREG, B
MOVEM AREG, A
(Target Code)
IR

IC
Memory allocation

Symbol table

Code generation

Target
program

Figure: Back End of the toy compiler


Symbol Tables
• Identifier used in source program called symbol

• Names of variables, functions and procedures are


symbols

• A Language processor uses the symbol table to


maintain information about attributes of symbols
used in a source program
Four kinds of operations
are performed on symbol table by LP

1. Add a symbol and its attributes : Make a new entry in the


symbol table

2. Locate a symbol’s entry : Find a symbol’s entry in the


symbol table

3. Delete a symbol’s entry : Remove the symbol’s


information from the table

4. Access a symbol’s entry : Access the entry and set,


modify or copy its attribute information.
Add and Locate
• Add and Locate operations take a symbol as a
parameter e.g. ADD(s1)
• The Add operation indicates an error if an entry
already exists for the symbol. Otherwise , it makes
a new entry

• The Locate operation returns a pointer to the


symbol’s entry in the table if one exists,
• Otherwise, it returns a null pointer
• int *p; p = Locate(s1);
Delete and Access
• The delete and access operations may either
take a symbol or the pointer to a symbol’s entry
returned by the locate operation as parameters.
Delete (s1) or Delete (p), access(a)

• Thus, the add and locate operations necessarily


involve a search in the symbol table, whereas
the delete and access operation may involve a
search
Design goals of symbol table
• Table’s organization should facilitate efficient
search

• And table should be compact ; it should not


occupy much memory

• Not possible to meet both these goals


simultaneously because of the principle
called time-space-trade-off
Data Structures for Language
Processing
Data structure used in Language processor can be
classified on the basis of following criteria:

1. Nature of DS: Whether linear or nonlinear ds.

2. Purpose of DS: whether search or allocation ds.

3. Lifetime of DS: whether used during language


processing or during target program execution.
Linear Data Structure
• Entries in the symbol table occupy
adjoining areas of memory.
• This property used to facilitate search

Non Linear Data Structure


• Entries in the symbol table do not occupy
contiguous areas of memory.
• The entries are searched and accessed
using pointers
• Use of linear – single contiguous area of
memory that can accommodate all entries
• This requirement forces designer to allocate
memory for the largest number of entries the
symbol table may contain

• However, it is wasteful because some part of the


allocate memory may not be used.

• Nonlinear avoids the problem because it allows


memory to be allocated in entry- by – entry
manner
Diagram
• Figure (a) Memory allocation for four linear data
structures named A-D used in a program

• Parts of data structure shown shaded with


dotted lines are not in current use

• These parts may remain unused throughout


execution of program;

• However memory allocated to them, cannot be


used for other data structures.
• Figure (b) shows allocation to four non linear
data structures named E – H.

• The program created different parts of these


data structures at different times, so these parts
do not occupy contiguous area of memory
• Two memory area to ds E, three memory area to
ds F and so on.
• No parts of the allocated memory areas are
currently unused.
Symbol table entry formats
• Entry in ST comprised of fields that
accommodate the attributes of one symbol
• One of these field is symbol field
• The field which forms basis for search is
called key field
• Hence symbol field is the key field

• An entry may be declared as record


• A programming language may specify
large number of attributes of symbols.
• These attributes applicable to specific
class is smaller
• Table below shows four class
– Variable – 3 (no.of attributes)
– Procedure - 2
– Function - 4
– Label - 1
– Total number of attributes is 8; see the diagra
Entry formats for
accommodating attribute
• Fixed length entries
– Each entry in ST has fields for all attributes in
the Programming language
• Variable-length entries
– Field only for the attributes specified for
symbols of its class
• Hybrid entries
– It has a fixed-length part and a variable-length
part
• Here in (a) each entry has 8 attribute fields
• When class=label, all attribute fields
except the field for storing its statement
number are redundant.(useless)

• (b) shows variable length entry format for


a symbol that is used as a label.
• It has only one attribute field
Fixed length formats
• When fixed entry formats are used all
entries in ST has identical format
• So it will use linear data structures like
array
• It will enables the use of an efficient
search procedure
• This organization makes inefficient use of
memory since many entries may contain
redundant fields
Variable length formats
• Use of variable length format leads to
compact organization in which memory
wastage does not occur
• Search method would have to know the
length of an entry
• So each entry would have to include a
length field
• Format for such an entry
Hybrid format
• Combines search efficiency of fixed length
and memory efficiency of variable length
• Each entry is split into two halves , fixed
part and variable part
• Pointer field is added to fixed part
• It points to variable part of an entry
• Fixed part of all entries are organized into
linear data structures
• As it has pointer to a variable part,
• Variable part does not need to be located
through search
• Hence it could be put into a linear or non
linear data structure
End of the Chapter..!!
Thank you.!!

You might also like