PUNE VIDYARTHI GRIHA’s
COLLEGE OF ENGINEERING, NASHIK.
• “INTRODUCTION OF COMPILER AND
L1EXICAL ANALYSIS ”
PREPARED BY :
PROF. ANAND N. GHARU
ASSISTANT PROFESSOR
COMPUTER DEPARTMENT
3/17/2019 SUBJECT – COMPILER
PROF. ANAND(BE
GHARU COMPUTER SPPU-2019)
CONTENTS
• COMPILER
• INTERPRETER
• ANALYSIS SYNTHESIS MODEL
• LANGUAGE PROCESSINGSYSTEM
• COMPILER PROCESSINBRIEF
3/17/2019 PROF. ANAND GHARU
COMPILERS
• “Compilation”
– Translation of a program written in a source
language into a semantically equivalent program
written in a target language
Input
Source Target
Compiler
Program Program
Error messages Output 3
3/17/2019 PROF. ANAND GHARU
INTERPRTERS
• “Interpretation”
– Performing the operations implied by the source
program
Source
Program
Interpreter Output
Input
Error messages
3/17/2019 PROF. ANAND GHARU
ANALYSIS – SYNTHESIS MODEL
• There are two parts to compilation:
Analysis determines the operations
implied by the source program which
are recorded in a tree structure
Synthesis takes the tree structure
and translates the operations therein
into the target program
5
3/17/2019 PROF. ANAND GHARU
ANALYSIS
Breaks up the source If the analysis part The analysis part also
program into detects that the source collects information
constituent pieces and program is either about the source
imposes a syntactically ill formed program and stores it
grammatical structure or semantically in a data structure
on them. It then uses unsound, then it must called a symbol table,
this structure to create provide informative which is passed along
an intermediate messages, so the user with the intermediate
representation of the can take corrective representation to the
source program. action. synthesis part.
3/17/2019 PROF. ANAND GHARU
SYNTHESIS
• The synthesis part constructs the desired target program from
the intermediate representation and the information in the
symbol table
ANALYSIS • Front end of compiler
• Back end of compiler
SYNTHESIS
3/17/2019 PROF. ANAND GHARU
COMPILERS ROLE
• An essential function of a compiler –
Record the variable names used in the source
program and collect information about various
attributes of each name.
• These attributes may provide information about the storage
allocated for a name , its type and its scope , procedure
names ,number and types of its arguments, the method of
passing each argument and the type returned
3/17/2019 PROF. ANAND GHARU
ISSUES IN COMPILATION
Hierarchy of operations need to be maintained to determine
correct order of expression evaluation
Maintain data type integrity with automatic type conversions
Handle user defined data types.
Develop appropriate storage mappings
3/17/2019 PROF. ANAND GHARU
ISSUES IN COMPILATION
Resolve occurrence of each variable name in a program i.e
construct separate symbol tables for different namespaces.
Handle different control structures.
Perform optimization
10
3/17/2019 PROF. ANAND GHARU
ISSUES IN COMPILATION
Single Pass Multi Pass
Speed better worse
(potentially) better
Memory better for
for small programs
large programs
Modularity worse better
Flexibility worse better
“Global” optimization impossible possible
Source Language single pass compilers are not possible
for many programming languages
11
3/17/2019 PROF. ANAND GHARU
COMPILER PASSES
A pass is a complete traversal of the source program, or a
complete traversal of some internal representation of the
source program.
A pass can correspond to a “phase” but it does not have to!
Sometimes a single “pass” corresponds to several phases that
are interleaved in time.
What and how many passes a compiler does over the source
program is an important designdecision.
12
3/17/2019 PROF. ANAND GHARU
SINGLE PASS COMPILER
A single pass compiler makes a single pass over the source text,
parsing, analyzing and generating code all at once.
Dependency diagram of a typical Single Pass Compiler:
Compiler Driver
calls
Syntactic Analyzer
calls calls
Contextual Analyzer Code Generator
13
3/17/2019 PROF. ANAND GHARU
MULTI PASS COMPILER
A multi pass compiler makes several passes over the program. The
output of a preceding phase is stored in a data structure and used by
subsequent phases.
Dependency diagram of a typical Multi Pass Compiler:
Compiler Driver
calls calls
calls
Syntactic Analyzer Contextual Analyzer Code Generator
input output input output input output
Source Text AST Decorated AST Object Code
14
3/17/2019 PROF. ANAND GHARU
SYMBOL TABLE MEANS
Symbol tables are data structures that are used by
compilers to hold information about source-program
constructs.
A symbol table is a necessary component because
Declaration of identifiers appears once in a program
Use of identifiers may appear in many places of the
program text
15
3/17/2019 PROF. ANAND GHARU
IN FORMATION PROVIDED BY
SYMBOL TABLE
Given an Identifier which name is it?
What information is to be associated with a name?
How do we access thisinformation?
16
3/17/2019 PROF. ANAND GHARU
SYMBOL TABLE NAMES
Variable and labels
Parameter
Constant
NAME Record
RecordField
Procedure
Array and files
17
3/17/2019 PROF. ANAND GHARU
WHO CREATES SYMBOL
TABLE ?
Identifi ers and attributes are entered by the analysisphases
when processing a definition (declaration) of an identifier
In simple languages with only global variables and implicit
declarations:
The scanner can enter an identifier into a symbol table
if it is not already there
In block-structured languages with scopes and explicit
declarations:
The parser and/or semantic analyzer enter identifiers
and corresponding attributes
18
3/17/2019 PROF. ANAND GHARU
USE OF SYMBOL TABLE
• Symbol table information is used by the analysis and
synthesis phases
• To verify that used identifiers have been defined
(declared)
• To verify that expressions and assignments are
semantically correct – type checking
• To generate intermediate or target code
19
3/17/2019 PROF. ANAND GHARU
MEMORY MANAGEMENT
What has a compiler to do with memory
management?
• compiler uses heap -allocated data structures
• modern languages have automatic data
(de)allocation
• garbage collection part of runtime support system
• compiler usually assists in identifying pointers
3/17/2019 PROF. ANAND GHARU
GARBAGE COLLECTION
• Some systems require user to call free when
finished with memory
– C/ C++
– reason for destructors in C++
• Other systems detect unused memory and
reclaim it
– Garbage Collection
– this is what Java does
3/17/2019 PROF. ANAND GHARU
GARBAGE COLLECTION
• Basic idea
– keep track of what memory is referenced and
when it is no longer accessible, reclaim the
memory
• Example
– linked list
3/17/2019 PROF. ANAND GHARU
EXAMPLE
Obj1 Obj2 Obj3
head
next next next
tail
• Assume programmer does the following
– obj1.next = obj2.next;
Obj1 Obj2 Obj3
head
next next next
tail
3/17/2019 PROF. ANAND GHARU
EXAMPLE
• Now there is no way for programmer to
reference obj2
– it’s garbage
• In system without garbage collection this is
called a memory leak
– location can’t be used but can’t be reallocated
– waste of memory and can eventually crash a
program
• In system with garbage collection this
chunk will be found and reclaimed
3/17/2019 PROF. ANAND GHARU
MARK AND SWEEP
• Basic idea
– go through all memory and mark every chunk
that is referenced
– make a second pass through memory and
remove all chunks not marked
OS 0 1 2 3
p2 = 650 p2 = 360
0 100 350 450 600
•Mark chunks 0, 1, and 3 as marked
•Place chunk 2 on the free list (turn it into a hole)
3/17/2019 PROF. ANAND GHARU
MARK AND SWEEP ISSUES
• Have to be able to identify all references
– this is difficult in some languages
– similar to compaction
• Requires jumping all over memory
– terrible for performance
• cache hits
• virtual memory
• Have to stop everything else to do
• Search time proportional to non-garbage
– may require lots of work for little reward
3/17/2019 PROF. ANAND GHARU
REFERENCE COUNTING
• Basic idea
– give each chunk a special field that is the
number of references to chunk
– whenever a new reference is made, increment
field by 1
– whenever a reference is removed, decrement
field by 1
– when reference count goes to zero, collect
chunk
• Requires compiler support
3/17/2019 PROF. ANAND GHARU
REFERENCE COUNTING
• Example
– everything in italics is added by compiler
p 1 0
Object p = new Object;
p.count++;
Object q = new Object; q
q.count++;
p.count--;
if(p.count == 0) 1 2
collect p
p = q;
p.count++;
3/17/2019 PROF. ANAND GHARU
REFERENCE COUNTING
• Above example does not check for NULL
reference
Object p = new Object
p.count++;
p.count--;
p = NULL;
if(p != NULL)
p.count++;
3/17/2019 PROF. ANAND GHARU
REFERENCE COUNTING
ISSUES
• What about pointers inside 0 referenced
page?
0 1
– both of these are garbage
– before reclaiming a chunk, must go
through all references in the chunk
• decrement the chunk they reference by 1
3/17/2019 PROF. ANAND GHARU
TOOLS USING
ANALYSIS – SYNTHESIS MODEL
Editors (syntax highlighting)
Pretty printers (e.g. Doxygen)
Static checkers (e.g. Lint and Splint)
Interpreters
31
3/17/2019 PROF. ANAND GHARU
TOOLS USING
ANALYSIS – SYNTHESIS MODEL
Text formatters (e.g. TeX and LaTeX)
Silicon compilers (e.g. VHDL)
Query interpreters/ compilers (Databases)
32
3/17/2019 PROF. ANAND GHARU
PREPROCESSORS, COMPILERS,
ASSEMBLERS, AND LINKERS
Skeletal Source Program
Preprocessor
Try for example:
gcc myprog.c Compiler
Assembler
Target Assembly Program Libraries and
Linker Relocatable
Relocatable Object Code Object Files
Absolute Machine Code
33
3/17/2019 PROF. ANAND GHARU
PHASES OF COMPILER
Skeletal Source Program
Preprocessor
Compiler
Assembler
Linker
34
3/17/2019 PROF. ANAND GHARU
THE PHASES OF A COMPILER
Phases Output Sample
Programmer (sourcecode producer) Source string A=B+C;
Scanner (performs lexical analysis) Token string ‘A’,‘=’,‘B’, ‘+’,‘C’, ‘;’
Andsymbol table with names
Parser (performs syntax analysis based Parse tree or abstract syntax tree ;
on the grammar of the programming |
language) =
/\
A +
/\
BC
Semantic analyzer (type checking, etc) Annotated parse tree or abstract
syntax tree
Intermediate code generator Three-address code, quads, orRTL int2fp B t1
+ t1 C t2
:= t2 A
Optimizer Three-address code, quads, orRTL int2fp B t1
+ t1 #2.3 A
Code generator Assembly code MOVF #2.3,r1
ADDF2 r1,r2
MOVF r2,A
Peephole optimizer Assembly code ADDF2 #2.3,r2 34
MOVF r2,A
3/17/2019 PROF. ANAND GHARU
THE GROUPING OF PHASES
• Front end: analysis (machine • A collection of phases is done
independent) only once (single pass) or
• Back end: synthesis (machine multiple times (multi pass)
dependent) • Single pass: usually requires
everything to be defined before
being used in source program
• Multi pass: compiler may have
to keep entire program
representation in memory
Compiler front Compiler
and back ends passes
3/17/2019 PROF. ANAND GHARU
COMPILER CONSTRUCTION
TOOLS
Software development tools are available
to implement one or more compiler
phases
• Scanner generators
• Parser generators
• Syntax-directed translation engines
• Automatic code generators
• Data-flow engines
3/17/2019 PROF. ANAND GHARU
BLOCK SCHEMATIC OF
LEXICAL ANALYZER
token
source lexical
program analyzer parser
get next
token
symbol
table
3/17/2019 PROF. ANAND GHARU
LEXICAL ANALYZER
PERSPECTIVE
LEXICAL ANALYZER PARSER
• Scan input
• Remove WS, NL, … Identify Perform Syntax Analysis
Actions Dictated by Token Order
Tokens Create Symbol Table Insert Update Symbol Table Entries
Tokens into ST Generate Errors Create Abstract Rep. of Source
Generate Errors
• Send Tokens to Parser
38
3/17/2019 PROF. ANAND GHARU
SEPERATION OF LEXICAL ANALYSIS
FROM SYNTAX ANALYSIS
• Separation of Lexical Analysis From Parsing
Presents a Simpler Conceptual Model
– From a Software Engineering Perspective Division
Emphasizes
• High Cohesion and Low Coupling
• Implies Well Specified Parallel Implementation
• Separation Increases Compiler Efficiency (I/O
Techniques to Enhance Lexical Analysis)
• Separation Promotes Portability.
– This is critical today, when platforms (OSsand
Hardware) are numerous and varied!
– Emergence of Platform Independence - Java 39
3/17/2019 PROF. ANAND GHARU
BASIC TERMINOLOGIES OF
LEXICAL ANALYSIS
Major Terms for Lexical Analysis?
TOKEN
A classification for a common set of strings
Examples Include <Identifier>, <number>, etc.
PATTERN
The rules which characterize the set of strings for a
token
Recall File and OS Wildcards ([A-Z]*.*)
LEXEME
Actual sequence of characters that matches pattern and
is classified by a token
Identifiers: x, count, name, etc…
40
3/17/2019 PROF. ANAND GHARU
INTRODUCING BASIC
TERMINOLOGY
Token Sample Lexemes Informal Description of Pattern
const const const
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters anddigits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “and “except “
Actual values are critical. Info is :
Classifies
Pattern 1. Stored in symbol table
2.Returned to parser
41
3/17/2019 PROF. ANAND GHARU
I/O - KEY FOR SUCCESSFUL
LEXICAL ANALYSIS
Character-at-a-time I/O
Block / Buffered I/O
Block/Buffered I/O
Utilize Block of memory
Stage data from source to buffer block at a time
Maintain two blocks - Why (Recall OS)?
Asynchronous I/O - for 1 block
While Lexical Analysis on 2nd block
Block 1 Block 2
When ptr... Still Process
done, issue token in 2nd 42
I/O block
3/17/2019 PROF. ANAND GHARU
Algorithm
Buffered I/O with Sentinels
Current token
E = M * eof C * * 2 eof eof
lexeme forward (scans
beginning ahead to find
forward : = forward + 1 ; pattern match)
if forward is at eof then begin
if forward at end of first half then begin
reload second half ; Block I/O
forward : = forward + 1
end
else if forward at end of second half then begin
Block I/O
reload first half ;
move forward to biginning of first half
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis end 43
2nd eof no more input !
3/17/2019 PROF. ANAND GHARU
HANDLING LEXICAL
ERRORS
• Error Handling is very localized, with Respect to
Input Source
• For example: whil ( x := 0 ) do
generates no lexical errors in PASCAL
• In what Situations do Errors Occur?
– Prefix of remaining input doesn’t match any defined
token
• Possible error recovery actions:
– Deleting or Inserting Input Characters
– Replacing or Transposing Characters
• Or, skip over to next separator to “ignore”
problem
44
3/17/2019 PROF. ANAND GHARU
3/17/2019 PROF. ANAND GHARU
Tool that helps to take set of descriptions
of possible tokens and produce Croutine
The set of descriptions is called lex
specification
The token description are known as
regular expressions
3/17/2019 PROF. ANAND GHARU
AUTOMATIC CONSTRUCTION OF
L EXICAL ANALYZER ……LEX
• Lex is a tool for creating lexical analyzers.
• Lexical analyzers tokenize input streams.
• Tokens are the terminals of a language.
• Regular expressions define tokens .
3/17/2019 PROF. ANAND GHARU
A AUTOMATIC CONSTRUCTION OF
L EXICAL ANALYZER ……LEX
lex
source lex.yy.c
Lex compiler
program
lex.l
C
C
lex.yy.c a.out
compiler
compile
r
input a.out sequence
stream
a.out of tokens
3/17/2019 PROF. ANAND GHARU
LEX SPECIFICATION
Lex Program Structure:
declarations
%%
translation rules
%%
auxiliary procedures
Name the file e.g. test.lex
Then, “lex test.lex” produces the file
“lex.yy.c” (a C-program)
3/17/2019 PROF. ANAND GHARU
LEX SPECIFICATION
sn
o
i %{
C declarat
/* definitions of all constants
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ... */
%}
......
declarations
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
......
%%
if { return(IF);}
Rules
then { return(THEN);}
{id} { yylval = install_id(); return(ID); }
......
%%
Auxiliary
install_id()
{ /* procedure to install the lexeme to the ST */
3/17/2019 PROF. ANAND GHARU
AUTOMATIC CONSTRUCTION OF
L EXICAL ANALYZER ……LEX
•To run lex on a source file, use the command: lex source.l
•This produces the file lex.yy.c which is the C source for the
lexical analyzer.
• To compile this, use: cc -o prog -O lex.yy.c -ll
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
SPECIFICATION
%{ Contains
the matching
#include <stdio.h> lexeme
Translation%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { } Invokes
%% the lexical
main() analyzer
{ yylex(); lex spec.l
} gcc lex.yy.c -ll
./a.out < spec.l
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
SPECIFICATION
%{
#include <stdio.h>
Translation int ch = 0, wd = 0, nl = 0; Regular
rules %} definition
delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
SPECIFICATION
%{
#include <stdio.h>
Translation %} Regular
rules digit [0-9] definition
letter [A-Za-z]
id
{letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”,
yytext); }
{id} { printf(“ident: %s\n”,
yytext); }
. { printf(“other: %s\n”,
yytext); }
3/17/2019 PROF. ANAND GHARU
REGULAR
EXPRESSIONS
• [xyz] match one character x, y, or z
(use \ to escape -) [^xyz]match any
character except x, y, and z
• [a-z] match one of a to z
• r* closure (match zero or more occurrences)
• r+ positive closure (match one or more
occurrences)
• r? optional (match zero or one occurrence)
• r
3/17/2019 1r2 match r1 then r2 (concatenation)
PROF. ANAND GHARU
EXAMPLE OF LEX
PROGRAM
int num_lines = 0, num_chars = 0;
%%
\n {++num_lines; ++num_chars;}
. {++num_chars;}
%%
main( argc, argv )
int argc; char **argv;
{
++argv, --argc; /* skip over program name
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
PROGRAM
*/
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else yyin = stdin;
yylex();
printf( "# of lines = %d, # of chars =
%d\n",
num_lines, num_chars ); }
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
PROGRAM
%{ #include <stdio.h> %}
WS [ \t\n]*
%%
[0123456789]+ printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]* printf("WORD\n");
{WS} /* do nothing */
. printf(“UNKNOWN\n“);
%%
3/17/2019 PROF. ANAND GHARU
EXAMPLE OF LEX
PROGRAM
main( argc, argv )
int argc; char **argv;
{ ++argv, --argc;
if ( argc > 0 ) yyin = fopen( argv[0], "r“);
else yyin = stdin;
yylex(); }
3/17/2019 PROF. ANAND GHARU
•THANK YOU!!!!!!!!!!
My Blog : anandgharu.wordpress.com
PROF. ANAND GHARU
3/17/2019