CD Unit 1
CD Unit 1
Lexical Analysis
• Language Processors
• Structure of a Compiler
• Lexical Analysis
• The Role of the Lexical Analyzer
• Bootstrapping
• Input Buffering
• Specification of Tokens
• Recognition of Tokens
• Lexical Analyzer Generator – LEX
• Finite Automata
• Regular Expressions and Finite Automata
• Design of a Lexical Analyzer Generator
Compiler
• Is a s/w which converts a prg written in HLL to LLL.
Language Processors
Steps:
1. Src code is converted into preprocessor
2. Preprocessor is connected to compiler which gives pure HLL code
3. With the help of compiler the preprocessor code is converted into
Assembler
4. Assembler will convert the code into m/c code
* Assembler translates LLL to m/c code
* Linker is used to link all parts of program together for execution
* Loader loads all them into memory and then the prg is executed
Preprocessor
• It removes all the include files
#include
#define
• In the places of these it is going to add the complete src code
Structure of a Compiler
• We can also call it as phases of a compiler as compiler operates in
phases.
• A phase is logically interrelated operation that takes the src prg in one
representation and produces o/p in another representation
• Two major phases of compilation
a) Analysis Phase – m/c independent but Language dependent
b) Synthesis Phase -- m/c dependent but Language independent
• Different phases are present in b/w conversion of HLL to LLL.
• All these 6 phases are managed with the help of symbol table manager
All the symbol related information will be stored here
If any error occurred in b/w these phases error handler is going to
handle that error.
1.Lexical analysis: lexical analysis is also called scanning.it is the phase of compilation
in which the complete source code is scanned and source program is broken-up in
to the group of strings called token.
Ex:consider an assignment statement
position:=initial+rate*60
the above statement in lexical analysis phase is broken up into series of
tokens as follows
lexeme token
position id
:= assignment symbol
initial id
+ plus sign
rate id
* multiplication
60 numerical literal
The statement after lexical analysis as id1:=id2+id3*60
where id1,id2,id3 are tokens for position ,initial and rate.
Lexical Analysis
• The source code will enter into phase 1
• This LA reads the stream of characters making up the src prg and group
the characters into meaningful sequences called Lexeme
• LA represents these lexemes in the form of token
2.Syntax analysis: The Syntax analysis is also called parsing.
• In this phase ,the tokens generated by the lexical analyzer are
grouped together to form a hierarchical structure
• the hierarchical structure generated in this phase is called
parse tree.
it can be represented by a syntax tree ,which is a
compressed representation of the parse tree
Syntax Analysis
• Also called as parsing
• In this phase expressions, statements,declarations are identified by
using the result of LA
• So this SA takes the tokens produced by LA as an i/p and generates a
parse tree/syntax tree
Semantic Analysis
• It checks whether the parse tree constructed follows the rules of
language or not
Semantic analysis:
• Semantic analysis checks the source program for semantic
errors .it performs type checking.
• changes one data type to another .
Example:id1:=id2+id3*60
Let all identifiers are real ,then we have to convert the
number 60 as int to real
4.Intermediate code generator:
• The intermediate code is a kind of code which is easy to generate
&this code can be easily converted to target code.(or)
• The intermediate code generator performs a conversion of syntax
tree into a code called target code.
• The intermediate code should be easily transferred to target code
and intermediate code generation should not take much time.
• It is in variety of forms such as Three-address code,
Quadruples,Triple and Postfix.
Ex:The three address code for the statement
position :=initial+rate*60 or id1:=id2+id3*60 is given
as
Temp1:=inttoreal(60)
Temp2:=id3*temp1
Temp3:=id2+temp2
Id1:=temp3
5.Code optimizer:
Code optimizer phase improves the intermediate
code .It reduces the code by removing the repeated
or unwanted instructions from the intermediate
code.
The code id1:=id2+id3*60 can be optimized as
temp1:=id3*60.0
id1:=id2+temp1
Code Optimization
if not it sends it and registers in the symbol table for further use
Specification of Tokens
• In order to specify the token we will use Regular Expression
• RE generates regular languages.
• Languages are set of strings which means set of alphabets
• We can specify token
1. Alphabets, strings, language
2. Operations on language
3. Regular Expression
4. Regular Definition
Alphabet
• It is a set of symbols denoted by Σ (sigma)
String
• A finite set of symbols which are generated from alphabet
• We can derive n no.of strings from the alphabet
a) Length of a string:
Length is the total no.of characters/symbols presented in the string
let s=1100
• The length of the string is denoted by |s| -- within the mod symbol
name of the string
|s|=4
b) Empty String:
• If length of the string is 0 then it is empty string
• Denoted by Epsilon 𝜖
𝜖=0
c) Prefix of a String:
Languages: Language means a set of Strings which are Generated from
the Alphabet.
Ex: The set of strings consisting of equal number of a’s and b’s:
L= {ε,ab,aabb,aaabbb….}
Operations on languages:
1.union
2.Concatenation
3.Kleen star/closure(a set of strings which
includes empty string)
Regular Expression:
• Expression can be defined over an alphabet as
• Let ∑ be an alphabet.The regular Expression over ∑ and the sets they denote are
defined as follows.
1) ∅ Is a Regular expression and denotes the empty set
2)ε is a Regular expression and denotes the set {ε } i.e.,a set containing empty String.
3)If ‘a’ is a symbol in ∑, then a is Regular expression that denotes {a}. i.e.,a set
containing the String a.
4)If r and s are two Regular expressions denoting the language L(r) and L(s), then
a)union of two Regular expression is a Regular expression.
b)concatenation of two Regular expression is a Regular expression.
c)Kleen closure of any Regular expression is a Regular expression.
• These three is known as properties or components of Regular expression.
• Ex: 0(0+1)*1 is a Regular expression denoting the set of all Strings over {0,1} starting
with 0 and ending with 1.
Regular Definition: A Regular Definition is a sequence of the definitions of
the form:
d1->r1 (d1 implies r1) which can be pronounced as produces
d2->r2
where d1,d2,…di is nothing but a symbol which is not part of the
alphabet and r1,r2… is nothing but a regular expression.
EX: Regular definition for identifier in pascal language.
letter->a|b…..|z|A|B…..|Z|.
digit->0|1|….|9|
id->letter(letter|digit)*.
Recognition of Tokens
• We have to recognize tokens with the help of transition diagrams
1) Recognition of Identifiers (Name of the Variable)
2) Recognition of delimiter (blank space/tab space/new line character)
3) Recognition of relational operators(>,>=,<,<=,==,!=)
4) Recognition of keywords such as if,else,for
5) Recognition of numbers(int/floating point numbers)
Lexical Analyzer Generator-LEX
• There is a wide range of tools for construction of lexical
analyzer
• The majority of these tools are based on regular expression.
• One of the traditional tool of that kind is LEX.
• LEX program is a tool that is used to generate lexical Analyzer
(or) Scanner. The tool is often referred as LEX compiler and its
input specification as LEX language.
1. lex.l prg is converted into c prg with the help of lex compiler
2. C compiler runs the lex.yy.c prg and produces an object prg
a.out
3. First 2 steps are performed by lex. A.out is LA that transforms
an i/p stream into a seq of tokens
Design of a Lexical Analyzer Generator
Declaration section:
This section is used to declare variables, manifest constants and regular definitions.
The manifest constant is an identifier that is used to represent a constant.
EX: In c manifest constants are declared as #define MAX=10
A regular definition is a sequence of regular expressions where each regular
regular expression is given a name .
d1->r1
d2->r2
d3->r3
…………
…………
dn->rn
Translation rules section: The Translation rules is a set of statements
which defines the action to be performed for each regular expression
given.
below are the form of statements in Translation rules section
r1 {method1}
r2 {method2}
. .
. .
rn {methodn}
Auxilary procedures: These procedures are used by the methods in
translation .These procedures can be compiled separately and loaded
with the lexical analyzer.
Lex Specification
Declaration section
%%
%%
Auxiliary procedures
section
%%
{a e i o u A E I O U} {vowels++}
{a-z A-Z } {const++}
%%
main()
{
printf(“enter the string “);
yylex();
printf(“no of vowels=%d\n no of constants =%d\n “,vowels,const);
}
int yywrop()
{
return 1;
}