CD Unit 1 | PDF | Parsing | String (Computer Science)
0% found this document useful (0 votes)
18 views

CD Unit 1

Uploaded by

hemanthavanapu13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

CD Unit 1

Uploaded by

hemanthavanapu13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Unit 1

Lexical Analysis
• Language Processors
• Structure of a Compiler
• Lexical Analysis
• The Role of the Lexical Analyzer
• Bootstrapping
• Input Buffering
• Specification of Tokens
• Recognition of Tokens
• Lexical Analyzer Generator – LEX
• Finite Automata
• Regular Expressions and Finite Automata
• Design of a Lexical Analyzer Generator
Compiler
• Is a s/w which converts a prg written in HLL to LLL.
Language Processors
Steps:
1. Src code is converted into preprocessor
2. Preprocessor is connected to compiler which gives pure HLL code
3. With the help of compiler the preprocessor code is converted into
Assembler
4. Assembler will convert the code into m/c code
* Assembler translates LLL to m/c code
* Linker is used to link all parts of program together for execution
* Loader loads all them into memory and then the prg is executed
Preprocessor
• It removes all the include files
#include
#define
• In the places of these it is going to add the complete src code
Structure of a Compiler
• We can also call it as phases of a compiler as compiler operates in
phases.
• A phase is logically interrelated operation that takes the src prg in one
representation and produces o/p in another representation
• Two major phases of compilation
a) Analysis Phase – m/c independent but Language dependent
b) Synthesis Phase -- m/c dependent but Language independent
• Different phases are present in b/w conversion of HLL to LLL.
• All these 6 phases are managed with the help of symbol table manager
All the symbol related information will be stored here
If any error occurred in b/w these phases error handler is going to
handle that error.
1.Lexical analysis: lexical analysis is also called scanning.it is the phase of compilation
in which the complete source code is scanned and source program is broken-up in
to the group of strings called token.
Ex:consider an assignment statement
position:=initial+rate*60
the above statement in lexical analysis phase is broken up into series of
tokens as follows
lexeme token
position id
:= assignment symbol
initial id
+ plus sign
rate id
* multiplication
60 numerical literal
The statement after lexical analysis as id1:=id2+id3*60
where id1,id2,id3 are tokens for position ,initial and rate.
Lexical Analysis
• The source code will enter into phase 1
• This LA reads the stream of characters making up the src prg and group
the characters into meaningful sequences called Lexeme
• LA represents these lexemes in the form of token
2.Syntax analysis: The Syntax analysis is also called parsing.
• In this phase ,the tokens generated by the lexical analyzer are
grouped together to form a hierarchical structure
• the hierarchical structure generated in this phase is called
parse tree.
it can be represented by a syntax tree ,which is a
compressed representation of the parse tree
Syntax Analysis
• Also called as parsing
• In this phase expressions, statements,declarations are identified by
using the result of LA
• So this SA takes the tokens produced by LA as an i/p and generates a
parse tree/syntax tree
Semantic Analysis
• It checks whether the parse tree constructed follows the rules of
language or not
Semantic analysis:
• Semantic analysis checks the source program for semantic
errors .it performs type checking.
• changes one data type to another .
Example:id1:=id2+id3*60
Let all identifiers are real ,then we have to convert the
number 60 as int to real
4.Intermediate code generator:
• The intermediate code is a kind of code which is easy to generate
&this code can be easily converted to target code.(or)
• The intermediate code generator performs a conversion of syntax
tree into a code called target code.
• The intermediate code should be easily transferred to target code
and intermediate code generation should not take much time.
• It is in variety of forms such as Three-address code,
Quadruples,Triple and Postfix.
Ex:The three address code for the statement
position :=initial+rate*60 or id1:=id2+id3*60 is given
as

Temp1:=inttoreal(60)
Temp2:=id3*temp1
Temp3:=id2+temp2
Id1:=temp3
5.Code optimizer:
Code optimizer phase improves the intermediate
code .It reduces the code by removing the repeated
or unwanted instructions from the intermediate
code.
The code id1:=id2+id3*60 can be optimized as

temp1:=id3*60.0
id1:=id2+temp1
Code Optimization

• Used to improve the intermediate code so that


o/p runs faster and takes less space
It removes the unnecessary code lines and arranges the sequence of
stmts in order to speed up the prg without wasting the resources
6.Code generator: The code generator phase converts the
intermediate code into a target code.
Ex: Assembly code for id1:=id2+id3*60
MOV id3,R2
MUL #60.0,R2
MOV id2,R1
ADD R2,R1
MOV R1,id1
Symbol Table: The data structure is used to record this information is
called symbol table.this data structures allows us to find the record
for each identifier quickly and store or retrieve data quickly.
Error handler: Error handler invoked when a fault in the source
program is detected .it warns the programmer by issuing a message
and adjust the information being passed from phase to phase so
that the compilation proceeds.
Role of Lexical Analyzer
• LA is the first phase of compiler
• Also called as linear analysis or scanning
Both LA and parser are dependent on symbol table
Symbol table contains all the identifier names along with there types
Functions of Lexical Analyzer:
• Produces stream of tokens
• Eliminates blank spaces and comments
• It keeps track of line numbers
• It generate symbol table which stores the information about identifiers ,constants
encountered in the input.
Token: Token is a sequence of characters having a collective meaning.in most
programming languages keywords,operators,identifiers,constants,literal strings are
treated as tokens
Pattern:Pattern is a set of rules that describe the token.
Lexeme:lexeme is a sequence of characters in the source program that is matched by
the pattern for a token
Ex:if(a<b)
Lexeme token pattern
If keyword if
( left parenthesis ( or )
a identifier letter followed by Letter or digit
< relational op < or <= or > or >= or ==
b identifier letter followed by letter or digit
) right parenthesis ( or )
Lexical Errors:
 A Lexical analyzer reads input string and generate tokens.
 It may be encounter the following type of errors
i)A misspelled identifier
ii)An identifier with over than specified or defined length.
These errors can be caused due to the following reasons
 An invalid character
 A missing character
 Misplaced character
Example:
main()
{
printf(“hello)
}
In this code ,2 syntax errors
i)Double quote missing right side of string “hello”
ii)Semicolon missing in printf() statement.
Bootstrapping
• Mainly useful in order to design a compiler
• Generally any compiler is represented with 3 languages
1. Source language – HLL
2. Target language – LLL
3. Implementation language – converts S to T and acts as an
intermediate b/w them
The process of compilation is represented in T diagram
Bootstrapping is a process in which a simple language is used in order
to translate a complicated prg;but this complicated prg produce even
more complicated prg
• Compiler is written using A language
• As cpu understands only MLL so for that purpose we need to develop a
small set of language A
• So we have to design a compiler for A language.., this compiler accepts A
language as an input.. It is producing m/c code as o/p M.,
• Here the compiler is the assembly level language which is m/c
understandable code
Input Buffering
Buffer: contains data that is stored for a short amt of time typically in
the cmp memory(RAM).
• The purpose of buffer is to hold data right before it is used
• If something needs to be processed then before going straight into
the processor it gets into the buffer and then it is passed to the
processor. This process is known as buffering
• For the compilers this LA uses this i/p buffering to check the stream of
characters in the src code using the buffering method
eof – end of file
• Eof is also known as Sentinels
• It has 2 pointers, which are used to identify each lexeme one by one
1. lexemeBegin
2. Forward
Procedure
1. lexemeBegin(lB) points to the first element
2. Forward(f) also points to the same element at the beginning, so we get to
know that E is the first character
3. lB will be in the E and f will move to the next i/p symbol =
4. LA checks the symbol table that if there is E present or not
 if it is present what is the attribute of it

 if not it sends it and registers in the symbol table for further use
Specification of Tokens
• In order to specify the token we will use Regular Expression
• RE generates regular languages.
• Languages are set of strings which means set of alphabets
• We can specify token
1. Alphabets, strings, language
2. Operations on language
3. Regular Expression
4. Regular Definition
Alphabet
• It is a set of symbols denoted by Σ (sigma)
String
• A finite set of symbols which are generated from alphabet
• We can derive n no.of strings from the alphabet
a) Length of a string:
Length is the total no.of characters/symbols presented in the string
let s=1100
• The length of the string is denoted by |s| -- within the mod symbol
name of the string
|s|=4
b) Empty String:
• If length of the string is 0 then it is empty string
• Denoted by Epsilon 𝜖
𝜖=0
c) Prefix of a String:
Languages: Language means a set of Strings which are Generated from
the Alphabet.
Ex: The set of strings consisting of equal number of a’s and b’s:
L= {ε,ab,aabb,aaabbb….}
Operations on languages:
1.union
2.Concatenation
3.Kleen star/closure(a set of strings which
includes empty string)
Regular Expression:
• Expression can be defined over an alphabet as
• Let ∑ be an alphabet.The regular Expression over ∑ and the sets they denote are
defined as follows.
1) ∅ Is a Regular expression and denotes the empty set
2)ε is a Regular expression and denotes the set {ε } i.e.,a set containing empty String.
3)If ‘a’ is a symbol in ∑, then a is Regular expression that denotes {a}. i.e.,a set
containing the String a.
4)If r and s are two Regular expressions denoting the language L(r) and L(s), then
a)union of two Regular expression is a Regular expression.
b)concatenation of two Regular expression is a Regular expression.
c)Kleen closure of any Regular expression is a Regular expression.
• These three is known as properties or components of Regular expression.
• Ex: 0(0+1)*1 is a Regular expression denoting the set of all Strings over {0,1} starting
with 0 and ending with 1.
Regular Definition: A Regular Definition is a sequence of the definitions of
the form:
d1->r1 (d1 implies r1) which can be pronounced as produces
d2->r2
where d1,d2,…di is nothing but a symbol which is not part of the
alphabet and r1,r2… is nothing but a regular expression.
EX: Regular definition for identifier in pascal language.
letter->a|b…..|z|A|B…..|Z|.
digit->0|1|….|9|
id->letter(letter|digit)*.
Recognition of Tokens
• We have to recognize tokens with the help of transition diagrams
1) Recognition of Identifiers (Name of the Variable)
2) Recognition of delimiter (blank space/tab space/new line character)
3) Recognition of relational operators(>,>=,<,<=,==,!=)
4) Recognition of keywords such as if,else,for
5) Recognition of numbers(int/floating point numbers)
Lexical Analyzer Generator-LEX
• There is a wide range of tools for construction of lexical
analyzer
• The majority of these tools are based on regular expression.
• One of the traditional tool of that kind is LEX.
• LEX program is a tool that is used to generate lexical Analyzer
(or) Scanner. The tool is often referred as LEX compiler and its
input specification as LEX language.
1. lex.l prg is converted into c prg with the help of lex compiler
2. C compiler runs the lex.yy.c prg and produces an object prg
a.out
3. First 2 steps are performed by lex. A.out is LA that transforms
an i/p stream into a seq of tokens
Design of a Lexical Analyzer Generator
Declaration section:
This section is used to declare variables, manifest constants and regular definitions.
The manifest constant is an identifier that is used to represent a constant.
EX: In c manifest constants are declared as #define MAX=10
A regular definition is a sequence of regular expressions where each regular
regular expression is given a name .
d1->r1
d2->r2
d3->r3
…………
…………
dn->rn
Translation rules section: The Translation rules is a set of statements
which defines the action to be performed for each regular expression
given.
below are the form of statements in Translation rules section
r1 {method1}
r2 {method2}
. .
. .
rn {methodn}
Auxilary procedures: These procedures are used by the methods in
translation .These procedures can be compiled separately and loaded
with the lexical analyzer.
Lex Specification
Declaration section

%%

Translation rules section LEX Program

%%

Auxiliary procedures
section

Structure of a LEX program


EXAMPLE1:Write a lex program to count the no of vowels & constants in a given input.
%{
#include<stdio.h>
int vowels=0;
int const=0;
%}

%%

{a e i o u A E I O U} {vowels++}
{a-z A-Z } {const++}

%%

main()
{
printf(“enter the string “);
yylex();
printf(“no of vowels=%d\n no of constants =%d\n “,vowels,const);
}
int yywrop()
{
return 1;
}

You might also like