0% found this document useful (0 votes)
16 views

Module 1

The document discusses the different phases of a compiler: 1) The analysis phase which includes lexical analysis, syntax analysis, and semantic analysis. Lexical analysis converts the input text into tokens. Syntax analysis generates a parse tree from these tokens. Semantic analysis checks for semantic correctness. 2) The synthesis phase which includes code optimization and code generation to produce the target code from the intermediate representation generated during analysis. 3) The key components of a compiler including translators like interpreters, assemblers, and compilers and the different phases they involve.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Module 1

The document discusses the different phases of a compiler: 1) The analysis phase which includes lexical analysis, syntax analysis, and semantic analysis. Lexical analysis converts the input text into tokens. Syntax analysis generates a parse tree from these tokens. Semantic analysis checks for semantic correctness. 2) The synthesis phase which includes code optimization and code generation to produce the target code from the intermediate representation generated during analysis. 3) The key components of a compiler including translators like interpreters, assemblers, and compilers and the different phases they involve.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Content - Module -1

• Introduction to Compilation And Lexical Analysis


• Introduction to LLVM
• Structure and Phases of a Compiler
• Design Issues
• Patterns & Lexemes
• Tokens & Attributes
• Specification of Tokens
• Extended Regular Expression
• Regular Expression to Deterministic Finite Automata (Direct method)
• Lex - A Lexical Analyzer Generator
Translator
• A translator is a program that takes one form of program as
input and converts it into another form.
• Types of translators are:
1. Interpreter
Source
2. Assembler Translator Target
Program Program
3. Compiler

Error
Messages
Interpreter
• Interpreter is a translator that reads a program written in source
language and translates it into an equivalent program in target
language line by line

void main() 0000 1100 0010


{ 0000
int a=1,b=2,c; Interpreter 1111 1100 0010
c=a+b; 1010 1100 0010
printf(“%d”,c); 0011 1100 0010
} 1111

Source Error Target


Program Messages Program
Assembler
• Assembler is a translator which takes the assembly code as an
input and generates the machine code as an output.

MOV id3, R1 0000 1100 0010


MUL #2.0, R1 0100
MOV id2, R2 0111 1000 0001
MUL R2, R1 Assembler 1111 0101 1110
MOV id1, R2 1100 0000 1000
ADD R2, R1 1011
MOV R1, id1 1100 0000 1000

Assembly Error
Messages Machine Code
Code
Compiler
• A compiler is a translator that reads a program written in source
language and translates it into an equivalent program in target
language. (Object/Machine Language/0’s, 1’s).

void main() 0000 1100 0010


{ 0100
int a=1,b=2,c; Compiler 0111 1000 0001
c=a+b; 1111 0101 1110
printf(“%d”,c); 1100 0000 1000
} 1011

Source Error Target


Program Messages Program
Features of a Compiler
• Speed of machine code

• The correctness of machine code

• The meaning of code should not change

• Good error detection

• Checking the code correctly according to grammar


Uses of Compilers

• Helps to make the code independent of the platform

• Makes the code free of syntax and semantic errors

• Generate executable files of code

• Translates the code from one language to another


Applications of Compilers
• Optimization of computer architectures
• Design of new computer architectures
• Artificial Intelligence
• Security
• Embedded Systems
• High-Performance Computing
•…
Analysis Synthesis model of compilation
• There are two parts of compilation.
1. Analysis Phase
2. Synthesis Phase

void main() Analysis Synthesis


{ Phase Phase 0000 1100
int a=1,b=2,c; 0111 1000
c=a+b; 0001
printf(“%d”,c); Intermediate 1111 0101
} Representation 1000
1011
Source Code Target Code
Analysis phase & Synthesis phase
Analysis Phase Synthesis Phase
• Analysis part breaks up the source • The synthesis part constructs
program into constituent pieces the desired target program from
and creates an intermediate the intermediate representation.
representation of the source • Synthesis phase consist of the
program. following sub phases:
• Analysis phase consists of three 1. Code optimization
sub phases:
2. Code generation
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
Phases of compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Lexical analysis
• Lexical Analysis is also called linear analysis or
scanning. Position = initial + rate * 60
• Lexical Analyzer divides the given source
statement into the tokens.
• Ex: Position = initial + rate * 60 would be Lexical analysis
grouped into the following tokens:
id1 = id2 + id3 * 60
Position (identifier)
= (Assignment operator)
initial (identifier)
+ (Plus Operator)
rate (identifier)
* (Multiplication Operator)
60 (digit/Number)
Lexeme
• Reads the stream of char making up the source program &
group the char into meaningful sequences called lexeme.
• Lexical analyzer represents the lexeme in the form of tokens.
• It is a sequence of characters in the source code that are matched
by given predefined language rules for every lexeme to be
specified as a valid token.
• Example:
• main is lexeme of type keyword(token)
• (,),{,} are lexemes of type punctuation(token)
Patterns
• Specifies a set of rules that a scanner follows to create a token.

• Example:
• For a keyword to be identified as a valid token, the pattern is the
sequence of characters that make the keyword.

• For identifier to be identified as a valid token, the pattern is the


predefined rules that it must start with alphabet, followed by alphabet
or a digit.
Token vs Lexeme Vs Pattern
Criteria Token Lexeme Pattern
Interpretation of All the reserved keywords of int, goto The sequence of characters
type Keyword that language(main, printf, etc.) that make the keyword.
Interpretation of Name of a variable, function, etc addsum, a, id1 It must start with the
type Identifier alphabet, followed by the
alphabet or a digit.
Interpretation of All the operators are considered +, = +, =
type Operator tokens.
Interpretation of Each kind of punctuation is (, ), {, } (, ), {, }
type Punctuation considered a token. E.G.
Semicolon, bracket, comma, etc.

Interpretation of A grammar rule or boolean “Welcome to Any string of characters


type Literal literal. GeeksforGeeks!” (except ‘ ‘) between ” and “
Phases of Compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Syntax analysis
Position = initial + rate*60

• Syntax Analysis is also called Parsing or Lexical analysis


Hierarchical Analysis.
id1 = id2 + id3 * 60
• It takes token produced by lexical analyzer
as Input & generates the parse tree. Syntax analysis

• The syntax analyzer checks each line of the


=
code and spots every error.
• If code is error free then syntax analyzer id1 +
generates the tree. *
id2

id3 60
Phases of compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Semantic analysis
• Semantic analyzer determines the meaning =
of a source string. id1 +
• It performs following operations: id2 * int to
1. Matching of parenthesis in the expression. real
id3 60
2. Matching of if..else statement.
3. Performing arithmetic operation that are type
Semantic analysis
compatible.
4. Checking the scope of operation. =
*Note: Consider id1, id2 and id3 are real
id1 +

id2 *
id3 inttoreal

60
Phases of compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Intermediate code generator
• Two important properties of intermediate =
code : +
id1
1. It should be easy to produce.
id2 *
2. Easy to translate into target program.
t3 id3 inttoreal
• Intermediate form can be represented using t2 t1
60
“three address code”.[Postfix notation & Syntax
Analysis tree] Intermediate code

• Three address code consist of a sequence of t1= int to real(60)


instruction, each of which has at most three t2= id3 * t1
operands. t3= id2 + t2
id1= t3
Phases of compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Code optimization
• It improves the intermediate code.
• This is necessary to have a faster Intermediate code
execution of code or less consumption
t1= int to real(60)
of memory. t2= id3 * t1
t3= t2 + id2
id1= t3

Code optimization

t1= id3 * 60.0


id1 = id2 + t1
Phases of compiler
Compiler

Synthesis
Analysis phase phase

Lexical
analysis Intermediate Code
code optimization
Syntax
analysis generation
Code
Semantic
generation
analysis
Code generation
• The intermediate code instructions are
translated into sequence of machine Code optimization
instruction.
t1= id3 * 60.0
id1 = id2 + t1

Code generation

MOV id3, R2
MUL #60.0, R2
MOV id2, R1
ADD R2,R1
MOV R1, id1

Id3→R2
Id2→R1
Symbol table
• Symbol table are data structures that are used by compilers to hold
information about source-program constructs.
• It is used to store information about the occurrences of various entities
such as, objects, classes, variable names, functions, etc.,
• It is used by both analysis phase and synthesis phase.
• Symbol table is used for the following purposes
• It is used to store the name of all the entities in a structured form at one place
• It is used to verify if a variable has been declared
• It is used to determine the scope of a name
• It is used to implement type checking by verifying assignments and
expression in the source code are semantically correct.
Cont.,
• Symbol table can be a linear (Linked list) or hash table
• It maintain a entry for each name as,
• <symbol name, type, attribute>

• Example:
• If a symbol table has to store information about the following
variable declaration:
• static int interest;
• then it should store the entry such as:
• <interest, int, static>
Operations of Symbol table
Scope Management
void pro_one() void pro_two()
{ {
int one_1; int two_1;
int one_2; int two_2;

{ \ { \
int one_3; |_ inner scope 1 int two_3; |_ inner scope 3
int one_4; | int two_4; |
} / } /

int one_5; int two_5;


}
{ \ ...
int one_6; |_ inner scope 2
int one_7; |
} /
}
Phases of compiler
Source program

Analysis
Lexical analysis Phase

Syntax analysis

Semantic analysis
Symbol
Error detection
table
and recovery
Intermediate code

Variable Type Address Code optimization


Name
Position Float 0001 Code generation Synthesis
Initial Float 0005 Phase
Rate Float 0009 Target Program
Exercise 1
• Write output of all the phases of compiler for following
statements:
1. X = b-c*2
2. I = p*n*r/100
3. F = C * (9/5) + 32
4. α = [-b + √(b^2 – 4*a*c)]/2*a
5. Volume of frustum: V = (Π*h)/12 (d^2 + d*b + b^2)
Grouping of Phases
Front end & back end (Grouping of phases)
Front end
• Depends primarily on source language and largely independent of the target machine.
• It includes following phases:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Creation of symbol table

Back end
 Depends on target machine and do not depends on source program.
 It includes following phases:
1. Code optimization
2. Code generation phase
3. Error handling and symbol table operation
Context of Compiler
(Cousins of compiler)
Context of compiler (Cousins of compiler)
Skeletal Source Program
• In addition to compiler, many other
system programs are required to Preprocessor
generate absolute machine code. Source
Program
• These system programs are: Compiler

Target Assembly
• Preprocessor Program
• Assembler Assembler
• Linker Relocatable
• Loader Object Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Preprocessor
 Some of the task performed by preprocessor: Preprocessor

1. Macro processing: Allows user to define macros. Source


Ex: #define PI 3.14159265358979323846 Program

2. File inclusion: A preprocessor may include the Compiler


header file into the program. Ex: Target Assembly
#include<stdio.h> Program
3. Rational preprocessor: It provides built in macro
for construct like while statement or if statement. Assembler

4. Language extensions: Add capabilities to the Relocatable


language by using built-in macros. Object Code
Libraries &
 Ex: the language SQL is a database query Linker / Loader
Object Files
language embedded in C. Statement beginning
with ## are taken by preprocessor to be
database access statement unrelated to C and Absolute Machine
translated into procedure call on routines that Code
perform the database access.
Context of compiler (Cousins of compiler)
Skeletal Source Program
Compiler
Preprocessor
 A compiler is a program that reads a program
written in source language and translates it Source
Program
into an equivalent program in target language.
Compiler

Target Assembly
Program
Assembler

Relocatable
Object Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Assembler
Preprocessor
 Assembler is a translator which takes the
assembly program (mnemonic) as an input Source
Program
and generates the machine code as an output.
Compiler

Target Assembly
Program
Assembler

Relocatable
Object Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Linker
 Linker makes a single program from a several files of Preprocessor
relocatable machine code. Source
Program
 These files may have been the result of several
Compiler
different compilation, and one or more library files.
Target Assembly
Loader Program
Assembler
 The process of loading consists of:
 Taking relocatable machine code Relocatable
Machine Code
 Altering the relocatable address
Libraries & Linker / Loader
 Placing the altered instructions and data in Object Files
main memory at the proper location.
Absolute Machine
Code
Pass structure
Pass structure
• One complete scan of a source program is called pass.
• Pass includes reading an input file and writing to the output file.
• In a single pass compiler analysis of source statement is immediately
followed by synthesis of equivalent target statement.
• In a two pass compiler intermediate code is generated between analysis
and synthesis phase.
• It is difficult to compile the source program into single pass due to:
forward reference
Pass structure
Forward reference: A forward reference of a program entity is a reference
to the entity which precedes its definition in the program.
• Can be solved by postponing the generation of target code until more
information concerning the entity becomes available.
• It leads to multi pass model of compilation.
Pass I:

• Perform analysis of the source program and note relevant information.

Pass II:

• In Pass II: Generate target code using information noted in pass I.


Types of compiler
Types of compiler
1. One pass compiler
• It is a type of compiler that compiles whole process in one-pass.
2. Two pass compiler
• It is a type of compiler that compiles whole process in two-pass.
• It generates intermediate code.
3. Incremental compiler
• The compiler which compiles only the changed line from the source code and
update the object code.
4. Native code compiler
• The compiler used to compile a source code for a same type of platform only.
5. Cross compiler
• The compiler used to compile a source code for a different kinds platform.
Types of compiler
1. One pass compiler

2. Two pass compiler


Token, Pattern & Lexemes
Interaction of scanner & parser
Token
Source Lexical
Parser
Program Analyzer
Get next
token

Symbol Table

• Upon receiving a “Get next token” command from parser, the lexical
analyzer reads the input character until it can identify the next token.
Lexical Analysis
• Secondary tasks:
• Stripping out from the source program comments and white
spaces in the form of blank, tab, and new line characters.
• Correlating error messages from the compiler with the
source program.
• Inserting lexemes for user-defined names into the symbol
table.
Issues in Lexical and Syntax Analysis
Why to separate lexical analysis & parsing?
1. Simplicity in design
• Separation allows the simplification of one or the other.
• Example: A parser with comments or white spaces is more complex
2. Improves compiler efficiency
• Optimization of lexical analysis because a large amount of time is
spent reading the source program and partitioning it into tokens.
3. Enhance compiler portability
• Input alphabet peculiarities and other device-specific anomalies can
be restricted to the lexical analyzer.
Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern
Sequence of character having a
associated with a token.
collective meaning is known as
Example: “non-empty sequence of digits”,
token. “letter followed by letters and digits”
Categories of Tokens:
1.Identifier Lexemes

2.Keyword The sequence of character in a source


program matched with a pattern for a
3.Operator token is called lexeme.
4.Special symbol Example: Rate, DIET, count, Flag
5.Constant
Example: Token, Pattern & Lexemes
C code:
printf("Total = %d\n", score);
▪ printf and score are lexemes matching the pattern for token id,
▪ "Total = %d\n" is a lexeme matching literal

In many programming languages, the following classes cover most or all of


the tokens:
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
total Identifier1

= Operator1
Tokens
sum Identifier2

+ Operator2

45 Constant1

Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Attributes of Tokens
•The
When
token
more
names
than and
one associated
lexeme canattribute
match a values
pattern,for
thethe
lexical
Fortran
analyzer
statement
must
E =provide
M * C **the
2 subsequent compiler phases additional information about the
areparticular
written below
lexeme
asthat
a sequence
matched.
of pairs.
<id,• pointer
For example,
to symbol-table
the pattern
entry
for for
token
E> number matches both 0 and 1
•<assign
The lexical
op> analyzer returns to the parser not only a token name, but an
<id,
attribute
pointervalue
to symbol-table
that describes
entry
thefor
lexeme
M> represented by the token;
• Normally,
<mult op> information about an identifier e.g., its lexeme, its type, and the
<id, pointer
location
to symbol-table
at which it isentry
firstfor
found
C> is kept in the symbol table. Thus, the
<exp op>
appropriate attribute value for an identifier is a pointer to the symbol-table
<number,
entryinteger
for thatvalue
identifier.
2>
Divide the following program into appropriate lexemes

<float> <id, limitedSquaare> <(> <id, x> <)> <{>


<float> <id, x>
<return> <(> <id, x> <op,"<="> <num, -10.0> <op, "||"> <id, x> <op, ">="> <num, 10.0>
<)> <op, "?"> <num, 100> <op, ":"> <id, x> <op, "*"> <id, x> <}>
Input Buffering
• Finding lexemes:
• We often have to look one or more characters beyond the next lexeme
before we can be sure we have the right lexeme.
• we need to look at least one additional character ahead.
• For instance,
• We cannot be sure we have seen the end of an identifier until we see a character
that is not a letter or digit, and therefore is not part of the lexeme for id.
• In C, single-character operators like -, =, or < could also be the beginning of a two-
character operator like ->, ==, or <=.
• A two-buffer scheme that handles large lookaheads safely
• An improvement involving “sentinels" that saves time checking for the
ends of buffers.
Input buffering
• There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair

• The lexical analysis scans the input string from left to right one character at a time.
• A specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character
• Buffer divided into two N-character halves, where N is the number of character on
one disk block [size - 4096 bytes].
• Using one system read command we can read N characters into a buffer.
• If fewer than N characters remain in the input file, then a special character, represented by eof
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward
lexeme_beginnig

• Pointer Lexeme Begin, marks the beginning of the current lexeme.


• Pointer Forward, scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to character at its right end.
• Lexeme Begin is set to the character immediately after the lexeme just found.
• If forward pointer is at the end of first buffer half then second is filled with N input
character.
• If forward pointer is at the end of second buffer half then first is filled with N input
character.
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward
lexeme_beginnig
Code to advance forward pointer
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else forward := forward + 1;
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward
lexeme_beginnig

• In buffer pairs for each character read, we make two tests.


• We can combine the buffer-end test with the test for the current character.
• We can reduce the two tests to one if we extend each buffer to hold a
sentinel character at the end.
• The sentinel is a special character that cannot be part of the source
program, and a natural choice is the character EOF.
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward
forward forward
lexeme_beginnig
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else terminate lexical analysis;
end
Specification of tokens
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more trailing
symbol of string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more leading
symbol of string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix from S.
e.g., nan is substring of banana
Proper prefix, suffix Any nonempty string x that is respectively proper prefix,
and substring of S suffix or substring of S, such that s≠x.
Subsequence of S A string obtained by removing zero or more not
necessarily contiguous symbol from S.
e.g., baaa is subsequence of banana.
Exercise
• Write prefix, suffix, substring, proper prefix, proper suffix and
subsequence of following string:
String: Compiler
Operations on languages
Operation Definition
Union of L and M L U M = {s | t is in L or t is in M }
Written as L U M
Concatenation of L
and M LM = {st | s is in L and t is in M }
Written as LM
Kleene closure of L
L∗ denotes “zero or more concatenation of” L.
Written as L∗
Positive closure of L
L+ denotes “one or more concatenation of” L.
Written as L+
Regular Expression &
Regular Definition
Regular expression
• A regular expression is a sequence of characters that define a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ
Rules to define regular expression
1. ∈ is a regular expression that denotes {∈}, the set containing empty
string.
2. If 𝑎 is a symbol in 𝛴 then 𝑎 is a regular expression, 𝐿 𝑎 = {𝑎}
3. Suppose 𝑟 and 𝑠 are regular expression denoting the languages 𝐿 𝑟
and 𝐿 𝑠 . Then,
a. r | s is a regular expression denoting L r U L(s)
b. r s is a regular expression denoting L r L s

c. r * is a regular expression denoting L r
d. r is a regular expression denoting L( r )

The language denoted by regular expression is said to be a regular set.


Regular expression
• L = Zero or More Occurrences of a = a*

𝜖
a
aa Infinite
aaa
…..
aaaa
aaaaa…..
Regular expression
• L = One or More Occurrences of a = a+

a
aa
aaa Infinite …..
aaaa
aaaaa…..
Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation . 2 left
Union | 3 left
Regular expression examples
1. 0 or 1
𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝟎, 𝟏 𝐑. 𝐄. = 𝟎 | 𝟏

2. 0 or 11 or 111
𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝟎, 𝟏𝟏, 𝟏𝟏𝟏 𝐑. 𝐄. = 𝟎 𝟏𝟏 𝟏𝟏𝟏

3. String having zero or more a. ∗


𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝛜, 𝐚, 𝐚𝐚, 𝐚𝐚𝐚, 𝐚𝐚𝐚𝐚 … . . 𝐑. 𝐄. = 𝐚

4. String having one or more a. +


𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝐚, 𝐚𝐚, 𝐚𝐚𝐚, 𝐚𝐚𝐚𝐚 … . . 𝐑. 𝐄. = 𝐚

5. Regular expression over 𝛴 = {𝑎, 𝑏, 𝑐} that represent all string of length 3.


𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝐚𝐛𝐜, 𝐛𝐜𝐚, 𝐛𝐛𝐛, 𝐜𝐚𝐛, 𝐚𝐛𝐚 … . 𝐑. 𝐄. = 𝐚 𝐛 𝐜 𝐚 𝐛 𝐜 (𝐚 𝐛 𝐜)

6. All binary string


𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝟎, 𝟏𝟏, 𝟏𝟎𝟏, 𝟏𝟎𝟏𝟎𝟏, 𝟏𝟏𝟏𝟏 … 𝐑. 𝐄. = (𝟎 | 𝟏)+
Regular expression examples
7. 0 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝝐, 𝒂, 𝒂𝒂, 𝒂𝒃𝒂𝒃, 𝒃𝒂𝒃 … 𝑹. 𝑬. = (𝒂 | 𝒃) ∗

8. 1 or more occurrence of either a or b or both


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒂, 𝒂𝒂, 𝒂𝒃𝒂𝒃, 𝒃𝒂𝒃, 𝒃𝒃𝒃𝒂𝒂𝒂 … 𝑹. 𝑬. = (𝒂 | 𝒃)+

9. Binary no. ends with 0


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎, 𝟏𝟎, 𝟏𝟎𝟎, 𝟏𝟎𝟏𝟎, 𝟏𝟏𝟏𝟏𝟎 … 𝑹. 𝑬. = (𝟎 | 𝟏)* 𝟎

10. Binary no. ends with 1


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟏, 𝟏𝟎𝟏, 𝟏𝟎𝟎𝟏, 𝟏𝟎𝟏𝟎𝟏, … 𝑹. 𝑬. = (𝟎 | 𝟏) ∗ 𝟏

11. Binary no. starts and ends with 1


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟏𝟏, 𝟏𝟎𝟏, 𝟏𝟎𝟎𝟏, 𝟏𝟎𝟏𝟎𝟏, … 𝑹. 𝑬. = 𝟏 (𝟎 | 𝟏) ∗ 𝟏

12. String starts and ends with same character


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎, 𝟏𝟎𝟏, 𝒂𝒃𝒂, 𝒃𝒂𝒂𝒃 … 𝑹. 𝑬. = 𝟏 (𝟎 | 𝟏) ∗ 𝟏 𝐨𝐫 𝟎 (𝟎 | 𝟏) ∗ 𝟎
∗ ∗
𝒂 (𝒂 | 𝒃) 𝒂 𝐨𝐫 𝒃 (𝒂 | 𝒃) 𝒃
Regular expression examples
13. All string of a and b starting with a
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒂, 𝒂𝒃, 𝒂𝒂𝒃, 𝒂𝒃𝒃… 𝑹. 𝑬. = 𝒂(𝒂 | 𝒃)*

14. String of 0 and 1 ends with 00


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎, 𝟏𝟎𝟎, 𝟎𝟎𝟎, 𝟏𝟎𝟎𝟎, 𝟏𝟏𝟎𝟎… 𝑹. 𝑬. = (𝟎 | 𝟏)∗ 𝟎𝟎

15. String ends with abb


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒂𝒃𝒃, 𝒃𝒂𝒃𝒃, 𝒂𝒃𝒂𝒃𝒃… 𝑹. 𝑬. = (𝒂 | 𝒃)∗ 𝒂𝒃𝒃

16. String starts with 1 and ends with 0


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟏𝟎, 𝟏𝟎𝟎, 𝟏𝟏𝟎, 𝟏𝟎𝟎𝟎, 𝟏𝟏𝟎𝟎… 𝑹. 𝑬. = 𝟏(𝟎 | 𝟏)∗ 𝟎

17. All binary string with at least 3 characters and 3rd character should be zero
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎𝟎, 𝟏𝟎𝟎, 𝟏𝟏𝟎𝟎, 𝟏𝟎𝟎𝟏… 𝑹. 𝑬. = 𝟎 𝟏 𝟎 𝟏 𝟎(𝟎 | 𝟏) ∗

18. Language which consist of exactly two b’s over the set 𝛴 = {𝑎, 𝑏}
∗ ∗ ∗
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒃𝒃, 𝒃𝒂𝒃, 𝒂𝒂𝒃𝒃, 𝒂𝒃𝒃𝒂… 𝑹. 𝑬. = 𝒂 𝒃 𝒂 𝒃 𝒂
Regular expression examples
24. The language with 𝛴 = {𝑎, 𝑏, 𝑐} where 𝑎 should be multiple of 3
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒂𝒂𝒂, 𝒃𝒂𝒂𝒂, 𝒃𝒂𝒄𝒂𝒃𝒂, 𝒂𝒂𝒂𝒂𝒂𝒂. . 𝑹. 𝑬. = ( 𝒃|𝒄 ∗ 𝒂 𝒃|𝒄 ∗ 𝒂 𝒃|𝒄 ∗ 𝒂 𝒃|𝒄 ∗ )∗
25. Even no. of 0
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎, 𝟎𝟏𝟎𝟏, 𝟎𝟎𝟎𝟎, 𝟏𝟎𝟎𝟏𝟎𝟎…. 𝑹. 𝑬. = (𝟏∗ 𝟎𝟏∗ 𝟎𝟏∗ )∗
26. String should have odd length
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎, 𝟎𝟏𝟎, 𝟏𝟏𝟎, 𝟎𝟎𝟎, 𝟏𝟎𝟎𝟏𝟎…. 𝑹. 𝑬. = 𝟎|𝟏 ( 𝟎 𝟏 (𝟎|𝟏))∗
27. String should have even length
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎, 𝟎𝟏𝟎𝟏, 𝟎𝟎𝟎𝟎, 𝟏𝟎𝟎𝟏𝟎𝟎…. 𝑹. 𝑬. = ( 𝟎 𝟏 (𝟎|𝟏))∗
28. String start with 0 and has odd length
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎, 𝟎𝟏𝟎, 𝟎𝟏𝟎, 𝟎𝟎𝟎, 𝟎𝟎𝟎𝟏𝟎…. 𝑹. 𝑬. = 𝟎 ( 𝟎 𝟏 (𝟎|𝟏))∗
30. String start with 1 and has even length
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟏𝟎, 𝟏𝟏𝟎𝟎, 𝟏𝟎𝟎𝟎, 𝟏𝟎𝟎𝟏𝟎𝟎…. 𝑹. 𝑬. = 𝟏(𝟎|𝟏)( 𝟎 𝟏 (𝟎|𝟏))∗
31. All string begins or ends with 00 or 11

𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎𝟏𝟎𝟏, 𝟏𝟎𝟏𝟎𝟎, 𝟏𝟏𝟎, 𝟎𝟏𝟎𝟏𝟏 … 𝑹. 𝑬. = (𝟎𝟎|𝟏𝟏)(𝟎 | 𝟏) ∗ | 𝟎 𝟏 (𝟎𝟎|𝟏𝟏)
Regular expression examples
31.Language of all string containing both 11 and 00 as substring
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎𝟏𝟏, 𝟏𝟏𝟎𝟎, 𝟏𝟎𝟎𝟏𝟏𝟎, 𝟎𝟏𝟎𝟎𝟏𝟏 … 𝑹. 𝑬. = ((𝟎|𝟏)∗ 𝟎𝟎(𝟎|𝟏)∗ 𝟏𝟏(𝟎|𝟏)∗ ) | ((𝟎|𝟏)∗ 𝟏𝟏(𝟎|𝟏)∗ 𝟎𝟎(𝟎|𝟏)∗ )

32.String ending with 1 and not contain 00


𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟏𝟏, 𝟏𝟏𝟎𝟏, 𝟏𝟎𝟏𝟏 … . +
𝑹. 𝑬. = 𝟏 𝟎𝟏

33.Language of C identifier
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒂𝒓𝒆𝒂, 𝒊, 𝒓𝒆𝒅𝒊𝒐𝒖𝒔, 𝒈𝒓𝒂𝒅𝒆𝟏 … . 𝑹. 𝑬. = (_ + 𝑳)(_ + 𝑳 + 𝑫)∗
𝒘𝒉𝒆𝒓𝒆 𝑳 𝒊𝒔 𝑳𝒆𝒕𝒕𝒆𝒓 & 𝐃 𝐢𝐬 𝐝𝐢𝐠𝐢𝐭
Regular definition
• A regular definition gives names to certain regular expressions and uses
those names in other regular expressions.
• Regular definition is a sequence of definitions of the form:
𝑑1 → 𝑟1
𝑑2 → 𝑟2
……
𝑑𝑛 → 𝑟𝑛
Where 𝑑𝑖 is a distinct name & 𝑟𝑖 is a regular expression.
▪ Example: Regular definition for identifier
letter → A|B|C|………..|Z|a|b|………..|z
digit → 0|1|…….|9|
id→ letter (letter | digit)*
Regular definition example
• Example: Unsigned Pascal numbers
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit → 0|1|…..|9
digits → digit digit*
optional_fraction → .digits | 𝜖
optional_exponent → (E(+|-|𝜖)digits)|𝜖
num → digits optional_fraction optional_exponent
Practice exercise
Write regular definitions for the following languages:
• All strings of lowercase letters that contain the five vowels in order.

• All strings of lowercase letters in which the letters are in ascending


lexicographic order.

• Comments, consisting of a string surrounded by /* and */, without an


intervening */, unless it is inside double-quotes (").

• All strings of a’s and b’s that do not contain the substring abb.

• All strings of a’s and b’s that do not contain the subsequence abb.
Solutions
Transition Diagram
Transition Diagram
• A stylized flowchart is called transition diagram.

is a state

is a transition

is a start state

is a final state
Transition Diagram : Relational operator

< =
0 1 2 return (relop,LE)

>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>
=
6 7 return (relop,GE)

other
8 return (relop,GT)
Transition diagram : Unsigned number

digit digit digit

start digit . digit E +or - digit other


1 2 3 4 5 6 7 8

E digit
3 other other
5280
9 10
39.37
1.894 E - 4 A transition diagram for whitespace
2.56 E + 7
45 E + 6
96 E 2
Hard coding and automatic generation lexical analyzers
• Hard coding lexical analyzer
• Lexical analysis is about identifying the pattern from the input.
• To recognize the pattern, transition diagram is constructed.
• Example: to represent identifier in ‘C’, the first character must be letter and other
characters are either letter or digits.
• To recognize this pattern, hard coding lexical analyzer will work with a transition
diagram.
• The automatic generation lexical analyzer takes special notation as input.
• Example: lex compiler tool will take regular expression as input and finds out the
pattern matching to that regular expression.

Letter or digit

Start Letter other


1 2 3
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states 𝑺
2. Set of input symbol 𝜮
3. A transition function move
4. Initial state 𝑺𝟎
5. Final states or accepting states 𝐅
Types of finite automata
• Types of finite automata are:
DFA
b

 Deterministic finite automata (DFA):


have for each state exactly one edge a b b
1 2 3 4
leaving out for each symbol.
a
a
b a
NFA DFA
 Nondeterministic finite automata a
(NFA): There are no restrictions on the
edges leaving a state. There can be a b b
1 2 3 4
several with the same symbol as label
and some edges can be labeled with 𝜖.
b NFA
Regular expression to NFA using Thompson's rule
1. For ∈ , construct the NFA 3. For regular expression 𝑠𝑡

start
start 𝜖 𝑖 N(s) N(t) 𝑓
𝑖 𝑓

Ex: ab
2. For 𝑎 in 𝛴, construct the NFA

start a a b
𝑖 𝑓 1 2 3
Regular expression to NFA using Thompson's rule
4. For regular expression 𝑠|𝑡 5. For regular expression 𝑠*
𝜖
N(s) 𝜖
𝜖
start 𝜖 𝜖
start 𝑖 N(s) 𝑓
𝑖 𝑓

𝜖 N(t) 𝜖 𝜖

Ex: a*
𝜖
Ex: (a|b) a
2 3
𝜖 𝜖 𝜖 𝑎 𝜖
1 2 3 4
1 6

𝜖 𝜖 𝜖
4 5
b
Regular expression to NFA using Thompson's rule
• a*b

𝜖 𝑎 𝜖 𝑏
1 2 3 4 5

𝜖
• b*ab
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6

𝜖
Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10. (0+1)*010(0+1)*
11. (010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA to
DFA using subset
construction method
Subset construction algorithm
Input: An NFA 𝑁.
Output: A DFA D accepting the same language.
Method: Algorithm construct a transition table 𝐷𝑡𝑟𝑎𝑛 for D. We use the
following operation:
OPERATION DESCRIPTION
 − 𝑐𝑙𝑜𝑠𝑢𝑟𝑒(𝑠) Set of NFA states reachable from NFA state 𝑠
on – transition alone.
𝑴𝒐𝒗𝒆 (𝑇, 𝑎) Set of NFA states to which there is a
transition on input symbol 𝑎 from some NFA
state 𝑠 in 𝑇.
Conversion from NFA to DFA
(a|b)*abb 𝜖

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖
Conversion from NFA to DFA
𝜖

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖- Closure(0)= {0, 1, 7, 2, 4}

= {0,1,2,4,7} ---- A
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}

𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,a) = {3,8}
𝜖- Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,b) = {5}
𝜖- Closure(Move(A,b)) = {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
B = {1, 2, 3, 4, 6, 7, 8}
Move(B,a) = {3,8}
𝜖- Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

B= {1, 2, 3, 4, 6, 7, 8}
Move(B,b) = {5,9}
𝜖- Closure(Move(B,b)) = {5, 6, 7, 1, 2, 4, 9}
= {1,2,4,5,6,7,9} ---- D
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

C= {1, 2, 4, 5, 6 ,7}
Move(C,a) = {3,8}
𝜖- Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

𝜖
C= {1, 2, 4, 5, 6, 7}
Move(C,b) = {5}
𝜖- Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B
b

D= {1, 2, 4, 5, 6, 7, 9}
Move(D,a) = {3,8}
𝜖- Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10}
𝜖

D= {1, 2, 4, 5, 6, 7, 9}
Move(D,b)= {5,10}
𝜖- Closure(Move(D,b)) = {5, 6, 7, 1, 2, 4, 10}
= {1,2,4,5,6,7,10} ---- E
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,a) = {3,8}
𝜖- Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B C
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,b)= {5}
𝜖- Closure(Move(E,b))= {5,6,7,1,2,4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA
a

b
States a b B D
a
A = {0,1,2,4,7} B C a
B = {1,2,3,4,6,7,8} B D
A a a b
C = {1,2,4,5,6,7} B C
D = {1,2,4,5,6,7,9} B E b
C E
E = {1,2,4,5,6,7,10} B C b

Transition Table
b
Note:
• Accepting state in NFA is 10 DFA
• 10 is element of E
• So, E is acceptance state in DFA
Exercise
• Convert following regular expression to DFA using subset construction
method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a
DFA optimization
1. Construct an initial partition Π of the set of states with two groups: the
accepting states 𝐹 and the non-accepting states 𝑆 − 𝐹.
2. Apply the repartition procedure to Π to construct a new partition Π𝑛𝑒𝑤.
3. If Π 𝑛𝑒𝑤 = Π, let Π𝑓𝑖𝑛𝑎𝑙 = Π and continue with step (4). Otherwise,
repeat step (2) with Π = Π𝑛𝑒𝑤.
for each group 𝐺 of Π do begin
partition 𝐺 into subgroups such that two states 𝑠 and 𝑡
of 𝐺 are in the same subgroup if and only if for all
input symbols 𝑎, states 𝑠 and 𝑡 have transitions on 𝑎
to states in the same group of Π.
replace 𝐺 in Π𝑛𝑒𝑤 by the set of all subgroups formed.
end
DFA optimization
4. Choose one state in each group of the partition Π𝑓𝑖𝑛𝑎𝑙 as the
representative for that group. The representatives will be the states of
𝑀′. Let s be a representative state, and suppose on input a there is a
transition of 𝑀 from 𝑠 to 𝑡. Let 𝑟 be the representative of 𝑡′s group. Then
𝑀′ has a transition from 𝑠 to 𝑟 on 𝑎. Let the start state of 𝑀′ be the
representative of the group containing start state 𝑠0 of 𝑀, and let the
accepting states of 𝑀′ be the representatives that are in 𝐹.
5. If 𝑀′ has a dead state 𝑑, then remove 𝑑 from 𝑀′. Also remove any state
not reachable from the start state.
DFA optimization
States a b
{𝐴, 𝐵, 𝐶, 𝐷, 𝐸}
A B C
B B D
Nonaccepting States Accepting States
C B C
{𝐴, 𝐵, 𝐶, 𝐷} {𝐸}
D B E
E B C
{𝐴, 𝐵, 𝐶} {𝐷}

States a b
{𝐴, 𝐶} {𝐵}
A B A
B B D

• Now no more splitting is possible. D B E


E B A
• If we chose A as the representative Optimized
for group (AC), then we obtain Transition Table
reduced transition table
Conversion from regular
expression to DFA
Rules to compute nullable, firstpos, lastpos
• nullable(n)
• The subtree at node 𝑛 generates languages including the empty string.

• firstpos(n)
• The set of positions that can match the first symbol of a string generated by the subtree at
node 𝑛.

• lastpos(n)
• The set of positions that can match the last symbol of a string generated be the subtree at
node 𝑛.

• followpos(i)
• The set of positions that can follow position 𝑖 in the tree.
Rules to compute nullable, firstpos, lastpos
Node n nullable(n) firstpos(n) lastpos(n)
A leaf labeled by  true ∅ ∅
A leaf with
false {i} {i}
position 𝐢

n | nullable(c1) firstpos(c1) lastpos(c1)


or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if (nullable(c1)) if (nullable(c2))
n . nullable(c1)
then firstpos(c1)  then lastpos(c1) 
and
c1 firstpos(c2) lastpos(c2)
c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
n ∗
true firstpos(c1) lastpos(c1)
c1
Rules to compute followpos
1. If n is concatenation node with left child c1 and right child c2 and i is a
position in lastpos(c1), then all position in firstpos(c2) are in followpos(i)

2. If n is * node and i is position in lastpos(n), then all position in


firstpos(n) are in followpos(i)
Conversion from regular expression to DFA
Example 1:
(a|b)* abb # Step 1: Construct Syntax Tree
. Step 2: Nullable node
.
#
𝟔
Here, * is only nullable node
.
𝑏
. 𝟓
𝑏
𝟒
∗ 𝑎
𝟑

𝑎 𝑏
𝟏 𝟐
Conversion from regular expression to DFA
Step 3: Calculate firstpos
Firstpos
{1,2,3} .

{1,2,3} . A leaf with position 𝒊 = {𝒊}


{6} #
{1,2,3} . 𝟔
{5} 𝑏 n
|
. 𝟓 firstpos(c1)  firstpos(c2)
{1,2,3} {4} 𝑏
𝟒 c1 c2
{1,2} ∗ {3} 𝑎
n ∗
𝟑 firstpos(c1)
c1
{1,2} |
n if (nullable(c1))
.
𝑎 𝑏 thenfirstpos(c1) 
{1} 𝟏 {2}𝟐 firstpos(c2)
c1 c2 else firstpos(c1)
Conversion from regular expression to DFA
Step 3: Calculate lastpos
Lastpos
{1,2,3} . {6}

{1,2,3} . {5} A leaf with position 𝒊 = {𝒊}


{6} # {6}
{1,2,3} . {4} 𝟔
n
{5} 𝑏 {5} |
{1,2,3} . {3} {4} 𝑏 {4} 𝟓 lastpos(c1)  lastpos(c2)
c1 c2
𝟒
{1,2} ∗ {1,2} {3} 𝑎 {3} n ∗
𝟑 lastpos(c1)
c1
{1,2} | {1,2}
n if (nullable(c2)) then
.
𝑎 𝑏 lastpos(c1)  lastpos(c2)
{1} {1} {2} {2} else lastpos(c2)
𝟏 𝟐 c1 c2
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6}
Lastpos
{1,2,3} . {5}
{6} # {6}
{1,2,3} . {4} 𝟔
{5} 𝑏 {5}
{1,2,3} . {3} {4} 𝑏 {4} 𝟓 .
𝟒
{1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {5}
𝟏
{6} 𝒄𝟐 {6}
𝟑

{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {5}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 6
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 5 = 6
{1} {1} {2} {2}
𝟏 𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
{1,2,3} . {6}
4 5
{1,2,3} . {5}
{6} # {6}
{1,2,3} . {4} 𝟔
{5} 𝑏 {5}
{1,2,3} . {3} {4} 𝑏 {4} 𝟓 .
𝟒
{1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {4}
𝟏
{5} 𝒄𝟐 {5}
𝟑

{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {4}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 5
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 4 = 5
{1} {1} {2} {2}
𝟏 𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6}
4 5
Lastpos
{1,2,3} . {5} 3 4
{6} # {6}
{1,2,3} . {4} 𝟔
{5} 𝑏 {5}
{1,2,3} . {3} {4} 𝑏 {4} 𝟓 .
𝟒
{1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {3}
𝟏
{4} 𝒄𝟐 {4}
𝟑

{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {3}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 4
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 3 = 4
{1} {1} {2} {2}
𝟏 𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6}
4 5
Lastpos
{1,2,3} . {5} 3 4
{6} # {6}
2 3
{1,2,3} . {4} 𝟔
{5} 𝑏 {5} 1 3
{1,2,3} . {3} {4} 𝑏 {4} 𝟓 .
𝟒
{1,2} ∗ {1,2} {3} 𝑎 {3} {1,2} 𝒄 {1,2} {3} 𝒄 {3}
𝟏 𝟐
𝟑

{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {1,2}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 3
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 3
{1} {1} {2} {2}
𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 3
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6}
4 5
Lastpos
{1,2,3} . {5} 3 4
{6} # {6}
2 1,2, 3
{1,2,3} . {4} 𝟔
{5} 𝑏 {5} 1 1,2, 3
{1,2,3} . {3} {4} 𝑏 {4} 𝟓
𝟒 {1,2} * {1,2}
{1,2} ∗ {1,2} {3} 𝑎 {3} 𝒏
𝟑

{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑛) = {1,2}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑛 = 1,2
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 1,2
{1} {1} {2} {2}
𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 1,2
Conversion from regular expression to DFA
Initial state = 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 of root = {1,2,3} ----- A
Position followpos
State A 5 6
δ( (1,2,3),a) = followpos(1) U followpos(3) 4 5
3 4
=(1,2,3) U (4) = {1,2,3,4} ----- B
2 1,2,3
1 1,2,3
δ( (1,2,3),b) = followpos(2)
=(1,2,3) ----- A States a b
A={1,2,3} B A
B={1,2,3,4}
Conversion from regular expression to DFA
State B
Position followpos
δ( (1,2,3,4),a) = followpos(1) U followpos(3)
5 6
=(1,2,3) U (4) = {1,2,3,4} ----- B 4 5
3 4
δ( (1,2,3,4),b) = followpos(2) U followpos(4) 2 1,2,3

=(1,2,3) U (5) = {1,2,3,5} ----- C 1 1,2,3

State C
δ( (1,2,3,5),a) = followpos(1) U followpos(3) States a b
A={1,2,3} B A
=(1,2,3) U (4) = {1,2,3,4} ----- B
B={1,2,3,4} B C
C={1,2,3,5} B D
δ( (1,2,3,5),b) = followpos(2) U followpos(5) D={1,2,3,6}

=(1,2,3) U (6) = {1,2,3,6} ----- D


Conversion from regular expression to DFA
State D
Position followpos
δ( (1,2,3,6),a) = followpos(1) U followpos(3) 5 6
=(1,2,3) U (4) = {1,2,3,4} ----- B 4 5
3 4

δ( (1,2,3,6),b) = followpos(2) 2 1,2,3


1 1,2,3
=(1,2,3) ----- A
b
a States a b
A={1,2,3} B A
a b b B={1,2,3,4} B C
A B C D
C={1,2,3,5} B D
a
a D={1,2,3,6} B A
b

DFA
Example 2
Convert the following RE to
DFA (Direct Method):
ba(a+b)*ab
Practice Problems
Convert the following regular expressions to deterministic finite
automata:
1. (a|b)*
2. (a*|b*)*
3. ((ε|a)|b*)*
4. (a|b)*abb(a|b)*
Solutions to practice problems
(a*|b*)*
(a|b)*

You might also like