0% found this document useful (0 votes)
39 views80 pages

CC Unit 2

The document outlines the process of lexical analysis, detailing how a program reads input characters, groups them into lexemes, and converts them into tokens for parsing. It discusses error handling in lexical analysis, including undefined symbols and invalid characters, as well as recovery methods. Additionally, it covers the structure and functioning of lexical analyzers, particularly using the Lex tool, and explains finite automata, including deterministic and nondeterministic types, in the context of recognizing patterns in input strings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views80 pages

CC Unit 2

The document outlines the process of lexical analysis, detailing how a program reads input characters, groups them into lexemes, and converts them into tokens for parsing. It discusses error handling in lexical analysis, including undefined symbols and invalid characters, as well as recovery methods. Additionally, it covers the structure and functioning of lexical analyzers, particularly using the Lex tool, and explains finite automata, including deterministic and nondeterministic types, in the context of recognizing patterns in input strings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit 2: Lexical Analysis

It is a process by which a program reads an input stream of


character , groups them into lexeme and converts them into
tokens to be analyzed by parser.
Errors in lexical analysis phase

1. Undefined symbol
2. Invalid character
3. Unterminated comments
Main functions of lexical analyzer

1. Separate the lexeme

2. Modify reserve words, operators and special symbols


according to specification of terminal and non terminal
symbol.
3. Identify user defined words that is a word that doesn’t
contain blank or hash or slash character and check that
whether user defined word follows the rule defined by
language or not.
4. Tokenizing:- tokens are formed. <token, value>
5. Enter the information into symbol table.
Additional Task

 It removes spaces and comments from the input text.

 For every ‘\n’ it assign a line number.

 Identify a string of digits and converts them into its

internal binary representation.


 Processing of single special character or double special

character.
 Recognizing string constant.

 Deleting and recovering from errors while processing

illegal strings.
Error recovery in lexical analysis
 There are two issues to be considered if an error is diagnosed during lexical

analysis.
 One is that the lexical analyzer leave the statement completely and go to next

step.
 Identify the error , correct the error internally and call the parser for syntax

analysis.

 Suppose a situation may arrives in which the lexical analyzer is unable to

proceed because none of the pattern for token matches a prefix of remaining
input. For this we use simplest recovery strategy that is panic mode recovery.
 In this strategy if an error is diagnosed then the remaining characters are not

processed. But this is not the proper way to handle lexical error because the
tokens have to be sent to the parser for syntax analysis.
Other possible recovery methods

Inserting a missing character

Replacing an incorrect character by correct character.

Transposing (changing the place)to adjacent character.

Deleting an extra character.


Token, Patterns ,Lexemes

----Token describes the class or category of


input string.
eg. Identifiers, keywords, constants are called
tokens

----Patterns defines set of rules that describe


the token.

----Lexemes are the sequence of characters in


the source program
Token

A token is a pair consisting of a token name and an

optional attribute value.

A token name is an abstract symbol representing a

kind of lexical unit.

They are the input symbols that the parser

processes.
Pattern
 A pattern is a description of the form that the lexemes of

a token may take.

 In the case of a keywords as a token, the pattern is just

the sequence of characters that form the keyword.

 For identifiers and some other tokens, the pattern is a

more complex structure that is matched by many strings.


Operations on Languages:
Regular Expression

Regular expressions are mathematical symbolism which


describes the set of strings of specific language.

It provides convenient and useful notation for


representing tokens.

The grammar defined by regular expressions is known


as regular grammar. The language defined by regular
grammar is known as regular language.
Notations
If r and s are regular expressions denoting the
languages L(r) and L(s), then

Union : (r)|(s) is a regular expression denoting L(r) U


L(s)

Concatenation : (r)(s) is a regular expression denoting


L(r)L(s)

Kleene closure : (r)* is a regular expression denoting


(L(r))*

(r) is a regular expression denoting L(r)


Precedence and Associativity

*, concatenation (.), and | (pipe sign) are left


associative

* has the highest precedence

Concatenation (.) has the second highest


precedence.

| (pipe sign) has the lowest precedence of all.


Regular definition
 If Σ is an alphabet of basic symbols, then a regular definition is a

sequence of definitions of the form:


d1 → r1
d2 → r2
.
dn → rn
Where each di is a new symbol & each ri is a regular expression
over the alphabet Σ ᵁ {d1,d2,……di-1}
Recognition of tokens

For this transition diagrams are used.

It is a diagrammatic representation to depict the action

that will take place when a lexical analyzer is called by


the parser to get the next token.
It is used to keep track of information about the characters

that are seen as the forward pointer scans the input.


The lexical analyzer Generator Lex

Lex: The Lex compiler is a tool that allows one to specify

a lexical analyzer by specifying regular expressions to


describe patterns for tokens..
Inputs are specified in the Lex language and the tool itself

is a lex compiler.
Lex compiler transforms the input patterns into a transition

diagram and generates code in a file called [Link].c


Use of Lex
Overview of Lex

 An input file lex.l is written in the Lex language and describes

the lexical analyzer to be generated.


 The lex compiler transforms lex.l to C program, in a file that is

always named [Link].c.


 The latter file is compiled by the C compiler into a file called

[Link].
 The C compiler output is a working lexical analyzer that can take

a stream of input characters and produce a stream of tokens.


More about [Link]
It is a subroutine of the parser.

It is a C function that returns an integer ,

which is a code for one of the possible token


names.
The attribute value is placed in a global

variable yylval which is shared between the


lexical analyzer and parser, there by making
it simple to return both the name and an
attribute value of a token.
Structure of a Lex Program

A Lex program consists of

declarations
%%
translation rules
%%
auxiliary functions
Structure of a Lex Program

The declaration section includes declarations of variables,

constants and regular definitions.


The transition rules each have the form

Pattern {Action}

Each pattern is a regular expression which may use the

regular definitions of the declaration section.


Structure of a Lex Program

The actions are fragments of code typically written in C.

The third section holds additional functions used in the

actions. Alternatively these functions can be compiled


separately and loaded with the lexical analyzer.
Finite Automata(FA)
Finite Automata(FA) is the simplest machine to recognize

patterns.
A Finite Automata consists of the following :

Q : Finite set of states.

 ∑ : set of Input Symbols.

q : Initial state.

F : set of Final States.

δ : Transition Function. Formal specification of machine is

{ Q, ∑, q, F, δ }.
Deterministic Finite Automata (DFA)

DFA consists of 5 tuples {Q, ∑, q, F, δ}.

Q : set of all states.


∑ : set of input symbols. ( Symbols which
machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X ∑ -->
Q.
Deterministic Finite Automata (DFA)

In a DFA, for a particular input character, the

machine goes to one state only.


A transition function is defined on every state

for every input symbol.


Also in DFA null (or ε) move is not allowed,

i.e., DFA cannot change state without any


input character.
Nondeterministic Finite Automata(NFA)

NFA is similar to DFA except following


additional features:

1. Null (or ε) move is allowed i.e., it can move


forward without reading symbols.

2. Ability to transmit to any number of states


for a particular input.
Nondeterministic Finite Automata(NFA)

However, these above features don’t add any


power to NFA. If we compare both in terms of
power, both are equivalent.
Due to above additional features, NFA has a

different transition function, rest is same as


DFA.
δ: Transition Function δ: Q X (∑ U ϵ ) --> 2 ^

Q.
Some Important Points:
1. Every DFA is NFA but not vice versa.

2. Both NFA and DFA have same power and each NFA can be translated
into a DFA.

3. There can be multiple final states in both DFA and NFA.

3. NFA is more of a theoretical concept.

4. DFA is used in Lexical Analysis in Compiler.


Applications :Finite Automata in Practice

 Grep : Global Regular Expression and Print

A Unix pattern matching utility that searches files for a string of text and outputs the line that

contains the pattern.


 Thermostats

 Coke Machines

 Elevators

 Train Track Switches

 Security Properties

 Lexical Analyzers for Parsers


The structure of the generated analyzer
The structure of the generated analyzer

 The fig gives us the overview of the architecture of a lexical


analyzer generated by Lex.

 Lexical analyzer program includes:



 • Program to simulate automata
 • Components created from lex program by lex itself which are
listed as follows:
 o A transition table for automaton.
 o Functions that are passed directly through lex to the
output.
 o Actions from input program (fragments of code) which are
invoked by automaton simulator when needed.
The structure of the generated analyzer

steps to construct automaton


Step 1: Convert each regular expression into
NFA either by Thompson's subset
construction or Direct Method.
Step 2: Combine all NFAs into one by
introducing new start state with s-transitions
to each of start states of NFAs Ni for pattern
Pi·
Step 2 is needed as the objective is to
construct single automaton to recognize
lexemes that matches with any of the
patterns.
a {action A1 for pattern Pl}
 abb { action A2 for pattern P2 }
 a*b+ { action A3 for pattern P3 }

For string obb, pattern P2 and pattern p3
matches. But the pattern P2 will be taken into
account as it was listed first in lex program.
For string aabbb · · · , matches pattern p3 as
it has many prefixes.
Fig. Shows NFAs for recognizing the above
mentioned three patterns.
The combined NFA for all three given
patterns is shown in Fig.
Pattern Matching Based on NFAs

 Lexical analyzer reads input from input buffer from the beginning of lexeme

pointed by the pointer lexemeBegin.


 Forward pointer is used to move ahead of input symbols, calculates the set

of states it is in at each point.


 If NFA simulation has no next state for some input symbol, then there will be

no longer prefix which reaches the accepting state exists.

 At such cases, the decision will be made on the so seen longest prefix i.e.,

lexeme matching some pattern.


 Process is repeated until one or more accepting states are reached. If there

are several accepting states, then the pattern Pi which appears earliest in

the list of lex program is chosen.


e.g. W= aaba
Explanation
Process starts with s-closure of initial state 0. After
processing all the input symbols, no state is found as
there is no transition out of state 8 on input a. Hence,
look for accepting state by retracting to previous
state. From Fig. state 2 which is an accepting state is
reached after reading input symbol a and therefore
the pattern a has been matched. At state 8, string aab
has been matched with pattern avb": By Lex rule, the
longest matching prefix should be considered. Hence,
action Ag corresponds to pattern p3 will be executed
for the string aab.

You might also like