Unit 2: Lexical Analysis
It is a process by which a program reads an input stream of
character , groups them into lexeme and converts them into
tokens to be analyzed by parser.
Errors in lexical analysis phase
1. Undefined symbol
2. Invalid character
3. Unterminated comments
Main functions of lexical analyzer
1. Separate the lexeme
2. Modify reserve words, operators and special symbols
according to specification of terminal and non terminal
symbol.
3. Identify user defined words that is a word that doesn’t
contain blank or hash or slash character and check that
whether user defined word follows the rule defined by
language or not.
4. Tokenizing:- tokens are formed. <token, value>
5. Enter the information into symbol table.
Additional Task
It removes spaces and comments from the input text.
For every ‘\n’ it assign a line number.
Identify a string of digits and converts them into its
internal binary representation.
Processing of single special character or double special
character.
Recognizing string constant.
Deleting and recovering from errors while processing
illegal strings.
Error recovery in lexical analysis
There are two issues to be considered if an error is diagnosed during lexical
analysis.
One is that the lexical analyzer leave the statement completely and go to next
step.
Identify the error , correct the error internally and call the parser for syntax
analysis.
Suppose a situation may arrives in which the lexical analyzer is unable to
proceed because none of the pattern for token matches a prefix of remaining
input. For this we use simplest recovery strategy that is panic mode recovery.
In this strategy if an error is diagnosed then the remaining characters are not
processed. But this is not the proper way to handle lexical error because the
tokens have to be sent to the parser for syntax analysis.
Other possible recovery methods
Inserting a missing character
Replacing an incorrect character by correct character.
Transposing (changing the place)to adjacent character.
Deleting an extra character.
Token, Patterns ,Lexemes
----Token describes the class or category of
input string.
eg. Identifiers, keywords, constants are called
tokens
----Patterns defines set of rules that describe
the token.
----Lexemes are the sequence of characters in
the source program
Token
A token is a pair consisting of a token name and an
optional attribute value.
A token name is an abstract symbol representing a
kind of lexical unit.
They are the input symbols that the parser
processes.
Pattern
A pattern is a description of the form that the lexemes of
a token may take.
In the case of a keywords as a token, the pattern is just
the sequence of characters that form the keyword.
For identifiers and some other tokens, the pattern is a
more complex structure that is matched by many strings.
Operations on Languages:
Regular Expression
Regular expressions are mathematical symbolism which
describes the set of strings of specific language.
It provides convenient and useful notation for
representing tokens.
The grammar defined by regular expressions is known
as regular grammar. The language defined by regular
grammar is known as regular language.
Notations
If r and s are regular expressions denoting the
languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U
L(s)
Concatenation : (r)(s) is a regular expression denoting
L(r)L(s)
Kleene closure : (r)* is a regular expression denoting
(L(r))*
(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left
associative
* has the highest precedence
Concatenation (.) has the second highest
precedence.
| (pipe sign) has the lowest precedence of all.
Regular definition
If Σ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form:
d1 → r1
d2 → r2
.
dn → rn
Where each di is a new symbol & each ri is a regular expression
over the alphabet Σ ᵁ {d1,d2,……di-1}
Recognition of tokens
For this transition diagrams are used.
It is a diagrammatic representation to depict the action
that will take place when a lexical analyzer is called by
the parser to get the next token.
It is used to keep track of information about the characters
that are seen as the forward pointer scans the input.
The lexical analyzer Generator Lex
Lex: The Lex compiler is a tool that allows one to specify
a lexical analyzer by specifying regular expressions to
describe patterns for tokens..
Inputs are specified in the Lex language and the tool itself
is a lex compiler.
Lex compiler transforms the input patterns into a transition
diagram and generates code in a file called [Link].c
Use of Lex
Overview of Lex
An input file lex.l is written in the Lex language and describes
the lexical analyzer to be generated.
The lex compiler transforms lex.l to C program, in a file that is
always named [Link].c.
The latter file is compiled by the C compiler into a file called
[Link].
The C compiler output is a working lexical analyzer that can take
a stream of input characters and produce a stream of tokens.
More about [Link]
It is a subroutine of the parser.
It is a C function that returns an integer ,
which is a code for one of the possible token
names.
The attribute value is placed in a global
variable yylval which is shared between the
lexical analyzer and parser, there by making
it simple to return both the name and an
attribute value of a token.
Structure of a Lex Program
A Lex program consists of
declarations
%%
translation rules
%%
auxiliary functions
Structure of a Lex Program
The declaration section includes declarations of variables,
constants and regular definitions.
The transition rules each have the form
Pattern {Action}
Each pattern is a regular expression which may use the
regular definitions of the declaration section.
Structure of a Lex Program
The actions are fragments of code typically written in C.
The third section holds additional functions used in the
actions. Alternatively these functions can be compiled
separately and loaded with the lexical analyzer.
Finite Automata(FA)
Finite Automata(FA) is the simplest machine to recognize
patterns.
A Finite Automata consists of the following :
Q : Finite set of states.
∑ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function. Formal specification of machine is
{ Q, ∑, q, F, δ }.
Deterministic Finite Automata (DFA)
DFA consists of 5 tuples {Q, ∑, q, F, δ}.
Q : set of all states.
∑ : set of input symbols. ( Symbols which
machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X ∑ -->
Q.
Deterministic Finite Automata (DFA)
In a DFA, for a particular input character, the
machine goes to one state only.
A transition function is defined on every state
for every input symbol.
Also in DFA null (or ε) move is not allowed,
i.e., DFA cannot change state without any
input character.
Nondeterministic Finite Automata(NFA)
NFA is similar to DFA except following
additional features:
1. Null (or ε) move is allowed i.e., it can move
forward without reading symbols.
2. Ability to transmit to any number of states
for a particular input.
Nondeterministic Finite Automata(NFA)
However, these above features don’t add any
power to NFA. If we compare both in terms of
power, both are equivalent.
Due to above additional features, NFA has a
different transition function, rest is same as
DFA.
δ: Transition Function δ: Q X (∑ U ϵ ) --> 2 ^
Q.
Some Important Points:
1. Every DFA is NFA but not vice versa.
2. Both NFA and DFA have same power and each NFA can be translated
into a DFA.
3. There can be multiple final states in both DFA and NFA.
3. NFA is more of a theoretical concept.
4. DFA is used in Lexical Analysis in Compiler.
Applications :Finite Automata in Practice
Grep : Global Regular Expression and Print
A Unix pattern matching utility that searches files for a string of text and outputs the line that
contains the pattern.
Thermostats
Coke Machines
Elevators
Train Track Switches
Security Properties
Lexical Analyzers for Parsers
The structure of the generated analyzer
The structure of the generated analyzer
The fig gives us the overview of the architecture of a lexical
analyzer generated by Lex.
Lexical analyzer program includes:
• Program to simulate automata
• Components created from lex program by lex itself which are
listed as follows:
o A transition table for automaton.
o Functions that are passed directly through lex to the
output.
o Actions from input program (fragments of code) which are
invoked by automaton simulator when needed.
The structure of the generated analyzer
steps to construct automaton
Step 1: Convert each regular expression into
NFA either by Thompson's subset
construction or Direct Method.
Step 2: Combine all NFAs into one by
introducing new start state with s-transitions
to each of start states of NFAs Ni for pattern
Pi·
Step 2 is needed as the objective is to
construct single automaton to recognize
lexemes that matches with any of the
patterns.
a {action A1 for pattern Pl}
abb { action A2 for pattern P2 }
a*b+ { action A3 for pattern P3 }
For string obb, pattern P2 and pattern p3
matches. But the pattern P2 will be taken into
account as it was listed first in lex program.
For string aabbb · · · , matches pattern p3 as
it has many prefixes.
Fig. Shows NFAs for recognizing the above
mentioned three patterns.
The combined NFA for all three given
patterns is shown in Fig.
Pattern Matching Based on NFAs
Lexical analyzer reads input from input buffer from the beginning of lexeme
pointed by the pointer lexemeBegin.
Forward pointer is used to move ahead of input symbols, calculates the set
of states it is in at each point.
If NFA simulation has no next state for some input symbol, then there will be
no longer prefix which reaches the accepting state exists.
At such cases, the decision will be made on the so seen longest prefix i.e.,
lexeme matching some pattern.
Process is repeated until one or more accepting states are reached. If there
are several accepting states, then the pattern Pi which appears earliest in
the list of lex program is chosen.
e.g. W= aaba
Explanation
Process starts with s-closure of initial state 0. After
processing all the input symbols, no state is found as
there is no transition out of state 8 on input a. Hence,
look for accepting state by retracting to previous
state. From Fig. state 2 which is an accepting state is
reached after reading input symbol a and therefore
the pattern a has been matched. At state 8, string aab
has been matched with pattern avb": By Lex rule, the
longest matching prefix should be considered. Hence,
action Ag corresponds to pattern p3 will be executed
for the string aab.