CD Unit-2.notes
CD Unit-2.notes
Unit-2 syllabus
Syntax Analysis
Introduction, Context-Free Grammars, Writing a Grammar, Top-Down Parsing, Bottom-
Up Parsing, Introduction to LR Parsing: Simple LR, More Powerful LR Parsers, Using
Ambiguous Grammars and Parser Generators.
Syntax analysis:
An input string is passed through several phases in a compiler. The first phase is
the lexical analysis, where the input is scanned and is divided into tokens. Syntax
analysis is the second phase of a compiler. The output of syntax analysis is used as
input to the semantic analyzer.
In syntax analysis, the compiler checks the syntactic structure of the input string,
i.e., whether the given string follows the grammar or not. It uses a data structure
called a parse tree or syntax tree to make comparisons. The parse tree is formed by
matching the input string with the pre-defined grammar. If the parsing is
successful, the given string can be formed by the grammar, else an error is
reported.
Importance of Syntax Analysis
It is used to check if the code is grammatically correct or not.
It helps us to detect all types of syntax errors.
It gives an exact description of the error.
It rejects invalid code before actual compiling.
Parsing Techniques
The parsing techniques can be divided into two types:
Top-down parsing: The parse tree is constructed from the root to the leaves in top-down
parsing. Some most common top-down parsers are Recursive Descent Parser and LL
parser.
Bottom-up parsing: The parse tree is constructed from the leaves to the tree’s root in
bottom-up parsing. Some examples of bottom-up parsers are the LR parser, SLR parser,
CLR parser, etc.
Derivation
The derivation is the process of using the production rules (grammar) to derive the input
string. There are two decisions that the parser must make to form the input string:
Deciding which non-terminal is to be replaced. There are two options to do this:
a) Left-most Derivation: When the non-terminals are replaced from left to right, it is
called left-most derivation.
b) Right-most Derivation: When the non-terminals are replaced from right to left, it is
called right-most derivation.
Deciding the production rule using which the non-terminal will be replaced.
Parse Tree
Parse tree is a graphical representation of the derivation. It is used to see how the given
string is derived from the start symbol. The start symbol is the root of the parse tree, and
the characters of the input string become the leaves.
Example
Consider the following set of production rules where ‘E’ is a non-terminal and ‘id’ is a
terminal:
E -> E + E (This means that E can be replaced with E + E)
E -> E * E (This means that E can be replaced with E * E)
E -> id (This means that E can be replaced with ‘id’. Since ‘id’ is a non-terminal, it will
not be replaced further, thus, it will form a leaf.)
We will construct a parse tree using the left-most derivation of “id + id * id”.
Step-1: Replace E
with E * E.
Result: E * E
Step-2: Replace
leftmost E with E + E
Result: E + E * E
Step-3,4,5: Replace
all E’s with id.
Result : id + id * id
Thus, we can generate a given string by following the production rules using the parse
trees.
Ambiguity: Grammar is ambiguous if there is more than one parse tree for any string.
For example, for the above string and grammar, we can construct two parse trees:
Ambiguous grammar is not considered suitable for a compiler design. There is no method
that can detect ambiguity or remove ambiguity. If there is an ambiguity in the grammar,
one has to remove it by either rewriting the whole grammar or by following associativity
and precedence constraints.
Limitations of Syntax Analysis
It cannot determine if the token is a valid token or not.
It cannot determine whether a token is used before or not.
It cannot determine whether the operation performed on tokens is valid or not.
It cannot tell whether the token was initialized or not.
Parse Tree-
1. Leftmost Derivation-
The process of deriving a string by expanding the leftmost non-terminal at each step is
called as leftmost derivation.
The geometrical representation of leftmost derivation is called as a leftmost derivation
tree.
Example-
Leftmost Derivation-
S → aB
→ aaBB (Using B → aBB)
→ aaaBBB (Using B → aBB)
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbaBB (Using B → aBB)
→ aaabbabB (Using B → b)
→ aaabbabbS (Using B → bS)
→ aaabbabbbA (Using S → bA)
→ aaabbabbba (Using A → a)
2. Rightmost Derivation-
The process of deriving a string by expanding the rightmost non-terminal at each step is
called as rightmost derivation.
The geometrical representation of rightmost derivation is called as a rightmost
derivation tree.
Example-
S → aB
→ aaBB (Using B → aBB)
→ aaBaBB (Using B → aBB)
→ aaBaBbS (Using B → bS)
→ aaBaBbbA (Using S → bA)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
→ aaaBBabbba (Using B → aBB)
→ aaaBbabbba (Using B → b)
→ aaabbabbba (Using B → b)
For unambiguous grammars, Leftmost derivation and Rightmost derivation represents the
same parse tree.
For ambiguous grammars, Leftmost derivation and Rightmost derivation represents different
parse trees.
Here,
The given grammar was unambiguous.
That is why, leftmost derivation and rightmost derivation represents the same parse tree.
1. Ambiguous Grammar-
A grammar is said to ambiguous if for any string generated by it, it produces more than
one-
Parse tree
Or derivation tree
Or syntax tree
Or leftmost derivation
Or rightmost derivation
Example-
Reason-01:
Since two parse trees exist for string w, therefore the grammar is ambiguous.
2. Unambiguous Grammar-
A grammar is said to unambiguous if for every string generated by it, it produces exactly
one-
Parse tree
Or derivation tree
Or syntax tree
Or leftmost derivation
Or rightmost derivation
Example-
G = (V, T, P, S)
Explanation
G in the above expression stands for grammar. It consists of the set of all the production
rules with the help of which we can generate the string of a language.
T stands for the final set of the terminal symbol. We have to use the lower case alphabet
to show them.
V and T are somewhat opposite as V stands for the set of all non-terminal symbols, and
we have to use capital letters to denote it.
P is the collection of all the production rules used to replace all non-terminals symbols of
a string with terminal or other non-terminal symbols.
S stands for the start symbol, used for deriving string.
Examples
Example 1
Construct the CFG for having any number of b’s over the set ∑= {b}.
Solution:
The regular expression for the above language is
r.e. = b*
Production rules are as given below:
S → bS rule 1
S → ε rule 2
Now, we will derive the string “bbbbbb.”
S
bS
bbS rule 1
bbbS rule 1
bbbbS rule 1
bbbbbS rule 1
bbbbbbS rule 1
bbbbbbε rule 2
bbbbbb
The r.e. = b* can generate a set of string {ε, b, bb, bbb,.....}
Parser is that phase of compiler which takes token string as input and with the help of
existing grammar, converts it into the corresponding parse tree. Parser is also known as
Syntax Analyzer.
Types of Parser:
Parser is mainly classified into 2 categories: Top-down Parser, and Bottom-up Parser.
These are explained as following below.
1. Top-down Parser:
Top-down parser is the parser which generates parse for the given input string with
the help of grammar productions by expanding the non-terminals i.e. it starts from the
start symbol and ends on the terminals. It uses left most derivation.
It is also known as Brute force parser or the with backtracking parser. It basically
generates the parse tree by using brute force and backtracking.
2. Bottom-up Parser:
Bottom-up Parser is the parser which generates the parse tree for the given input string
with the help of grammar productions by compressing the non-terminals i.e. it starts
from non-terminals and ends on the start symbol. It uses reverse of the right most
derivation.
Further Bottom-up parser is classified into 2 types: LR parser, and Operator precedence
parser.
(i). LR parser:
LR parser is the bottom-up parser which generates the parse tree for the given string by
using unambiguous grammar. It follows reverse of right most derivation.
LR parser is of 4 types:
(a). LR(0)
In top down technique parse tree constructs from top and input will read from left to
right. In top down, In top down parser, It will start symbol from proceed to string.
It follows left most derivation.
In top down parser, difficulty with top down parser is if variable contain more than
one possibility selecting 1 is difficult.
Working of Top down Parser:
Let’s consider an example where grammar is given and you need to construct a parse
tree by using top down parser technique.
Example –
S -> aABe
A -> Abc | b
B -> d
Now, let’s consider the input to read and to construct a parse tree with top down
approach.
Input –
abbcde$
Now, you will see that how top down approach works. Here, you will see how you can
generate a input string from the grammar for top down approach.
First, you can start with S -> a A B e and then you will see input string a in the
beginning and e in the end.
Now, you need to generate abbcde .
Expand A-> Abc and Expand B-> d.
Now, You have string like aAbcde and your input string is abbcde.
Expand A->b.
Final string, you will get abbcde.
Given below is the Diagram explanation for constructing top down parse tree. You can
see clearly in the diagram how you can generate the input string using grammar with
top down approach.
WRITING GRAMMAR:
Recursion-
Left Recursion
Right Recursion
General Recursion
1. Left Recursion-
A production of grammar is said to have left recursion if the leftmost variable of its RHS
is same as variable of its LHS.
A grammar containing a production having left recursion is called as Left Recursive
Grammar.
Example-
S → Sa / ∈
(Left Recursive Grammar)
Then, we can eliminate left recursion by replacing the pair of productions with-
A → βA’
A’ → αA’ / ∈
(Right Recursive Grammar)
2. Right Recursion-
A production of grammar is said to have right recursion if the rightmost variable of its
RHS is same as variable of its LHS.
A grammar containing a production having right recursion is called as Right Recursive
Grammar.
Example-
S → aS / ∈
(Right Recursive Grammar)
Right recursion does not create any problem for the Top down parsers.
Therefore, there is no need of eliminating right recursion from the grammar.
3. General Recursion-
The recursion which is neither left recursion nor right recursion is called as general
recursion.
Example-
S → aSb / ∈
If RHS of more than one production starts with the same symbol,
Example-
This kind of grammar creates a problematic situation for Top down parsers.
Top down parsers cannot decide which production must be chosen to parse the string in
hand.
To remove this confusion, we use left factoring.
Left Factoring-
Left factoring is a process by which the grammar with common prefixes is transformed
to make it useful for Top down parsers.
How?
In left factoring,
We make one production for each common prefixes.
The common prefix may be a terminal or a non-terminal or a combination of both.
Rest of the derivation is added by new productions.
The grammar obtained after the process of left factoring is called as Left Factored
Grammar.
Example-
Problem-01:
E –> T E’
E’ –> + T E’ | e
E –> E + T | T T –> F T’
T –> T * F | F T’ –> * F T’ | e
F –> ( E ) | id F –> ( E ) | id
**Here e is Epsilon
For Recursive Descent Parser; we are going to write one program for every variable.
Example:
Predictive parser (or) LL(1) PARSER (OR) dynamic parser (or) without back track
parser
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done
from the Left to Right manner and the second L shows that in this parsing technique,
we are going to use the Left most Derivation Tree.
And finally, the 1 represents the number of look-ahead, which means how many
symbols are you going to see when you want to make a decision.
Predictive Parser:
Example-1:
Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
*ε denotes epsilon
Find their First and Follow sets:
First Follow
E’ –> +TE’/ε { +, ε } { $, ) }
T’ –> *FT’/ε { *, ε } { +, $, ) }
id + * ( ) $
As you can see that all the null productions are put under the Follow set of that symbol
and all the remaining productions are lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that
one cell may contain more than one production.
Build the parse tree from leaves to root. Bottom-up parsing can be defined as an attempt
to reduce the input string w to the start symbol of grammar by tracing out the rightmost
derivations of w in reverse.
Eg.
o Shift reduce parsing performs the two actions: shift and reduce. That's why it is
known as shift reduces parsing.
o At the shift action, the current symbol in the input string is pushed to a stack.
o At each reduction, the symbols will replaced by the non-terminals. The symbol is
the right side of the production and non-terminal is the left side of the production.
Basic Operations –
Shift: This involves moving symbols from the input buffer onto the stack.
Reduce: If the handle appears on top of the stack then, its reduction by using
appropriate production rule is done i.e. RHS of a production rule is popped out of a
stack and LHS of a production rule is pushed onto the stack.
Accept: If only the start symbol is present in the stack and the input buffer is empty
then, the parsing action is called accept. When accepted action is obtained, it is
means successful parsing is done.
Error: This is the situation in which the parser can neither perform shift action nor
reduce action and not even accept action.
Example 1 – Consider the grammar
S –> S + S
S –> S * S
S –> id
Perform Shift Reduce parsing for input string “id + id + id”.
1. Operator-Precedence Parsing
2. LR-Parser
Rule-01:
If precedence of b is higher than precedence of a, then we define a < b
If precedence of b is same as precedence of a, then we define a = b
If precedence of b is lower than precedence of a, then we define a > b
Rule-02:
An identifier is always given the higher precedence than any other symbol.
$ symbol is always given the lowest precedence.
Rule-03:
If two operators have the same precedence, then we go by checking their associativity.
Step-02:
Start scanning the string from LHS in the forward direction until > symbol is
encountered.
Keep a pointer on that location.
Step-03:
Start scanning the string from RHS in the backward direction until < symbol is
encountered.
Keep a pointer on that location.
Step-04:
Everything that lies in the middle of < and > forms the handle.
Replace the handle with the head of the respective production.
Step-05:
Keep repeating the cycle from Step-02 to Step-04 until the start symbol is reached.
Advantages-
Disadvantages-
Step-01:
Step-02:
id + x $
Step-01:
Step-02:
LR parser:
LR parsing is one type of bottom up parsing. It is used to parse the large class of
grammars.
"K" is the number of input symbols of the look ahead used to make number of parsing
decision.
LR algorithm:
The LR algorithm requires stack, input, output and parsing table. In all type of LR
parsing, input, output and stack are same but parsing table is different.
Input buffer is used to indicate end of input and it contains the string to be parsed
followed by a $ Symbol.
A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the
stack.
Parsing table is a two dimensional array. It contains two parts: Action part and Go To
part.
LR(0) Parser
We need two functions –
Closure()
Goto()
Augmented Grammar
If G is a grammar with start symbol S then G’, the augmented grammar for G, is the
grammar with new start symbol S’ and a production S’ -> S. The purpose of this new
starting production is to indicate to the parser when it should stop parsing and announce
acceptance of input.
Let a grammar be S -> AA
A -> aA | b
The augmented grammar for the above grammar will be
S’ -> S
S -> AA
A -> aA | b
LR (0) Items
An LR(0) is the item of a grammar G is a production of G with a dot at some position in
the right side.
S -> ABC yields four items
S -> .ABC
S -> A.BC
S -> AB.C
S -> ABC.
The production A -> ε generates only one item A -> .ε
Closure Operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed
from I by the two rules:
Goto Operation :
Goto(I, X) = 1. Add I by moving dot after X.
2. Apply closure to first step.
The action function takes as arguments a state i and a terminal a (or $ , the input end
marker). The value of ACTION[i, a] can have one of four forms:
1. Shift j, where j is a state.
2. Reduce A -> β.
3. Accept
4. Error
We extend the GOTO function, defined on sets of items, to states: if GOTO [Ii, A] =
Ij then GOTO also maps a state i and a non terminal A to state j.
EXAMPLE:
Consider the grammar S ->AA
A -> aA | b
Augmented grammar S’ -> S
S -> AA
A -> aA | b
Action part of the table contains all the terminals of the grammar whereas the
goto part contains all the non terminals.
For every state of goto graph we write all the goto operations in the table. If goto
is applied to a terminal than it is written in the action part if goto is applied on a
non terminal it is written in goto part. If on applying goto a production is reduced
( i.e if the dot reaches at the end of production and no further closure can be
applied) then it is denoted as R i and if the production is not reduced (shifted) it is
denoted as Si.
If a production is reduced it is written under the terminals given by follow of the
left side of the production which is reduced for ex: in I5 S->AA is reduced so
R1 is written under the terminals in follow(S)={$}
LR (1) Parsing
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given
grammar G. It helps the parser to identify when to stop the parsing and announce the
acceptance of the input.
Example
Given grammar
1. S → AA
2. A → aA | b
1. S`→ S
2. S → AA
3. A → aA | b
Canonical Collection of LR(0) items
An LR (0) item is a production G with dot at some position on the right side of the
production.
LR(0) items is useful to indicate that how much of the input has been scanned up to a
given point in the process of parsing.
Example:
Given grammar:
1. S → AA
2. A → aA | b
Add Augment Production and insert '•' symbol at the first position for every production in
G
1. S` → •S
2. S → •AA
3. A → •aA
4. A → •b
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes
I0 = S` → •S
S → •AA
Add all productions starting with "A" in modified I0 State because "•" is followed by the
non-terminal. So, the I0 State becomes.
I0= S` → •S
S → •AA
A → •aA
A → •b
I1= S` → S•
Add all productions starting with A in to I2 State because "•" is followed by the non-
terminal. So, the I2 State becomes
I2 =S→A•A
A → •aA
A → •b
A→a•A
A→•aA
A → •b
Goto(I3,a)=Closure(A→a•A)=(sameasI3)
Go to (I3, b) = Closure (A → b•) = (same as I4)
I4= Goto(I0,b)=closure(A→b•)=A→b•
I5= Goto(I2,A)=Closure(S→AA•)=SA→A•
I6= Go to (I3, A) = Closure (A → aA•) = A → aA•
Drawing DFA:
LR (1) Table
o If a state is going to some other state on a terminal then it correspond to a shift
move.
o If a state is going to some other state on a variable then it correspond to go to
move.
o If a state contain the final item in the particular row then write the reduce node
completely.
Explanation:
I0 on S is going to I1 so write it as 1.
D.SANJEEVA REDDY Page 34
[CD] Unit-2
I0 on A is going to I2 so write it as 2.
I2 on A is going to I5 so write it as 5.
I3 on A is going to I6 so write it as 6.
I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
I0, I2 and I3 on b are going to I4 so write it as S4 which means that shift 4.
I4, I5 and I6 all states contains the final item because they contain • in the right
most end. So rate the production as production number.
S→ AA ... (1)
A→ aA ... (2)
A→ b ... (3)
I1 contains the final item which drives(S` → S•), so action {I1, $} = Accept.
I4 contains the final item which drives A → b• and that production corresponds to
the production number 3 so write it as r3 in the entire row.
I5 contains the final item which drives S → AA• and that production corresponds
to the production number 1 so write it as r1 in the entire row.
I6 contains the final item which drives A → aA• and that production corresponds
to the production number 2 so write it as r2 in the entire row.
SLR (1) refers to simple LR Parsing. It is same as LR (0) parsing. The only difference is
in the parsing table. To construct SLR (1) parsing table, we use canonical collection of
LR (0) item.
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
The steps which use to construct SLR (1) Table is given below:
If a state (Ii) is going to some other state (Ij) on a terminal then it corresponds to a shift
move in the action part.
If a state (Ii) is going to some other state (Ij) on a variable then it correspond to go to
move in the Go to part.
If a state (Ii) contains the final item like A → ab• which has no transitions to the next
state then the production is known as reduce production. For all terminals X in FOLLOW
(A), write the reduce entry along with their production numbers.
Example
1. S -> •Aa
2. A->αβ•
1. Follow(S) = {$}
2. Follow (A) = {a}
EXAMPLE – Construct SLR parsing table for the given context-free grammar
S–>AA
A–>aA|b
Solution:
STEP1 – Find augmented grammar
The augmented grammar of the given grammar is:-
S’–>.S [0th production]
S–>.AA [1st production]
A–>.aA [2nd production]
A–>.b [3rd production]
STEP2 – Find LR(0) collection of items
Below is the figure showing the LR(0) collection of items. We will understand
everything one by one.
CLR refers to canonical look ahead. CLR parsing use the canonical collection of LR (1)
items to build the CLR (1) parsing table. CLR (1) parsing table produces the more
number of states as compare to the SLR (1) parsing.
In the CLR (1), we place the reduce node only in the look ahead symbols.
LR (1) item:
The look ahead is used to determine that where we place the final item.
The look ahead always add $ symbol for the argument production.
Example:
1. S → AA
2. A → aA
3. A → b
Add Augment Production, insert '•' symbol at the first position for every production in G
and also add the look ahead.
1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "." is followed by the non-
terminal. So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the
non-terminal. So, the I0 State becomes.
Add all productions starting with A in modified I0 State because "." is followed by the
non-terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
1. If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add
‘ . ‘ preceding each of its production.
2. from each state to the next state, the ‘ . ‘ shifts to one place to the right.
3. All the rules of lookahead apply here.
In the figure, I0 consists of augmented grammar.
Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.).
This state is the accept state . S is seen by the compiler. Since I1 is a part of the 0th
production, the lookahead is the same ie $
Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is
seen by the compiler. Since I2 is a part of the 1st production, the lookahead is the
same i.e. $.
I0 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is the
same ie a|b.
I0 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is
seen by the compiler. Since I4 is a part of the 3rd production, the lookahead is the
same i.e. a | b.
I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is
seen by the compiler. Since I5 is a part of the 1st production, the lookahead is the
same i.e. $.
1. S → AA ... (1)
2. A → aA ....(2)
3. A → b ... (3)
The placement of shift node in CLR (1) parsing table is same as the SLR (1) parsing
table. Only difference in the placement of reduce node.
I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3, action {I4,
b} = R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action
{I8, b} = R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.
LALR refers to the look ahead LR. To construct the LALR (1) parsing table, we use the
canonical collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items which have same productions but different
look ahead are combined to form a single set of items
LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Example
1. S → AA
2. A → aA
3. A → b
Add Augment Production, insert '•' symbol at the first position for every production in G
and also add the look ahead.
1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b
DFA:
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.
A → •aA, a/b/$
A → •b, a/b/$
The I4 and I7 are same but they differ only in their look ahead, so we can combine them
and called as I47.
The I8 and I9 are same but they differ only in their look ahead, so we can combine them
and called as I89.
LL Parser LR Parser
First L of LL is for left to right and second L L of LR is for left to right and R is for
is for leftmost derivation. rightmost derivation.
Ends when stack used becomes empty. Starts with an empty stack.
Pre-order traversal of the parse tree. Post-order traversal of the parser tree.
Terminal is read after popping out of stack. Terminal is read before pushing into
LL Parser LR Parser
the stack.
Let us see the comparison between SLR, CLR, and LALR Parser.
It is very easy and cheap to It is also easy and cheap to It is expensive and
implement. implement. difficult to implement.
SLR Parser is the smallest LALR and SLR have the CLR Parser is the largest.
in size. same size. As they have less As the number of states is
number of states. very large.
It requires less time and It requires more time and It also requires more time
space complexity. space complexity. and space complexity.
PARSER GENERATOR:
Yet Another Compiler Compiler (YACC) is a tool that generates a parser program for
a given LALR(1) grammar. It processes the grammar and outputs a C program. YACC
takes help from the Lex tool (a lexical analyzer generator) to tokenize an input stream.
Parts of the YACC program
A YACC program contains three main sections:
Definitions
Rules
Auxiliary routines
Definitions
Definitions are at the top of the YACC input file. They include header files or any
information about the regular definitions or tokens. We define the tokens using a
modulus sign, whereas we place the code specific to C, such as the header files,
within %{%{ and %}%}.
Some examples are as follows:
%token ID
{% #include <stdio.h> %}
Rules
Rules are between %%%% and %%%%. They define the actions we take when we scan
the tokens. They execute whenever a token in the input stream matches with the
grammar. Any action in C is placed between curly brackets ( {}{}).
Auxiliary routines
Auxiliary routines includes functions that we may require in the rules section. Here, we
write a function in regular C syntax. This section includes the main() function, in which
the yyparse() function is always called.
The yyparse() function reads the tokens, performs the actions, and returns to the main
when it reaches the end of the file or when an error occurs. It returns 00 if the parsing is
successful, and 11 if it is unsuccessful.
Example
The following code is an example of a YACC program of a simple calculator taking two
operands. The .y extension file contains the YACC code for the parser generation, which
uses the .l extension file that includes the lexical analyzer.
file.y
%{
#include <ctype.h>
#include <stdio.h>
int yylex();
void yyerror();
int tmp=0;
%}
%token num
%left '+' '-'
%left '*' '/'
D.SANJEEVA REDDY Page 47
[CD] Unit-2
%%
%%
void yyerror(){
printf("The arithmetic expression is incorrect\n");
tmp=1;
}
int main(){
printf("Enter an arithmetic expression(can contain +,-,*,/ or parenthesis):\n");
yyparse();
}
lexfile.l
%option noinput nounput noyywrap
%{
#include <stdlib.h>
#include <stdio.h>
#include "y.tab.h"
extern int yylval;
%}
%%
[\t] ;
[\n] return 0;
Lines 1–7: We initialize the header files with the function definitions.
Lines 9–12: We initialize the tokens for the grammar.
Lines 16–23: We define the grammar used to build the calculator.
Lines 27–34: We define an error function with the main function.
In lexfile.l Lex file:
Lines 1–7: The header files are initialized along with the YACC file included as a header
file.
Lines 11–18: The regular definition of the expected tokens is defined such as the
calculator input will contain digits 0-9.
Execution
To execute the code, we need to type the following commands in the terminal below:
We type lex lexfile.l and press "Enter" to compile the Lex file.
We type yacc -d yaccfile.y and press "Enter" to compile the YACC file.
We type gcc -Wall -o output lex.yy.c y.tab.c and press "Enter" to generate the C files for
execution.
We type ./output to execute the program