Compiler Design
Lexeme: In a compiler, a lexeme refers to the smallest meaningful unit in the source code of a
program. It is a sequence of characters in the source code that represents a single, indivisible
element, such as a keyword, identifier, operator, or literal. Lexemes are the building blocks of a
program's syntax and semantics, and they are used by the lexical analyzer (lexer) to generate tokens.
A lexer is a crucial component of the compiler that performs lexical analysis or scanning. Its primary
role is to read the source code character by character, identify lexemes, and categorize them into
tokens, associating each token with a specific type and, in some cases, additional attributes. Tokens
are then passed to the parser for further analysis and processing.
Here are some examples of lexemes and the corresponding tokens they may generate in a
programming language like C:
Lexeme: "while"
Token Type: Keyword
Description: The lexeme "while" is recognized as a keyword in C, indicating the start of a while loop.
Lexeme: "count"
Token Type: Identifier
Description: The lexeme "count" represents an identifier, which could be a variable or function name.
Lexeme: "+"
Token Type: Operator
Description: The lexeme "+" is recognized as an operator, indicating addition.
Lexeme: "42"
Token Type: Integer Literal
Description: The lexeme "42" represents an integer literal, a numeric constant.
Lexeme: "3.14"
Token Type: Floating-Point Literal
Description: The lexeme "3.14" represents a floating-point literal, a decimal number with a fractional
part.
Lexeme: "("
Token Type: Left Parenthesis
Description: The lexeme "(" is recognized as a left parenthesis, often used to group expressions.
In summary, a lexeme in a compiler is a sequence of characters in the source code that represents a
single, meaningful element of the program. Lexemes are identified and categorized into tokens
during the lexical analysis phase, and these tokens are then used by subsequent phases of the
compiler for parsing, semantic analysis, and code generation.
Type of grammar
CFG Simplification
[Video lecture-11: Knowledge Gate]
In a CFG, it may happen that all the production rules and symbols are not
needed for the derivation of strings. Besides, there may be some null
productions and unit productions. Elimination of these productions and
symbols is called simplification of CFGs. Simplification essentially comprises
of the following steps −
Reduction of CFG
Removal of Unit Productions
Removal of Null Productions
Reduction of CFG
CFGs are reduced in two phases −
Phase 1 − Derivation of an equivalent grammar, G’, from the CFG, G, such
that each variable derives some terminal string.
Derivation Procedure −
Step 1 − Include all symbols, W1, that derive some terminal and
initialize i=1.
Step 2 − Include all symbols, Wi+1, that derive Wi.
Step 3 − Increment i and repeat Step 2, until Wi+1 = Wi.
Step 4 − Include all production rules that have Wi in it.
Phase 2 − Derivation of an equivalent grammar, G”, from the CFG, G’,
such that each symbol appears in a sentential form.
Derivation Procedure −
Step 1 − Include the start symbol in Y1 and initialize i = 1.
Step 2 − Include all symbols, Yi+1, that can be derived from Yi and include
all production rules that have been applied.
Step 3 − Increment i and repeat Step 2, until Yi+1 = Yi.
Problem
Find a reduced grammar equivalent to the grammar G, having production
rules, P: S → AC | B, A → a, C → c | BC, E → aA | e
Solution
Phase 1 −
T = { a, c, e }
W1 = { A, C, E } from rules A → a, C → c and E → aA
W2 = { A, C, E } U { S } from rule S → AC
W3 = { A, C, E, S } U ∅
Since W2 = W3, we can derive G’ as −
G’ = { { A, C, E, S }, { a, c, e }, P, {S}}
where P: S → AC, A → a, C → c , E → aA | e
Phase 2 −
Y1 = { S }
Y2 = { S, A, C } from rule S → AC
Y3 = { S, A, C, a, c } from rules A → a and C → c
Y4 = { S, A, C, a, c }
Since Y3 = Y4, we can derive G” as −
G” = { { A, C, S }, { a, c }, P, {S}}
where P: S → AC, A → a, C → c
Removal of Unit Productions
Any production rule in the form A → B where A, B ∈ Non-terminal is
called unit production..
Removal Procedure −
Step 1 − To remove A → B, add production A → x to the grammar rule
whenever B → x occurs in the grammar. [x ∈ Terminal, x can be Null]
Step 2 − Delete A → B from the grammar.
Step 3 − Repeat from step 1 until all unit productions are removed.
Problem
Remove unit production from the following −
S → XY, X → a, Y → Z | b, Z → M, M → N, N → a
Solution −
There are 3 unit productions in the grammar −
Y → Z, Z → M, and M → N
At first, we will remove M → N.
As N → a, we add M → a, and M → N is removed.
The production set becomes
S → XY, X → a, Y → Z | b, Z → M, M → a, N → a
Now we will remove Z → M.
As M → a, we add Z→ a, and Z → M is removed.
The production set becomes
S → XY, X → a, Y → Z | b, Z → a, M → a, N → a
Now we will remove Y → Z.
As Z → a, we add Y→ a, and Y → Z is removed.
The production set becomes
S → XY, X → a, Y → a | b, Z → a, M → a, N → a
Now Z, M, and N are unreachable, hence we can remove those.
The final CFG is unit production free −
S → XY, X → a, Y → a | b
Removal of Null Productions
In a CFG, a non-terminal symbol ‘A’ is a nullable variable if there is a
production A → ε or there is a derivation that starts at A and finally ends
up with
ε: A → .......… → ε
Removal Procedure
Step 1 − Find out nullable non-terminal variables which derive ε.
Step 2 − For each production A → a, construct all productions A →
x where x is obtained from ‘a’ by removing one or multiple non-terminals
from Step 1.
Step 3 − Combine the original productions with the result of step 2 and
remove ε - productions.
Problem
Remove null production from the following −
S → ASA | aB | b, A → B, B → b | ∈
Solution −
There are two nullable variables − A and B
At first, we will remove B → ε.
After removing B → ε, the production set becomes −
S→ASA | aB | b | a, A ε B| b | &epsilon, B → b
Now we will remove A → ε.
After removing A → ε, the production set becomes −
S→ASA | aB | b | a | SA | AS | S, A → B| b, B → b
This is the final production set without null transition.
Question : In general how many normal forms are for CFG (Context Free Grammar)
Ans. 2 Chomsky Normal Form and Greiback Normal Form
Important terms
The most general phase of structured grammar is?
Answer: a. Context Sensitive Grammar
Explanation: Context-sensitive grammar is the most general phase of
structured grammar because, in this grammar, the left-hand side and the right
side contain the terminals or non-terminals.
In the compiler, the function of using intermediate code is:
Answer: d. to increase the chances of re-using the machine-independent code
optimizer in other compilers.
Explanation: After semantic analysis, the intermediate code increases the
chances of reusing the machine-independent code optimizer in other compilers.
In how many types of optimization can be divided?
Answer: a. two types
Explanation: The code optimization technique is divided into machine-
dependent and machine-independent types.
Which language is accepted by the push-down automata?
Answer: c. Type 2 language
Explanation: According to the Chomsky hierarchy, push down automata accepts
the Type 2 language, which is used for context-free language.
Which of the following parser is a top-down parser?
Answer: d. Recursive descent parser
Explanation: Recursive descent parser is a type of top-down parser which
generates the parse tree from top to bottom and reads the input string from left
to right.
Which parser is most powerful in the following parsers?
a. Operator Precedence
b. SLR
c. Canonical LR
d. LALR
Answer: c. Canonical LR
Explanation: Canonical LR (CLR) is the most powerful parser than LALR and SLR.
The value of which variable is updated inside the loop by a loop-invariant value?
a. loop
b. strength
c. induction
d. invariable
Answer: c. induction
Explanation: The value of the induction variable is updated inside the loop by a
loop-invariant value.
Which of the following is not a characteristic of the compiler?
a. More execution time
b. Debugging process is slow
c. The execution takes place after the removal of all syntax errors
d. Firstly scans the entire program and then transforms it into machine-
understandable code
Answer: a. More execution time
Explanation: The compiler does not take more time to execute. So, more
execution time is not a characteristic of the compiler.
Which method merges the multiple loops into the single one?
a. Constant Folding
b. Loop rolling
c. Loop fusion or jamming
d. None of the above
Answer: c. Loop fusion or Loop jamming
Explanation: Loop fusion is an optimization technique which merges the multiple
bodies of loops into a single body. This programming technique may reduce the
runtime performance of the program.
Which parser is known as the shift-reduce parser?
a. Bottom-up parser
b. Top-down parser
c. Both Top-down and bottom-up
d. None of the Above
Answer: a. Bottom-up parser
Explanation: Bottom-up parser in the compiler is also called the shift-reduce
parser.