Syntax Analysis
3.1 Parsing
Parsing involves analyzing a string of tokens to determine its grammatical structure. It
ensures the source code conforms to the language's syntax rules.
Types of Parsing:
1. Top-down Parsing: Constructs the parse tree from root to leaves (e.g., Recursive Descent
Parsing).
2. Bottom-up Parsing: Constructs the parse tree from leaves to root (e.g., LR Parsing).
Example:
Grammar:
E→E+T|T
T→T*F|F
F → (E) | id
Input: id + id * id
Parsing determines if the input matches the start symbol E.
3.2 Top-down Parsing
Top-down parsing starts at the root of the parse tree and expands non-terminals using
production rules to match the input string from left to right.
Example:
Grammar:
S → aA
A→b
Input: ab
Steps:
1. Start with S.
2. Expand S → aA.
3. Match a.
4. Expand A → b and match b.
Parse Tree:
S
/\
a A
|
b
Case study on parse tree construction (top-down)
In this case study, we'll explore the process of parsing a simple expression language
that supports basic arithmetic operations and parentheses. Parsing is a fundamental
concept in compiler design and language processing, and it involves breaking down
a string of symbols into its syntactic structure.
The Grammar
To define the structure of our expression language, we'll use a context-free
grammar:
expression -> expression '+' term | expression '-' term | term
term -> term '*' factor | term '/' factor | factor
factor -> '(' expression ')' | NUMBER
This grammar defines the following rules:
Expression:
An expression can be an expression added to a term.
An expression can be an expression subtracted from a term.
An expression can be a single term.
Term:
A term can be a term multiplied by a factor.
A term can be a term divided by a factor.
A term can be a single factor.
Factor:
A factor can be an expression enclosed in parentheses.
A factor can be a number.
Parsing Process
To parse an expression, we can use a parsing algorithm like recursive descent or
shift-reduce. These algorithms analyze the input string token by token, building a
parse tree that represents the syntactic structure of the expression.
Example:
Consider the expression 2 * (3 + 4). Here's how it would be parsed:
Match 2 *: This matches the term -> term * factor rule.
Match 2: This matches the factor -> NUMBER rule.
Match (3 + 4): This matches the factor -> ( expression ) rule.
Match 3 + 4: This matches the expression -> expression + term rule.
Match 3 and 4: Both match the factor -> NUMBER rule.
The resulting parse tree would look like this:
expression
/ \
term expression
/ \ / \
factor * factor + factor
| | | | |
NUMBER NUMBER NUMBER NUMBER NUMBER
Implementation
We can implement a parser using a programming language like C++, Java, or Python.
Parser generators like Yacc or Bison can automate the process of generating a
parser from a grammar specification.
Key Points:
- Tokenization: The input string is broken down into tokens (e.g., numbers,
operators, parentheses).
- Parsing: The tokens are analyzed to determine the syntactic structure of the
expression.
- Semantic Analysis: The meaning of the expression is checked, and type checking is
performed.
- Code Generation: The parsed expression is translated into machine code or
intermediate code.
Conclusion
Parsing is a fundamental technique in compiler design and language processing. By
understanding the grammar of a language and using appropriate parsing
algorithms, we can analyze and interpret input strings. This case study provides a
basic introduction to parsing and demonstrates how to construct a parse tree for a
simple expression language.
3.3.1 Predictive Parsing
Predictive parsing uses lookahead tokens to decide which production to use. It avoids
backtracking and requires the grammar to be LL(1).
Example:
Grammar:
E → T E'
E' → + T E' | ε
T → id
Input: id + id
Steps:
1. Start with E.
2. Expand E → T E'.
3. Match T → id, expand E' → + T E', and match tokens.
3.4.1 Top-down Parsing Principles of CFG
Context-Free Grammar (CFG) consists of terminals, non-terminals, a start symbol, and
production rules.
Principles:
- Always expand the leftmost non-terminal (Leftmost Derivation).
- Eliminate Left Recursion for top-down parsing.
Example:
Original Grammar: A → Aα | β
After Elimination: A → βA', A' → αA' | ε
3.5 Regular Expressions vs CFG
| Aspect | Regular Expressions | Context-Free Grammar (CFG) |
|-----------------|----------------------------|----------------------------|
| Expressiveness | Regular languages | Context-free languages |
| Applications | Token patterns | Programming constructs |
| Parsing | Finite automata | Parsers (Top-down/Bottom-up)|
Example:
Regex: [a-zA-Z_][a-zA-Z_0-9]*
CFG: E → T + E | T
3.6 Recursive Descent Parsing
Recursive Descent Parsing uses recursive functions corresponding to grammar non-
terminals.
Example Grammar:
E→T+E|T
T → id
Pseudo-code:
void E() {
T();
if (lookahead == '+') {
match('+'); E(); }}
3.7 Bottom-Up Parsing
Bottom-up parsing reduces input symbols to the start symbol.
Shift-Reduce Parsing Example:
Grammar:
E→E+T|T
T → id
Input: id + id
Steps:
1. Shift id, reduce to T.
2. Shift +, id, reduce to T.
3. Reduce T + T to E.
Example : Bottom-up parsing is a parsing technique that starts from the input tokens and
gradually reduces them to the start symbol of the grammar. It's like building a house from
the ground up, starting with the foundation and working your way to the roof.
Consider the following grammar:
S -> A B C
A -> a
B -> b
C -> c
And the input string: abc
Parsing Steps:
1. **Shift**:
- The parser reads the first input symbol 'a' and shifts it onto the stack.
- Stack: a
- Input: bc
2. **Reduce**:
- The top of the stack 'a' matches the right-hand side of the production A -> a.
- The parser reduces 'a' to 'A'.
- Stack: A
- Input: bc
3. **Shift**:
- The parser reads the next input symbol 'b' and shifts it onto the stack.
- Stack: Ab
- Input: c
4. **Reduce**:
- The top of the stack 'b' matches the right-hand side of the production B -> b.
- The parser reduces 'b' to 'B'.
- Stack: AB
- Input: c
5. **Shift**:
- The parser reads the next input symbol 'c' and shifts it onto the stack.
- Stack: ABC
- Input: (empty)
6. **Reduce**:
- The top three symbols on the stack 'ABC' match the right-hand side of the production S ->
ABC.
- The parser reduces 'ABC' to 'S'.
- Stack: S
- Input: (empty)
Now, the stack contains only the start symbol 'S', which means the input string has been
successfully parsed.
Key Points:
- **Shift-Reduce Actions**: The parser alternates between shifting input symbols onto the
stack and reducing sequences of symbols to non-terminals.
- **Handle Identification**: The parser must correctly identify the handle (the rightmost
substring that can be reduced) at each step.
- **Parse Table**: A parse table is used to determine the appropriate action (shift or reduce)
based on the current state and input symbol.
Common Bottom-Up Parsing Algorithms:
- **LR(0)**: Simple but limited in its ability to handle ambiguous grammars.
- **SLR(1)**: More powerful than LR(0), but still has limitations.
- **LR(1)**: More powerful than SLR(1), but can be complex to implement.
- **LALR(1)**: A compromise between SLR(1) and LR(1), providing a good balance of
power and simplicity.