Day 5 - Syntax Analysis
Day 5 - Syntax Analysis
Yared Y.
Syntax Analysis
(Parsing)
• After lexical analysis (scanning), we have a series of
tokens.
• In syntax analysis (or parsing), we want to interpret
what those tokens mean.
• Goals:
Recover the structure described by that series of
tokens.
Report errors if those tokens do not properly encode
a structure
Syntax Analysis at a glance
Syntax Analysis
• The second phase of a compiler is called syntax
analysis. The input to this phase consists of a stream of
tokens put out by the lexical analysis phase. They are
then checked for proper syntax, i.e. the compiler checks
to make sure the statements and expressions are
correctly formed.
• Some examples of syntax errors in Java are:
So,
Our grammar is: (N, T, P, S), where N – nonterminals, T-
terminals, P- production rules, S - Sentences
“Learners are awesome “ can be
specific as:
S cAd
A bc | a
Input string is “cad”
• As you can see the parser can be used not only to check
for proper syntax, but to produce output as well. This
process is called syntax directed translation.
• Syntax-Directed Translation (SDT) is a method used in compiler design where the translation
of a source program is driven by the syntax of the program as defined by a context-free grammar.
The translation process is directed by the syntax, meaning that as the parser recognizes the
structure of the source code (using grammar rules), it simultaneously generates an output, often in
the form of intermediate code, abstract syntax trees (AST), or other representations.
Why do you need Syntax analyzer
• Check if the code is valid grammatically
• The syntactical analyser helps you to apply rules to the code
• Helps you to make sure that each opening brace has a corresponding
closing balance
• Each declaration has a type and that the type must be exists
• Helps you to detect all types of Syntax errors
• Find the position at which error has occurred
• Clear & accurate description of the error.
• Recovery from an error to continue and find further errors in the code.
• Should not affect compilation of “correct” programs.
• The parse must reject invalid texts by reporting syntax errors
Grammar, Languages, and
Pushdown machines
• Languages can be specified using regular expressions,
finite state machines, grammars
• A grammar is a list of rules which can be used to
produce or generate all the strings of a language, and
which does not generate any strings which are not in
the language.
• A grammar is a set of structural rules which describe a
language. Grammars assign structure to any sentence
• A grammar consists of:
• A finite set of characters, called the input alphabet, the input
symbols, or terminal symbols.
• A finite set of symbols, distinct from the terminal symbols, called
nonterminal symbols, exactly one of which is designated the
starting nonterminal (if no nonterminal is explicitly designated
as the starting nonterminal, it is assumed to be the nonterminal
defined in the first rule).
• A finite list of rewriting rules, also called productions, which
define how strings in the language may be generated. Each of
these rewriting rules is of the form a → b, where a and b are
arbitrary strings of terminals and nonterminals, and a is not null.
Terminal vs Nonterminal symbols
• Terminal symbols (in short: terminals) are the symbols from
which the programs of the programming language are built.
• Examples for such symbols are reserved keywords of the language, or
• symbol classes such as identifiers which comprise a set of symbols.
• Nonterminals of the grammar denote sets of words that can
be produced from them by means of the production rules of
the grammar.
• Nonterminal symbols are terms that have to be defined further
somewhere in the grammar (they have to appear on a left-hand side),
and terminal symbols such as tokens which need no further definition.
The nonterminal nodes form interior nodes in the parse tree; the
terminal nodes are all at the ends of branches.
• The grammar specifies a language in the following way:
beginning with the starting nonterminal, any of the
rewriting rules are applied repeatedly to produce a
sentential form, which may contain a mix of terminals
and nonterminals.
If at any point, the sentential form contains no
nonterminal symbols, then it is in the language of this
grammar.
• Convention(can be changed)
all lower case letters and numbers are terminal symbols,
and all upper case letters (or words which begin with an
upper case letter) are nonterminal symbols.
The starting nonterminal is always S unless otherwise
specified.
Each of the grammars will be numbered (G1, G2, G3, ...)
for reference purposes.
Example:
Show three different derivations
using the grammar shown below:
Solution
S⇒aSA⇒aBAA⇒aBabA⇒aBabab⇒
abAabab⇒abababab
S⇒aSA⇒aSab⇒aBAab⇒abAAab⇒
ababAab⇒abababab
S⇒BA⇒bAA⇒babA⇒babab
Derivation (revisited)
• A derivation is basically a sequence of production rules,
in order to get the input string.
• During parsing, we take two decisions for some
sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will
be replaced.
• To decide which non-terminal to be replaced with
production rule, we can have two options.
• Left-most derivation
• Right-most derivation
Left-most derivation & right-most
derivation
• If the sentential form of an input is scanned and
replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most
derivation is called the left-sentential form.
EE+E*E
E id + id * id
• In a parse tree:
• All leaf nodes are terminals.
• All interior nodes are non-terminals.
• In-order traversal gives original input string.
• A parse tree depicts associativity and precedence of
operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence
over the operator which is in the parent nodes.
• Terminals (constants) are leaf nodes.
• Terminals are the basic symbols from which strings are
formed.
• Terminals are the elementary symbols of the language defined
by the grammar
• Non-terminals (variables) are interior nodes.
• Non-terminals are syntactic variables that denote sets of
strings. The sets of strings denoted by nonterminals help
define the language generated by the grammar.
• Nonterminals impose a hierarchical structure on the language
that is key to syntax analysis and translation
List out the terminals, nonterminals
and start symbol
•G
EE+T
EE–T
E T (read this as: E drives T)
TT*F
TT/F
TF
F(E)
F id
Ambiguity
• A grammar G is said to be ambiguous if it has more than one
parse tree (left or right derivation) for at least one string.
Example:
EE+E
EE–E
E id