0% found this document useful (0 votes)
26 views

Day 5 - Syntax Analysis

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Day 5 - Syntax Analysis

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Syntax Analysis

Yared Y.
Syntax Analysis
(Parsing)
• After lexical analysis (scanning), we have a series of
tokens.
• In syntax analysis (or parsing), we want to interpret
what those tokens mean.
• Goals:
Recover the structure described by that series of
tokens.
Report errors if those tokens do not properly encode
a structure
Syntax Analysis at a glance
Syntax Analysis
• The second phase of a compiler is called syntax
analysis. The input to this phase consists of a stream of
tokens put out by the lexical analysis phase. They are
then checked for proper syntax, i.e. the compiler checks
to make sure the statements and expressions are
correctly formed.
• Some examples of syntax errors in Java are:

x = (2+3) * 9); // mismatched


parentheses
if x>y x = 2; // missing parentheses
while (x==3) do f1(); // invalid keyword do
• Syntax analysis is all about discovering structure in
code.
• It determines whether or not a text follows the expected
format. The main aim of this phase is to make sure that
the source code was written by the programmer is
correct or not.
• Tasks performed in this phase
• Obtain tokens from the lexical analyzer
• Checks if the expression is syntactically correct or not
• Report all syntax errors
• Construct a hierarchical structure which is known as a parse
tree
• The output of the syntax analysis phase (if there are no
syntax errors) could be a stream of atoms or syntax
trees.
• An atom is a primitive operation which is found in most
computer architectures, or which can be implemented
using only a few machine language instructions.
Purpose of Syntax analysis
• Validation:
Syntax analysis ensures that the input program's syntax
is correct according to the programming language's
grammar rules.
If the syntax is incorrect, the parser generates error
messages.
• Structure Representation:
It constructs a data structure, usually a parse tree or
abstract syntax tree (AST), which represents the
syntactical structure of the program.
Key concepts of a Parse Tree
• Nodes
• Root Node: Represents the start symbol of the grammar (usually the
whole expression or statement being parsed).
• Leaf Nodes: Represent terminal symbols, which are the actual tokens
or literals in the input string.
• Internal Nodes: Represent non-terminal (uppercase, set of
variables) symbols of the grammar, which are derived using the
production rules.
• Edges
• The edges in the parse tree connect nodes and represent the application
of production rules.
• Each edge corresponds to a derivation step in the grammar.
• Ambiguity
Parse tree for: (a + b)* c
• each interior node represents an operation
• the children of the node represent the arguments of the
operation
Formal Grammars

• A formal grammar is a system


of rules (called productions)
that control the order in which
words may occur in a sentence.

• Take English sentence:


Generation process for our example
<S>  <Noun> <Verb Phrase>
<Verb Phrase>  <Verb> <Adjective>
<Noun>  Learners | You | Alex
<Verb>  are | is
<Adjective>  awesome | good

So,
Our grammar is: (N, T, P, S), where N – nonterminals, T-
terminals, P- production rules, S - Sentences
“Learners are awesome “ can be
specific as:

“You are good”


“Alex is good”

{<Noun>, <Verb>, <Adj>, <Verb


phrase>} are variables or
nonterminasl

{You, are, is, alex, good} are


terminals, fixed, specific, constants
• Our grammar is: (N, T, P, S), where N – nonterminals, T-
terminals, P- production rules, S - Sentences

N – nonterminals: a finite, non-empty set of


Nonterminals
T is a finite, nonempty set of Terminals
S is a special non-terminal called Start symbol
P is a finite set whose elements are of the form, a  B
Parse Trees
• A parse tree pictorially shows how the start symbol of a
grammar derives a string in the language.
If nonterminal A has a production A  XYZ, then a parse
tree may have an interior node labeled A with three
children labeled X, Y, and Z, from left to right:
Parse tree for: A * B + C * D
Parse Tree for: -(id + id)
• A syntax analyzer or parser takes the input from a
lexical analyzer in the form of token streams. The parser
analyzes the source code (token stream) against the
production rules to detect any errors in the code. The
output of this phase is a parse tree.
Example: Construct a parse tree given the
following production rule and an input string
(i.e., check if the input is part of the rule)

Suppose the production rules for the Grammar of a


language are:

S  cAd
A  bc | a
Input string is “cad”
• As you can see the parser can be used not only to check
for proper syntax, but to produce output as well. This
process is called syntax directed translation.

• Syntax-Directed Translation (SDT) is a method used in compiler design where the translation
of a source program is driven by the syntax of the program as defined by a context-free grammar.
The translation process is directed by the syntax, meaning that as the parser recognizes the
structure of the source code (using grammar rules), it simultaneously generates an output, often in
the form of intermediate code, abstract syntax trees (AST), or other representations.
Why do you need Syntax analyzer
• Check if the code is valid grammatically
• The syntactical analyser helps you to apply rules to the code
• Helps you to make sure that each opening brace has a corresponding
closing balance
• Each declaration has a type and that the type must be exists
• Helps you to detect all types of Syntax errors
• Find the position at which error has occurred
• Clear & accurate description of the error.
• Recovery from an error to continue and find further errors in the code.
• Should not affect compilation of “correct” programs.
• The parse must reject invalid texts by reporting syntax errors
Grammar, Languages, and
Pushdown machines
• Languages can be specified using regular expressions,
finite state machines, grammars
• A grammar is a list of rules which can be used to
produce or generate all the strings of a language, and
which does not generate any strings which are not in
the language.
• A grammar is a set of structural rules which describe a
language. Grammars assign structure to any sentence
• A grammar consists of:
• A finite set of characters, called the input alphabet, the input
symbols, or terminal symbols.
• A finite set of symbols, distinct from the terminal symbols, called
nonterminal symbols, exactly one of which is designated the
starting nonterminal (if no nonterminal is explicitly designated
as the starting nonterminal, it is assumed to be the nonterminal
defined in the first rule).
• A finite list of rewriting rules, also called productions, which
define how strings in the language may be generated. Each of
these rewriting rules is of the form a → b, where a and b are
arbitrary strings of terminals and nonterminals, and a is not null.
Terminal vs Nonterminal symbols
• Terminal symbols (in short: terminals) are the symbols from
which the programs of the programming language are built.
• Examples for such symbols are reserved keywords of the language, or
• symbol classes such as identifiers which comprise a set of symbols.
• Nonterminals of the grammar denote sets of words that can
be produced from them by means of the production rules of
the grammar.
• Nonterminal symbols are terms that have to be defined further
somewhere in the grammar (they have to appear on a left-hand side),
and terminal symbols such as tokens which need no further definition.
The nonterminal nodes form interior nodes in the parse tree; the
terminal nodes are all at the ends of branches.
• The grammar specifies a language in the following way:
beginning with the starting nonterminal, any of the
rewriting rules are applied repeatedly to produce a
sentential form, which may contain a mix of terminals
and nonterminals.
If at any point, the sentential form contains no
nonterminal symbols, then it is in the language of this
grammar.

• If G is a grammar, then we designate the language


specified by this grammar as L(G)
Derivation in Compilers
• The sequence of the production rules which are used
for obtaining the input string is known as derivation.
• A derivation is a sequence of rewriting rules, applied to
the starting nonterminal, ending with a string of
terminals. A derivation thus serves to demonstrate that
a particular string is a member of the language
Example
Grammar G1 consists of 4 rules, the terminal symbols
{0,1}, and the starting nonterminal, S, see the figure

Now, an example of a derivatioin using G1 is:


S ⇒ 0S0 ⇒ 00S00 ⇒ 001S100 ⇒0010100

Thus 0010100 is in L(G1), it’s one of the strings in the language


of the grammar G1.
L(G1) = {0, 1, 000, 010, 101, 111, 00000, …}
• An input sentence can be recognized using this
grammar, with a series of replacements, as follows:
(1) Start out with the topmost symbol in the
grammar, the goal
symbol.
(2) Replace that symbol with one of its right-hand
sides.
(3) Continue replacing nonterminals, always
replacing the
leftmost nonterminal with its right-hand side,
until there are
• Let the terminal symbols are {a,b} ( Ꜫ represents the null string
and is not a terminal symbol).
Given the G2 as shown in the figure, do the derivation
S ⇒ ASB ⇒AASBB ⇒AaSBB ⇒AaBB ⇒AaBb ⇒Aabb ⇒aabb

Thus aabb is in L(G2)


G2 specifies the set of all strings of a’s and b’s
Which contain the same number of a’s as b’s and in which all the
a’s precede all the b’s. Note that the null string is permitted in a
rewriting rule
L(G2) = {Ꜫ, ab, aabb, aaabbb, aaaabbbb, …} = {anbn} such that n
≥0
• Two grammars, g1 and g2, are said to be equivalent if
L(g1) = L(g2); i.e., they specify the same language.

• Convention(can be changed)
all lower case letters and numbers are terminal symbols,
and all upper case letters (or words which begin with an
upper case letter) are nonterminal symbols.
The starting nonterminal is always S unless otherwise
specified.
Each of the grammars will be numbered (G1, G2, G3, ...)
for reference purposes.
Example:
Show three different derivations
using the grammar shown below:
Solution
S⇒aSA⇒aBAA⇒aBabA⇒aBabab⇒
abAabab⇒abababab

S⇒aSA⇒aSab⇒aBAab⇒abAAab⇒
ababAab⇒abababab

S⇒BA⇒bAA⇒babA⇒babab
Derivation (revisited)
• A derivation is basically a sequence of production rules,
in order to get the input string.
• During parsing, we take two decisions for some
sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will
be replaced.
• To decide which non-terminal to be replaced with
production rule, we can have two options.
• Left-most derivation
• Right-most derivation
Left-most derivation & right-most
derivation
• If the sentential form of an input is scanned and
replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most
derivation is called the left-sentential form.

• If we scan and replace the input with production rules,


from right to left, it is known as right-most derivation.
The sentential form derived from the right-most
derivation is called the right-sentential form.
• Left-most Derivation
• When the sentential form of input is scanned and replaced in
left to right sequence, it is known as left-most derivation. The
sentential form which is derived by the left-most derivation is
called the left-sentential form.
• In left-most derivation, we always replace the non-terminal to
the far left first.
• Right-most Derivation
• Rightmost derivation scan and replace the input with production
rules, from right to left, sequence. It’s known as right-most
derivation. The sentential form which is derived from the
rightmost derivation is known as right-sentential form.
Example: given the production rules
and the input string in the fig, find
out the derivations
• Left-most derivation
EE*E
EE+E*E
E id + E * E
E  id + id * E
E  id + id * id

• Notice that the left-most side non-terminal is always


processed first.
• Right-most derivation
EE+E
EE+E*E
E E + E * id
E  E + id * id
E  id + id * id
Imagine that we have the following
grammar
Parse Tree (revisisted)
• A parse tree is a graphical depiction of a derivation. It is
convenient to see how strings are derived from the start
symbol. The start symbol of the derivation becomes the
root of the parse tree
• We take the left-most derivation of a + b * c
• The left-most derivation is:
EE*E
EE+E*E
E id + E * E
E  id + id * E
E  id + id * id
Steps to draw parse tree
EE*E

EE+E*E

E  id + id * id
• In a parse tree:
• All leaf nodes are terminals.
• All interior nodes are non-terminals.
• In-order traversal gives original input string.
• A parse tree depicts associativity and precedence of
operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence
over the operator which is in the parent nodes.
• Terminals (constants) are leaf nodes.
• Terminals are the basic symbols from which strings are
formed.
• Terminals are the elementary symbols of the language defined
by the grammar
• Non-terminals (variables) are interior nodes.
• Non-terminals are syntactic variables that denote sets of
strings. The sets of strings denoted by nonterminals help
define the language generated by the grammar.
• Nonterminals impose a hierarchical structure on the language
that is key to syntax analysis and translation
List out the terminals, nonterminals
and start symbol
•G
EE+T
EE–T
E  T (read this as: E drives T)
TT*F
TT/F
TF
F(E)
F  id
Ambiguity
• A grammar G is said to be ambiguous if it has more than one
parse tree (left or right derivation) for at least one string.
Example:
EE+E
EE–E
E  id

• For the string id + id - id, the above grammar generates two


parse trees
Classes of Grammars
• Unrestricted Grammar
• Context-sensitive Grammar (that does consider
semantics when analyzing a sentence)

• Context-free Grammar (our focus here)


• Right Linear Grammar

You might also like