Unit 1 Slides
Unit 1 Slides
CSE 353
Unit 1 syllabus
Course Outcome
CO Course Outcome(CO)
Code
CO3 Compare top down with bottom up parsers, and develop appropriate
parser to produce parse tree representation of the input
CO4 Design syntax directed translation schemes for a given context free
grammar.
CO5 Generate intermediate code for statements in high level language,
Benefits and limitations of automatic memory management.
HLL
Pure HLL
• Pre-Processor – The pre-processor removes all the #include directives by including
the files called file inclusion and all the #define directives using macro expansion. It
performs file inclusion, augmentation, macro-processing etc.
• Assembly Language – Its neither in binary form nor high level. It is an intermediate
state that is a combination of machine instructions and some other useful data
needed for execution.
• Assembler – For every platform (Hardware + OS) we will have an assembler. They
are not universal since for each platform we have one. The output of assembler is
called object file. It translates assembly language to machine code.
• Interpreter – An interpreter converts high level language into low level machine
language, just like a compiler. But they are different in the way they read the input.
The Compiler in one go reads the inputs, does the processing and executes the
source code whereas the interpreter does the same line by line. Compiler scans the
entire program and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time. Interpreted programs
are usually slower with respect to compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run. The
address within the program will be in such a way that it will cooperate for the
program movement.
• Loader/Linker – It converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or sometimes
both can happen). Linker loads a variety of object files into a single file to make it
executable. Then loader loads it in memory and executes it.
Compiler
Compiler is a translator program that translates a program written in (HLL)
the source program and translate it into an equivalent program in (MLL) the
target program. As an important part of a compiler is error showing to the
programmer.
Structure of Compiler
Executing a program written in HLL programming language is basically of two parts.
the source program must first be compiled translated into an object program. Then
the result object program is loaded into a memory is executed.
Stream of Tokens
Parse Tree
refined Parse
Tree
3- Address Code
Optimized Code
Lexical Analyzer-
Syntax Analyzer –
CODE GENERATION:
•It is the final phase of the compiler.
•It gets input from code optimization phase and produces the target code or object
code as result.
•Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
•The code generation involves - allocation of register and memory - generation of
correct references - generation of correct data types - generation of missing code.
SYMBOL TABLE MANAGEMENT:
•Symbol table is used to store all the information about identifiers used in the program.
•It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
•It allows to find the record for each identifier quickly and to store or retrieve data from
that record.
•Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
ERROR HANDLING:
•Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
•In lexical analysis, errors occur in separation of tokens.
•In syntax analysis, errors occur during construction of syntax tree.
•In semantic analysis, errors occur when the compiler detects constructs with right
syntactic structure but no meaning during type conversion.
•In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Example
Lexical Analyzer
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
•Lexical Analysis can be implemented with the Deterministic finite Automata.
•The output is a sequence of tokens that is sent to the parser for syntax analysis.
•Its task is to read the input characters and produce as output a sequence of tokens
that the parser uses for syntax analysis
•Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA).
The process of forming tokens from an input stream of characters is called
tokenization.
Consider this expression in the C programming language: sum=3+2;
Lexeme Token Type
Sum Identifier
= Assignment Operator
3 Number
+ Addition Operator
2 Number
; End of statement
The function of lexical analysis is to tokenize and separate them out from the
program or statement.
Letter/Digit
Start
0 1 2
Letter delimiter
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
Solution
Question 2:
int max(x,y)
int x, y;
/* find max of x and y*/
{
return(x>y ? X:y);
}
Solution:
int max ( x , y )
int x , y ; { return
( x > y ? x :
y ) ; }
Question3 :
A pass refers to the number of times the compiler goes through the source
code.
Multi-pass compiler goes through the source code several times. In other
words, it allows the source code to pass through each compilation unit several
times. Each pass takes the result of the previous pass as input and creates
intermediate outputs. Therefore, the code improves in each pass. The final
code is generated after the final pass
The main difference between phases and passes of compiler is that phases are the
steps in the compilation process while passes are the number of times
the compiler traverses through the source code.
Bootstrapping and Cross Compiler
Bootstrapping is a process in which simple language is used to translate
more complicated program which in turn may handle far more complicated
program. This complicated program can further handle even more complicated
program and so on.
It is used to build programs for same It is used to build programs for other
system/machine & OS it is installed. system/machine like AVR/ARM.
It can generate executable file like .exe It can generate raw code .hex
33
Execution of Finite Automata
• A DFA can take only one path through the
state graph
– Completely determined by input
34
Acceptance of NFAs
• An NFA can get into multiple states
1
0 1
• Input: 1 0 1
35
NFA vs. DFA (1)
• NFAs and DFAs recognize the same set of
languages (regular languages)
36
NFA vs. DFA (2)
• For a given language the NFA can be simpler than
the DFA
1
0 0
NFA
0
1 0
0 0
DFA
1
1
37
Regular Expressions to Finite Automata
• High-level sketch
NFA
Regular
expressions DFA
Lexical
Specification
38
Regular Expressions to NFA (1)
• For each kind of rexp, define an NFA
– Notation: NFA for rexp A
• For ε
ε
• For input a
a
39
Regular Expressions to NFA (2)
• For AB
A ε
B
• For A | B
B ε
ε
ε
ε A
40
Example of RegExp -> NFA conversion
• Consider the regular expression
(1 | 0)*1
• The NFA is
ε
ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε
41
NFA to DFA. The Trick
• Simulate the NFA
• Each state of resulting DFA
= a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through ε-moves from
NFA start state
• Add a transition S →a S’ to DFA iff
– S’ is the set of NFA states reachable from the states in
S after seeing the input a
• considering ε-moves as well
42
NFA -> DFA Example
ε
ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε
0
0 FGABCDHI
ABCDHI 0 1
1
1 EJGABCDHI
43
Issues in Lexical Analysis
Lexical analysis is the process of producing tokens from the source program.
It has the following issues:
•Lookahead
•Ambiguities
Lookahead
•Lookahead is required to decide when one token will end
and the next token will begin. The simple example which
has lookahead issues are i vs if,
•= vs. ==. Therefore, a way to describe the lexemes of
each token is required.
•A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
•Hence, the number of lookahead to be considered and a
way to describe the lexemes of each token is also needed.
Ambiguities
• Among rules which matched the same number of characters, the rule given
first is preferred.
Lexical Errors
A character sequence which is not possible to scan into any valid
token is a lexical error. Important facts about the lexical error:
▪ Lexical errors are not very common, but it should be managed by a
scanner