Lexical Analysis
• Scan input program to identify valid words, removes comments, extra
white space.
• How do we specify the valid words of a language?
• Regular expression.
• How do we check if sequence of character matches the valid
words of a language?
• Finite Automata.
• token (also called word) –> set of strings defining an atomic element
with a defined meaning
• pattern -> a rule describing a set of string (specified using regular
expression)
• lexeme -> a sequence of characters that match some pattern
• symbol -> the recognized token
• At the first occurrence of the symbol, entry is made in symbol
table
• Additional information (attributes) about the symbol may be
added by the parser
Examples
Token Pattern Sample
Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters “hello”
between “ “
Tokens
• Keywords, operators, identifiers (names), constants, literal strings, punctuation symbols such as
parentheses, brackets, commas, semicolons, and colons, etc.
• A unique integer representing the token is passed by LA to the parser
• Attributes for tokens (apart from the integer representing the token)
• identifier: the lexeme of the token, or a pointer into the symbol table where the lexeme is
stored by the LA
• intnum: the value of the integer (similarly for floatnum, etc.)
• string: the string itself
• The exact set of attributes are dependent on the compiler designer
Challenges in lexical analysis
• Certain languages do not have any reserved words, e.g., while, do, if, else, etc., are reserved in ’C’, but not in
PL/1
Example of using do loop in FORTRAN
• In FORTRAN, some keywords are context-dependent
• In the statement, DO 10 I = 10.86
• DO10I is an identifier, and DO is not a keyword
• But in the statement, DO 10 I = 10, 86
• DO is a keyword
• Such features require substantial look ahead for resolution
• Example above -> we cannot be sure until we see the comma (after 10) that DO is a keyword
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but not so in ’C’
• Lexical analysis cannot catch any significant errors except for simple errors such as, illegal symbols, etc.
• In such cases, lexical analysis skips characters in the input until a well-formed token is found
Languages
• Symbol: An abstract entity, not defined
• Examples: letters {a,b,c,…,z} and digits {0,1,..,9}
• String: A finite sequence of symbols
• abcb, caba are strings over the symbols {a,b,c}
• |w| is the length of the string w, and is the #symbols in it
• ∊ is the empty string and is of length 0
• Alphabet: A finite set of symbols (e.g., {a,b,c,…,z}, {0,1,..,9} )
• Language: A set of strings of symbols from some alphabet
• Φ (empty language) and {∊} (set with empty string) are languages
• The set of palindromes over {0,1} is an infinite language
• The set of strings, {01, 10, 111} over {0,1} is a finite language
• If Σ is an alphabet, Σ∗ is the set of all strings over Σ
• We need a ‘finite representation’ (encoded by finite string) for a language
• Regular language (or type-3) is represented by Regular expression
• Context-free language (or type-2) is represented by a Context-free grammar
• Context-sensitive language (or type-1) is represented by a Context-sensitive grammar
• type-0 language is represented by type-0 grammar
Regular Expressions (REs)
• Let Σ be an alphabet. The REs over Σ and the languages they denote (or generate) are defined as
below
• φ (empty language/set, not even empty string) is an RE. L(φ) = φ
• ∊ (empty string) is an RE. L(∊) = {∊}
• For each a ∈ Σ, a is an RE. L(a) = {a}
• E.g., Σ ={1,2,3,4} then each 1,2,3,4 are REs, L(1)={1}, L(2)={2},…
• If r and s are REs (not symbol) denoting the languages R and S, respectively
• Concatenation: (rs) is an RE, L(rs) = R.S = {xy | x ∈ R ∧ y ∈ S}
• Union: (r + s) (or (r|s)) is an RE, L(r + s) = R ∪ S
• Kleene closure/closure: (r∗) is an RE, L(r∗) = R∗ = ⋃ 𝑅𝑖
(L∗ is called the Kleene closure or closure of L)
Examples of Regular Expressions
• Given L = set of all strings of 0’s and 1’s, what is the RE?
• r = (0 + 1)* or (0|1)*
• How do we generate the string 101 ?
• (0 + 1) ∗ ⇒ (0 + 1)(0 + 1)(0 + 1) ⇒ 101
• Given L = set of all strings of 0’s and 1’s, with at least two consecutive 0’s, what is the RE?
• r = (0 + 1) ∗00(0 + 1) ∗
• Given L = {w ∈ {0, 1} ∗ | w has two or three occurrences of 1, the first and second of which are not
consecutive}, what is the RE?
• r = 0∗10∗010∗ (10∗+ ∊)
• Given r = (1 + 10)∗, what is the language?
• L = set of all strings of 0’s and 1’s, beginning with 1 and not having two consecutive 0’s
• Given r = (0 + 1)∗011, what is the language?
• L = set of all strings of 0’s and 1’s ending in 011
Examples of Regular Expressions
• Given r = c∗(a + bc∗)∗ , what is the language?
• L = set of all strings over {a,b,c} that do not have the substring ac
• Given L = {w | w ∈ {a, b}∗ ∧ w ends with a} , what is the RE?
• r = (a + b)∗a
• Given L = {if, then, else, while, do, begin, end}, what is the RE?
• r = if + then + else + while + do + begin + end
Automata
• Automata are machines (abstract machines) that accept languages
• Finite State Automata accept RLs (corresponding to REs)
• Pushdown Automata accept CFLs (corresponding to CFGs)
• Linear Bounded Automata accept CSLs (corresponding to CSGs)
• Turing Machines accept type-0 languages (corresponding to type-0 grammars)
• Applications of Automata
• Switching circuit design
• Lexical analyzer in a compiler
• String processing (grep, awk), etc.
• State charts used in object-oriented design
• Modelling control applications, e.g., elevator operation
• Parsers of all types
• Compilers
Finite State Automaton (FSA)
• An FSA is an acceptor or recognizer of regular languages
• An FSA is a 5-tuple, (Q, Σ, δ, q0, F), where
• Q is a finite set of states
• Σ is the input alphabet
• δ is the transition function, δ : Q × Σ → Q
• That is, δ(q, a) is a state for each state q and input symbol a
• q0 is the start state
• F is the set of final or accepting states
• In one move from some state q, an FSA reads an input symbol, changes the state based on δ, and
gets ready to read the next input symbol
• An FSA accepts its input string, if starting from q0, it consumes the entire input string, and
reaches a final state
• If the last state reached is not a final state, then the input string is rejected
FSA example
• Q = {q0, q1, q2, q3} -> finite set of states
• Σ = {a, b, c} -> the input alphabet
• q0 is the start state and F = {q0, q2} (F -> set
of final or accepting states)
• The transition function δ is defined by the
table below
• Language accepted by the FSA?
• is the set of all strings beginning with an
’a’ and ending with a ’c’ ( is also
accepted)