Lexical Analysis
Lexical Analysis
Lecture 02
Role of the Lexical Analyzer
Next_char() Next_token()
Input Scanner Parser
character token
Symbol
Table
Interaction of Lexical Analyzer with Parser
Next_char() Next_token()
Input
Scanner Parser
character token
Symbol
Table
• Source:
if ( x == -3.1415 ) /* test x */ then ...
• Token Stream:
< IF >
< LPAREN >
< ID, “x” >
< EQUALS >
< NUM, -3.14150000 >
< RPAREN >
< THEN >
...
Token Attributes
• More than one lexeme matches a pattern
– We need attribute to distinguish them
• e.g. “tok_relation” matches “< or <= or = or != or >= or >”
• < tok_integer_constant, 1415 >
assembly language
system-programming language
lexical-analyzer generator
Harder to Faster
implement
Managing Input Buffers
.. E = M * C * * 2 ..
forward
lexeme_beginning
.. .. E = M * C * * 2 \0
forward
• Deficiency: lexeme_beginning
Code:
if (forward at end of buffer1) then
relaod buffer2;
forward = forward +1;
else if (forward at end of buffer2) then
relaod first half;
move forward to the beginning of the buffer1
else
forward = forward +1
Sentinels
.. .. E = M * \0 C * * 2 \0 \0
forward
lexeme_beginning
forward++;
if *forward == ‘\0’ then
if forward at end of buffer #1 then
Read next N bytes into buffer #2;
forward = address of first char of buffer #2;
elseIf forward at end of buffer #2 then
Read next N bytes into buffer #1;
forward = address of first char of buffer #1;
else
// do nothing; a real \0 occurs in the input
endIf
endIf
Terminology
• Alphabet (∑) : AKA character class
– A set of symbols (“characters”)
– Examples: ∑ = { 1, 0} : binary alphabet
∑ = { 1, 2, 3, 4, 5, 6 } : Alphabet on dice outcome
• String : AKA Sentence or word
– Sequence of symbols
– Finite in length
– Example: abbadc Length of s = |s|
• Empty String (ε)
– It is a string
– |ε|=0
• Language
– A set of strings over some fixed alphabet Each string is finite in length,
– Examples: L1 = { a, baa, bccb } but the set may have an infinite
L2 = { } number of elements.
Note the difference
L3 = {ε}
L4 = {ε, ab, abab, ababab, abababab,... }
Terminology
• Prefix ...of string s
– String obtained by removing zero or more trailing symbols
– s = hello
– Prefixes: ε, h, he, hel, hell, hello
• Suffix ...of string s
– String obtained by deleting zero or more of the leading symbols
– s = hello
– Suffixes: hello, ello, llo, lo, o, ε
• Substring ...of string s
– String obtained by deleting a prefix and a suffix
– s = hello
– Substrings: ε. ell, hel, llo, hello, ….
• Proper prefix / suffix / substring ... of s
– String s1 that is respectively prefix, suffix, or substring of s such that s1 ≠ s
and s1 ≠ ε
• Subsequence….of string s
– String formed by deleting zero or more not necessarily contiguous symbols
– S=hello
– Subsequence: hll, eo, hlo, etc.
Terminology
Terminology
• Language
– A set of strings
– L = { ... }
– M = { ... }
• Union of two languages
– L ∪ M = { s | s is in L or is in M }
– Example:
– L = { a, ab }
– M = { c, dd }
– L ∪ M = { a, ab, c, dd }
• Concatenation of two languages
– L M = { st | s is in L and t is in M }
– Example:
– L = { a, ab }
– M = { c, dd }
– L M = { ac, add, abc, abdd }
Kleene closure
Positive closure
• Let: L = { a, bc }
• Example: L0 = { ε }
L1 = L = { a, bc }
L2 = LL = { aa, abc, bca, bcbc }
L3 = LLL = { aaa, aabc, abca, abcbc, bcaa, bcabc, bcbca, bcbcbc }
...etc...
LN = LN-1L = LLN-1
• The “Positive Closure” of a language:
∞
L+ = U Li = L1 ∪ L2 ∪ L3 ∪ K
i =1
Note ε is not included UNLESS it is
• Example: in L to start with
• L+ = { a, bc, aa, abc, bca, bcbc, aaa, aabc, abca, abcbc, ... }
L1 L2 L3
Example
Let: L = { a, b, c, ..., z }
D = { 0, 1, 2, ..., 9 }
• (Over alphabet ∑)
• ε is a regular expression.
• (Over alphabet ∑)
• And, given a regular expression R, what is L(R) ?
• ε is a regular expression.
– L(ε) = { ε }
• If a is a symbol (i.e., if a∈ ∑, then a is a regular expression.
– L(a) = { a }
• If R and S are regular expressions, then R|S is a regular
expression.
– L(R|S) = L(R) ∪ L(S)
• If R and S are regular expressions, then RS is a regular
expression.
– L(RS) = L(R) L(S)
• If R is a regular expression, then R* is a regular expression.
– L(R*) = (L(R))*
• If R is a regular expression, then (R) is a regular expression.
How to “Parse” Regular Expressions
• Precedence:
– * has highest precedence.
– Concatenation as middle precedence.
– | has lowest precedence.
– Use parentheses to override these rules.
• Examples:
– a b* = a (b*)
• If you want (a b)* you must use parentheses.
– a | b c = a | (b c)
• If you want (a | b) c you must use parentheses.
• Examples:
– a | b | cab {a, b, cab}
– b* {ε, b, bb, bbb, ...}
– a | b* {a, ε, b, bb, bbb, ...}
– (a | b)* {ε, a, b, aa, ab, ba, bb, aaa, ...}
“Set of all strings of a’s and b’s, including ε.”
Equality vs Equivalence
where
• Each di is a new symbol such that di ∉ Σ and di ≠dj
where j < I
• Each ri is a regular expression over Σ ∪ {d1,d2,…,di-1)
Regular Definition
Addition Notation / Shorthand
Nonregular sets
Problem: How to describe tokens?
Solution: Regular Expressions
Scanner Generators
Input: Sequence of regular definitions
Output: A lexer (e.g., a program in Java or “C”)
Approach:
– Read in regular expressions
– Convert into a Finite State Automaton (FSA)
– Optimize the FSA
– Represent the FSA with tables / arrays
– Generate a table-driven lexer (Combine “canned” code with tables.)