Letter Manipulation
What is Pattern Searching
• Pattern searching in Data Structures and Algorithms
(DSA) is a fundamental concept that involves
searching for a specific pattern or sequence of
elements within a given data structure.
• This technique is commonly used in string matching
algorithms to find occurrences of a particular pattern
within a text or a larger string.
• Pattern searching plays a crucial role in tasks such as
text processing, data retrieval, and computational
biology.
• Text Processing: Searching for keywords in a document, finding and replacing
text, spell checking, and plagiarism detection.
• Information Retrieval: Finding relevant documents in a database, web search,
and data mining.
• Bioinformatics: Searching for DNA sequences in a genome, protein analysis,
and gene expression analysis.
• Network Security: Detecting malicious patterns in network traffic, intrusion
detection, and malware analysis.
Applications
Data Mining: Identifying patterns in large datasets, customer
segmentation, and fraud detection.
Important Pattern Searching
Algorithms:
Naive String Matching
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
Aho-Corasick algorithm
Examples
• Input: T[] = “THIS IS A TEST TEXT”, P[] = “TEST”
Output: Pattern found at index 10
• Input: T[] = “AABAACAADAABAABA”, P[] = “AABA”
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.
Total number of comparisons: M (N-M+1)
8
Worst case time complexity: O(MN)
Brute Force-
Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in
length...
• Best case if pattern found: Finds pattern in first M positions of text.
For example, M=5.
Total number of comparisons: M
Best case time complexity: O(M) 9
Brute Force-
Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in length...
• If pattern not found: Always mismatch on first character. For example, M=5.
Total number of comparisons: N
10
time complexity: O(N)
Rabin-Karp Algorithm
• The Rabin Karp Algorithm is string searching
algorithm that uses hashing to find the patterns
in strings
• It make use of hash functions and rolling hash
technique
Rabin-Karp
• The Rabin-Karp string searching algorithm calculates a hash
value for the pattern, and for each M-character subsequence of
text to be compared.
• If the hash values are unequal, the algorithm will calculate the
hash value for next M-character sequence.
• If the hash values are equal, the algorithm will do a Brute Force
comparison between the pattern and the M-character
sequence.
• In this way, there is only one comparison per text subsequence,
and Brute Force is only needed when hash values match.
12
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100
13
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern and selected section of
text
hash_t= hash value of next section of text, one character over
while (end of text)
14
What is the hash
function used to
calculate values for
character sequences?
Hash Function
• Let b be the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is
mapped to the number
• Furthermore, given x(i) we can compute x(i+1) for the next
subsequence t[i+1 .. i+M] in constant time, as follows:
• In this way, we never explicitly compute a new value. We
simply adjust the existing value as we move over one
16
character.
Rabin-Karp Mods
• If M is large, then the resulting value (~bM) will be enormous. For this reason, we hash the value
by taking it mod a prime number q.
• The mod function is particularly useful in this case due to several of its inherent properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + … +(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
mod q
17
Rabin-Karp Algorithm
• Given a text T[0. . .n-1] and a pattern P[0. . .m-1]
Write a function search(char P[], char T[]) that prints all occurrences of
P[] present in T[] using Rabin Karp algorithm.
Note:Assume that n > m.
Limitations