Aditya W. Mahastama, S.Kom, M.Sc.
Dr. phil. Lucia D. Krisnawati
OCR Post-Correction
13 Mei 2024
Outline
Outline
Recap
Introduction
OCR Errors
OCR Post-Correction Methods
Dynamic programming
Bigram model
2
Introduction
(Full Text of “The History of Java”, British National Museum, https://archive.org/
stream/historyjava01raffgoog/historyjava01raffgoog_djvu.txt)
3
Introduction
(Google Book, The History of Java, https://books.google.co.id/books? 4
id=gJEC2q7DzpQC&printsec=frontcover&hl=id&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)
Introduction
5
OCR Post-Correction
OCR Errors
2 types of OCR errors:
Non-word errors: A word that is recognized by OCR system,
but it does not correspond to any entry in the lexicon.
“How is your day”
“Huw is your day”
Real-word errors : a word that is recognized by OCR & it is in
the lexicon, but it is grammatically incorrect
“how is your day”
“how is you day”
6
OCR Post-Correction
OCR Errors
Non-word & real-word errors fall in 3 categories:
Deletion: error occurs when 1 or more characters are discarded or
removed from within the original word.
e.g “House” → “Hose”, “Huse”, “Hse”,
“was” → “was ;
Insertion: error occurs when 1 or more extra characters are added
to the original word.
eg. “Science” → “sciience” , “sciencce”
“Kediri” → “Kediriy” ; “people” → “|>eople”
Substitution: errors occurs when 1 or more characters are
accidentally changed in the original word
eg. “computer” → “conputer”, “comquter”
“since” → “Hince” ;
“Brambanan” → “Brambdnan”
7
OCR Post-Correction
Methods of OCR Post-Correction
OCR post-correction methods could be broken down
into 3 categories:
Manual error correction
Dictionary-based error correction
Context-based error correction
Machine learning
8
OCR Post-Correction
Methods of OCR Post-Correction
Manual error correction:
Hiring a group of people to edit the OCR output text manually
Requires a continues manual human intervention
Distributed Proofreading initiated by Gutenberg Project in 2000
DP is a web-based project designed to facilitate collaborative
conversion and proofreading of paper books into e-book
Proofreading is done through several rounds
Advantages:
The easiest way to get a correction
The correction accuracy is relatively high
Disadvantages:
A laborious, costly, and time-consuming activity
It is still considered error-prone
9
OCR Post-Correction
10
11
OCR Post-Correction
Dictionary-based Error Correction
It requires a dictionary:
Dictionary → normal dictionary
Dictionary → lexicon derived from corpus
Dictionary → indexing term : dictionary and its posting list
The requirement of the perfect dictionary (Strohmaier et al):
Cocr → a corpus from OCR analysis of a corpus C
D → a perfect dictionary for postcorrection
3 principles of a perfect D:
D contains each word of C → (W Í D | "w ÎW in C )
D contains only words from C → D = C
"w ÎW in D, D stores the frequency of w in C
12
OCR Post-Correction
Dictionary-based Error Correction
Basic Error Detection and Correction
Given a dictionary of all words in a language the spell checker
searches every word in D
If an according dictionary entry is found, the word is correct
If an according dictionary entry is not found, the word is marked
as a possible error
▸ Each word of the dictionary is compared to the misspelled
word
▸ Dictionary entries that are similar to this word are selected
▸ The selected dictionary entries are provided as correction
suggestions for the misspelled word.
To compare different kinds of words, the word Distance
Measure, like Levenshtein Distance is used
13
Excursion: Levenshtein
Distance
Levenshtein Distance
a.k.a Minimum-Edit-Distance (MED) algorithm
is a common metric to measure the distance between two
words.
Character level edits include:
Substitution of one character to another
Insertion of one character
Deletion of one character
Transposition of 1 character (in an advanced MED algorithm)
14
Excursion: Levenshtein
Distance Excursion
Minimum Edit Distance
3 steps of Minimum Edit Distance (MED)
(Jurafsky & Martin, 2009)
Excursion: Levenshtein
Distance
Minimum Edit Distance
Defining the edit cost:
c(a, ε) = 1 a cost for deletion operation
c(ε, a) = 1 a cost for insertion operation
c(a, b) = 1 a cost for character substitution
c(a, a) = 0 a cost for zero substitution or a match
c(ab, ba) = 1 a cost for cross substitution
16
Excursion
Minimum Edit Distance algorithm
Function Min-Edit-Distance(target, source) returns min-
distance
n ← len(trg)
m ← len(src)
create a distance matrix distance[n+1, m+1]
distance[0,0] ← 0
for each column i from 0 to n do
for each row j from 0 to m do
distance[i,j] ← MIN(distance[i-1,j] + ins-cost(trg i),
distance[i-1, j-1] + subst-cost[src j, trgi],
distance[i, j-1] + del-cost(srcj)
distance[i-2, j-2] + trans-cost[srcj, trgi])
Excursion: Levenshtein
Distance
Minimum Edit Distance
i 0 1 2 3 4
algorithm
j k e d u
Example:
source: kedu, 0 0 1 2 3 4
Target: kediiy 1 k 1 0 1 2 3
The MED distance is 2 e 2 1 0 1 2
shown on the bottom
3 d 3 2 1 0 2
right colomn of the
matrix 4 i 4 3 2 1 1
5 i 5 4 3 2 2
6 y 6 5 4 3 3
18
Excursion: Levenshtein
Distance
Now Your j 0 1 2 3 4 5 6 7 8 9
Turn !!! i b o r o b o d o
Compute the 0 0 1 2 3 4 5 6 7 8 9
MED for the 1 b 1
following
2 t 2
strings:
3 ^ 3
Ssrc = boro bodo
Strg = bt^o bodo 4 0 4
5 5
6 b 6
7 o 7
8 d 8
9 o 9
19
OCR Post-Correction
Drawbacks of Dictionary-based Error Correction
Requires a perfect dictionary → D contains all words
A regular dictionary targets only 1 specific language
Conventional dictionaries do not support: proper names
such as person names, geographical names, historical sites
Standard dictionary is static
20
OCR Post-Correction
Context-based Error correction
The context of words is represented, using so-called N-
Gram language models for the different languages.
N-Gram models count overlapping sequences of N words in
big language corpora to calculate the probability of word
contexts.
These probabilities are then used to identify unlikely word
sequences in the input documents.
The confusion matrix is often used to map the confusion
probability.
The context of word could be guessed by using POS Tagger
and applying morphological and syntactic rules
21
OCR Post-Correction
A Study Case on Bassil and Alwani (2012)
The overall system architecture
22
OCR Post-Correction
A Study Case on Bassil and Alwani (2012)
23
OCR Post-Correction
What is Google’s Spelling checker algorithm based on?
Corpus → the whole web
Indexing word n-grams with their collection frequency (cf)
Prediction → n-gram model
The query log → high priority as correction candidates
24
OCR Post-Correction
What is Google’s Spelling checker algorithm based on?
Indexing character n-grams with their count (C)
Prediction → n-gram model
unigram Count (cf) bigram Count (cf)
#ex 7891 #ex exa 354
exa 5432 exa xam 354
xam 5432 xam amp 261
amp 4231 amp mpl 201
mpl 4123 mpl ple 198
ple 9782 ple le# 102
le# 2056 exa xan 35
xan 1234 xan anp 21
anp 1965 anp npl 20
npl 1657 npl ple 22
25
OCR Post-Correction
Revisiting N-gram Model
Character/Word n-grams
C(w n−1 ,w n ) C(w n−2 ,w n−1 ,w n )
P(w n∣w n−1 )= P(w n∣w n−2 ,w n−1 )=
C(w n−1 ) C(w n−2 ,w n−1 )
ci +1 C (w n−1 , w n )+1
Plaplace (w i )= P Laplace (w n∣w n−1 )=
N +V C (w n−1 )+V
26
OCR Post-Correction
Computing the spelling error correction
Given a string “exanple”, the P(exan} in bigrams:
Generate some candidates & compute their probabillities
C ($ ex , exa) C (exa , exan) 354 0
P(exan)=C ($ ex ) x ( ) x( )= X =0
C ($ ex) C (exa) 7891 5432
C ($ ex , exa) C (exa , exam) 354 354
P(exam)=C ($ ex) x ( )x( )= X =0.003
C ($ ex) C (exa) 7891 5432
Note: $ stands for #
27
OCR Post-Correction
Computing the spelling error correction
Predicting correction on the world level using the context
28
References
Bassil, Y., & Alwani, M. (2012). OCR Post-processing Error
Correction Algorithm Using Google's Online Spelling Suggestion
Strohmaier, CM. Ringlstaetter, C., & Schulz, K.U. (n.d). Lexical
Postcorrection of OCR-Results: The Web as a Dynamic
Secondary Dictionary?
29