0% found this document useful (0 votes)
33 views29 pages

OCR Post-Correction Techniques Explained

Uploaded by

Kevin Husen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views29 pages

OCR Post-Correction Techniques Explained

Uploaded by

Kevin Husen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Aditya W. Mahastama, S.Kom, M.Sc.

Dr. phil. Lucia D. Krisnawati

OCR Post-Correction

13 Mei 2024
Outline

Outline
Recap
Introduction
OCR Errors
OCR Post-Correction Methods
Dynamic programming
Bigram model

2
Introduction

(Full Text of “The History of Java”, British National Museum, https://archive.org/


stream/historyjava01raffgoog/historyjava01raffgoog_djvu.txt)
3
Introduction

(Google Book, The History of Java, https://books.google.co.id/books? 4


id=gJEC2q7DzpQC&printsec=frontcover&hl=id&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)
Introduction

5
OCR Post-Correction

OCR Errors
2 types of OCR errors:
Non-word errors: A word that is recognized by OCR system,
but it does not correspond to any entry in the lexicon.
“How is your day”
“Huw is your day”
Real-word errors : a word that is recognized by OCR & it is in
the lexicon, but it is grammatically incorrect
“how is your day”
“how is you day”

6
OCR Post-Correction

OCR Errors
Non-word & real-word errors fall in 3 categories:
Deletion: error occurs when 1 or more characters are discarded or
removed from within the original word.
e.g “House” → “Hose”, “Huse”, “Hse”,
“was” → “was ;
Insertion: error occurs when 1 or more extra characters are added
to the original word.
eg. “Science” → “sciience” , “sciencce”
“Kediri” → “Kediriy” ; “people” → “|>eople”
Substitution: errors occurs when 1 or more characters are
accidentally changed in the original word
eg. “computer” → “conputer”, “comquter”
“since” → “Hince” ;
“Brambanan” → “Brambdnan”

7
OCR Post-Correction

Methods of OCR Post-Correction


OCR post-correction methods could be broken down
into 3 categories:
Manual error correction
Dictionary-based error correction
Context-based error correction
Machine learning

8
OCR Post-Correction

Methods of OCR Post-Correction


Manual error correction:
Hiring a group of people to edit the OCR output text manually
Requires a continues manual human intervention
Distributed Proofreading initiated by Gutenberg Project in 2000
DP is a web-based project designed to facilitate collaborative
conversion and proofreading of paper books into e-book
Proofreading is done through several rounds
Advantages:
The easiest way to get a correction
The correction accuracy is relatively high
Disadvantages:
A laborious, costly, and time-consuming activity
It is still considered error-prone
9
OCR Post-Correction

10
11
OCR Post-Correction

Dictionary-based Error Correction


It requires a dictionary:
Dictionary → normal dictionary
Dictionary → lexicon derived from corpus
Dictionary → indexing term : dictionary and its posting list
The requirement of the perfect dictionary (Strohmaier et al):
Cocr → a corpus from OCR analysis of a corpus C
D → a perfect dictionary for postcorrection
3 principles of a perfect D:
D contains each word of C → (W Í D | "w ÎW in C )
D contains only words from C → D = C
"w ÎW in D, D stores the frequency of w in C

12
OCR Post-Correction

Dictionary-based Error Correction


Basic Error Detection and Correction
Given a dictionary of all words in a language the spell checker
searches every word in D
If an according dictionary entry is found, the word is correct
If an according dictionary entry is not found, the word is marked
as a possible error
▸ Each word of the dictionary is compared to the misspelled

word
▸ Dictionary entries that are similar to this word are selected

▸ The selected dictionary entries are provided as correction

suggestions for the misspelled word.


To compare different kinds of words, the word Distance
Measure, like Levenshtein Distance is used

13
Excursion: Levenshtein
Distance

Levenshtein Distance
a.k.a Minimum-Edit-Distance (MED) algorithm
is a common metric to measure the distance between two
words.
Character level edits include:
Substitution of one character to another
Insertion of one character
Deletion of one character
Transposition of 1 character (in an advanced MED algorithm)

14
Excursion: Levenshtein
Distance Excursion

Minimum Edit Distance


3 steps of Minimum Edit Distance (MED)

(Jurafsky & Martin, 2009)


Excursion: Levenshtein
Distance

Minimum Edit Distance


Defining the edit cost:
c(a, ε) = 1 a cost for deletion operation
c(ε, a) = 1 a cost for insertion operation
c(a, b) = 1 a cost for character substitution
c(a, a) = 0 a cost for zero substitution or a match
c(ab, ba) = 1 a cost for cross substitution

16
Excursion

Minimum Edit Distance algorithm


Function Min-Edit-Distance(target, source) returns min-
distance
n ← len(trg)
m ← len(src)
create a distance matrix distance[n+1, m+1]
distance[0,0] ← 0
for each column i from 0 to n do
for each row j from 0 to m do
distance[i,j] ← MIN(distance[i-1,j] + ins-cost(trg i),
distance[i-1, j-1] + subst-cost[src j, trgi],
distance[i, j-1] + del-cost(srcj)
distance[i-2, j-2] + trans-cost[srcj, trgi])
Excursion: Levenshtein
Distance

Minimum Edit Distance


i 0 1 2 3 4
algorithm
j k e d u
Example:
source: kedu, 0 0 1 2 3 4

Target: kediiy 1 k 1 0 1 2 3

The MED distance is 2 e 2 1 0 1 2


shown on the bottom
3 d 3 2 1 0 2
right colomn of the
matrix 4 i 4 3 2 1 1

5 i 5 4 3 2 2

6 y 6 5 4 3 3

18
Excursion: Levenshtein
Distance

Now Your j 0 1 2 3 4 5 6 7 8 9
Turn !!! i b o r o b o d o

Compute the 0 0 1 2 3 4 5 6 7 8 9
MED for the 1 b 1
following
2 t 2
strings:
3 ^ 3
Ssrc = boro bodo
Strg = bt^o bodo 4 0 4

5 5

6 b 6
7 o 7
8 d 8
9 o 9

19
OCR Post-Correction

Drawbacks of Dictionary-based Error Correction


Requires a perfect dictionary → D contains all words
A regular dictionary targets only 1 specific language
Conventional dictionaries do not support: proper names
such as person names, geographical names, historical sites
Standard dictionary is static

20
OCR Post-Correction

Context-based Error correction


The context of words is represented, using so-called N-
Gram language models for the different languages.
N-Gram models count overlapping sequences of N words in
big language corpora to calculate the probability of word
contexts.
These probabilities are then used to identify unlikely word
sequences in the input documents.
The confusion matrix is often used to map the confusion
probability.
The context of word could be guessed by using POS Tagger
and applying morphological and syntactic rules

21
OCR Post-Correction

A Study Case on Bassil and Alwani (2012)


The overall system architecture

22
OCR Post-Correction

A Study Case on Bassil and Alwani (2012)

23
OCR Post-Correction

What is Google’s Spelling checker algorithm based on?


Corpus → the whole web
Indexing word n-grams with their collection frequency (cf)
Prediction → n-gram model
The query log → high priority as correction candidates

24
OCR Post-Correction

What is Google’s Spelling checker algorithm based on?


Indexing character n-grams with their count (C)
Prediction → n-gram model
unigram Count (cf) bigram Count (cf)
#ex 7891 #ex exa 354
exa 5432 exa xam 354
xam 5432 xam amp 261
amp 4231 amp mpl 201
mpl 4123 mpl ple 198
ple 9782 ple le# 102
le# 2056 exa xan 35
xan 1234 xan anp 21
anp 1965 anp npl 20
npl 1657 npl ple 22
25
OCR Post-Correction

Revisiting N-gram Model


Character/Word n-grams

C(w n−1 ,w n ) C(w n−2 ,w n−1 ,w n )


P(w n∣w n−1 )= P(w n∣w n−2 ,w n−1 )=
C(w n−1 ) C(w n−2 ,w n−1 )

ci +1 C (w n−1 , w n )+1
Plaplace (w i )= P Laplace (w n∣w n−1 )=
N +V C (w n−1 )+V

26
OCR Post-Correction

Computing the spelling error correction


Given a string “exanple”, the P(exan} in bigrams:
Generate some candidates & compute their probabillities

C ($ ex , exa) C (exa , exan) 354 0


P(exan)=C ($ ex ) x ( ) x( )= X =0
C ($ ex) C (exa) 7891 5432

C ($ ex , exa) C (exa , exam) 354 354


P(exam)=C ($ ex) x ( )x( )= X =0.003
C ($ ex) C (exa) 7891 5432
Note: $ stands for #
27
OCR Post-Correction

Computing the spelling error correction


Predicting correction on the world level using the context

28
References

Bassil, Y., & Alwani, M. (2012). OCR Post-processing Error


Correction Algorithm Using Google's Online Spelling Suggestion
Strohmaier, CM. Ringlstaetter, C., & Schulz, K.U. (n.d). Lexical
Postcorrection of OCR-Results: The Web as a Dynamic
Secondary Dictionary?

29

You might also like