0% found this document useful (0 votes)

33 views29 pages

OCR Post-Correction Techniques Explained

Uploaded by

Kevin Husen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views29 pages

OCR Post-Correction Techniques Explained

Uploaded by

Kevin Husen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Aditya W. Mahastama, S.Kom, M.Sc.

Dr. phil. Lucia D. Krisnawati

OCR Post-Correction

13 Mei 2024
Outline

Outline
Recap
Introduction
OCR Errors
OCR Post-Correction Methods
Dynamic programming
Bigram model

2
Introduction

(Full Text of “The History of Java”, British National Museum, https://archive.org/

stream/historyjava01raffgoog/historyjava01raffgoog_djvu.txt)
3
Introduction

(Google Book, The History of Java, https://books.google.co.id/books? 4

id=gJEC2q7DzpQC&printsec=frontcover&hl=id&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)
Introduction

5
OCR Post-Correction

OCR Errors
2 types of OCR errors:
Non-word errors: A word that is recognized by OCR system,
but it does not correspond to any entry in the lexicon.
“How is your day”
“Huw is your day”
Real-word errors : a word that is recognized by OCR & it is in
the lexicon, but it is grammatically incorrect
“how is your day”
“how is you day”

6
OCR Post-Correction

OCR Errors
Non-word & real-word errors fall in 3 categories:
Deletion: error occurs when 1 or more characters are discarded or
removed from within the original word.
e.g “House” → “Hose”, “Huse”, “Hse”,
“was” → “was ;
Insertion: error occurs when 1 or more extra characters are added
to the original word.
eg. “Science” → “sciience” , “sciencce”
“Kediri” → “Kediriy” ; “people” → “|>eople”
Substitution: errors occurs when 1 or more characters are
accidentally changed in the original word
eg. “computer” → “conputer”, “comquter”
“since” → “Hince” ;
“Brambanan” → “Brambdnan”

7
OCR Post-Correction

Methods of OCR Post-Correction

OCR post-correction methods could be broken down
into 3 categories:
Manual error correction
Dictionary-based error correction
Context-based error correction
Machine learning

8
OCR Post-Correction

Methods of OCR Post-Correction

Manual error correction:
Hiring a group of people to edit the OCR output text manually
Requires a continues manual human intervention
Distributed Proofreading initiated by Gutenberg Project in 2000
DP is a web-based project designed to facilitate collaborative
conversion and proofreading of paper books into e-book
Proofreading is done through several rounds
Advantages:
The easiest way to get a correction
The correction accuracy is relatively high
Disadvantages:
A laborious, costly, and time-consuming activity
It is still considered error-prone
9
OCR Post-Correction

10
11
OCR Post-Correction

Dictionary-based Error Correction

It requires a dictionary:
Dictionary → normal dictionary
Dictionary → lexicon derived from corpus
Dictionary → indexing term : dictionary and its posting list
The requirement of the perfect dictionary (Strohmaier et al):
Cocr → a corpus from OCR analysis of a corpus C
D → a perfect dictionary for postcorrection
3 principles of a perfect D:
D contains each word of C → (W Í D | "w ÎW in C )
D contains only words from C → D = C
"w ÎW in D, D stores the frequency of w in C

12
OCR Post-Correction

Dictionary-based Error Correction

Basic Error Detection and Correction
Given a dictionary of all words in a language the spell checker
searches every word in D
If an according dictionary entry is found, the word is correct
If an according dictionary entry is not found, the word is marked
as a possible error
▸ Each word of the dictionary is compared to the misspelled

word
▸ Dictionary entries that are similar to this word are selected

▸ The selected dictionary entries are provided as correction

suggestions for the misspelled word.

To compare different kinds of words, the word Distance
Measure, like Levenshtein Distance is used

13
Excursion: Levenshtein
Distance

Levenshtein Distance
a.k.a Minimum-Edit-Distance (MED) algorithm
is a common metric to measure the distance between two
words.
Character level edits include:
Substitution of one character to another
Insertion of one character
Deletion of one character
Transposition of 1 character (in an advanced MED algorithm)

14
Excursion: Levenshtein
Distance Excursion

Minimum Edit Distance

3 steps of Minimum Edit Distance (MED)

(Jurafsky & Martin, 2009)

Excursion: Levenshtein
Distance

Minimum Edit Distance

Defining the edit cost:
c(a, ε) = 1 a cost for deletion operation
c(ε, a) = 1 a cost for insertion operation
c(a, b) = 1 a cost for character substitution
c(a, a) = 0 a cost for zero substitution or a match
c(ab, ba) = 1 a cost for cross substitution

16
Excursion

Minimum Edit Distance algorithm

Function Min-Edit-Distance(target, source) returns min-
distance
n ← len(trg)
m ← len(src)
create a distance matrix distance[n+1, m+1]
distance[0,0] ← 0
for each column i from 0 to n do
for each row j from 0 to m do
distance[i,j] ← MIN(distance[i-1,j] + ins-cost(trg i),
distance[i-1, j-1] + subst-cost[src j, trgi],
distance[i, j-1] + del-cost(srcj)
distance[i-2, j-2] + trans-cost[srcj, trgi])
Excursion: Levenshtein
Distance

Minimum Edit Distance

i 0 1 2 3 4
algorithm
j k e d u
Example:
source: kedu, 0 0 1 2 3 4

Target: kediiy 1 k 1 0 1 2 3

The MED distance is 2 e 2 1 0 1 2

shown on the bottom
3 d 3 2 1 0 2
right colomn of the
matrix 4 i 4 3 2 1 1

5 i 5 4 3 2 2

6 y 6 5 4 3 3

18
Excursion: Levenshtein
Distance

Now Your j 0 1 2 3 4 5 6 7 8 9
Turn !!! i b o r o b o d o

Compute the 0 0 1 2 3 4 5 6 7 8 9
MED for the 1 b 1
following
2 t 2
strings:
3 ^ 3
Ssrc = boro bodo
Strg = bt^o bodo 4 0 4

5 5

6 b 6
7 o 7
8 d 8
9 o 9

19
OCR Post-Correction

Drawbacks of Dictionary-based Error Correction

Requires a perfect dictionary → D contains all words
A regular dictionary targets only 1 specific language
Conventional dictionaries do not support: proper names
such as person names, geographical names, historical sites
Standard dictionary is static

20
OCR Post-Correction

Context-based Error correction

The context of words is represented, using so-called N-
Gram language models for the different languages.
N-Gram models count overlapping sequences of N words in
big language corpora to calculate the probability of word
contexts.
These probabilities are then used to identify unlikely word
sequences in the input documents.
The confusion matrix is often used to map the confusion
probability.
The context of word could be guessed by using POS Tagger
and applying morphological and syntactic rules

21
OCR Post-Correction

A Study Case on Bassil and Alwani (2012)

The overall system architecture

22
OCR Post-Correction

A Study Case on Bassil and Alwani (2012)

23
OCR Post-Correction

What is Google’s Spelling checker algorithm based on?

Corpus → the whole web
Indexing word n-grams with their collection frequency (cf)
Prediction → n-gram model
The query log → high priority as correction candidates

24
OCR Post-Correction

What is Google’s Spelling checker algorithm based on?

Indexing character n-grams with their count (C)
Prediction → n-gram model
unigram Count (cf) bigram Count (cf)
#ex 7891 #ex exa 354
exa 5432 exa xam 354
xam 5432 xam amp 261
amp 4231 amp mpl 201
mpl 4123 mpl ple 198
ple 9782 ple le# 102
le# 2056 exa xan 35
xan 1234 xan anp 21
anp 1965 anp npl 20
npl 1657 npl ple 22
25
OCR Post-Correction

Revisiting N-gram Model

Character/Word n-grams

C(w n−1 ,w n ) C(w n−2 ,w n−1 ,w n )

P(w n∣w n−1 )= P(w n∣w n−2 ,w n−1 )=
C(w n−1 ) C(w n−2 ,w n−1 )

ci +1 C (w n−1 , w n )+1
Plaplace (w i )= P Laplace (w n∣w n−1 )=
N +V C (w n−1 )+V

26
OCR Post-Correction

Computing the spelling error correction

Given a string “exanple”, the P(exan} in bigrams:
Generate some candidates & compute their probabillities

C ($ ex , exa) C (exa , exan) 354 0

P(exan)=C ($ ex ) x ( ) x( )= X =0
C ($ ex) C (exa) 7891 5432

C ($ ex , exa) C (exa , exam) 354 354

P(exam)=C ($ ex) x ( )x( )= X =0.003
C ($ ex) C (exa) 7891 5432
Note: $ stands for #
27
OCR Post-Correction

Computing the spelling error correction

Predicting correction on the world level using the context

28
References

Bassil, Y., & Alwani, M. (2012). OCR Post-processing Error

Correction Algorithm Using Google's Online Spelling Suggestion
Strohmaier, CM. Ringlstaetter, C., & Schulz, K.U. (n.d). Lexical
Postcorrection of OCR-Results: The Web as a Dynamic
Secondary Dictionary?

Bayesian Models for Pronunciation Errors
No ratings yet
Bayesian Models for Pronunciation Errors
50 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
Wild-card Queries & Spelling Correction
No ratings yet
Wild-card Queries & Spelling Correction
52 pages
Wild-card Queries and Spell Correction
No ratings yet
Wild-card Queries and Spell Correction
48 pages
Tolerant Retrieval and Spelling Correction
No ratings yet
Tolerant Retrieval and Spelling Correction
82 pages
Wildcard Query Processing Techniques
No ratings yet
Wildcard Query Processing Techniques
44 pages
Lec 6
No ratings yet
Lec 6
19 pages
Lec 8
No ratings yet
Lec 8
17 pages
Spelling Error Detection Survey
No ratings yet
Spelling Error Detection Survey
3 pages
Advanced Spell Checking Techniques
No ratings yet
Advanced Spell Checking Techniques
19 pages
Spell Checking Techniques and Errors
No ratings yet
Spell Checking Techniques and Errors
13 pages
Module2 Ch3 B
No ratings yet
Module2 Ch3 B
96 pages
Wildcard Query Processing Techniques
No ratings yet
Wildcard Query Processing Techniques
44 pages
Seen Ha Us
No ratings yet
Seen Ha Us
10 pages
Spell Correction
No ratings yet
Spell Correction
46 pages
Synopsis On Spell Cheker
No ratings yet
Synopsis On Spell Cheker
12 pages
Spell Checker Optimization Techniques
No ratings yet
Spell Checker Optimization Techniques
6 pages
Synopsis Chandrashekhar
No ratings yet
Synopsis Chandrashekhar
5 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
NLP Mrinmoyee Mam
No ratings yet
NLP Mrinmoyee Mam
4 pages
Spell Checker Project Report
No ratings yet
Spell Checker Project Report
15 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Bayesian Spelling Correction Guide
No ratings yet
Bayesian Spelling Correction Guide
5 pages
Auto-Correction via N-gram Indexing
No ratings yet
Auto-Correction via N-gram Indexing
5 pages
Dictionary Structures for Tolerant Retrieval
No ratings yet
Dictionary Structures for Tolerant Retrieval
39 pages
Finite-State Spell-Checking With Weighted Language
No ratings yet
Finite-State Spell-Checking With Weighted Language
7 pages
IR Practical B1
No ratings yet
IR Practical B1
15 pages
Learning Journal Unit 2
No ratings yet
Learning Journal Unit 2
3 pages
UNIT3
No ratings yet
UNIT3
52 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Multimedia Application L3
No ratings yet
Multimedia Application L3
49 pages
GujaratiWordCorrection Jan2025
No ratings yet
GujaratiWordCorrection Jan2025
15 pages
Spelling Correction in IR Systems
No ratings yet
Spelling Correction in IR Systems
4 pages
Lec # 5-1
No ratings yet
Lec # 5-1
22 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Week 2
No ratings yet
Week 2
95 pages
Dynamic Programming: Edit Distance Explained
No ratings yet
Dynamic Programming: Edit Distance Explained
70 pages
Spelling Correction Using Edit Distance
No ratings yet
Spelling Correction Using Edit Distance
83 pages
Understanding Edit Distance in NLP
No ratings yet
Understanding Edit Distance in NLP
85 pages
L6 Dictonary+Tolerant Retrieval
No ratings yet
L6 Dictonary+Tolerant Retrieval
63 pages
Spelling Correction and Detection in NLP: An: K.Dhanush Kumar 22R21A66F8 CSM-C
No ratings yet
Spelling Correction and Detection in NLP: An: K.Dhanush Kumar 22R21A66F8 CSM-C
9 pages
Note 4
No ratings yet
Note 4
1 page
Career Aspirations Data Cleaning Algorithm
No ratings yet
Career Aspirations Data Cleaning Algorithm
5 pages
C90 2036 PDF
No ratings yet
C90 2036 PDF
6 pages
Autocorrect With Python
No ratings yet
Autocorrect With Python
6 pages
Minimum Cost Edit Distance Explained
No ratings yet
Minimum Cost Edit Distance Explained
24 pages
Spelling Correction and Detection in NLP An Overview
No ratings yet
Spelling Correction and Detection in NLP An Overview
9 pages
Noisy Channel Model for Spelling Correction
No ratings yet
Noisy Channel Model for Spelling Correction
20 pages
Measure Distance Between 2 Words by Simple Calculation
No ratings yet
Measure Distance Between 2 Words by Simple Calculation
7 pages
Azerbaijani Spell Correction Using DNN
No ratings yet
Azerbaijani Spell Correction Using DNN
5 pages
Information Retrieval Techniques Overview
No ratings yet
Information Retrieval Techniques Overview
46 pages
Implementing TextBlob for Spell Check
No ratings yet
Implementing TextBlob for Spell Check
5 pages
Spell Checker For Kannada OCR
No ratings yet
Spell Checker For Kannada OCR
4 pages
Chapter 5. Probabilistic Models of Pronunciation and Spelling
No ratings yet
Chapter 5. Probabilistic Models of Pronunciation and Spelling
40 pages
Understanding Minimum Edit Distance
No ratings yet
Understanding Minimum Edit Distance
3 pages
Assignement 3 1
No ratings yet
Assignement 3 1
3 pages
Lec04 SpellingCorrection
No ratings yet
Lec04 SpellingCorrection
25 pages
Techniques for Scoring Short Answer Essays
No ratings yet
Techniques for Scoring Short Answer Essays
13 pages
Mechanism of Labour: For Normal Mechanism, of The Fetus Should Be On Following Condition
0% (1)
Mechanism of Labour: For Normal Mechanism, of The Fetus Should Be On Following Condition
8 pages
Assignment Solutions
No ratings yet
Assignment Solutions
3 pages
CRTP Notes
100% (1)
CRTP Notes
33 pages
SPM Chemistry Structured Answers 2003-2008
68% (19)
SPM Chemistry Structured Answers 2003-2008
27 pages
Manufacturing - Welding
No ratings yet
Manufacturing - Welding
8 pages
2022 Scheme ISE
No ratings yet
2022 Scheme ISE
12 pages
SAP Focused Build Setup Guide
No ratings yet
SAP Focused Build Setup Guide
2 pages
Introduction To Spintronics 2Nd Edition Supriyo Bandyopadhyay (Author)
No ratings yet
Introduction To Spintronics 2Nd Edition Supriyo Bandyopadhyay (Author)
509 pages
Procedure Ut Asme B31.3 PDF
100% (1)
Procedure Ut Asme B31.3 PDF
16 pages
Lecture No.8 - Bridge Foundations
100% (1)
Lecture No.8 - Bridge Foundations
40 pages
Symmetric Encryption Techniques Explained
No ratings yet
Symmetric Encryption Techniques Explained
35 pages
Navigation and Seamanship
No ratings yet
Navigation and Seamanship
39 pages
Bode - Plot - and - Op - Stability 1-15
No ratings yet
Bode - Plot - and - Op - Stability 1-15
15 pages
Electons Energy Levels & Atomic Orbitals
No ratings yet
Electons Energy Levels & Atomic Orbitals
47 pages
VM-5 Series Monitors for Rotating Machinery
No ratings yet
VM-5 Series Monitors for Rotating Machinery
16 pages
Pressure Gauge Installation Guide
No ratings yet
Pressure Gauge Installation Guide
1 page
CV ql800 Eng Raster 100
No ratings yet
CV ql800 Eng Raster 100
50 pages
Smarttouch For Kodak I900 Series Scanners Release Notes: Version 1.9.8.1177 Summary
No ratings yet
Smarttouch For Kodak I900 Series Scanners Release Notes: Version 1.9.8.1177 Summary
3 pages
Charcoal Production Tech Comparison Caribbean
No ratings yet
Charcoal Production Tech Comparison Caribbean
50 pages
Data Engineering by 2026 - Your Step-By-Step Roadmap
No ratings yet
Data Engineering by 2026 - Your Step-By-Step Roadmap
37 pages
Math New Lesson
100% (1)
Math New Lesson
5 pages
Calculus 141: Trapezoidal & Simpson’s Rule
No ratings yet
Calculus 141: Trapezoidal & Simpson’s Rule
4 pages
Acid Bases and Salt Class 7 Class Notes
No ratings yet
Acid Bases and Salt Class 7 Class Notes
43 pages
3D Printing: A Student's Exploration
100% (1)
3D Printing: A Student's Exploration
15 pages
Save on Craftsman Tractors & Tools
100% (1)
Save on Craftsman Tractors & Tools
20 pages
Cell Immobilization For Enhanced Milk Clotting Enz
No ratings yet
Cell Immobilization For Enhanced Milk Clotting Enz
14 pages
Moffatt Et Al 2020 - Cover - NewBrunswick
No ratings yet
Moffatt Et Al 2020 - Cover - NewBrunswick
8 pages
Power Distribution Block 335A Specifications
No ratings yet
Power Distribution Block 335A Specifications
2 pages
Isolating Primary Murine Hepatic Stellate Cells
No ratings yet
Isolating Primary Murine Hepatic Stellate Cells
27 pages

OCR Post-Correction Techniques Explained

Uploaded by

OCR Post-Correction Techniques Explained

Uploaded by

Aditya W. Mahastama, S.Kom, M.Sc.

Dr. phil. Lucia D. Krisnawati

(Full Text of “The History of Java”, British National Museum, https://archive.org/

(Google Book, The History of Java, https://books.google.co.id/books? 4

Methods of OCR Post-Correction

Methods of OCR Post-Correction

Dictionary-based Error Correction

Dictionary-based Error Correction

▸ The selected dictionary entries are provided as correction

suggestions for the misspelled word.

Minimum Edit Distance

(Jurafsky & Martin, 2009)

Minimum Edit Distance

Minimum Edit Distance algorithm

Minimum Edit Distance

The MED distance is 2 e 2 1 0 1 2

Drawbacks of Dictionary-based Error Correction

Context-based Error correction

A Study Case on Bassil and Alwani (2012)

A Study Case on Bassil and Alwani (2012)

What is Google’s Spelling checker algorithm based on?

What is Google’s Spelling checker algorithm based on?

Revisiting N-gram Model

C(w n−1 ,w n ) C(w n−2 ,w n−1 ,w n )

Computing the spelling error correction

C ($ ex , exa) C (exa , exan) 354 0

C ($ ex , exa) C (exa , exam) 354 354

Computing the spelling error correction

Bassil, Y., & Alwani, M. (2012). OCR Post-processing Error

You might also like