Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1992, Proceedings of the IEEE
It is time for a major change of approach to character recognition research. The traditional approach, focusing on the the correct classijication of isolated characters, has been exhausted. The demonstration of the superiority of a new classification method under operational conditions requires large experimental facilities and data bases beyond the resources of most researchers. In any case, even perfect classification of individual characters is insufficient for the conversion of complex archival documents to a useful computer-readable form. Many practical OCR tasks require integrated treatment of entire documents and well-organized typographic and domain-specific knowledge. New OCR systems should take advantage of the typographic uniformity of paragraphs or other layout components. They should also exploit the unavoidable interaction with human operators to improve themselves without explicit "training. "
For four years, ISRI has conducted an annual test of optical character recognition (OCR) systems known as "page readers." These systems accept as input a bitmapped image of any document page, and attempt to identify the machine-printed characters on the page. In the annual test, we measure the accuracy of this process by comparing the text that is produced as output with the correct text. The goals of the test include:
Vol. 19 No. 2 FEBRUARY 2021 International Journal of Computer Science and Information Security (IJCSIS), 2021
This paper provides a total overview of OCR. Optical character recognition is nothing but the ability of the computer to collect and decipher the handwritten inputs from documents, photos or any other devices. Over these many years, many researchers have been researching and paying attention on this topic and proposed many methods which can be solved. This research provides a historical view and the summarization of the research which done on this field.
DigItalia, 2020
Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target was to create a tool that identifies font groups automatically in images of historical documents. We concentrated on Gothic font groups that were commonly used in German texts printed in the 15 th and 16 th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not only differentiate between the above-mentioned font groups but also Hebrew, Greek, Antiqua and Italic. It can also identify woodcut images and irrelevant data (book covers, empty pages, etc.). In a second step, we created an online training infrastructure (okralact), which allows for the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari. At the same time, it facilitates training for specific models of font groups. The high accuracy of the recognition tool paves the way for the unprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the tool could help to fill a major gap in historical research. OCR-D
2003
Tools for Optical Character Recognition (OCR), commercially available today, provide di erent recognition degrees depending on a number of factors. We analyse here the features of six of the most widely used \othe-shelf" OCR softwares.
2005
We announce the availability of the UNLV/ISRI Analytic Tools for OCR Evaluation together with a large and diverse collection of scanned document images with the associated ground-truth text. This combination of tools and test data will allow anyone to conduct a meaningful test comparing the performance of competing page-reading algorithms. The value of this collection of software tools and test data is enhanced by knowledge of the past performance of several systems using exactly these tools and this data. These performance comparisons were published in previous ISRI Test Reports and are also provided. Another value is that the tools can be used to test the character accuracy of any page-reading OCR system for any language included in the Unicode standard. The paper concludes with a summary of the programs, test data, and documentation that is available and gives the URL where they can be located.
Lecture Notes in Computer Science, 2015
Optical Character Recognition (OCR) is a very extensive branch of pattern recognition. The existence of super effective software designed for omnifont text recognition, capable of handling multiple languages, creates an impression that all problems in this field have already been solved. Indeed, focus of research in the OCR domain has constantly been shifting from offline, typewritten, Latin character recognition towards Asiatic alphabets, handwritten scripts and online process. Still, however, it is difficult to come across an elaboration which would not only cover the topic of numerous feature extraction methods for printed, Latin derived, isolated characters conceptually, but which would also attempt to implement, compare and optimize them in an experimental way. This paper aims at closing this gap by thoroughly examining the performance of several statistical methods with respect to their recognition rate and time efficiency.
International Journal of Machine Learning and Computing, 2012
Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website. The paper presents a survey of applications of OCR in different fields and further presents the experimentation for three important applications such as Captcha, Institutional Repository and Optical Music Character Recognition. We make use of an enhanced image segmentation algorithm based on histogram equalization using genetic algorithms for optical character recognition. The paper will act as a good literature survey for researchers starting to work in the field of optical character recognition.
International Journal of Advanced Trends in Computer Science and Engineering, 2020
The technology associated with character recognition has emerged as a vital technology within the era of the fourth historic period. Character recognition is developing as a core technology needed in various fields. Character recognition is performed by extracting characters from a picture and recognizing the extracted characters. Character recognition technology has been continuously developed. Recently, together with the event of the fourth historic period, character recognition technology has been used as a core technology in many places. This paper introduces the technology associated with character recognition and therefore the program for character recognition.
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017
Currently an intensive amount of research is going on in the field of digitizing historical Archives for converting scanned page images into searchable full text. anyOCR is a new OCR system which mainly emphasize the techniques requires for digitizing a historical archive with high accuracy. It is an open-source system for the research community who can be easily applied the anyOCR system for digitization of a historical archive. The anyOCR system can also be used for contemporary document images containing diverse, simple to complex, layouts. This paper describes the current state of the anyOCR system, its architecture, as well as its major features. The anyOCR system supports a complete document processing pipeline, which includes layout analysis, training OCR models and text line prediction, with an addition of fast and interactive layout and OCR error corrections web-based services.
2005
This work proposes a new taxonomy and metric for classifying and counting the number of errors in the transcription performed by Optical Character Recognizers (OCRs). It also presents a comparative study on the performance of commercial OCR tools.
The Optical Character Recognition (OCR) is one of the automatic identification techniques that fulfill the automation needs in various applications. A machine can read the information present in natural scenes or other materials in any form with OCR. The typed and printed character recognition is uncomplicated due to its well-defined size and shape. The handwriting of individuals differs in the above aspects. So, the handwritten OCR system faces complexity to learn this difference to recognize a character. In this paper, we discussed the various stages in text recognition, handwritten OCR systems classification according to the text type, study on Chinese and Arabic text recognition as well as application oriented recent research in OCR.
2008 The Eighth IAPR International Workshop on Document Analysis Systems, 2008
In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a topdown segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.
Proceedings IS&T/SPIE 20th Annual Symposium, 2008
2015
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 reca...
Abstract—This paper presents a recent trends and tools used for feature extraction that helps in efficient classification of the handwritten alphabets. Numerous models of feature extraction have been defined by different researchers in their respective dissertation. It is found that the use of Euler Number in addition to zoning increases the speed and the accuracy of the classifier as it reduces the search space by dividing the character set into three groups.
ACM Computing Surveys, 2021
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
2021
Optical Character Recognition (OCR), is that the process of conversion of image text or handwritten text into machine understandable form. Simply OCR means conversion of characters that is recognized and convert it into computer readable form. It is widely used as a kind of data entry from original paper data sources such as banking papers or consultation papers, whether passport documents, invoices, statement, receipts, card, mail or any number of printed records. It is a standard method of digitizing printed texts in order that they will be electronically edited, searched, and stored more compactly. OCR is the field of research in Pattern Recognition, Artificial Intelligence and Computer Vision. OCR is that the electronic translation of handwritten, type written or printed text into machine translated images. It is widely used to recognize and search text from documents or to publish the text on a website. This document represents review of Optical Character Recognition methods su...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.