Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004
No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as "born digital" documents. Our approaches emphasize versatility first: that is, methods which work reliably across the broadest possible range of documents.
International Journal on Document Analysis and Recognition (IJDAR), 2009
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
The 15 th International Scientific Conference eLearning and Software for Education, 2019
We live in the century of technology, where the enormous evolution of data and science has recently favored a strong interest in processing, transmitting, and storing information. If, in the past, only a human mind could extract meaningful information from image data, after decades of dedicated research, scientists have managed to build complex systems that can identify different areas, tables, and texts from scanned documents, all the obtained information being easily accessed and passed by one to another. Books, newspapers, maps, letters, drawings-all types of documents can be scanned and processed in order to become available in a digital format. In the digital world, the storage space is very small compared to physical documents, so these applications will replace millions of old paper volumes with a single memory disk and will be accessible at the same time for anyone using just Internet access and without having a risk of deterioration. Other problems, such as ecological issues, accessibility and flexibility constraints can be solved by the use of document image analysis systems. This article presents the methods and techniques used to process on-paper documents and convert them to electronic ones, starting from pixel level and getting to the level of the entire document. The main purpose of Document Image Analysis Systems is to recognize texts and graphical interpretations from images, extract, format and present their contained information accordingly to the people's needs. We will also try to provide solid ground for practitioners that implement systems from this category to enhance the unsupervised processing features in order to make physical documents easily available to the masses.
1995
The conversion of documents into electronic form has proved more difficult than anticipated. Document image analysis still accounts for only a small fraction of the rapidly-expanding document imaging market. Nevertheless, the optimism manifested over the last thirty years has not dissipated. Driven partly by document distribution on CD-ROM and via the World Wide Web, there is more interest in the preservation of layout and format attributes to increase legibility (sometimes called "page reconstruction") rather than just text/non-text separation. The realization that accurate document image analysis requires fairly specific pre-stored information has resulted in the investigation of new data structures for knowledge bases and for the representation of the results of partial analysis. At the same time, the requirements of downstream software, such as word processing, information retrieval and computer-aided design applications, favor turning the results of the analysis and recognition into some standard format like SGML or DXF. There is increased emphasis on large-scale, automated comparative evaluation, using laboriously compiled test databases. The cost of generating these databases has stimulated new research on synthetic noise models. According to recent publications, the accurate conversion of business letters, technical reports, large typeset repositories like patents, postal addresses, specialized line drawings, and office forms containing a mix of handprinted, handwritten and printed material, is finally on the verge of success.
The 15 th International Scientific Conference eLearning and Software for Education, 2019
We live in the century of technology, where the enormous evolution of data and science has recently favored a strong interest in processing, transmitting, and storing information. If, in the past, only a human mind could extract meaningful information from image data, after decades of dedicated research, scientists have managed to build complex systems that can identify different areas, tables, and texts from scanned documents, all the obtained information being easily accessed and passed by one to another. Books, newspapers, maps, letters, drawings-all types of documents can be scanned and processed in order to become available in a digital format. In the digital world, the storage space is very small compared to physical documents, so these applications will replace millions of old paper volumes with a single memory disk and will be accessible at the same time for anyone using just Internet access and without having a risk of deterioration. Other problems, such as ecological issues, accessibility and flexibility constraints can be solved by the use of document image analysis systems. This article presents the methods and techniques used to process on-paper documents and convert them to electronic ones, starting from pixel level and getting to the level of the entire document. The main purpose of Document Image Analysis Systems is to recognize texts and graphical interpretations from images, extract, format and present their contained information accordingly to the people's needs. We will also try to provide solid ground for practitioners that implement systems from this category to enhance the unsupervised processing features in order to make physical documents easily available to the masses.
Proceedings of the 2007 international workshop on Semantically aware document processing and indexing - SADPI '07, 2007
A huge number of documents that were only available in libraries are now on the web. The web access is a solution to protect the cultural heritage and to facilitate knowledge transmission. Most of these documents are displayed as images of the original paper pages and are indexed by hand. In this paper, we present how and why Document Image Analysis contributes to build the Digital Libraries of the future. Readers expect human-centred interactive reading stations, which imply the production of hyperdocuments to fit the reader's intentions and needs. Image analysis allows extracting and categorizing the meaningful document components and relationships; it also provides readers' adapted visualisation of the original images. Document Image Analysis is an essential prerequisite to enrich hyperdocuments that support content-based readers' activities such as information seeking and navigation. This paper focuses the function of the original image: a reference for the reader and the input data that are processed to automatically detect what makes sense in a document.
Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1997
This paper describes a document image analysis toolbox, including a collection of document image processing and analysis algorithms, performance metrics and evaluation tools, and graphical model tools for information integration.
Document Recognition and Retrieval XIV, 2007
We address the problem of content-based image retrieval in the context of complex document images. Complex document are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images. This is in contrast to existing datasets for content-based image retrieval which normally contain only thousands of images. We believe that the formation of such large dataset is essential in understanding the problems associated with realistic applications.
1995
In the late 1980's, the prevalence of fast computers, large computer memory, and inexpensive scanners fostered an increasing interest in document image analysis. With many paper documents being sent and received via fax machines and being stored digitally in large document databases, the interest grew to do more with these images than simply view and print them.
—Due to the rapid increase of different digitized documents, the development of a system to automatically retrieve document images from a large collection of structured and unstructured document images is in high demand. Many techniques have been developed to provide an efficient and effective way for retrieving and organizing these document images in the literature. This paper provides an overview of the methods which have been applied for document image retrieval over recent years. It has been found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.
Document Recognition and Retrieval XXI, 2013
In this paper a semi-automated document image clustering and retrieval is presented to create links between different documents based on their content. Ideally the initial bundling of shuffled document images can be reproduced to explore large document databases. Structural and textural features, which describe the visual similarity, are extracted and used by experts (e.g. registrars) to interactively cluster the documents with a manually defined feature subset (e.g. checked paper, handwritten). The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.
1997
This paper describes the Document Image Understanding Toolbox currently under development at the University of Washington's Intelligent Systems Laboratory The Toolbox provides a common data structure and a variety of document image analysis and understanding algorithms from which Toolbox users can construct document image processing systems. An algon'thms for font attribute recognition based on the image analysis techniques available in the toolbox ISL DIU Toolbox is also presented.
2002
Abstract Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document's contents. In this paper we briefly describe various components of a document analysis system.
Lecture Notes in Computer Science, 2017
We present the DAE Platform in the specic context of reproducible research. DAE was developed at Lehigh University targeted at the Document Image Analysis research community for distributing document images and associated document analysis algorithms, as well as an unlimited range of annotations and ground truth for benchmarking and evaluation of new contributions to the state-of-the-art. DAE was conceived from the beginning with the idea of reproducibility and data provenance in mind. In this paper we more specically analyze how this approach answers a number of challenges raised by the need of providing fully reproducible experimental research. Furthermore, since DAE has been up and running without interruption since 2010, we are in a position of providing a qualitative analysis of the technological choices made at the time, and suggest some new perspectives in light of more recent technologies and practices.
Computer, 1992
Intelligent document segmentation can bring electronic browsing within the reach of most users. The authors show how this is achieved through document processing, analysis, and parsing the graphic sentence. L et's quickly calculate the requirements of electronic data storage and access for a standard library of technical journals. A medium-sized research library subscribes to about 2,000 periodicals, each averaging about 500 pages per volume, for a total of one million pages per year. Although this article was output to film at 1,270 dpi (dots per inch) by an imagesetter, reproduction on a 300-dpi laser printer or display would be marginally acceptable to most readers (at least for the text and some of the art). At 300 dpi, each page contains about six million pixels (picture elements). At a conservative compression ratio of 1O:l (using existing facsimile methods), this yields 80 gigabytes per year for the entire collection of periodicals. While this volume is well beyond the storage capabilities of individual workstations, it is acceptable for a library file server. (Of course, unformatted text requires only about 6 kilobytes per page even without compression, but it is not an acceptable vehicle for technical material.) A 10-page article can be transmitted over a high-speed network, and printed or displayed in image form in far less time than it takes to walk to a nearby library. Furthermore, while Computer may be available in most research libraries, you may have to wait several days for an interlibrary loan through facsimile or courier. There is, therefore, ample motivation to develop systems for the electronic distribution of digitized technical material. However, even if the material is available in digital image form, not everyone has convenient access to a high-speed line, a laser printer, or a 2,100 x 3,000-dot display. We show how intelligent document segmentation can bring electronic browsing within the reach of readers equipped with only a modem and a personal computer. Document analysis constitutes a domain of about the right degree of difficultj for current research on knowledge-based image-processing algorithms. This is important because document analysis itself is a transient application: There is no question that eventually information producers and consumers must be digitally linked. But there are also practical advantages to recognizing the location and extent of significant blocks of information on the page. This is also true for segmenting and 10
Proceedings of the fourth workshop on …, 2010
The goal of document image analysis is to produce interpretations that match those of a fluent and knowledgeable human when viewing the same input. Because computer vision techniques are not perfect, the text that results when processing scanned pages is frequently noisy. Building on previous work, we propose a new paradigm for handling the inevitable incomplete, partial, erroneous, or slightly orthogonal interpretations that commonly arise in document datasets. Starting from the observation that interpretations ...
International Journal of Computer Applications, 2010
The economic feasibility of creating a large database of document image has left a tremendous need for robust ways to access the information. Printed documents are scanned for archiving or in an attempt to move towards a paperless office and are stored as images. In this paper, we provide a survey of methods developed by researchers to access document images. The survey includes papers covering the current state of art on the research in document image retrieval based on images such as signature, logo, machine-print, different fonts etc.
Lecture Notes in Computer Science, 2006
The research goal of highly versatile document analysis systems, capable of performing useful functions on the great majority of document images, seems to be receding, even in the face of decades of research. One family of nearly universally applicable capabilities includes document image content extraction tools able to locate regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. The severity of the methodological problems is suggested by the lack of agreement within the R&D community on even what is meant by a representative set of samples in this context. Even when this is agreed, it is often not clear how sufficiently large sets for training and testing can be collected and ground truthed. Perhaps this can be alleviated by discovering a principled way to amplify sample sets using synthetic variations. We will then need classification methodologies capable of learning automatically from these huge sample sets in spite of their poorly parameterized-or unparameterizable-distributions. Perhaps fast expected-time approximate k-nearest neighbors classifiers are a good solution, even if they tend to require enormous data structures: hashed k-d trees seem promising. We discuss these issues and report recent progress towards their resolution.
IEEE Transactions on Knowledge and Data Engineering, 2004
With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval.
Document Analysis and …, 2011
This contest aims to provide a metric giving indications on the influence of individual document analysis stages to overall end-to-end applications. Contestants are provided with a full, working pipeline which operates on a page image to extract useful information. The pipeline is built with clearly identified analysis stages (e.g. binarization, skew detection, layout analysis, OCR ...) that have a formalized input and output. Contestants are invited to contribute their own algorithms as an alternative to one or more of the initially provided stages. The evaluation measures the overall impact of the contributed algorithm on the final (end-of-pipeline) output.
The 15 th International Scientific Conference eLearning and Software for Education , 2019
Technology advances to make life easier for people. We tend to surround us with devices as small as possible and with the highest computing power. The need for data access from everywhere is an important detail. As a consequence, digital documents have been gaining ground on printed ones and for some sectors, the latter were even replaced. The need and the obligation to preserve the written cultural heritage, represented by books and valuable documents, some of them rare and even unique, forced us to imagine a system that protects the patrimony but makes it also accessible. In order to make books easily available to the public and at the lowest possible risk for the protection of the originals, we came to the idea of designing and creating an efficient digitization system of these records. The current article presents the proposed architecture of a Document Image Analysis System that will process the information with individual modules for each type of operation. The main scope for such tool is to recognize information from the documents and extract them for electronic use. The flow of operations are indicated by user, some steps can be eliminated depending on the user's desire and needs. In order to design an efficient Document Image Analysis System, we need a 3 axis approach: Education-involving students that can receive tasks for replacing modules and validating their homework, Research-performing various tests and Performance-testing the module interconnection and enabling the system to be extremely configurable. No matter what axis is considered, the main scope is the flexibility of the system-performed by individual modules as physical binaries or collection of binaries that are linked via scripts. Each module is designed to accomplish a certain major task by executing several sub-tasks whose results, in most cases, are subject to an intelligent voting process that produces the module's output data.