Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006
We explore connections between digital libraries and interactive document image analysis. Digital libraries can provide useful data and metadata for research in automated document image analysis, and allow unbiased testing of DIA algorithms. With these goals in mind, we suggest criteria for constructing and evaluating interactive DIA tools.
2004
No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as "born digital" documents. Our approaches emphasize versatility first: that is, methods which work reliably across the broadest possible range of documents.
Lecture Notes in Computer Science, 2017
We present the DAE Platform in the specic context of reproducible research. DAE was developed at Lehigh University targeted at the Document Image Analysis research community for distributing document images and associated document analysis algorithms, as well as an unlimited range of annotations and ground truth for benchmarking and evaluation of new contributions to the state-of-the-art. DAE was conceived from the beginning with the idea of reproducibility and data provenance in mind. In this paper we more specically analyze how this approach answers a number of challenges raised by the need of providing fully reproducible experimental research. Furthermore, since DAE has been up and running without interruption since 2010, we are in a position of providing a qualitative analysis of the technological choices made at the time, and suggest some new perspectives in light of more recent technologies and practices.
Proceedings of the 2007 international workshop on Semantically aware document processing and indexing - SADPI '07, 2007
A huge number of documents that were only available in libraries are now on the web. The web access is a solution to protect the cultural heritage and to facilitate knowledge transmission. Most of these documents are displayed as images of the original paper pages and are indexed by hand. In this paper, we present how and why Document Image Analysis contributes to build the Digital Libraries of the future. Readers expect human-centred interactive reading stations, which imply the production of hyperdocuments to fit the reader's intentions and needs. Image analysis allows extracting and categorizing the meaningful document components and relationships; it also provides readers' adapted visualisation of the original images. Document Image Analysis is an essential prerequisite to enrich hyperdocuments that support content-based readers' activities such as information seeking and navigation. This paper focuses the function of the original image: a reference for the reader and the input data that are processed to automatically detect what makes sense in a document.
Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1997
This paper describes a document image analysis toolbox, including a collection of document image processing and analysis algorithms, performance metrics and evaluation tools, and graphical model tools for information integration.
1995
The conversion of documents into electronic form has proved more difficult than anticipated. Document image analysis still accounts for only a small fraction of the rapidly-expanding document imaging market. Nevertheless, the optimism manifested over the last thirty years has not dissipated. Driven partly by document distribution on CD-ROM and via the World Wide Web, there is more interest in the preservation of layout and format attributes to increase legibility (sometimes called "page reconstruction") rather than just text/non-text separation. The realization that accurate document image analysis requires fairly specific pre-stored information has resulted in the investigation of new data structures for knowledge bases and for the representation of the results of partial analysis. At the same time, the requirements of downstream software, such as word processing, information retrieval and computer-aided design applications, favor turning the results of the analysis and recognition into some standard format like SGML or DXF. There is increased emphasis on large-scale, automated comparative evaluation, using laboriously compiled test databases. The cost of generating these databases has stimulated new research on synthetic noise models. According to recent publications, the accurate conversion of business letters, technical reports, large typeset repositories like patents, postal addresses, specialized line drawings, and office forms containing a mix of handprinted, handwritten and printed material, is finally on the verge of success.
International Journal on Document Analysis and Recognition (IJDAR), 2009
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
1995
In the late 1980's, the prevalence of fast computers, large computer memory, and inexpensive scanners fostered an increasing interest in document image analysis. With many paper documents being sent and received via fax machines and being stored digitally in large document databases, the interest grew to do more with these images than simply view and print them.
The 15 th International Scientific Conference eLearning and Software for Education, 2019
We live in the century of technology, where the enormous evolution of data and science has recently favored a strong interest in processing, transmitting, and storing information. If, in the past, only a human mind could extract meaningful information from image data, after decades of dedicated research, scientists have managed to build complex systems that can identify different areas, tables, and texts from scanned documents, all the obtained information being easily accessed and passed by one to another. Books, newspapers, maps, letters, drawings-all types of documents can be scanned and processed in order to become available in a digital format. In the digital world, the storage space is very small compared to physical documents, so these applications will replace millions of old paper volumes with a single memory disk and will be accessible at the same time for anyone using just Internet access and without having a risk of deterioration. Other problems, such as ecological issues, accessibility and flexibility constraints can be solved by the use of document image analysis systems. This article presents the methods and techniques used to process on-paper documents and convert them to electronic ones, starting from pixel level and getting to the level of the entire document. The main purpose of Document Image Analysis Systems is to recognize texts and graphical interpretations from images, extract, format and present their contained information accordingly to the people's needs. We will also try to provide solid ground for practitioners that implement systems from this category to enhance the unsupervised processing features in order to make physical documents easily available to the masses.
1997
This paper describes the Document Image Understanding Toolbox currently under development at the University of Washington's Intelligent Systems Laboratory The Toolbox provides a common data structure and a variety of document image analysis and understanding algorithms from which Toolbox users can construct document image processing systems. An algon'thms for font attribute recognition based on the image analysis techniques available in the toolbox ISL DIU Toolbox is also presented.
The 15 th International Scientific Conference eLearning and Software for Education, 2019
We live in the century of technology, where the enormous evolution of data and science has recently favored a strong interest in processing, transmitting, and storing information. If, in the past, only a human mind could extract meaningful information from image data, after decades of dedicated research, scientists have managed to build complex systems that can identify different areas, tables, and texts from scanned documents, all the obtained information being easily accessed and passed by one to another. Books, newspapers, maps, letters, drawings-all types of documents can be scanned and processed in order to become available in a digital format. In the digital world, the storage space is very small compared to physical documents, so these applications will replace millions of old paper volumes with a single memory disk and will be accessible at the same time for anyone using just Internet access and without having a risk of deterioration. Other problems, such as ecological issues, accessibility and flexibility constraints can be solved by the use of document image analysis systems. This article presents the methods and techniques used to process on-paper documents and convert them to electronic ones, starting from pixel level and getting to the level of the entire document. The main purpose of Document Image Analysis Systems is to recognize texts and graphical interpretations from images, extract, format and present their contained information accordingly to the people's needs. We will also try to provide solid ground for practitioners that implement systems from this category to enhance the unsupervised processing features in order to make physical documents easily available to the masses.
Computer, 1992
Intelligent document segmentation can bring electronic browsing within the reach of most users. The authors show how this is achieved through document processing, analysis, and parsing the graphic sentence. L et's quickly calculate the requirements of electronic data storage and access for a standard library of technical journals. A medium-sized research library subscribes to about 2,000 periodicals, each averaging about 500 pages per volume, for a total of one million pages per year. Although this article was output to film at 1,270 dpi (dots per inch) by an imagesetter, reproduction on a 300-dpi laser printer or display would be marginally acceptable to most readers (at least for the text and some of the art). At 300 dpi, each page contains about six million pixels (picture elements). At a conservative compression ratio of 1O:l (using existing facsimile methods), this yields 80 gigabytes per year for the entire collection of periodicals. While this volume is well beyond the storage capabilities of individual workstations, it is acceptable for a library file server. (Of course, unformatted text requires only about 6 kilobytes per page even without compression, but it is not an acceptable vehicle for technical material.) A 10-page article can be transmitted over a high-speed network, and printed or displayed in image form in far less time than it takes to walk to a nearby library. Furthermore, while Computer may be available in most research libraries, you may have to wait several days for an interlibrary loan through facsimile or courier. There is, therefore, ample motivation to develop systems for the electronic distribution of digitized technical material. However, even if the material is available in digital image form, not everyone has convenient access to a high-speed line, a laser printer, or a 2,100 x 3,000-dot display. We show how intelligent document segmentation can bring electronic browsing within the reach of readers equipped with only a modem and a personal computer. Document analysis constitutes a domain of about the right degree of difficultj for current research on knowledge-based image-processing algorithms. This is important because document analysis itself is a transient application: There is no question that eventually information producers and consumers must be digitally linked. But there are also practical advantages to recognizing the location and extent of significant blocks of information on the page. This is also true for segmenting and 10
Document Analysis and …, 2011
This contest aims to provide a metric giving indications on the influence of individual document analysis stages to overall end-to-end applications. Contestants are provided with a full, working pipeline which operates on a page image to extract useful information. The pipeline is built with clearly identified analysis stages (e.g. binarization, skew detection, layout analysis, OCR ...) that have a formalized input and output. Contestants are invited to contribute their own algorithms as an alternative to one or more of the initially provided stages. The evaluation measures the overall impact of the contributed algorithm on the final (end-of-pipeline) output.
The 15 th International Scientific Conference eLearning and Software for Education , 2019
Technology advances to make life easier for people. We tend to surround us with devices as small as possible and with the highest computing power. The need for data access from everywhere is an important detail. As a consequence, digital documents have been gaining ground on printed ones and for some sectors, the latter were even replaced. The need and the obligation to preserve the written cultural heritage, represented by books and valuable documents, some of them rare and even unique, forced us to imagine a system that protects the patrimony but makes it also accessible. In order to make books easily available to the public and at the lowest possible risk for the protection of the originals, we came to the idea of designing and creating an efficient digitization system of these records. The current article presents the proposed architecture of a Document Image Analysis System that will process the information with individual modules for each type of operation. The main scope for such tool is to recognize information from the documents and extract them for electronic use. The flow of operations are indicated by user, some steps can be eliminated depending on the user's desire and needs. In order to design an efficient Document Image Analysis System, we need a 3 axis approach: Education-involving students that can receive tasks for replacing modules and validating their homework, Research-performing various tests and Performance-testing the module interconnection and enabling the system to be extremely configurable. No matter what axis is considered, the main scope is the flexibility of the system-performed by individual modules as physical binaries or collection of binaries that are linked via scripts. Each module is designed to accomplish a certain major task by executing several sub-tasks whose results, in most cases, are subject to an intelligent voting process that produces the module's output data.
Nowadays, Digital Libraries have become a widely used service to store and share both digital born documents and digital versions of works stored by traditional libraries. Document images are intrinsically non-structured and the structure and semantic of the digitized documents is in most part lost during the conversion. Several techniques related to the Document Image Analysis research area have been proposed in the past to deal with document image retrieval applications. In this chapter a survey about the more recent techniques applied in the field of recognition and retrieval of text and graphical documents is presented. In particular we describe techniques related to recognition-free approaches.
2006
We examine some research issues in pattern recognition and image processing that have been spurred by the needs of digital libraries. Broader-and not only linguistic-context must be introduced in character recognition on low-contrast, tightly-set documents because the conversion of documents to coded (searchable) form is lagging far behind conversion to image formats. At the same time, the prevalence of imaged documents over coded documents gives rise to interesting research problems in interactive annotation of document images. At the level of circulation, reformatting document images to accommodate diverse user needs remains a challenge.
2002
Abstract Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document's contents. In this paper we briefly describe various components of a document analysis system.
Document Recognition and Retrieval XVIII, 2011
The Field of Document Recognition is bipolar. On one end lies the excellent work of academic institutions engaging in original research on scientifically interesting topics. On the other end lies the document recognition industry which services needs for high-volume data capture for transaction and back-office applications. These realms seldom meet, yet the need is great to address technical hurdles for practical problems using modern approaches from the Document Recognition, Computer Vision, and Machine Learning disciplines. We reflect on three categories of problems we have encountered which are both scientifically challenging and of high practical value. These are Doctype Classification, Functional Role Labeling, and Document Sets. Doctype Classification asks, "What is the type of page I am looking at?" Functional Role Labeling asks, "What is the status of text and graphical elements in a model of document structure?" Document Sets asks, "How are pages and their contents related to one another?" Each of these has ad hoc engineering approaches that provide 40-80% solutions, and each of them begs for a deeply grounded formulation both to provide understanding and to attain the remaining 20-60% of practical value. The practical need is not purely technical but also depends on the user experience in application setup and configuration, and in collection and groundtruthing of sample documents. The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.
1995
A strategy for document analysis is presented which uses Portable Document Format (PDF -the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the 'semantic gap' between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.
Sensors
Document imaging/scanning approaches are essential techniques for digitalizing documents in various real-world contexts, e.g., libraries, office communication, managementof workflows, and electronic archiving [...]
Journal of e-learning and knowledge society, 2014
Handwritten documents provide a rich source of data and, with the growth in the availability of digitised documents, it becomes increasingly important to improve our ability to analyse and extract “knowledge” from such sources. This paper describes an approach to the provision of tools which can extract information about the writer of handwritten documents, especially those which were written in earlier times and which constitute key elements in our heritage and culture. We show how the constraints inherent in such documents influence our analytical approach, and we also show how developing appropriate “knowledge extraction” techniques can also be essential in other, more general, important application scenarios.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.