Robust document image understanding technologies

Henry S. Baird; Daniel Lopresti; Brian D. Davison; William M. Pottenger

Robust document image understanding technologies

2004

Abstract

No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as "born digital" documents. Our approaches emphasize versatility first: that is, methods which work reliably across the broadest possible range of documents.

Intelligent document segmentation can bring electronic browsing within the reach of most users. The authors show how this is achieved through document processing, analysis, and parsing the graphic sentence. L et's quickly calculate the requirements of electronic data storage and access for a standard library of technical journals. A medium-sized research library subscribes to about 2,000 periodicals, each averaging about 500 pages per volume, for a total of one million pages per year. Although this article was output to film at 1,270 dpi (dots per inch) by an imagesetter, reproduction on a 300-dpi laser printer or display would be marginally acceptable to most readers (at least for the text and some of the art). At 300 dpi, each page contains about six million pixels (picture elements). At a conservative compression ratio of 1O:l (using existing facsimile methods), this yields 80 gigabytes per year for the entire collection of periodicals. While this volume is well beyond the storage capabilities of individual workstations, it is acceptable for a library file server. (Of course, unformatted text requires only about 6 kilobytes per page even without compression, but it is not an acceptable vehicle for technical material.) A 10-page article can be transmitted over a high-speed network, and printed or displayed in image form in far less time than it takes to walk to a nearby library. Furthermore, while Computer may be available in most research libraries, you may have to wait several days for an interlibrary loan through facsimile or courier. There is, therefore, ample motivation to develop systems for the electronic distribution of digitized technical material. However, even if the material is available in digital image form, not everyone has convenient access to a high-speed line, a laser printer, or a 2,100 x 3,000-dot display. We show how intelligent document segmentation can bring electronic browsing within the reach of readers equipped with only a modem and a personal computer. Document analysis constitutes a domain of about the right degree of difficultj for current research on knowledge-based image-processing algorithms. This is important because document analysis itself is a transient application: There is no question that eventually information producers and consumers must be digitally linked. But there are also practical advantages to recognizing the location and extent of significant blocks of information on the page. This is also true for segmenting and 10

Log In

Robust document image understanding technologies

Sign up for access to the world's latest research

Abstract

Related papers

Related topics