A* Path Planning for Line Segmentation of Handwritten Documents

Marco A. Wiering

A* Path Planning for Line Segmentation of Handwritten Documents

2014

Abstract

This paper describes the use of a novel A * pathplanning algorithm for performing line segmentation of handwritten documents. The novelty of the proposed approach lies in the use of a smart combination of simple soft cost functions that allows an artificial agent to compute paths separating the upper and lower text fields. The use of soft cost functions enables the agent to compute near-optimal separating paths even if the upper and lower text parts are overlapping in particular places. We have performed experiments on the Saint Gall and Monk line segmentation (MLS) datasets. The experimental results show that our proposed method performs very well on the Saint Gall dataset, and also demonstrate that our algorithm is able to cope well with the much more complicated MLS dataset.

—Text-line extraction in handwritten documents is an important step for document image understanding, and a number of algorithms have been proposed to address this problem. In order to overcome this limitation, we develop text-line extraction algorithm for cursive handwriting. Our method is based on connected components (CCs), however, unlike conventional methods, we analysed strokes and partition under-segmented CCs into normalized ones. Due to this normalization, the proposed method is able to estimate the states of CCs for a range of different languages and writing styles. I. INTRODUCTION TEXT-LINE extraction in document images is an essential step for various document image processing tasks such as layout analysis and optical character recognition (OCR).Therefore, there have been a lot of researches in this area, and a number of algorithms have been proposed for the extraction of text-lines in machine-printed document images. However, text-line extraction in handwritten documents is still considered a challenging problem: the scale and orientation of characters are spatially varying, inter-line distances are irregular, and characters may touch across words and/or text-lines. Handwriting detection is a technique or ability of computer to receive & interpret intelligible handwritten input from source. Handwriting recognition is comparatively difficult, because different people have different handwriting style. In optical character recognition, segmentation is a significant phase and accuracy of character recognition highly depends on accuracy of segmentation. Incorrect segmentation leads to incorrect character recognition. Segmentation phase includes text line, word, and character segmentation. Text line detection and separation in digital image documents is a challenging job for handwritten document analysis and character recognition. The problem becomes compounded if the text lines in the text image are connected or overlapped. Emergence of these problems is common in handwritten documents in comparison of printed documents because of individual's varying handwriting styles. Researchers are continuously working on these problems for different languages. Text-line extraction in handwritten documents is an important step for document image understanding, we develop a language-independent text-line extraction algorithm. However, most conventional work focused on specific character sets. That is, conventional algorithms address the variations caused by individual writers by exploiting language-specific features. The situation is worse for Indian scripts where most characters are connected. On the other hand, character components are placed in a one-dimensional way in cursive Latin-based and Indian scripts, allowing us to develop horizontal bottom-up clustering rules. Our method is based on connected components (CCs), however, unlike conventional methods; we analyze strokes and partition under-segmented CCs into normalized ones. Due to this normalization, the proposed method is able to estimate the states of CCs for a range of different languages and writing styles. From the estimated states, we build a cost function whose minimization yields text-lines. We develop an effective CC segmentation method: by partitioning under-segmented CCs into normalized ones, we can estimate states reliably in a variety of documents.

Log In

A* Path Planning for Line Segmentation of Handwritten Documents

Sign up for access to the world's latest research

Abstract

Related papers

Related topics