Configurable Table Structure Recognition
Configurable Table Structure Recognition
ABSTRACT 2]. These methods can be applied to PDF documents through con-
Today, PDF is one of the most popular document formats in the verting PDF to these formats. However, this process leads to the
web. Many PDF documents are not images, but remain untagged. loss of valuable information. Table extraction from PDF directly
They have no tags for identifying the logical reading order, para- can provide better results. PDF is a richer representation of docu-
graphs, figures, and tables. One of the challenges with these doc- ments in comparison with images and plain-text. PDF documents
uments is how to extract tables from them. The paper discusses a can contain machine-readable text (text chunks with their positions,
new system for table structure recognition in untagged PDF doc- font characteristics, and order of appearance in a file), as well as
uments. It is formulated as a set of configurable parameters and vector graphics including table rulings. We expect that these fea-
ad-hoc heuristics for recovering table cells. We consider two dif- tures can allow to extract tables more accurately.
ferent configurations for the system and demonstrate experimental Several methods and tools for PDF table extraction are proposed
results based on the existing competition dataset for both of them. in two last decades. Some of them are discussed in the surveys [1,
6, 7]. Ramel et al. [11] consider two techniques for detecting
and recognizing tables from documents in an exchange format like
CCS Concepts PDF. The first is based on the analysis of ruling lines. The sec-
•Applied computing → Document analysis; Document manage- ond is to analyze the arrangement of text components. Hassan et
ment and text processing; al. [5] expand these ideas for PDF table extraction. In the project
TableSeer, Liu et al. [8] propose methods for detecting tables in
PDF documents and extracting metadata (headers). They use text
Keywords arrangement, fonts, whitespace, and keywords (e. g. “Table”, “Fig-
table extraction, table structure recognition, untagged PDF docu- ure”). Oro et al. [10] present PDF-TREX, a heuristic method where
ments, PDF document analysis, PDF accessibility PDF table extraction is realized as building from content elements
to tables in bottom-up way.
1. INTRODUCTION Yildiz et al. [15] propose a heuristic method for PDF table ex-
traction using the ‘pdftohtml’1 tool for generating its input. Ras-
Today, PDF is one of the most popular document formats in the
tan et al. [12] consider a framework for the end-to-end table pro-
web as can be measured by Google’s “filetype:pdf” search. Many
cessing including the task of table structure recognition. They also
PDF documents are not images, but remain untagged. They have
use the ‘pdftohtml’ tool to prepare their input. However, this tool
no tags for identifying the logical reading order, paragraphs, figures
occasionally makes mistakes in combining text chunks, which are
and tables. Nganji [4] estimates that 95.5% of scientific articles
located too close to each other, thus the input can be corrupted.
published by four leading publishers are untagged PDF documents.
Nurminen [9] in his thesis describes comprehensive PDF table de-
One of the important challenges with these documents is how to
tection and structure recognition algorithms that have demonstrated
extract tables from them.
high recall and precision on “ICDAR 2013 Table Competition” [4].
Table extraction typically consists of two main steps: table de-
Some of them are implemented in Tabula2 system.
tection, i. e. recovering the bounding box of a table in a document,
In contrast to the existing methods, we suggest a configurable
and table structure recognition, i. e. recovering its rows, columns,
system that is formulated as a set of customizable parameters and
and cells. Many of existing methods for extracting tables from un-
ad-hoc heuristics for recovering table cells from text chunks and
structured documents traditionally deal with only images or plain-
rulings. We exploit features of table presentations in untagged PDF
text as source. They are considered in several surveys, including [1,
documents. Most of them such as horizontal and vertical distances,
fonts, and rulings are well known and used in the existing methods.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
Additionally, we propose to exploit the feature of appearance of
for profit or commercial advantage and that copies bear this notice and the full cita- text printing instruction in PDF files.
tion on the first page. Copyrights for components of this work owned by others than Usually, when a table printed in a PDF document originally was
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- an object (e.g. a table in a Word-document) then 1) one printing
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@[Link]. instruction forms a part or a whole of textual content of only one
DocEng ’16, September 12-16, 2016, Vienna, Austria
1 [Link]
⃝
c 2016 ACM. ISBN 978-1-4503-4438-8/16/09. . . $15.00
2 [Link]
DOI: [Link]
119
A table is a collection of related data a 6 8 9
d e 1 Fiscal
3 R&D 5 GDP 2) 7 Ratio of R&D
A table is a collection of related data b
2 year
4 expenditures 10 expenditures to 11 a
7 (bn yen) 8 (bn yen) 12 GDP
A table is a collection of related data c f g
131996 14 a) 15.079 16 506.480 17 a) 2.98
15 18
Figure 1: Preprocessing of text chunks (a–c) and rulings (d–g).
R&D 2)
Fiscal GDP Ratio of R&D
expenditures expenditures to
physical cell; 2) printing instructions forming a text inside each year b
physical cell appear in the PDF file in the order that coincides with (bn yen) (bn yen) GDP
the human reading order of this text. We notice that it is true for 1996 a) 15.079 506.480 a) 2.98
many PDF generators. This feature can be especially useful in case
of multi-row cells in table heads without rulings. Figure 2: Text chunks and the order of their appearance (a),
We also consider two configurations for the system and demon- and text blocks constructed from them (b).
strate experimental results based on the existing competition
dataset, “ICDAR 2013 Table Competition”, for both of them.
• H2 , eliminating padding text chunks: if a text chunk consists
only of a series of padding characters (e. g. series of dots),
2. TABLE STRUCTURE RECOGNITION then it is excluded.
We present the process of table structure recognition as three
consecutive steps: Often, the two kinds of text chunks are visually detached from the
rest of text chunks by long spaces. This lead to improperly re-
1. preprocessing: generating and preparing text chunks and rul- covered columns. Thus, eliminating them, we try to prevent some
ings from a source document; errors.
A ruling is defined as a bonding box with four coordinates:
2. text block recovering: combining text chunks into text r = (xl , xr , yt , yb ). Visual rulings can be originally presented by
blocks; printing instructions for lines and rectangles. We merge all seg-
ments of one visual line (Fig. 1, d) into one ruling (Fig. 1, e). We
3. cell recovering: dividing table space into rows, columns, and also split each rectangle (Fig. 1, f ) into four rulings corresponding
cells via text blocks. to its boundaries (Fig. 1, g).
2.1 Preprocessing 2.2 Text Block Recovering
We operate two kind of objects: text chunks and rulings. A text We define a text block as a set of chunks. On this step all text
chunk is defined as c = (b, f , o, w), where chunks are combined into blocks (Fig. 2). One chunk can be in-
cluded only in one text block.
• b = (xl , yt , xr , yb ) — bonding box with four coordinates:
Text chunks are handled in pairs. We make a decision for each
xl = xl (c) — left, yt = yt (c) — top, xr = xr (c) — right, and
pair of chunks: to combine them or not. Two text chunks can be
yb = yb (c) — bottom, xl , yt , xr , yb ∈ R, the x-coordinate in-
combined into one block when they satisfy the following condi-
creases from left to right, and y-coordinate increases from
tions, in case of horizontal combining:
top to bottom;
• P1 , word spacing: the horizontal distance between the chunks
• f = ( f f , fs , fb , fi ) — font with the attributes: f f = f f (c) — is less than a configurable threshold;
family (string value), fs = fs (c) — size in points, fb = fb (c):
fb ∈ {true, false} — bold or not, fi = fi (c): fi ∈ {true, false} • P2 , vertical projections: there is a configurable intersection
— italic or not; of their vertical projections;
or in case of vertical combining:
• o = o(c) : o ∈ N — index number in the order of the appear-
ance of text chunks in the source PDF file. • P3 , line spacing: the vertical distance between the chunks is
less than a configurable threshold;
• w = w(c) : w ∈ R — space width.
• P4 , horizontal projections: there is a configurable intersec-
Initially each text chunk corresponds to one instruction of text tion of their horizontal projections.
printing. The same text can be presented in PDF by different print-
ing instructions, depending on the used PDF generator, as shown Moreover, a configuration can specify that two combining
in Fig. 1, a–c. At first, we split all text chunks (Fig. 1, a) into one- chunks c1 and c2 have to satisfy some or all of the ad-hoc heuristics
character chunks (Fig. 1, b) and merge them into word chunks with listed below:
removing space characters and reindexing the order of their appear- • H3 , adjacency in the order of the appearance: they are adja-
ance (Fig. 1, c). cent in the order of their appearance in the source PDF file,
On this stage our system enables applying two ad-hoc heuristics o(c1 ) = o(c2 ) + 1;
for eliminating some kinds of “insular” text chunks from the further
processing: • H4 , no rulings in text blocks: there are no rulings in the rect-
angle between the chunks defined as
• H1 , eliminating itemization text chunks: if a text chunk con- ( )
b(c1 , c2 ) = xl (c1 , c2 ), yt (c1 , c2 ), xr (c1 , c2 ), yb (c1 , c2 ) ,
tains only one character marking itemized lists (e. g. bullet,
square), then it is excluded; where
120
( )
– xl (c1 , c2 ) = min xl (c1 ), xl (c2 ) , • Default values: kw = 1 and kh = 1.
( )
– yt (c1 , c2 ) = min yt (c1 ), yt (c2 ) , The second C2 -configuration consists of the following settings:
( )
– xr (c1 , c2 ) = max xr (c1 ), xr (c2 ) , • P1 , word spacing:
( )
– yb (c1 , c2 ) = max yb (c1 ), yb (c2 ) ;
w, if wmin < |d| ≤ wmax
• H5 , identical font family: f f (c1 ) = f f (c2 ); sw = wmin , if |d| ≤ wmin
w ∗ k , otherwise;
w
• H6 , identical font size: fs (c1 ) = fs (c2 );
where w is a space width of the left chunk, kw : 0 < k < 1 is a
• H7 , identical font bold attribute: fb (c1 ) = fb (c2 ); width factor, d is the horizontal distance between the chunks,
• H8 , identical font italic attribute: fi (c1 ) = fi (c2 ). wmin : t1 ∈ R, wmin > 0 is a threshold (the minimum width of
a space), wmax : wmax ∈ R, wmax > wmin is a threshold (the
We suppose that each text block is a textual content of one cell, maximum width of the space);
and each non-empty cell contains only one block. Thus, we try
• P2 , vertical projections: yb (c1 ) = yb (c2 );
to recover non-empty cells without their arrangement in rows and
columns. • P3 , line spacing: sl = t2 : t2 ∈ R, t2 > 0 is a threshold;
2.3 Cell Recovering • P4 , horizontal projections: xl (c1 ) ≤ xl (c2 ) ≤ xr (c2 ) or
In this step we construct rows and columns that constitute an ar- xl (c1 ) ≤ xr (c2 ) ≤ xr (c2 );
rangement of cells. The system provides two algorithms for slicing
• Cell constructing: A2 -algorithm (connected text block anal-
a table space into rows and columns. A configuration can use one
ysis);
of them.
The first (A1 ) is based on the whitespace analysis. We use the • Ad-hoc heuristics: H1 –H8 ;
algorithm [14] to recover horizontal and vertical gaps between text
blocks. Each whitespace gap corresponds to a ruling. Thus, we try • Default values: kw = 0.5, wmin = 4, and wmax = 56.
to recover all rulings, which separate cells in a table.
The second (A2 ) is the analysis of connected text blocks. To 4. EXPERIMENTAL EVALUATION
generate columns, we first exclude each multi-column text block To evaluate both configurations we use the methodology for al-
located in more than one column. We decide that a text block is gorithms for table understanding in PDF documents proposed in
multi-column when its horizontal projection intersects with the pro- the paper [3]. We also use the existing competition dataset3 , “IC-
jections of two or more text blocks located in the same line. Each DAR 2013 Table Competition” [4]. It contains 156 tables in 67
column is considered as an intersection of horizontal projections PDF documents collected from EU and US government websites.
of one-column text blocks. Similarly, rows are constructed from The evaluated prototype of our system uses the iText4 library for
vertical projections of one-row text blocks. PDF interpretation to extract PDF objects from source documents
In this step we also recover empty cells. Some of them can be and to generate the text chunks and rulings. In the evaluation, the
erroneous, i. e. they absent in the source table. The system provides parameters for both configurations have been set up by default val-
the ad-hoc heuristic to dispose of erroneous empty cells: ues without searching for their optimal values. The experimental
• H9 , cell singleton: if a column contains only one non-empty results are shown in Table 1. The highest F-score reaches more
cell then the column is merged with the nearest column to the than 0.93.
left. Note that the evaluation was performed automatically using Nur-
minen’s Python scripts5 for comparing ground-truth and result files
that implement this methodology with slight modifications. There-
3. TWO CONFIGURATIONS fore our results shown in Table 1 should not be matched directly
In the paper, we consider two configurations for our system. The with others demonstrated on “ICDAR 2013 Table Competition”.
main difference between them consists in estimation of word (P1 ) Nevertheless, we can declare that the experimental results show the
and line (P3 ) spacing, as well as used algorithm for cell construc- high performance of our system on the recognized dataset of PDF
tion. tables.
The first C1 -configuration is the following settings:
• P1 , word spacing: sw = w ∗ kw where w is a space width of Table 1: Experimental results
the left chunk, and kw : kw ∈ R, kw > 0 is a width factor; Configuration C1 C2
recall 0.9121 0.9233
• P2 , vertical projections: yt (c1 ) ≤ yt (c2 ) ≤ yb (c1 ) or yt (c1 ) ≤ precision 0.9180 0.9499
yb (c2 ) ≤ yb (c1 ). F-score 0.9150 0.9364
• P3 , line spacing: sl = h ∗ kh , where h is a height of the upper
chunk, and kh : kh > 0 is a height factor; Moreover, we can improve F-score via setting optimal values for
the configuration parameters. In both configurations, the numeric
• P4 , horizontal projections: xl (c1 ) ≤ xl (c2 ) ≤ xr (c2 ) or thresholds and factors can be set as the result of searching for maxi-
xl (c1 ) ≤ xr (c2 ) ≤ xr (c2 ). mum of F-score on a target dataset. For example, we have searched
• Cell constructing: A1 -algorithm (whitespace analysis). 3 [Link]
4 [Link]
121
core
F-score cil for Grants of the President of Russian Federation (grant NSh-
0.9 8081.2016.9). Our web-application for PDF table extraction is per-
formed on resources of the Shared Equipment Center of Integrated
0.85 Information and Computing Network for Irkutsk Research and Ed-
ucational Complex8 .
0.8
0.75
75 7. REFERENCES
4 [1] B. Coüasnon and A. Lemaitre. Handbook of Document
1 3 Image Processing and Recognition, chapter Recognition of
kw 2 2 kh
Tables and Forms, pages 647–677. Springer London, 2014.
thee width factor 3 the height [2] A. C. e Silva, A. M. Jorge, and L. Torgo. Design of an
1 factor end-to-end method to extract information from tables.
4
International Journal of Document Analysis and Recognition
(IJDAR), 8(2):144–171, 2006.
Figure 3: Searching for factor values to maximize F-score in
[3] M. Göbel, T. Hassan, E. Oro, and G. Orsi. A methodology
C1 -configuration.
for evaluating algorithms for table understanding in PDF
documents. In Proc. of the 2012 ACM Symposium on
for the maximum of F-score as the function of two variables (the Document Engineering, pages 45–48, New York, NY, USA,
width and height factors) in the C1 -configuration on the competi- 2012.
tion dataset “ICDAR 2013 Table Competition” (Fig. 3). We have [4] M. Göbel, T. Hassan, E. Oro, and G. Orsi. ICDAR 2013 table
evaluated 2500 tests, where both kw and kh have increased from 0 competition. In Proc. of the 12th Int. Conf. on Document
to 5 with the step 0.1. The F-score have reached the maximum Analysis and Recognition, pages 1449–1453, 2013.
(0.9189) when the width factor kw is 0.9 and the height factor is [5] T. Hassan and R. Baumgartner. Table recognition and
1.0. understanding from PDF files. In Proc. of the 9th Int. Conf.
on Document Analysis and Recognition - Volume 02, pages
1143–1147, Washington, DC, USA, 2007. IEEE Comp. Soc.
5. CONCLUSION AND FURTHER WORK
[6] J. Hu and Y. Liu. Analysis of Documents Born Digital, pages
Unlike the existing solutions, our system enables the advanced 775–804. Springer London, London, 2014.
configuration options which allow to adapt it to different sources.
[7] S. Khusro, A. Latif, and I. Ullah. On methods and tools of
We have formulated a set of valuable ad-hoc heuristics that can be
table detection, extraction and annotation in PDF documents.
enhanced in the future. It is important to note, that it was for the first
J. Inf. Sci., 41(1):41–57, Feb. 2015.
time, that we have examined the possibility of applying the order
of the appearance of text chunks in PDF files for table structure [8] Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer:
recognition. Automatic table metadata extraction and searching in digital
The main applications of our system are in the field of data ac- libraries. In Proc. of the 7th ACM/IEEE Joint Conf. on
cessibility, information extraction, and unstructured data integra- Digital Libraries, pages 91–100, 2007.
tion. Particularly, we develop an experimental web-application6 [9] A. Nurminen. Algorithmic extraction of data in tables in
for PDF table extraction based on the prototype of our system. In PDF documents. Master’s thesis, Tampere University of
the current state, this tool enables only manual table selection in a Technology, Tampere, Finland, 2013.
page of a PDF document and automatic table structure recognition. [10] E. Oro and M. Ruffolo. PDF-TREX: An approach for
As the result of this process, an extracted table is accessible in the recognizing and extracting tables from PDF documents. In
editable format, HTML or spreadsheet, that can be used as input in Proc. of the 10th Int. Conf. on Document Analysis and
our rule-based spreadsheet data canonicalization system7 for fur- Recognition, pages 906–910, 2009.
ther transforming data from arbitrary tables to relational ones [13]. [11] J. Y. Ramel, M. Crucianu, N. Vincent, and C. Faure.
The further work is in progress on expanding the set of ad-hoc Detection, extraction and representation of tables. In Proc. of
heuristics. We believe the involvement of the additional features the 7th Int. Conf. on Document Analysis and Recognition,
such as text alignment, superscript, and subscript will allow to im- pages 374–378 vol.1, 2003.
prove our system. We also expect an advancement in the prepro- [12] R. Rastan, H.-Y. Paik, and J. Shepherd. Texus: A task-based
cessing step for excluding “messy” rulings, which originate from approach for table extraction and understanding. In Proc. of
underlined or striked text. In the future, our system also can be the 2015 ACM Symposium on Document Engineering, pages
extended for supporting automatic PDF table detection. 25–34, 2015.
[13] A. Shigarov. Table understanding using a rule engine. Expert
6. ACKNOWLEDGMENTS Systems with Applications, 42(2):929–937, 2015.
We thank Tamir Hassan for the detailed discussion and explana- [14] A. Shigarov and R. Fedorov. Simple algorithm page layout
tion of the methodology for evaluating algorithms for table under- analysis. Pattern Recognition and Image Analysis,
standing in PDF documents [3] in the part of table structure recog- 21(2):324–327, 2011.
nition. We also thank Anssi Nurminen for providing his Python [15] B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A method to
scripts, which have allowed us to automate the evaluation process. extract table information from PDF files. In Proc. of the 2nd
This work was financially supported by the Russian Foundation Indian Int. Conf. on Artificial Intelligence, Pune, India,
for Basic Research (grants 15-37-20042, 14-07-00166) and Coun- pages 1773–1785, 2005.
6 available at [Link]
7 available at [Link] 8 [Link]
122