Configurable Table Structure Recognition

Configurable Table Structure Recognition in Untagged PDF documents Configurable Table Structure Recognition in Untagged PDF documents

Uploaded by

rasuvzjikahsnkgilmer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views4 pages

Configurable Table Structure Recognition

Configurable Table Structure Recognition in Untagged PDF documents Configurable Table Structure Recognition in Untagged PDF documents

Uploaded by

rasuvzjikahsnkgilmer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Configurable Table Structure Recognition

in Untagged PDF documents

Alexey Shigarov Andrey Mikhailov Andrey Altaev

Matrosov Institute for System Matrosov Institute for System Matrosov Institute for System
Dynamics and Control Theory Dynamics and Control Theory Dynamics and Control Theory
of SB RAS of SB RAS of SB RAS
134 Lermontov st. 134 Lermontov st. 134 Lermontov st.
Irkutsk, Russia Irkutsk, Russia Irkutsk, Russia
shigarov@[Link] mikhailov@[Link] altaev@[Link]

ABSTRACT 2]. These methods can be applied to PDF documents through con-
Today, PDF is one of the most popular document formats in the verting PDF to these formats. However, this process leads to the
web. Many PDF documents are not images, but remain untagged. loss of valuable information. Table extraction from PDF directly
They have no tags for identifying the logical reading order, para- can provide better results. PDF is a richer representation of docu-
graphs, figures, and tables. One of the challenges with these doc- ments in comparison with images and plain-text. PDF documents
uments is how to extract tables from them. The paper discusses a can contain machine-readable text (text chunks with their positions,
new system for table structure recognition in untagged PDF doc- font characteristics, and order of appearance in a file), as well as
uments. It is formulated as a set of configurable parameters and vector graphics including table rulings. We expect that these fea-
ad-hoc heuristics for recovering table cells. We consider two dif- tures can allow to extract tables more accurately.
ferent configurations for the system and demonstrate experimental Several methods and tools for PDF table extraction are proposed
results based on the existing competition dataset for both of them. in two last decades. Some of them are discussed in the surveys [1,
6, 7]. Ramel et al. [11] consider two techniques for detecting
and recognizing tables from documents in an exchange format like
CCS Concepts PDF. The first is based on the analysis of ruling lines. The sec-
•Applied computing → Document analysis; Document manage- ond is to analyze the arrangement of text components. Hassan et
ment and text processing; al. [5] expand these ideas for PDF table extraction. In the project
TableSeer, Liu et al. [8] propose methods for detecting tables in
PDF documents and extracting metadata (headers). They use text
Keywords arrangement, fonts, whitespace, and keywords (e. g. “Table”, “Fig-
table extraction, table structure recognition, untagged PDF docu- ure”). Oro et al. [10] present PDF-TREX, a heuristic method where
ments, PDF document analysis, PDF accessibility PDF table extraction is realized as building from content elements
to tables in bottom-up way.
1. INTRODUCTION Yildiz et al. [15] propose a heuristic method for PDF table ex-
traction using the ‘pdftohtml’1 tool for generating its input. Ras-
Today, PDF is one of the most popular document formats in the
tan et al. [12] consider a framework for the end-to-end table pro-
web as can be measured by Google’s “filetype:pdf” search. Many
cessing including the task of table structure recognition. They also
PDF documents are not images, but remain untagged. They have
use the ‘pdftohtml’ tool to prepare their input. However, this tool
no tags for identifying the logical reading order, paragraphs, figures
occasionally makes mistakes in combining text chunks, which are
and tables. Nganji [4] estimates that 95.5% of scientific articles
located too close to each other, thus the input can be corrupted.
published by four leading publishers are untagged PDF documents.
Nurminen [9] in his thesis describes comprehensive PDF table de-
One of the important challenges with these documents is how to
tection and structure recognition algorithms that have demonstrated
extract tables from them.
high recall and precision on “ICDAR 2013 Table Competition” [4].
Table extraction typically consists of two main steps: table de-
Some of them are implemented in Tabula2 system.
tection, i. e. recovering the bounding box of a table in a document,
In contrast to the existing methods, we suggest a configurable
and table structure recognition, i. e. recovering its rows, columns,
system that is formulated as a set of customizable parameters and
and cells. Many of existing methods for extracting tables from un-
ad-hoc heuristics for recovering table cells from text chunks and
structured documents traditionally deal with only images or plain-
rulings. We exploit features of table presentations in untagged PDF
text as source. They are considered in several surveys, including [1,
documents. Most of them such as horizontal and vertical distances,
fonts, and rulings are well known and used in the existing methods.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
Additionally, we propose to exploit the feature of appearance of
for profit or commercial advantage and that copies bear this notice and the full cita- text printing instruction in PDF files.
tion on the first page. Copyrights for components of this work owned by others than Usually, when a table printed in a PDF document originally was
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- an object (e.g. a table in a Word-document) then 1) one printing
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@[Link]. instruction forms a part or a whole of textual content of only one
DocEng ’16, September 12-16, 2016, Vienna, Austria
1 [Link]
⃝
c 2016 ACM. ISBN 978-1-4503-4438-8/16/09. . . $15.00
2 [Link]
DOI: [Link]

119
A table is a collection of related data a 6 8 9
d e 1 Fiscal
3 R&D 5 GDP 2) 7 Ratio of R&D
A table is a collection of related data b
2 year
4 expenditures 10 expenditures to 11 a
7 (bn yen) 8 (bn yen) 12 GDP
A table is a collection of related data c f g
131996 14 a) 15.079 16 506.480 17 a) 2.98
15 18
Figure 1: Preprocessing of text chunks (a–c) and rulings (d–g).
R&D 2)
Fiscal GDP Ratio of R&D
expenditures expenditures to
physical cell; 2) printing instructions forming a text inside each year b
physical cell appear in the PDF file in the order that coincides with (bn yen) (bn yen) GDP
the human reading order of this text. We notice that it is true for 1996 a) 15.079 506.480 a) 2.98
many PDF generators. This feature can be especially useful in case
of multi-row cells in table heads without rulings. Figure 2: Text chunks and the order of their appearance (a),
We also consider two configurations for the system and demon- and text blocks constructed from them (b).
strate experimental results based on the existing competition
dataset, “ICDAR 2013 Table Competition”, for both of them.
• H2 , eliminating padding text chunks: if a text chunk consists
only of a series of padding characters (e. g. series of dots),
2. TABLE STRUCTURE RECOGNITION then it is excluded.
We present the process of table structure recognition as three
consecutive steps: Often, the two kinds of text chunks are visually detached from the
rest of text chunks by long spaces. This lead to improperly re-
1. preprocessing: generating and preparing text chunks and rul- covered columns. Thus, eliminating them, we try to prevent some
ings from a source document; errors.
A ruling is defined as a bonding box with four coordinates:
2. text block recovering: combining text chunks into text r = (xl , xr , yt , yb ). Visual rulings can be originally presented by
blocks; printing instructions for lines and rectangles. We merge all seg-
ments of one visual line (Fig. 1, d) into one ruling (Fig. 1, e). We
3. cell recovering: dividing table space into rows, columns, and also split each rectangle (Fig. 1, f ) into four rulings corresponding
cells via text blocks. to its boundaries (Fig. 1, g).
2.1 Preprocessing 2.2 Text Block Recovering
We operate two kind of objects: text chunks and rulings. A text We define a text block as a set of chunks. On this step all text
chunk is defined as c = (b, f , o, w), where chunks are combined into blocks (Fig. 2). One chunk can be in-
cluded only in one text block.
• b = (xl , yt , xr , yb ) — bonding box with four coordinates:
Text chunks are handled in pairs. We make a decision for each
xl = xl (c) — left, yt = yt (c) — top, xr = xr (c) — right, and
pair of chunks: to combine them or not. Two text chunks can be
yb = yb (c) — bottom, xl , yt , xr , yb ∈ R, the x-coordinate in-
combined into one block when they satisfy the following condi-
creases from left to right, and y-coordinate increases from
tions, in case of horizontal combining:
top to bottom;
• P1 , word spacing: the horizontal distance between the chunks
• f = ( f f , fs , fb , fi ) — font with the attributes: f f = f f (c) — is less than a configurable threshold;
family (string value), fs = fs (c) — size in points, fb = fb (c):
fb ∈ {true, false} — bold or not, fi = fi (c): fi ∈ {true, false} • P2 , vertical projections: there is a configurable intersection
— italic or not; of their vertical projections;
or in case of vertical combining:
• o = o(c) : o ∈ N — index number in the order of the appear-
ance of text chunks in the source PDF file. • P3 , line spacing: the vertical distance between the chunks is
less than a configurable threshold;
• w = w(c) : w ∈ R — space width.
• P4 , horizontal projections: there is a configurable intersec-
Initially each text chunk corresponds to one instruction of text tion of their horizontal projections.
printing. The same text can be presented in PDF by different print-
ing instructions, depending on the used PDF generator, as shown Moreover, a configuration can specify that two combining
in Fig. 1, a–c. At first, we split all text chunks (Fig. 1, a) into one- chunks c1 and c2 have to satisfy some or all of the ad-hoc heuristics
character chunks (Fig. 1, b) and merge them into word chunks with listed below:
removing space characters and reindexing the order of their appear- • H3 , adjacency in the order of the appearance: they are adja-
ance (Fig. 1, c). cent in the order of their appearance in the source PDF file,
On this stage our system enables applying two ad-hoc heuristics o(c1 ) = o(c2 ) + 1;
for eliminating some kinds of “insular” text chunks from the further
processing: • H4 , no rulings in text blocks: there are no rulings in the rect-
angle between the chunks defined as
• H1 , eliminating itemization text chunks: if a text chunk con- ( )
b(c1 , c2 ) = xl (c1 , c2 ), yt (c1 , c2 ), xr (c1 , c2 ), yb (c1 , c2 ) ,
tains only one character marking itemized lists (e. g. bullet,
square), then it is excluded; where

120
( )
– xl (c1 , c2 ) = min xl (c1 ), xl (c2 ) , • Default values: kw = 1 and kh = 1.
( )
– yt (c1 , c2 ) = min yt (c1 ), yt (c2 ) , The second C2 -configuration consists of the following settings:
( )
– xr (c1 , c2 ) = max xr (c1 ), xr (c2 ) , • P1 , word spacing:
( ) 
– yb (c1 , c2 ) = max yb (c1 ), yb (c2 ) ; 
w, if wmin < |d| ≤ wmax
• H5 , identical font family: f f (c1 ) = f f (c2 ); sw = wmin , if |d| ≤ wmin

w ∗ k , otherwise;
w
• H6 , identical font size: fs (c1 ) = fs (c2 );
where w is a space width of the left chunk, kw : 0 < k < 1 is a
• H7 , identical font bold attribute: fb (c1 ) = fb (c2 ); width factor, d is the horizontal distance between the chunks,
• H8 , identical font italic attribute: fi (c1 ) = fi (c2 ). wmin : t1 ∈ R, wmin > 0 is a threshold (the minimum width of
a space), wmax : wmax ∈ R, wmax > wmin is a threshold (the
We suppose that each text block is a textual content of one cell, maximum width of the space);
and each non-empty cell contains only one block. Thus, we try
• P2 , vertical projections: yb (c1 ) = yb (c2 );
to recover non-empty cells without their arrangement in rows and
columns. • P3 , line spacing: sl = t2 : t2 ∈ R, t2 > 0 is a threshold;
2.3 Cell Recovering • P4 , horizontal projections: xl (c1 ) ≤ xl (c2 ) ≤ xr (c2 ) or
In this step we construct rows and columns that constitute an ar- xl (c1 ) ≤ xr (c2 ) ≤ xr (c2 );
rangement of cells. The system provides two algorithms for slicing
• Cell constructing: A2 -algorithm (connected text block anal-
a table space into rows and columns. A configuration can use one
ysis);
of them.
The first (A1 ) is based on the whitespace analysis. We use the • Ad-hoc heuristics: H1 –H8 ;
algorithm [14] to recover horizontal and vertical gaps between text
blocks. Each whitespace gap corresponds to a ruling. Thus, we try • Default values: kw = 0.5, wmin = 4, and wmax = 56.
to recover all rulings, which separate cells in a table.
The second (A2 ) is the analysis of connected text blocks. To 4. EXPERIMENTAL EVALUATION
generate columns, we first exclude each multi-column text block To evaluate both configurations we use the methodology for al-
located in more than one column. We decide that a text block is gorithms for table understanding in PDF documents proposed in
multi-column when its horizontal projection intersects with the pro- the paper [3]. We also use the existing competition dataset3 , “IC-
jections of two or more text blocks located in the same line. Each DAR 2013 Table Competition” [4]. It contains 156 tables in 67
column is considered as an intersection of horizontal projections PDF documents collected from EU and US government websites.
of one-column text blocks. Similarly, rows are constructed from The evaluated prototype of our system uses the iText4 library for
vertical projections of one-row text blocks. PDF interpretation to extract PDF objects from source documents
In this step we also recover empty cells. Some of them can be and to generate the text chunks and rulings. In the evaluation, the
erroneous, i. e. they absent in the source table. The system provides parameters for both configurations have been set up by default val-
the ad-hoc heuristic to dispose of erroneous empty cells: ues without searching for their optimal values. The experimental
• H9 , cell singleton: if a column contains only one non-empty results are shown in Table 1. The highest F-score reaches more
cell then the column is merged with the nearest column to the than 0.93.
left. Note that the evaluation was performed automatically using Nur-
minen’s Python scripts5 for comparing ground-truth and result files
that implement this methodology with slight modifications. There-
3. TWO CONFIGURATIONS fore our results shown in Table 1 should not be matched directly
In the paper, we consider two configurations for our system. The with others demonstrated on “ICDAR 2013 Table Competition”.
main difference between them consists in estimation of word (P1 ) Nevertheless, we can declare that the experimental results show the
and line (P3 ) spacing, as well as used algorithm for cell construc- high performance of our system on the recognized dataset of PDF
tion. tables.
The first C1 -configuration is the following settings:
• P1 , word spacing: sw = w ∗ kw where w is a space width of Table 1: Experimental results
the left chunk, and kw : kw ∈ R, kw > 0 is a width factor; Configuration C1 C2
recall 0.9121 0.9233
• P2 , vertical projections: yt (c1 ) ≤ yt (c2 ) ≤ yb (c1 ) or yt (c1 ) ≤ precision 0.9180 0.9499
yb (c2 ) ≤ yb (c1 ). F-score 0.9150 0.9364
• P3 , line spacing: sl = h ∗ kh , where h is a height of the upper
chunk, and kh : kh > 0 is a height factor; Moreover, we can improve F-score via setting optimal values for
the configuration parameters. In both configurations, the numeric
• P4 , horizontal projections: xl (c1 ) ≤ xl (c2 ) ≤ xr (c2 ) or thresholds and factors can be set as the result of searching for maxi-
xl (c1 ) ≤ xr (c2 ) ≤ xr (c2 ). mum of F-score on a target dataset. For example, we have searched
• Cell constructing: A1 -algorithm (whitespace analysis). 3 [Link]
4 [Link]

• Ad-hoc heuristics: H1 , H3 –H9 ; 5 [Link]

121
core
F-score cil for Grants of the President of Russian Federation (grant NSh-
0.9 8081.2016.9). Our web-application for PDF table extraction is per-
formed on resources of the Shared Equipment Center of Integrated
0.85 Information and Computing Network for Irkutsk Research and Ed-
ucational Complex8 .
0.8
0.75
75 7. REFERENCES
4 [1] B. Coüasnon and A. Lemaitre. Handbook of Document
1 3 Image Processing and Recognition, chapter Recognition of
kw 2 2 kh
Tables and Forms, pages 647–677. Springer London, 2014.
thee width factor 3 the height [2] A. C. e Silva, A. M. Jorge, and L. Torgo. Design of an
1 factor end-to-end method to extract information from tables.
4
International Journal of Document Analysis and Recognition
(IJDAR), 8(2):144–171, 2006.
Figure 3: Searching for factor values to maximize F-score in
[3] M. Göbel, T. Hassan, E. Oro, and G. Orsi. A methodology
C1 -configuration.
for evaluating algorithms for table understanding in PDF
documents. In Proc. of the 2012 ACM Symposium on
for the maximum of F-score as the function of two variables (the Document Engineering, pages 45–48, New York, NY, USA,
width and height factors) in the C1 -configuration on the competi- 2012.
tion dataset “ICDAR 2013 Table Competition” (Fig. 3). We have [4] M. Göbel, T. Hassan, E. Oro, and G. Orsi. ICDAR 2013 table
evaluated 2500 tests, where both kw and kh have increased from 0 competition. In Proc. of the 12th Int. Conf. on Document
to 5 with the step 0.1. The F-score have reached the maximum Analysis and Recognition, pages 1449–1453, 2013.
(0.9189) when the width factor kw is 0.9 and the height factor is [5] T. Hassan and R. Baumgartner. Table recognition and
1.0. understanding from PDF files. In Proc. of the 9th Int. Conf.
on Document Analysis and Recognition - Volume 02, pages
1143–1147, Washington, DC, USA, 2007. IEEE Comp. Soc.
5. CONCLUSION AND FURTHER WORK
[6] J. Hu and Y. Liu. Analysis of Documents Born Digital, pages
Unlike the existing solutions, our system enables the advanced 775–804. Springer London, London, 2014.
configuration options which allow to adapt it to different sources.
[7] S. Khusro, A. Latif, and I. Ullah. On methods and tools of
We have formulated a set of valuable ad-hoc heuristics that can be
table detection, extraction and annotation in PDF documents.
enhanced in the future. It is important to note, that it was for the first
J. Inf. Sci., 41(1):41–57, Feb. 2015.
time, that we have examined the possibility of applying the order
of the appearance of text chunks in PDF files for table structure [8] Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer:
recognition. Automatic table metadata extraction and searching in digital
The main applications of our system are in the field of data ac- libraries. In Proc. of the 7th ACM/IEEE Joint Conf. on
cessibility, information extraction, and unstructured data integra- Digital Libraries, pages 91–100, 2007.
tion. Particularly, we develop an experimental web-application6 [9] A. Nurminen. Algorithmic extraction of data in tables in
for PDF table extraction based on the prototype of our system. In PDF documents. Master’s thesis, Tampere University of
the current state, this tool enables only manual table selection in a Technology, Tampere, Finland, 2013.
page of a PDF document and automatic table structure recognition. [10] E. Oro and M. Ruffolo. PDF-TREX: An approach for
As the result of this process, an extracted table is accessible in the recognizing and extracting tables from PDF documents. In
editable format, HTML or spreadsheet, that can be used as input in Proc. of the 10th Int. Conf. on Document Analysis and
our rule-based spreadsheet data canonicalization system7 for fur- Recognition, pages 906–910, 2009.
ther transforming data from arbitrary tables to relational ones [13]. [11] J. Y. Ramel, M. Crucianu, N. Vincent, and C. Faure.
The further work is in progress on expanding the set of ad-hoc Detection, extraction and representation of tables. In Proc. of
heuristics. We believe the involvement of the additional features the 7th Int. Conf. on Document Analysis and Recognition,
such as text alignment, superscript, and subscript will allow to im- pages 374–378 vol.1, 2003.
prove our system. We also expect an advancement in the prepro- [12] R. Rastan, H.-Y. Paik, and J. Shepherd. Texus: A task-based
cessing step for excluding “messy” rulings, which originate from approach for table extraction and understanding. In Proc. of
underlined or striked text. In the future, our system also can be the 2015 ACM Symposium on Document Engineering, pages
extended for supporting automatic PDF table detection. 25–34, 2015.
[13] A. Shigarov. Table understanding using a rule engine. Expert
6. ACKNOWLEDGMENTS Systems with Applications, 42(2):929–937, 2015.
We thank Tamir Hassan for the detailed discussion and explana- [14] A. Shigarov and R. Fedorov. Simple algorithm page layout
tion of the methodology for evaluating algorithms for table under- analysis. Pattern Recognition and Image Analysis,
standing in PDF documents [3] in the part of table structure recog- 21(2):324–327, 2011.
nition. We also thank Anssi Nurminen for providing his Python [15] B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A method to
scripts, which have allowed us to automate the evaluation process. extract table information from PDF files. In Proc. of the 2nd
This work was financially supported by the Russian Foundation Indian Int. Conf. on Artificial Intelligence, Pune, India,
for Basic Research (grants 15-37-20042, 14-07-00166) and Coun- pages 1773–1785, 2005.
6 available at [Link]
7 available at [Link] 8 [Link]

122

Detecting Table Region in PDF Documents Using Distant Supervision
No ratings yet
Detecting Table Region in PDF Documents Using Distant Supervision
7 pages
PDF-TREX An Approach For Recognizing and Extracting Tables From PDF Documents
No ratings yet
PDF-TREX An Approach For Recognizing and Extracting Tables From PDF Documents
5 pages
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks
No ratings yet
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks
6 pages
Tables To LaTeX - Structure and Content Extraction From Scientific Tables
No ratings yet
Tables To LaTeX - Structure and Content Extraction From Scientific Tables
10 pages
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks
No ratings yet
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks
6 pages
A Table Detection Method For Multipage PDF Documents Via Visual Seperators and Tabular Structures
No ratings yet
A Table Detection Method For Multipage PDF Documents Via Visual Seperators and Tabular Structures
5 pages
Table Header Detection and Classification: Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles
No ratings yet
Table Header Detection and Classification: Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles
7 pages
Deep Learning for PDF Table Queries
No ratings yet
Deep Learning for PDF Table Queries
12 pages
Automatic Table Detection, Structure Recognition and Data Extraction From Document Images
No ratings yet
Automatic Table Detection, Structure Recognition and Data Extraction From Document Images
8 pages
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks (2016) - 136
No ratings yet
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks (2016) - 136
7 pages
Simple Table Detection in Documents
No ratings yet
Simple Table Detection in Documents
11 pages
Sudi Klemens 2019
No ratings yet
Sudi Klemens 2019
104 pages
Table Recognition and Understanding From PDF Files
No ratings yet
Table Recognition and Understanding From PDF Files
5 pages
GraphNeural New 2208.11203
No ratings yet
GraphNeural New 2208.11203
8 pages
1 PB
No ratings yet
1 PB
6 pages
2019 ICDAR PRiba
No ratings yet
2019 ICDAR PRiba
6 pages
Table Bank
No ratings yet
Table Bank
9 pages
A Table Detection Method For Multipage PDF Documen
No ratings yet
A Table Detection Method For Multipage PDF Documen
6 pages
Reaserch Paper
No ratings yet
Reaserch Paper
10 pages
An Overview of Data Extraction From Invoices
No ratings yet
An Overview of Data Extraction From Invoices
15 pages
Table Extraction via cGAN and GA
No ratings yet
Table Extraction via cGAN and GA
8 pages
Extract Tables from PDFs with OCR
No ratings yet
Extract Tables from PDFs with OCR
15 pages
Large Dataset for Table Structure Recognition
No ratings yet
Large Dataset for Table Structure Recognition
13 pages
Robust PDF Table Locator: CS 229 Machine Learning Stanford University
No ratings yet
Robust PDF Table Locator: CS 229 Machine Learning Stanford University
6 pages
Deeppdf: A Deep Learning Approach To Analyzing Pdfs
No ratings yet
Deeppdf: A Deep Learning Approach To Analyzing Pdfs
1 page
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
No ratings yet
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
19 pages
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
No ratings yet
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
19 pages
PDF Table Extractor Guide
No ratings yet
PDF Table Extractor Guide
17 pages
Pubtables-1M: Towards Comprehensive Table Extraction From Unstructured Documents
No ratings yet
Pubtables-1M: Towards Comprehensive Table Extraction From Unstructured Documents
10 pages
Rule Based Extraction From PDF
No ratings yet
Rule Based Extraction From PDF
4 pages
Table Net
No ratings yet
Table Net
6 pages
Rule Based Approach To Extract Metadata From Scientific PDF Documents
No ratings yet
Rule Based Approach To Extract Metadata From Scientific PDF Documents
4 pages
Image-Based Table Recognition Data Model and Metric
No ratings yet
Image-Based Table Recognition Data Model and Metric
11 pages
Table Extraction
No ratings yet
Table Extraction
2 pages
TSR Paper
No ratings yet
TSR Paper
9 pages
Anssi Nurminen Algorithmic Extraction of Data in Tables in PDF Documents
No ratings yet
Anssi Nurminen Algorithmic Extraction of Data in Tables in PDF Documents
80 pages
PDFX: Fully-Automated PDF-to-XML Conversion of Scientific Literature
No ratings yet
PDFX: Fully-Automated PDF-to-XML Conversion of Scientific Literature
4 pages
A Benchmark of PDF Information Extraction
No ratings yet
A Benchmark of PDF Information Extraction
23 pages
PDF 4
No ratings yet
PDF 4
7 pages
Robust PDF Document Conversion Using Recurrent Neural Networks
No ratings yet
Robust PDF Document Conversion Using Recurrent Neural Networks
9 pages
CERMINE: Automatic Extraction of Structured Metadata From Scientific Literature
No ratings yet
CERMINE: Automatic Extraction of Structured Metadata From Scientific Literature
19 pages
Deepdesrt Deep Learning For Table Detection
No ratings yet
Deepdesrt Deep Learning For Table Detection
6 pages
PDFto XML
No ratings yet
PDFto XML
24 pages
Recognition and Classification of Figures in PDF Documents
No ratings yet
Recognition and Classification of Figures in PDF Documents
12 pages
Table Extraction Techniques for QA
No ratings yet
Table Extraction Techniques for QA
26 pages
PDFFigures 2.0 - Proceedings of The 16th ACM-IEEE-CS On Joint Conference On Digital Libraries
0% (1)
PDFFigures 2.0 - Proceedings of The 16th ACM-IEEE-CS On Joint Conference On Digital Libraries
6 pages
Parsing-Pdfs: Pypdf2
No ratings yet
Parsing-Pdfs: Pypdf2
2 pages
Extract Table of Contents From PDF
No ratings yet
Extract Table of Contents From PDF
2 pages
Bibliopage: A Dataset of Scanned Title Pages For Bibliographic Metadata Extraction
No ratings yet
Bibliopage: A Dataset of Scanned Title Pages For Bibliographic Metadata Extraction
18 pages
Conclusion and Future Work
No ratings yet
Conclusion and Future Work
1 page
Rule Based Metadata Extraction Framework From Academic Articles
No ratings yet
Rule Based Metadata Extraction Framework From Academic Articles
10 pages
Exttnet: A Deep Learning Algorithm For Extracting Table Texts From Invoice Images
No ratings yet
Exttnet: A Deep Learning Algorithm For Extracting Table Texts From Invoice Images
6 pages
D SC M T: Distantly Supervised Composition Extraction From Tables in Materials Science Articles
No ratings yet
D SC M T: Distantly Supervised Composition Extraction From Tables in Materials Science Articles
17 pages
Task Description
No ratings yet
Task Description
2 pages
Extracting Structured Data From Invoices
No ratings yet
Extracting Structured Data From Invoices
7 pages
Understanding Research Objectives, Aims, Scope
No ratings yet
Understanding Research Objectives, Aims, Scope
14 pages
Second Law of Thermodynamics Practice Questions - Without Ans
No ratings yet
Second Law of Thermodynamics Practice Questions - Without Ans
4 pages
PAC PAC PAC Pac - Man Cocktail Table Man Cocktail Table Man Cocktail Table Man Cocktail Table
No ratings yet
PAC PAC PAC Pac - Man Cocktail Table Man Cocktail Table Man Cocktail Table Man Cocktail Table
17 pages
HDFC Bank Account Opening Guide
100% (1)
HDFC Bank Account Opening Guide
1 page
Bayfront Park Solar License Agreement
No ratings yet
Bayfront Park Solar License Agreement
31 pages
Vocal Effects Cheat Sheet
No ratings yet
Vocal Effects Cheat Sheet
4 pages
Tm9 2320 208 20p
100% (1)
Tm9 2320 208 20p
156 pages
HRM vs. HRD and SHRM Overview
No ratings yet
HRM vs. HRD and SHRM Overview
17 pages
Maggot Moon by Sally Gardner - Q&A
No ratings yet
Maggot Moon by Sally Gardner - Q&A
2 pages
Force and Work on a 5 kg Block
No ratings yet
Force and Work on a 5 kg Block
4 pages
Polarmods - Patcher Logcat
No ratings yet
Polarmods - Patcher Logcat
7 pages
2024 Global Estimates of Child Labour Executive Summary ENGLISH EMBARGOED
No ratings yet
2024 Global Estimates of Child Labour Executive Summary ENGLISH EMBARGOED
8 pages
New Zealand Trend Analysis Final
No ratings yet
New Zealand Trend Analysis Final
9 pages
Costing Accounting Problems
No ratings yet
Costing Accounting Problems
3 pages
GMDSS Questions: Question? A B C D
100% (2)
GMDSS Questions: Question? A B C D
9 pages
Hestia: Goddess of Hearth and Home
No ratings yet
Hestia: Goddess of Hearth and Home
3 pages
Introduction To Chemical Engineering Thermodynamics
No ratings yet
Introduction To Chemical Engineering Thermodynamics
50 pages
Infection Control Quiz for KAMC Staff
No ratings yet
Infection Control Quiz for KAMC Staff
6 pages
Chery A1 Operating Instruction Manual
No ratings yet
Chery A1 Operating Instruction Manual
176 pages
Case # 89
No ratings yet
Case # 89
2 pages
Material 111079 12210325 Textbook
No ratings yet
Material 111079 12210325 Textbook
170 pages
Water Bodies Judgment
No ratings yet
Water Bodies Judgment
17 pages
Timeline of Science
100% (1)
Timeline of Science
7 pages
Scientech 2708
No ratings yet
Scientech 2708
53 pages
Assessment of Personal Entrepreneurial Competencies
No ratings yet
Assessment of Personal Entrepreneurial Competencies
2 pages
Lexical Semantics & Relations
No ratings yet
Lexical Semantics & Relations
15 pages
Quiz Practical Research 1
No ratings yet
Quiz Practical Research 1
29 pages
Round 2
No ratings yet
Round 2
20 pages
Quiz Bowl
100% (1)
Quiz Bowl
11 pages
BS-I Timetable for 2024-2028
No ratings yet
BS-I Timetable for 2024-2028
3 pages

Configurable Table Structure Recognition

Uploaded by

Configurable Table Structure Recognition

Uploaded by

Configurable Table Structure Recognition

in Untagged PDF documents

Alexey Shigarov Andrey Mikhailov Andrey Altaev

• Ad-hoc heuristics: H1 , H3 –H9 ; 5 [Link]

You might also like