0% found this document useful (0 votes)
5 views17 pages

Deplot Paper

The paper introduces D E P LOT, a novel one-shot visual language reasoning method that translates plots and charts into linearized tables for improved reasoning with large language models (LLMs). This approach significantly outperforms existing state-of-the-art models by 29.4% on human-written queries while requiring only one-shot supervision. The method decomposes the reasoning task into plot-to-text translation and reasoning over the translated text, addressing the challenges of visual language comprehension effectively.

Uploaded by

gyuminf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Deplot Paper

The paper introduces D E P LOT, a novel one-shot visual language reasoning method that translates plots and charts into linearized tables for improved reasoning with large language models (LLMs). This approach significantly outperforms existing state-of-the-art models by 29.4% on human-written queries while requiring only one-shot supervision. The method decomposes the reasoning task into plot-to-text translation and reasoning over the translated text, addressing the challenges of visual language comprehension effectively.

Uploaded by

gyuminf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

D E P LOT: One-shot visual language reasoning

by plot-to-table translation

Fangyu Liu♠♣∗ § Julian Martin Eisenschlos♣∗


Francesco Piccinno♣ Syrine Krichene♣ Chenxi Pang♣ Kenton Lee♣
Mandar Joshi♣ Wenhu Chen♣ Nigel Collier♠ Yasemin Altun♣

Google DeepMind ♠ University of Cambridge

Abstract end-to-end solutions to such methods (Lee et al.,


2023; Liu et al., 2023a). Whilst being an effective
Visual language such as charts and plots is ubiq-
uitous in the human world. Comprehending solution, end-to-end methods need to be finetuned
arXiv:2212.10505v2 [[Link]] 23 May 2023

plots and charts requires strong reasoning skills. on large amounts of task data and they still lag
Prior state-of-the-art (SOTA) models require at behind on queries that require complex reasoning
least tens of thousands of training examples even after finetuning. As an example, the current
and their reasoning capabilities are still much SOTA model M AT C HA (Liu et al., 2023a) achieves
limited, especially on complex human-written only 38.2% accuracy on ChartQA (Masry et al.,
queries. This paper presents the first few(one)-
2022) (human written queries).
shot solution to visual language reasoning. We
decompose the challenge of visual language In the meantime, large language models (LLMs)
reasoning into two steps: (1) plot-to-text trans- such as GPT-3 (Brown et al., 2020) and PaLM
lation, and (2) reasoning over the translated text. (Chowdhery et al., 2022) have demonstrated ex-
The key in this method is a modality conver- ceptional few-shot reasoning skills, without requir-
sion module, named as D E P LOT, which trans- ing expensive human annotations. However, it is
lates the image of a plot or chart to a linearized
an open question how multimodal reasoning tasks
table. The output of D E P LOT can then be di-
rectly used to prompt a pretrained large lan- could benefit from LLMs. In this work, we pro-
guage model (LLM), exploiting the few-shot pose to decompose the multimodal visual language
reasoning capabilities of LLMs. To obtain D E - reasoning problem into: (1) converting the input
P LOT, we standardize the plot-to-table task by plot image to a linearized table and (2) passing the
establishing unified task formats and metrics, linearized table to LLMs for one-shot reasoning.
and train D E P LOT end-to-end on this task. D E -
The key of the method is a modality conversion
P LOT can then be used off-the-shelf together
with LLMs in a plug-and-play fashion. Com- module called D E P LOT that maps charts and plots
pared with a SOTA model finetuned on thou- to the underlying data table. While there has been
sands of data points, D E P LOT+LLM with just prior works in chart information extraction, they are
one-shot prompting achieves a 29.4% improve- usually hybrid systems combining complex hand-
ment over finetuned SOTA on human-written designed rules, OCR, keypoint detection, and ob-
queries from the task of chart QA.12 ject segmentation modules (Siegel et al., 2016; Luo
1 Introduction et al., 2021; Masry et al., 2022). For different types
of charts, distinct approaches have been used (Rane
Multimodal reasoning on visual language such as et al., 2021; Kato et al., 2022). Besides, there does
plots and charts is an extremely complex task. For not exist a unified, consistent, and accurate frame-
downstream tasks such as question answering (QA) work for evaluating chart information extraction –
on plots/charts, a model needs to first extract rele- metrics specific to certain types of charts (Siegel
vant information from the image, organize them in et al., 2016) or overly-simplified number matching
a sensible manner, then perform reasoning over the metrics (Luo et al., 2021) have been used. Our
entries extracted. Previous studies have proposed proposed D E P LOT is an end-to-end image-to-text

Work done during Google internship. Transformer model trained with the task of plot-
§
Equal contributions. to-table translation. A combination of synthetic
1
Code and models: [Link]/google-research/google- and web-crawled charts and plots and their under-
research/tree/master/deplot
2
For questions please contact fl399@[Link] and lying data table are collected as the training corpus.
eisenjulian@[Link]. We demonstrate that D E P LOT significantly outper-
Header: models |augmented-set|human-set
Row 1: VisionTapas |67.2 |22.2
DePlot Row 2: Pix2Struct |82.9 |30.4
Row 3: MatCha |89.0 |38.0
Row 4: DePlot |91.0 |67.6

LLM

FlanPaLM: Let's find the average of DePlot across the two sets. We add the two numbers in the DePlot row and
divide by 2: 91.0+67.6=158.6/2=79.3. Let's find the second strongest model across the two sets. We find the
row with the second highest average: Row 3. We find theaverage of MatCha across the two sets. We add the
two numbers in the MatCha row and divide by 2:89.0+38.0=127.0/2=63.5. We subtract the second strongest
average from the strongest average: 79.3-63.5=15.8. The answer is 15.8.

Codex: # DePlot is on row 4, the second strongest is on row 3.


deplot_augmented_set = 91.0
deplot_human_set = 67.6 15.8
matcha_augmented_set = 89.0
matcha_human_set = 38.0
ans = (deplot_augmented_set - matcha_augmented_set + deplot_human_set - matcha_human_set) /__
2

Figure 1: An illustration of the D E P LOT+LLM method. This is a real example using FlanPaLM (Chung et al., 2022)
with Chain-of-Thoughts prompting (Wei et al., 2022) and Codex (Chen et al., 2021) with Program-of-Thoughts
prompting (Chen et al., 2022). The light blue boxes are input (or intermediate forms of the input) to the LLM and
the light red box contains the answer generated by the LLMs. Key reasoning steps are highlighted.

forms hybrid systems and can uniformly handle with just one-shot supervision, outperforming the
all types of charts. To accurately capture plot-to- second best method (which is fully supervised) by
table systems’ effectiveness (and avoid error prop- 29.4% on human-written queries.
agation to downstream tasks), we propose a novel
table matching metric that considers both textual 2 Background
and numeric entries with relative error tolerance,
Plug-and-play of multimodal pretrained models.
and is invariant to transpositions, row and column
Numerous large pretrained models, either for cross-
permutations.
modal tasks such as CLIP (Radford et al., 2021),
After accurately translating plot images to texts or single-modal tasks, such as GPT-3 and PaLM,
(as linearized tables), we can pass the output from have been introduced in the past few years. These
D E P LOT in conjunction with a query to LLMs to pretrained models’ strong zero/few-shot inference
compute the answer. We take advantage of novel capabilities have enabled creative solutions to more
prompting techniques such as Chain of Thoughts complex multimodal tasks. Socratic Models (Zeng
(CoT) (Wei et al., 2022), Self-Consistency (SC) et al., 2023) combine multimodal pretrained mod-
(Wang et al., 2023), and Program of Thoughts els using multimodal prompts for tasks such as
(PoT) (Chen et al., 2022) to elicit more accurate multimodal assistive dialogue and robot percep-
answers. An illustration of the whole process can tion & planning. Minds’ Eyes (Liu et al., 2023b)
be seen in Figure 1. converts physical reasoning queries into code that
To summarize, this work has the following con- could be executed in physical engines. MAGIC (Su
tributions: (1) We standardize the plot-to-table task et al., 2022) inserts visual control using CLIP in
and propose a unified and informative metric for ta- text generation models for unsupervised image cap-
ble comparison. (2) We propose a highly-effective tioning. Similar our work, Yang et al. (2022) also
modality conversion model D E P LOT to translate translates natural images into texts and leverage
a multimodal task into a language-only task and GPT-3 for knowledge-based VQA.
then leverage LLMs to solve it with just one shot. However, all above approaches focus on natural
(3) D E P LOT+LLM achieves SOTA on ChartQA images and the tasks of interest usually only require

2
capturing very basic visual information such as marization (Kantharaj et al., 2022). D E P LOT, as an
types of objects. Visual language reasoning poses end-to-end neural model, outperforms ChartOCR
a different set of challenges from natural image by very large margins on plot-to-table conversion.
reasoning – it requires, first, accurate and detailed Beyond methodology, the evaluation of plot data
information extraction (IE) from complex visual extraction tasks has traditionally been ununified.
language data (plots and charts in this work); and Siegel et al. (2016); Luo et al. (2021); Kato et al.
secondly very strong numerical reasoning skills (2022) design different metrics for different types
to answer queries based on information extracted. of charts and the metrics can be defined upon coor-
While end-to-end fully-supervised models strug- dinates, bounding boxes, or keypoints of the graphs’
gle to answer complex human-written queries, D E - objects. However, this measures only the interme-
P LOT when combined with LLMs can outperform diate steps of the data extraction process rather than
the supervised SOTA by 29.4%. This is achieved the quality of data extraction itself. We formulate
by decomposing the two key challenges in visual chart data extraction as a plot-to-table translation
language reasoning into leveraging two strong pre- task since the ultimate goal of chart IE is obtaining
trained models that excel at their own respective the underlying data table. Besides our work, Masry
tasks. et al. (2022) also considers chart IE as plot-to-table
conversion. However, the metric used in Masry
Zero & few-shot reasoning over tables. Tradi- et al. (2022) is a number set matching metric, ig-
tionally, table reasoning tasks are dominated by noring table structure (i.e., correct organization of
end-to-end neural models with table-specific ar- the extracted numbers). We propose a better table
chitectural designs (Herzig et al., 2020; Yin et al., comparison metric and discuss more in §3.
2020; Andrejczuk et al., 2022). Recently, there
has been a surge in using LLMs to process tables 3 Standardizing the Plot-to-table Task
for downstream tasks such as QA. Chen (2023)
shows that with just one-shot in-context demonstra- To perform visual language reasoning, we propose
tion, GPT-3 could reach near SOTA performance to decompose a visual language reasoning task on
on table QA datasets, on par with end-to-end mod- plots into two steps: (1) converting plots to texts (in
els trained with at least thousands of training ex- the form of linearized tables) using D E P LOT and
amples. Beyond pure LLM approaches, Binder (2) inputing the linearized table to LLMs for reason-
(Cheng et al., 2023), Program of Thoughts (Chen ing. Accurately performing plot-to-table transla-
et al., 2022), and Program-Aided Language models tion is essential for the downstream visual language
(Gao et al., 2022) all combine LLMs with com- reasoning tasks. Plot-to-table is also an important
pilers/program executors for table reasoning tasks task standalone as it addresses IE from plots/charts,
and have achieved SOTA performance. D E P LOT which can benefit applications such as automatic
can be combined with pure LLMs and also any of reports and documents digitization. We will stan-
the aforementioned neural-symbolic methods in a dardize the plot-to-table conversion task in §3.1
plug-and-play style. and propose a new metric for evaluating plot-to-
table conversion quality. Then in §3.2, we intro-
Information extraction from plots and charts. duce the D E P LOT model and training procedure
Prior works on plot/chart IE is usually pipeline- for performing plot-to-table conversion.
based, combining OCR, object detection/segmenta-
tion systems, and hand-crafted rules. Such special- 3.1 Task Definition
ized systems are frequently designed for specific Prior research in table similarity metric is lim-
types of graphs, e.g., Kato et al. (2022) for line ited. Masry et al. (2022) has introduced a met-
graphs, and Rane et al. (2021) for bar plots. Chart- ric based on the graph IE metric proposed in Luo
BERT (Akhtar et al., 2023) adopts an OCR-based et al. (2021), which we denote Relative Number
method for text extraction from charts and uses Set Similarity or RNSS. The metric looks only at
two more stages of neural methods for processing the unordered set of numeric entries predicted and
the extracted texts. ChartOCR (Luo et al., 2021) measures how the predicted set matches the target
is a hybrid system that accepts all types of chart set of numbers. In the following, we first intro-
inputs and has been adopted by downstream task duce RNSS more formally and then discuss our
models for chart QA (Masry et al., 2022) and sum- rationales of proposing a more well-rounded met-

3
ric Relative Mapping Similarity or RMS. (1 − NLτ (pr ||pc , tr ||tc )) (1 − Dθ (pv , tv )). When
both the keys and values are similar, the similarity
Relative Number Set Similarity (RNSS). Let
Dτ,θ is close to 1 (close to 0 when dissimilar).
the model predicted numbers in table be P =
To compute RMS, we first compute the pairwise
{pi }1≤i≤N and numbers in target tables be T =
similarity between keys in P and T using the cost
{tj }1≤j≤M . We compute the pairwise set of rela-
function 1 − NLτ (pr ||pc , tr ||tc ). We obtain a simi-
tive distances between them:
  larity matrix with shape N × M and with the ma-
∥p − t∥ trix we can identify the minimal cost matching
D(p, t) = min 1, .
∥t∥ X ∈ RN ×M between the keys (in the form of a
Then the N × M matrix of distances can be used to binary matrix). Then we can compute the precision
find a minimal cost matching between the elements and recall between two full mappings as the total
in P and T , expressed in the form of binary matrix similarities of the correspondingly matched entries:
X ∈ RN ×M . The final score is computed as PN PM
i=1 j=1 Xij Dτ,θ (pi , tj )
PN PM RMSprecision = 1 − ,
i=1 j=1 Xij D(pi , tj ) N
RNSS = 1 − . (1) (2)
max(N, M )
PN PM
However, RNSS has several key limitations: it i=1 j=1 Xij Dτ,θ (pi , tj )
RMSrecall = 1 − .
does not distinguish the position of numbers within M
the table; it completely ignores all non numeric (3)
content; it gives credit to very high relative errors;
The RMSF1 score can be computed the harmonic
and it does not distinguish precision versus recall
mean of the precision and recall. Because permu-
losses in table reconstruction.
tations of columns and rows yield the same set of
In contrast, we argue that a metric to measure
column header, row header, value entries, the re-
similarity between tables should satisfy the follow-
sulting metric is invariant to them. In order to allow
ing desiderata:
for table transpositions, we just consider both the
1. Be invariant to transpositions, as well as per-
table and its transposed version and return the one
mutations of column and rows.
that corresponds to highest RMSF1 score.
2. Allow but penalize small errors in numeric or
textual values up to a certain threshold. 3.2 Training Plot-to-table Conversion Models
3. Clearly reflect losses in precision or recall.
Unlike prior works that combine rule-based heuris-
Relative Mapping Similarity (RMS). In order to tics, OCR systems, and object/keypoint segmen-
address all of these requirements, we propose RMS, tation/detection systems (Siegel et al., 2016; Luo
which views tables not as sets of numbers but as et al., 2021; Kato et al., 2022), we propose D E -
unordered collection of mappings from row and P LOT as an end-to-end solution to plot information
column headers (r, c) to a single value v, which extraction. D E P LOT is conceptually simple yet can
we write pi = (pri , pci , pvi ) and tj = (trj , tcj , tvj ) for robustly work for all types of charts (line, dot, bar,
each entry in the predicted table P = {pi }1≤i≤N and pie charts) without requiring type-specific en-
and the target table T = {tj }1≤j≤M respectively. gineering and hybrid components. Specifically, we
Following Biten et al. (2019), the distance be- initialize an image-to-text encode-decoder Trans-
tween textual entries can be measured with Nor- former model with the architecture and weights of
malized Levenshtein Distance, or NLτ where the the SOTA visual language model M AT C HA (Liu
variable τ is such that values above τ are set to the et al., 2023a). We continue finetuning the M AT C HA
maximum of 1, in order to prevent partial credit for checkpoint with the task of mapping plots to their
very dissimilar texts. Therefore the distance of two underlying data tables. The table is linearized as
keys pi and tj is NLτ (pr ||pc , tr ||tc ) where || de- a textual sequence (markdown format) with | sep-
notes string concatenation. The distance between arating cells and \n separating rows. D E P LOT
numeric entries is computed using relative distance is trained to generate the table from left to right
Dθ (p, t) = min(1, ∥p − t∥/∥t∥) and distances autoregressively.
above θ are set to the maximum of 1. Combin- The training corpus is a set of parallel plot-
ing this two distances we can compute the similar- table pairs collected similar to Liu et al. (2023a)
ity between two entries in a mapping Dτ,θ (p, t) as – both synthetic data and real world plot-table

4
pairs are combined to form a finetuning corpus. Metric RNSS RMSF1
Specifically, three sources of plot-table pairs are Pearson’s r 0.46 0.87
used: (1) synthetic data generated by Liu et al. Spearman’s ρ 0.84 0.96
(2023a); (2) synthetic data generated by Methani
et al. (2020) (also used in PlotQA dataset); (3) real- Table 1: Correlations between human judgments and
world data crawled by Masry et al. (2022) (also metric scores (both RNSS and RMSF1 ).
used in ChartQA). (3) is sourced from four web-
sites, they are [Link], pewresearch. over the baseline RNSS, suggesting that RMSF1 is
com, [Link], and [Link]. a much more sensitive and informative metric for
The three corpora are mixed with the rate of [Link]. evaluating the model generated tables.
The size of each can be seen in Liu et al. (2023a).
To avoid data leackage in downstream evaluation, 4 Prompting LLMs for Reasoning
only training set charts from the above datasets are
used. We call our finetuned checkpoint D E P LOT.3 With D E P LOT introduced in §3, we can convert a
given chart/plot into its textual form (as a linearized
3.3 Human Eval of Plot-to-table Metrics table). We can then construct textual prompts by
concatenating the linearized tables and the ques-
To verify that RMS is indeed more sensitive and tions for QA tasks. We follow the typical in-context
robust than previously proposed table compari- learning paradigm to prepend a one-shot example
son metric, we conduct human evaluation to com- before the current prompt.
pare RMSF1 with the previously used RNSS metric. The full prompts use either Chain-of-Thoughts
Specifically, we sample 50 plot-table pairs where (CoT) (Wei et al., 2022) or Program-of-Thoughts
the tables are predictions of the plot-to-table con- (PoT) (Chen et al., 2022) and can be seen in
version models (to be introduced in more details in Appx. §C. They are slightly modified versions of
§5.2). We score the 50 pairs with RNSS and RMSF1 . the ones used by Chen (2023) and Chen et al.
Then we collect human judgment of the table pre- (2022) for evaluating reasoning on tabular data. Be-
diction quality from 6 human annotators on the 50 sides CoT prompting, we also explore combining
examples.4 For each instance, the human annota- D E P LOT+LLM with self-consistency (SC) (Wang
tors are given a plot, the model’s predicted table, et al., 2023), which samples a diverse set of reason-
and three questions regarding different aspects of ing paths and choose the majority-voted answer in-
the quality of the predicted table. The three ques- stead of relying on one greedily-decoded answer as
tions are (1) “Does the model overgenerate column- in CoT. In order to simplify performing arithmetic
s/rows or some rows/columns are missing?”, (2) on large numbers, we also tested prompting the
“Are the x, y label/index names, and title correct?”, models to generate python code that can be passed
and (3) “Are numbers close to the true values and through an interpreter. In order to do that, we adapt
associated with the correct column, row labels/in- the paradigm from Chen et al. (2022); Gao et al.
dexes?”. Annotators should rate the table from 1–5 (2022) to the context of tables. Future work could
(the higher the better). We attach the full annota- alternatively take advantage of finetuned tabular
tion form in Appx. §B. The final human score for QA models such as Herzig et al. (2020) or use
a plot-table pair is the average of the scores across LLMs that generate SQL programs (Cheng et al.,
the three questions across all human annotators. 2023) and might require multiple LLM iterative in-
We compute the Pearson’s r and Spearman’s ρ cor- vocations to perform different atomic operations.
relations between metric scores and human scores.
As shown in Table 1, under both correlation met- 5 Experiment
rics, we can observe a great improvement of RMSF1
We introduce the experimental setup in §5.1 and
3
Note that the original M AT C HA model is also pretrained then the results in §5.2 including both plot-to-table
with the task of plot derendering (which includes plot-to-table), translation and downstream QA tasks.
however for a different purpose – i.e., transferring knowledge
to downstream finetuning tasks. Our continue finetuning fo-
cuses solely on the task of plot-to-table conversion. We also 5.1 Experimental Setup
use a much longer sequence length (512 vs. 192) to accom- Training and inference. D E P LOT is trained for
modate long tables.
4
The 6 annotators are all experienced NLP researchers in 10k steps with a maximum sequence length of
information extraction with at least a Master’s degree. 512. The other hyperparameters are identical to

5
M AT C HA pretraining as introduced in Liu et al. Dataset # Tables # QA Pairs
(2023a). In D E P LOT inference we set tempera- ChartQA (Human) 625 1,250
ture to 0 (so the output is deterministic). For LLM ChartQA (Machine) 987 1,250
prompting, in all cases we use temperature of 0.4. PlotQA (v1) 33K 1.2M
PlotQA (v2) 33K 4.3M
Datasets and metrics. We evaluate on two chart/-
plot question answering benchmarks ChartQA Table 3: Dataset statistics of the test sets for ChartQA
(Masry et al., 2022) and PlotQA (Methani et al., and PlotQA.
2020). ChartQA contains two sets: augmented
(aug.) and human where the augmented set is syn-
thetically generated and the human set is human called ChartOCR. This system also relies on mul-
written. Human written queries usually are more di- tiple hand-crafted rules that depend on the type
verse and complex, requiring more reasoning while of chart. We also compare against two PaLI mod-
synthetic questions are usually highly templatic. els (Chen et al., 2023) (of different input resolu-
PlotQA is purely synthetic. It contains v1 & v2 tions) finetuned with the same plot-to-table cor-
sets where v1 is mostly extractive questions and v2 pus as D E P LOT. Finally, we compare with the
focuses more on numerical reasoning. Both RNSS M AT C HA base model off-the-shelf. The results are
and RMSF1 are used for evaluating plot-to-table shown in Table 4.
translation (though we have argued that RMSF1 is
Model↓, Metric→ RNSS RMSF1
the more informative metric). Following Masry
et al. (2022); Methani et al. (2020), exact match ChartOCR (Luo et al., 2021) 81.0 60.1
accuracy with 5% tolerance on numerical error is PaLI-17B (res. 224) + plot-to-table 77.2 24.8
used to report all QA numbers. PaLI-17B (res. 588) + plot-to-table 90.5 74.9
We list data statistics of plot-to-table training M AT C HA (Liu et al., 2023a) 95.4 92.3
in Table 2. Note that the plot-table pairs are only D E P LOT 97.1 94.2
from ChartQA and PlotQA training sets (not their
validation/test sets). The statistics of PlotQA and Table 4: Benchmarking plot-to-table conversion ac-
ChartQA test data are listed in Table 3. Note that curacy on the PlotQA dataset (all individual plots in
we are also using plot-table pairs from the PlotQA PlotQA test sets). Both a pipeline-bsed based method
(ChartOCR) and end-to-end methods (PaLI-17B and
test set for evaluating the plot-to-table task (plot-
M AT C HA) are used as baselines. RMSF1 can capture
table pairs from v1 and v2 are identical). the shortcomings of baselines such as ChartOCR with
much greater sensitivity.
Component Rate Size
synthetic (by us) 33.3% 270K On both metrics, D E P LOT outperforms the base-
ChartQA 33.3% 22K line ChartOCR by very significant margins. The
PlotQA 33.3% 224K
gap is especially large on RMSF1 since ChartOCR
might suffice to extract numbers from the plot but
Table 2: Data statistics for the training data of plot-to-
table task. can struggle to organize the extracted numbers into
a structured table with the correct row and col-
umn labels. When compared against PaLI and
Hardware. We train and evaluate our models M AT C HA, D E P LOT is also better, suggesting that
using 64 GCP-TPUv3. The training of D E P LOT a visual-language-specific architecture/initializa-
can be completed in roughly 5 hours. tion and task-specific finetuning can both boost
plot-to-table accuracy. It is also worth noting that
Parameters. D E P LOT has 282M parameters. PaLI-17B (res. 588) performs much better than the
FlanPaLM has 540B parameters. Codex and GPT3 224-resolution variant, indicating that high input
have roughly 175B parameters. resolution is a key ingredient for chart information
5.2 Main Results extraction.

Plot-to-table translation. We evaluate plot-to- Downstream tasks. We list the main results on
table conversion against an OCR and keypoint de- ChartQA (Masry et al., 2022) and PlotQA (Methani
tection based system proposed by Luo et al. (2021) et al., 2020) in Table 5. We evaluate different

6
ChartQA PlotQA
Model↓ aug. human avg. v1 v2 avg.
human ceiling - - - - - 80.5
CRCT - - - 76.9 34.4 55.7
VL-T5-OCR - - 41.6 75.9 56.0 66.0
Fully-supervised

T5-OCR - - 41.0 72.6 56.2 64.4


VisionTapas-OCR - - 45.5 65.3 42.5 53.9
PaLI-17B (res. 224) 6.2 12.6 9.4 56.9 13.1 35.0
PaLI-17B (res. 588) 64.9 30.4 47.6 64.5 15.2 39.8
Pix2Struct 81.6 30.5 56.0 73.2 71.9 72.5
M AT C HA 90.2 38.2 64.2 92.3 90.7 91.5
One-shot (OURS)

D E P LOT+GPT3 CoT 37.3 36.5 36.9 31.9 51.3 41.6


D E P LOT+GPT3 SC 42.6 41.9 42.3 35.0 51.6 43.3
D E P LOT+FlanPaLM CoT 76.7 57.8 67.3 51.3 44.9 48.1
D E P LOT+FlanPaLM SC 78.8 62.2 70.5 57.8 50.1 53.9
D E P LOT+Codex PoT SC 91.8 61.6 76.7 58.8 69.8 64.3
D E P LOT+FlanPaLM+Codex PoT SC 91.0 67.6 79.3 62.2 71.0 66.6

Table 5: Main experimental results on two plot/chart QA benchmarks ChartQA & PlotQA. “ ” denotes human-
written queries while “ ” denotes synthetic queries. Detailed introduction of the baselines can be found in Appx. §A.
The last six rows show D E P LOT+LLM results – the only one-shot setup. CoT denotes chain-of-thought prompting,
SC denotes self-consistency, PoT denotes program-of-thought prompting. The best results are achieved for ChartQA
by majority voting across joint 10 CoT and 10 PoT predictions.

D E P LOT+LLM setups. We evaluate chain-of- P LOT+LLM models underperform the end-to-end


thoughts (CoT) (Wei et al., 2022) prompts for GPT- SOTA M AT C HA.
3 (Brown et al., 2020) (text-davinci-003) In summary, D E P LOT+LLM significantly out-
and FlanPaLM (Chung et al., 2022) (540B). In performs finetuned SOTA on human-written chart
addition, we use self-consistency (SC) (Wang QA queries and overall underperforms finetuned
et al., 2023) across 10 predictions. Finally, SOTA on synthetic QA queries. We believe it is
we use program-of-thoughts (PoT) (Chen et al., especially important to achieve good performance
2022) to prompt Codex (Chen et al., 2021) on the human set as it is much more diverse and
(code-davinci-002) to generate python snip- reflects the real-world challenges. The results sug-
pets that can be subsequently executed by an inter- gest D E P LOT+LLM’s strong capability in solving
preter to extract an answer.5 Since some reasoning novel human queries unseen in demonstration. It is
operations are better done by plain language (like also worth emphasizing again that D E P LOT+LLM
computing an argmax) and some by code snippets requires much less supervision than the finetuned
(like floating point arithmetic), we find optimal re- SOTA methods (one shot vs. tens of thousands
sults by doing self-consistency across both CoT of training examples). We will discuss why D E -
and PoT predictions. P LOT+LLM underperforms on PlotQA in error
D E P LOT+LLM performs especially strong on analysis (§6.1).
the ChartQA human set (denoted with “ ”) which Besides one-shot learning, we also experimented
contains complex human written queries. Com- with zero- and few-shot inference. We found the
pared with prior SOTA M AT C HA, D E P LOT+LLM models generally fail without demonstration and
when combined with FlanPaLM and Codex and few-shot has similar performance as one-shot. Af-
Self-Consistency (SC) achieves an improvement ter the submission of this paper, we experimented
of 29.4% (38.2%→67.6%). This is also the best with RLHF-ed LLMs such as ChatGPT6 , GPT-
setup for PlotQA. On the heavily synthetic queries 4 (OpenAI, 2023), and Bard7 , finding that such
from PlotQA v1 and v2 (denoted with “ ”), D E - aligned conversational models are capable of pro-
cessing the D E P LOT-generated tables in a zero-
5
We also evaluated PaLM and FlanPaLM for code genera-
6
tion but found Codex to be more likely to write correct code [Link]/blog/chatgpt
7
instead of do the reasoning in comment blocks. [Link]

7
shot manner. This can potentially further improve Second, we show an example where the D E -
D E P LOT+LLM’s performance on academic bench- P LOT+LLM framework fails in Table 7. The LLMs
marks by large margins. are unable to accurately identify the “highest value
of the gray bar” since they do not have informa-
6 Analyses and Discussions tion about the color of bars. In Table 7, though
D E P LOT+FlanPaLM correctly predicted “Yes”, it
In this section, we first conduct case studies and is correct for the wrong reason – FlanPaLM ran-
error analysis in §6.1 to better understand D E P LOT’ domly chose the highest value in light blue bars
wins and losses against end-to-end methods. Then which also happens to be smaller than the average
in §6.2 we study the performance of D E P LOT when of “identity theft”. This is a typical failure mode
exposed to out-of-distribution web charts and plots. where the query refers to a visual attribute but such
attribute is lost in plot-to-table translation. In future
6.1 Case Study and Error Analysis
work, we plan to develop a table encoding scheme
To more concretely demonstrate the strengths and that also considers visual attributes to avoid such
weaknesses of D E P LOT+LLM, we present two case errors.
studies for the downstream task ChartQA. We com-
pare D E P LOT+FlanPaLM using either PoT or CoT.
First, in Table 6 we show an example demon-
strating the benefit of using LLM and prompting
techniques for stronger numerical reasoning. While
the finetuned SOTA M AT C HA wrongly predicts the
answer, D E P LOT+FlanPaLM (using either CoT or
PoT) produces accurately the answer.

Question: Is the average of all the bars in “identity theft”


greater than the highest value of the gray bar?

M AT C HA: No

Question: What is the difference between the sum shares of D E P LOT+FlanPaLM CoT: Let’s find the average of
Indonesia and Ireland and the share of Mauritania? all the bars in "Identity theft". The sum of all the bars is
66+17+16 = 99. The average is 99/3 = 33. The highest
M AT C HA: 0.07 value of the gray bar is 26. 33 is greater than 26. The
answer is Yes.
D E P LOT+FlanPaLM CoT: Let’s find the sum of
shares of Indonesia and Ireland: 2.88+2.33=5.21. Let’s D E P LOT+Codex PoT:
subtract the share of Mauritania: 5.21-4.15=1.06. The
#Identity theft corresponds to row 5
answer is 1.06.
#Numbers on row 5 are [66, 17, 16]
#Highest value of the gray bar is 79
D E P LOT+Codex PoT:
ans = 66 > 79
indonesia = 2.88
ireland = 2.33 Compiler output: No
mauritania = 4.15
ans=(indonesia+ireland)-mauritania Table 7: D E P LOT+LLM struggles with queries related
to visual attributes such as color (which is lost in plot-
Compiler output: 1.06 to-table translation).
Table 6: D E P LOT+LLM shows much stronger numeri-
cal reasoning skills than the finetuned M AT C HA. While D E P LOT+LLM has surpassed finetuned

8
SOTA on ChartQA, we notice that the picture on 7 Conclusion
PlotQA is different – D E P LOT underperforms fine-
tuned SOTA M AT C HA by a large margin (66.6% We have proposed D E P LOT+LLM, a method for
vs. 91.5%). Through error analysis, we observe visual language reasoning by decomposing the task
that there are two major reasons. First, synthetic into two steps. The first step is converting a plot
queries are highly templatic and covers only re- into linearized table using an image-to-text Trans-
stricted types of questions. Models finetuned with former model finetuned for the conversion. The
thousands of examples can learn to solve such tem- second step is combining the plot-to-text model
platic questions, even better than humans do (hu- with an off-the-shelf LLM to reason on the lin-
man ceiling on PlotQA is just 80.5% compared earized table with just one-shot supervision.
with M AT C HA performance of 91.5%). How- We standardize the plot-to-table conversion task
ever, D E P LOT+LLMs only learn from one-shot by proposing a new table similarity comparison
in-context example and thus cannot exploit such metric that considers the structure of the table and
bias encoded in the training set. The second reason the numeric values but is invariant to column/row
is the loss of information in plot-to-table translation. permutation. With the new metric, we compare our
Synthetic queries are usually highly extractive and image-to-text model D E P LOT’s performance with
include questions asking visual attributes such as an OCR-based baseline and three end-to-end base-
color, shape, or direction of objects in a plot. When lines, achieving the best improvement. The conver-
plots are converted to tables, such information is sion model is then used for downstream tasks of
lost. We plan to also decode visual attributes in ChartQA and PlotQA. On ChartQA human-query
future work when training the plot-to-table model. set, the one-shot D E P LOT+LLM model achieves
More successful and failure case analyses are +29.4% performance compared with end-to-end
available in Appx. §D. SOTA finetuned with thousands of examples. We
have also conducted comprehensive analyses to un-
6.2 Out-of-distribution Analysis derstand the wins and loses of the D E P LOT+LLM
framework and highlight that encoding visual at-
One limitation of our evaluation setup is that the tributes can be a fruitful direction for future explo-
kind and style of charts that are part of D E P LOT’s ration.
training corpus are in the same domain as those
in the evaluation sets from ChartQA and PlotQA. Limitations
This raises the question of whether D E P LOT will
generalize to charts sourced from different websites D E P LOT’s strength is highly dependent on the ac-
or built using completely different tools. However, curacy of plot-to-text(table) conversion. To obtain
few public resources exist containing both charts effective plot-to-text conversion, large amounts of
and their associated tables. diverse and in-domain plot-table parallel data are
In order to estimate the out-of-distribution capa- usually needed. It is unknown to which extent
bilities of D E P LOT we annotate 10 charts from the D E P LOT can work for out-of-domain (OOD) plot-
recently released TaTa dataset (Gehrmann et al., to-text conversion. We investigated this in section
2022), sourced from [Link]. We §6.2 but in the future a wider range of web charts
skip choropleth maps since none have been seen can be used to gain a deeper understanding into
during training. We find D E P LOT obtains an aver- D E P LOT’s robustness for OOD plots.
age 78% RMSF1 score in reconstructing the underly- Beyond, D E P LOT does not work for visual lan-
ing tables. We observed two limitations in D E P LOT guage that does not have a clear latent textual repre-
which we outline below and can attributed to the sentation such as textbook figures where the visual
nature of the training datasets used. First the model illustrations are created using specialized software
could get distracted by adjacent text, such as ref- and do not have clear structured representations.
erences to external sources, and it benefited from Another limitation of the current D E P LOT ap-
cropping the chart in advance. Secondly, D E P LOT proach is that we ignore any layout information
struggled to understand labels or values linked to such as orientation and color of the visual ele-
their corresponding bar/pie section by an arrow. We ments/objects. In future work, we can incorporate
will address these issues in future work by making such attributional information by including them in
the synthetic data creation pipeline more robust. the decoding target.

9
Ethics Statement Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
To the best of our knowledge, D E P LOT is of low William Saunders, Christopher Hesse, Andrew N.
risk to the society since it is an information ex- Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
traction model that converts graphics information Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
from image to textual information in the form of Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
table. That said, when combined with LLMs, D E - Sutskever, and Wojciech Zaremba. 2021. Evaluat-
P LOT+LLM can demonstrate potential risk such as ing large language models trained on code. arXiv
generating toxic content similar to when LLMs are preprint arXiv:2107.03374.
used standalone. As a result, we should proceed Wenhu Chen. 2023. Large language models are few(1)-
with caution when deploying D E P LOT+LLM in the shot table reasoners. In Findings of the Associa-
real-world and take necessary precautions such as tion for Computational Linguistics: EACL 2023,
pages 1120–1130, Dubrovnik, Croatia. Association
having a filtering stage after the generation. for Computational Linguistics.
In terms of data used, all training and evaluation
data are either synthetically created using rules or Wenhu Chen, Xueguang Ma, Xinyi Wang, and
William W Cohen. 2022. Program of thoughts
publicly available data on the web with appropriate prompting: Disentangling computation from reason-
permissive licenses. ing for numerical reasoning tasks. arXiv preprint
arXiv:2211.12588.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier-
References giovanni, Piotr Padlewski, Daniel Salz, Sebastian
Mubashara Akhtar, Oana Cocarascu, and Elena Sim- Goodman, Adam Grycner, Basil Mustafa, Lucas
perl. 2023. Reading and reasoning over chart im- Beyer, et al. 2023. Pali: A jointly-scaled multilingual
ages for evidence-based automated fact-checking. In language-image model. In The Eleventh Interna-
Findings of the Association for Computational Lin- tional Conference on Learning Representations.
guistics: EACL 2023, pages 399–414, Dubrovnik,
Croatia. Association for Computational Linguistics. Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu
Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong,
Ewa Andrejczuk, Julian Eisenschlos, Francesco Pic- Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer,
cinno, Syrine Krichene, and Yasemin Altun. 2022. et al. 2023. Binding language models in symbolic
Table-to-text generation and pre-training with TabT5. languages. In The Eleventh International Conference
In Findings of the Association for Computational on Learning Representations.
Linguistics: EMNLP 2022, pages 6758–6766, Abu
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021.
Dhabi, United Arab Emirates. Association for Com-
Unifying vision-and-language tasks via text genera-
putational Linguistics.
tion. In International Conference on Machine Learn-
A. Furkan Biten, R. Tito, A. Mafla, L. Gomez, M. Rusi- ing, pages 1931–1942. PMLR.
nol, M. Mathew, C. Jawahar, E. Valveny, and
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
D. Karatzas. 2019. Icdar 2019 competition on scene
Maarten Bosma, Gaurav Mishra, Adam Roberts,
text visual question answering. In 2019 International
Paul Barham, Hyung Won Chung, Charles Sutton,
Conference on Document Analysis and Recognition
Sebastian Gehrmann, et al. 2022. PaLM: Scaling
(ICDAR), pages 1563–1570, Los Alamitos, CA, USA.
language modeling with pathways. arXiv preprint
IEEE Computer Society.
arXiv:2204.02311.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Askell, et al. 2020. Language models are few-shot 2022. Scaling instruction-finetuned language models.
learners. Advances in neural information processing arXiv preprint arXiv:2210.11416.
systems, 33:1877–1901.
Alexey Dosovitskiy, Lucas Beyer, Alexander
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Thomas Unterthiner, Mostafa Dehghani, Matthias
Harrison Edwards, Yuri Burda, Nicholas Joseph, Minderer, Georg Heigold, Sylvain Gelly, et al. 2021.
Greg Brockman, Alex Ray, Raul Puri, Gretchen An image is worth 16x16 words: Transformers
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- for image recognition at scale. In International
try, Pamela Mishkin, Brooke Chan, Scott Gray, Conference on Learning Representations.
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter, Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,
Philippe Tillet, Felipe Petroski Such, Dave Cum- Pengfei Liu, Yiming Yang, Jamie Callan, and Gra-
mings, Matthias Plappert, Fotios Chantzis, Eliza- ham Neubig. 2022. PAL: Program-aided language
beth Barnes, Ariel Herbert-Voss, William Hebgen models. arXiv preprint arXiv:2211.10435.

10
Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty,
Jan A Botha, Michael Chavinda, Ankur Parikh, and and Enamul Hoque. 2022. ChartQA: A benchmark
Clara Rivera. 2022. TaTa: A multilingual table-to- for question answering about charts with visual and
text dataset for african languages. arXiv preprint logical reasoning. In Findings of the Association for
arXiv:2211.00142. Computational Linguistics: ACL 2022, pages 2263–
2279, Dublin, Ireland. Association for Computational
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Linguistics.
Müller, Francesco Piccinno, and Julian Eisenschlos.
2020. TaPas: Weakly supervised table parsing via Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and
pre-training. In Proceedings of the 58th Annual Meet- Pratyush Kumar. 2020. PlotQA: Reasoning over sci-
ing of the Association for Computational Linguistics, entific plots. In Proceedings of the IEEE/CVF Win-
pages 4320–4333, Online. Association for Computa- ter Conference on Applications of Computer Vision,
tional Linguistics. pages 1527–1536.

Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, OpenAI. 2023. GPT-4 Technical Report. arXiv preprint
Ahmed Masry, Megh Thakkar, Enamul Hoque, and arXiv:2303.08774.
Shafiq Joty. 2022. Chart-to-text: A large-scale bench- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
mark for chart summarization. In Proceedings of the Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
60th Annual Meeting of the Association for Compu- try, Amanda Askell, Pamela Mishkin, Jack Clark,
tational Linguistics (Volume 1: Long Papers), pages et al. 2021. Learning transferable visual models
4005–4023, Dublin, Ireland. Association for Compu- from natural language supervision. In International
tational Linguistics. Conference on Machine Learning, pages 8748–8763.
PMLR.
Hajime Kato, Mitsuru Nakazawa, Hsuan-Kung Yang,
Mark Chen, and Björn Stenger. 2022. Parsing line Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
chart images using linear programming. In Proceed- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
ings of the IEEE/CVF Winter Conference on Applica- Wei Li, Peter J Liu, et al. 2020. Exploring the limits
tions of Computer Vision, pages 2109–2118. of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1–67.
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu,
Fangyu Liu, Julian Eisenschlos, Urvashi Khandel- Chinmayee Rane, Seshasayee Mahadevan Subramanya,
wal, Peter Shaw, Ming-Wei Chang, and Kristina Devi Sandeep Endluri, Jian Wu, and C Lee Giles.
Toutanova. 2023. Pix2Struct: Screenshot parsing 2021. ChartReader: Automatic parsing of bar-plots.
as pretraining for visual language understanding. In In 2021 IEEE 22nd International Conference on In-
Proceedings of the 40th International Conference on formation Reuse and Integration for Data Science
Machine Learning. (IRI), pages 318–325. IEEE.

Matan Levy, Rami Ben-Ari, and Dani Lischinski. 2022. Noah Siegel, Zachary Horvitz, Roie Levin, Santosh
Classification-regression for chart comprehension. In Divvala, and Ali Farhadi. 2016. FigureSeer: Pars-
European Conference on Computer Vision, pages ing result-figures in research papers. In European
469–484. Springer. Conference on Computer Vision, pages 664–680.
Springer.
Fangyu Liu, Francesco Piccinno, Syrine Krichene,
Chenxi Pang, Kenton Lee, Mandar Joshi, Nigel Col- Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani
lier, Yasemin Altun, and Julian Martin Eisenschlos. Yogatama, Yan Wang, Lingpeng Kong, and Nigel
2023a. MatCha: Enhancing visual language pretrain- Collier. 2022. Language models can see: Plugging
ing with math reasoning and chart derendering. In visual controls in text generation. arXiv preprint
Proceedings of the 61st Annual Meeting of the Asso- arXiv:2205.02655.
ciation for Computational Linguistics. Association Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
for Computational Linguistics. Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2023. Self-consistency improves chain
Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, of thought reasoning in language models. In The
Soroush Vosoughi, Claire Cui, Denny Zhou, and An- Eleventh International Conference on Learning Rep-
drew M Dai. 2023b. Mind’s eye: Grounded language resentations.
model reasoning through simulation. In The Eleventh
International Conference on Learning Representa- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
tions. Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022. Chain of thought prompt-
Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew ing elicits reasoning in large language models. In
Lin. 2021. ChartOCR: Data extraction from charts Advances in Neural Information Processing Systems.
images via a deep hybrid framework. In 2021 IEEE
Winter Conference on Applications of Computer Vi- Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei
sion (WACV). The Computer Vision Foundation. Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. 2022.

11
An empirical study of gpt-3 for few-shot knowledge-
based vqa. Proceedings of the AAAI Conference on
Artificial Intelligence, 36(3):3081–3089.
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se-
bastian Riedel. 2020. TaBERT: Pretraining for joint
understanding of textual and tabular data. In Proceed-
ings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 8413–8426, On-
line. Association for Computational Linguistics.
Andy Zeng, Maria Attarian, brian ichter,
Krzysztof Marcin Choromanski, Adrian Wong,
Stefan Welker, Federico Tombari, Aveek Purohit,
Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vin-
cent Vanhoucke, and Pete Florence. 2023. Socratic
models: Composing zero-shot multimodal reasoning
with language. In The Eleventh International
Conference on Learning Representations.

12
A Details of Baselines from 1 to 5. The final table score is the average
score from the three questions.
We introduce below the details of the baselines
used in Table 5.
C Chain-of-thoughts and
T5 is an encode-decoder Transformer model pro-
posed by Raffel et al. (2020). The baseline model
Program-of-thoughts Prompt
T5 takes the concatenation of a linearized table (and In Figure 3 we show the one-shot prompt used
a query, when the task is QA) as input, and aims across all experiments CoT. It is taken from devel-
to decode the target (answer or summarization). opment set examples in combination with prompts
When the gold table is availible, the gold table is used by Chen (2023). We also modified this prompt
used as the input and the chart image is not used di- to output Python code when needing to compute
rectly. VL-T5 proposed by Cho et al. (2021) is sim- arithmetic operations in Figure 4.
ilar to T5 but also takes a visual input (i.e., the chart
image) on the encoder side. VisionTaPas (Masry
D More Case Study
et al., 2022) is modified from TaPas (Herzig et al.,
2020) to incorporate the visual modality by adding Successes. In Table 8 and Table 9 we demon-
a ViT model (Dosovitskiy et al., 2021) and cross- strate two more cases where D E P LOT+LLM are
modal fusion layers. T5-OCR, VL-T5-OCR, and successful due to its stronger numerical reasoning
VisionTaPas-OCR are the same model as T5, VL- capabilities.
T5, and VisionTaPas, respectively. However, they
do not assume the existence of gold table but use
an OCR-based system to extract the data table from
the chart image. The above mentioned models and
their performance numbers are all extracted from
Masry et al. (2022) and Kantharaj et al. (2022).
Please see the original paper for more details. Clas-
sification - Regression Chart Transformer (CRCT)
(Levy et al., 2022) is the best performing model
on PlotQA according to the PlotQA benchmark
on [Link]. It uses a detector
that extracts all textual and visual elements of chart
then processes these elements with a multimodal
Transformer. PaLI (Chen et al., 2023) with 17B
Question: Can you identify how many points the female
parameters is a SOTA on multiple vision-language data moved from 2009 to 2019?
tasks in the natural image domain however fails
significantly on chart understanding tasks. MatCha M AT C HA: 0.8

Liu et al. (2023a) is the strongest supervised base- D E P LOT+FlanPaLM CoT: Let’s find the row of year 2009,
line and uses a mixture of image-to-text tasks as that’s Row 11. Let’s extract the number on Row 11, that’s
5.27. Let’s find the row of year 2019, that’s Row 1. Let’s
pretraining to inject math reasoning and chart lay- extract the number on Row 1, that’s 5.9. 5.9-5.27=0.63.
out understanding knowledge to the base model. The answer is 0.63.
In downstream tasks ChartQA and PlotQA, the
D E P LOT+Codex PoT:
full-supervised models are finetuned with the cor-
responding training sets (ChartQA has ∼33k data #Python
#year 2009 corresponds to row 11
points and PlotQA has ∼37M). The fully super- #year 2019 corresponds to row 1
vised results are collected from Liu et al. (2023a). female_2009 = 5.27
female_2019 = 5.9
ans = female_2019 - female_2009
B Human Evaluation Questions
Compiler output: 0.63
We list below (Figure 2) the annotation form of the
three questions asked when producing the human Table 8: A line plot example that requires numerical
judgment scores of plot-table pairs. Each question reasoning which is easily solved by D E P LOT+LLM but
asks one aspect regarding the quality of the gener- failed by M AT C HA.
ated table and the annotator needs to rate the table

13
[plot]

[generated table]

Questions (rate 1 to 5; the higher the better):

1. Does the model overgenerate columns/rows or some rows/columns are missing? (1


means lots of extra or missing columns/rows, 5 means no extra columns/rows and
all necessary columns/rows are presented)
Answer:

2. Are the x, y label/index names, and title correct? (1 means nothing is accurate,
5 means all information are exactly correct)
Answer:

3. Are numbers close to the true values and associated with the correct column, row
labels/indexes? (1 means nothing is accurate, 5 means all values and their
associated labels/indexes are accurate)
Answer:

Figure 2: Questions for producing human scores of plot-table pairs.

Read the table below to answer the following questions.

Header: Year | Democrats | Republicans | Independents


Row 1: 2004 | 68.1% | 45.0% | 53.0%
Row 2: 2006 | 58.0% | 42.0% | 53.0%
Row 3: 2007 | 59.0% | 38.0% | 45.0%
Row 4: 2009 | 72.0% | 49.0% | 60.0%
Row 5: 2011 | 71.0% | 51.2% | 58.0%
Row 6: 2012 | 70.0% | 48.0% | 53.0%
Row 7: 2013 | 72.0% | 41.0% | 60.0%

Q: In which year republicans have the lowest favor rate?


A: Let's find the column of republicans. Then let's extract the favor rates, they
[45.0, 42.0, 38.0, 49.0, 51.2, 48.0, 41.0]. The smallest number is 38.0, that's
Row 3. Row 3 is year 2007. The answer is 2007.

Q: What is the sum of Democrats' favor rates of 2004, 2012, and 2013?
A: Let's find the rows of years 2004, 2012, and 2013. We find Row 1, 6, 7. The
favor dates of Democrats on that 3 rows are 68.1, 70.0, and 72.0.
68.1+70.0+72=210.1. The answer is 210.1.

Q: By how many points do Independents surpass Republicans in the year of 2011?


A: Let's find the row with year = 2011. We find Row 5. We extract Independents
and Republicans' numbers. They are 58.0 and 51.2. 58.0-51.2=6.8. The answer is
6.8.

Q: Which group has the overall worst performance?


A: Let's sample a couple of years. In Row 1, year 2004, we find Republicans
having the lowest favor rate 45.0 (since 45.0<68.1, 45.0<53.0). In year 2006, Row
2, we find Republicans having the lowest favor rate 42.0 (42.0<58.0, 42.0<53.0).
The trend continues to other years. The answer is Republicans.

Q: Which party has the second highest favor rates in 2007?


A: Let's find the row of year 2007, that's Row 3. Let's extract the numbers on
Row 3: [59.0, 38.0, 45.0]. 45.0 is the second highest. 45.0 is the number of
Independents. The answer is Independents.

Figure 3: Prompt used for question answering on tables.

14
Read the table below and write code to answer the following questions using the
variable ans.

Header: Year | Democrats | Republicans | Independents


Row 1: 2004 | 68.1% | 45.0% | 53.0%
Row 2: 2006 | 58.0% | 42.0% | 53.0%
Row 3: 2007 | 59.0% | 38.0% | 45.0%
Row 4: 2009 | 72.0% | 49.0% | 60.0%
Row 5: 2011 | 71.0% | 51.2% | 58.0%
Row 6: 2012 | 70.0% | 48.0% | 53.0%
Row 7: 2013 | 72.0% | 41.0% | 60.0%

Q: What was the average difference in approval rates between democrats and
republicans in 2006 and 2007?
#Python
democrats_2006 = 58.0
republicans_2006 = 42.0
difference_2006 = democrats_2006 - republicans_2006
democrats_2007 = 59.0
republicans_2007 = 38.0
difference_2007 = democrats_2007 - republicans_2007
ans = (difference_2006 + difference_2007) / 2

Q: What is the average of Democrats' favor rates of 2004, 2012, and 2013?
#Python
# Years 2004, 2012, and 2013 correspond to rows 1, 6 and 7.
democrats_2004 = 68.1
democrats_2012 = 70.0
democrats_2013 = 72.0
ans = (democrats_2004 + democrats_2012 + democrats_2013) / 3

Q: Which party had less than 50% approval rate in 2013?


#Python
# year 2013 corresponds to row 7. Numbers on row 7 are [72.0, 41.0, 60.0]
# Republicans are the only with less than 50.
ans = "Republicans"

Q: What is percentage of relative increase in approval rate for democrats from 2012
to 2013?
#Python
# year 2012 and 2013 correspond to rows 6 and 7.
# Numbers on row 6 are [70.0, 48.0, 53.0]
# Numbers on row 7 are [72.0, 41.0, 60.0]
democrats_2012 = 70.0
democrats_2013 = 72.0
ans = 100 * (democrats_2013 - democrats_2012) / democrats_2012

Q: What is the difference between republicans in 2011 and democrats in 2006?


#Python
# year = 2011 corresponds to row 5 and the republicans had a 51.2 rate
republicans_2011 = 51.2
# year = 2006 corresponds to row 2 and the democrats had a 58.0 rate
democrats_2006 = 58.0
# The difference between A and B is A - B which may be negative
ans = republicans_2011 - democrats_2006

Figure 4: Prompt used for question answering on tables using Python code.

15
Question: What is the average penetration rate?

M AT C HA: 43

D E P LOT+FlanPaLM CoT: Let’s add all the penetra-


tion rates: 48% + 43% + 33% + 26% + 18% = 168%.
There are 5 years, so the average is 168% / 5 = 33.6%. The
answer is 33.6.

D E P LOT+Codex PoT:
#Python
# Years 2013, 2011, 2009, 2007, and
2005 correspond to rows 1, 2, 3, 4,
and 5.
penetration_2013 = 48
penetration_2011 = 43 Question: What’s the median value of the red graph?
penetration_2009 = 33
penetration_2007 = 26 M AT C HA: 10.3
penetration_2005 = 18
ans = (penetration_2013 + D E P LOT+FlanPaLM CoT: Let’s find the median
penetration_2011 + penetration_2009 value of red graph. The median is the middle value of the
+ penetration_2007 + data set. The data set is [2, 2, 3.5, 0, 4.5, 5.6, 6.6, 6.8, 0].
penetration_2005) / 5 The middle value is 4.5. The answer is 4.5.
Compiler output: 33.6 D E P LOT+Codex PoT:
Table 9: D E P LOT+LLM is especially suitable for more #Python
complex numerical computations such as computing the # The median of the red graph is the
number in the middle of the sorted
average for multiple numbers as shown in this example. list of values.
# The values are 2, 2, 3.5, 0, 4.5,
5.6, 6.6, 6.8, 0.
Failures. For questions concerning color or other # The median is 4.5.
visual attributes of the graph, the D E P LOT+LLM ans = 4.5
framework is unable to handle since such informa- Compiler output: 4.5
tion is lost in modality translation and not consid-
ered in the current textual table encoding scheme. Table 10: All models made mistakes. For the finetuned
model M AT C HA, the error seems to be numerical rea-
We show an additional examples in Table 10.
soning since its 10.3 is from the correct graph. However,
Besides color, plot-to-table conversion can ig- for D E P LOT+LLMs, the error originates from identify-
nore other visual attributes such as the example in ing the wrong graph as the red graph.
Table 11. There does not exist a one-to-one align-
ment between dots on the line graphs and x labels.
The D E P LOT model produces a table with only x
labels and the extrapolated y values and ignore the
dots in the graph.

16
Question: Which country’s line bar shows three data points in the same place between 2010 and 2014?

M AT C HA: Belize

D E P LOT:
Entity | 2004 | 2006 | 2008 | 2010 | 2012 | 2014 | 2016 | 2017
Belize | 65.0 | 65.0 | 61.0 | 61.0 | 58.0 | 55.0 | 62.0 | 62.0
Papua New Guinea | 44.0 | 52.0 | 44.0 | 41.0 | 37.0 | 46.0 | 47.0 | 43.0

D E P LOT+FlanPaLM CoT: Let’s find the rows with countries. We find Row 1, 2. Let’s find the years between
2010 to 2014. We find Row 6, 7. We find the data in the same place between 2010 to 2014. We find 61.0, 61.0, 41.0, 41.0.
The answer is Belize.

D E P LOT+Codex PoT:
#Python
# Row 1 corresponds to Belize. Numbers on row 1 are [65.0, 65.0, 61.0, 61.0,
58.0, 55.0, 62.0, 62.0]
# Row 2 corresponds to Papua New Guinea. Numbers on row 2 are [44.0, 52.0, 44.0,
41.0, 37.0, 46.0, 47.0, 43.0]
# Belize has three data in the same place between 2010 to 2014.
ans = "Belize"

Compiler output: Belize

Table 11: An error caused by plot-to-table translation. The dots on the lines and the x labels (years) are not exactly
aligned, causing challenges in the translation.

17

You might also like