Arrow-Guided VLM: Enhancing Flowchart
Understanding via Arrow Direction Encoding
Takamitsu Omasa Ryo Koshihara Masumi Morishige
Galirage Inc.
[email protected]arXiv:2505.07864v1 [cs.AI] 9 May 2025
Abstract
Flowcharts are indispensable tools in software design and business-process anal-
ysis, yet current Vision Language Models (VLMs) frequently misinterpret the
directional arrows and graph topology that set these diagrams apart from natural
images. This paper introduces a seven-stage pipeline, grouped into three broader
processes—(1) arrow-aware detection of nodes and arrow endpoints; (2) Optical
Character Recognition (OCR) to extract node text; and (3) construction of a struc-
tured prompt that guides the VLMs. Tested on a 90-question benchmark distilled
from 30 annotated flowcharts, our method raises overall accuracy from 80% to 89%
(+9 pp), a sizeable and statistically significant gain achieved without task-specific
fine-tuning of the VLMs. The benefit is most pronounced for next-step queries
(25/30 → 30/30; 100%, +17 pp); branch-result questions improve more modestly,
and before-step queries remain difficult. A parallel evaluation with an LLM-as-a-
Judge protocol shows the same trends, reinforcing the advantage of explicit arrow
encoding. Limitations include dependence on detector and OCR precision, the
small evaluation set, and residual errors at nodes with multiple incoming edges.
Future work will enlarge the benchmark with synthetic and handwritten flowcharts
and assess the approach on Business Process Model and Notation (BPMN) and
Unified Modeling Language (UML).
1 Introduction
Flowcharts distill complex control flow, decision logic, and data transformations into a handful of
boxes and arrows. In software engineering and business-process management, these diagrams are
more than didactic artifacts, as such diagrams enable automatic code generation and serve as effective
pedagogical tools [1, 2].
Within just three years, Large Language Models (LLMs) have advanced at an unprecedented pace:
accuracy on the 57-subject Massive Multitask Language Understanding (MMLU) suite climbed
from 43.9% with GPT-3 (2021) [3] to nearly 89% with GPT-4o [4]. VLMs likewise achieve
benchmark-leading results on diverse multimodal benchmarks; for instance, GPT-4o excels on Mas-
sive Multi-discipline Multimodal Understanding (MMMU), MathVista, and Document Visual Ques-
tion Answering (DocVQA) [5–8]. However, its high accuracy deteriorates markedly once explicit
graph-topology reasoning is required. On the simulated subset of the FlowLearn benchmark, con-
verting flowcharts to Mermaid code still proves challenging: on the link-level F1 metric, Claude-3
Opus scores 0.30 and GPT-4 with Vision (GPT-4V) only 0.22 (100-sample subset; 9, Table 7),
underscoring how current VLMs struggle to recover edge relationships.
Previous approaches can be categorized into two main types. First, some studies couple off-the-
shelf object detectors such as YOLO [10] with OCR; the resulting bounding boxes and tokens
are concatenated into a prompt for a VLM, yielding only modest gains over detector-free baselines.
Preprint.
Figure 1: Overview of the seven-stage pipeline: OCR, object detection, text–object fusion, arrow
anchoring, node–arrow linking, graph-structured prompt generation, and VLM-based reasoning.
Second, other work relies on zero-shot segmentation models, most prominently the Segment Anything
Model (SAM) [11]. GenFlowchart [12], for instance, converts SAM masks into bounding boxes, adds
OCR, and queries GPT-3.5 Turbo, yet still suffers from arrow-ordering ambiguities and localization
noise. A complementary strand improves the detector itself—arrow-aware models like Arrow R-
CNN halve localization errors on handwritten diagrams [13]—but these specialised detectors have
not been fused with LLMs. Their outputs feed rule-based pipelines, so branch ordering and multi-step
reasoning remain unresolved. To close this gap, we propose the first detector–VLM fusion pipeline
for flowcharts (Fig. 1). First, a fine-tuned, arrow-aware detector localises nodes and arrowheads. The
OCR stage then extracts textual labels. Finally, the text, bbox pairs are serialised into a coordinate-rich
prompt that, together with the image, is fed to GPT-4o. Unlike prior work that lists raw labels, each
token is annotated with its normalised center-of-mass, allowing the VLM to infer edge orientations
through the geometric priors internalised during pre-training.
Motivated by these gaps, this study investigates whether tightly coupling a flowchart-aware detector
with a VLM via a coordinate-rich prompt can close the reasoning gap on diagrammatic tasks. To
investigate this question, an Arrow-aware detector is fused with GPT-4o and evaluated on a new
90-question suite spanning three query types and diverse diagram complexities, observing up to +9
pp overall and 100% accuracy on next-step queries.
These gains rest on a relatively small test set—90 questions from 30 diagrams—and remain bounded
by the detection model’s localization accuracy. These results are therefore viewed as a first step;
scaling the benchmark, adapting the pipeline to large public corpora such as FlowLearn, and
exploring detector–VLM co-training are left for future work.[9]
2 Related Work
2.1 Limitations of End-to-End VLMs on Diagram Tasks
VLMs such as GPT-4o achieve state-of-the-art scores on natural-image VQA and captioning bench-
marks; however, their accuracy drops sharply when tasks demand explicit reasoning over graph topol-
ogy or precise measurement rather than free-form visual cues. [5] [9] show that on the FlowLearn
benchmark GPT-4V and Claude-3 achieve only F1 = 0.22 and 0.30, respectively, when translating
simulated flowcharts into Mermaid code or answering edge-oriented questions; most errors stem
from missed arrowheads, confusion between incoming and outgoing edges, and OCR noise that
propagates through the reasoning process.
When Chen et al. [14] re-evaluated the AI2 Diagrams (AI2D) corpus originally introduced by
Kembhavi et al. [15], GPT-4V answered just 75.3 % of questions on the AI2D-Test split—well below
2
human-level performance. Chen et al. attribute the gap to questions that can be solved without
genuine visual reasoning and to potential data leakage, while follow-up error analyses in diagram-
specific benchmarks (e.g., FlowLearn) highlight persistent failures to associate arrows, call-outs, and
legend entries with their correct textual referents.
Data-visualisation benchmarks reinforce the trend. On the ChartInsights low-level ChartQA bench-
mark [16], GPT-4V answers only 56.1% of questions with a vanilla prompt (rising to 66.4% under
a Yes/No prompt), and simple corruptions—most notably median blur—degrade accuracy by about
15 percentage points.
On the larger, real-world CharXiv corpus, Wang et al. [17] shows that GPT-4o answers only 47.1 % of
reasoning questions correctly. Similarly, Xia et al. [18] report that GPT-4V attains just 33 % accuracy
on the ChartX question-answering task and no more than 27.2 AP on the accompanying Structured
Chart Representation Matching (SCRM) benchmark, which measures table reconstruction quality.
Across these datasets—flowcharts [9], textbook illustrations [14], and statistical charts [17, 18] —four
failure modes recur: (i) entity–label misalignment caused by invisible coordinates, (ii) cascading
OCR errors, (iii) ambiguity in arrow or series direction, and (iv) acute sensitivity to minor visual
perturbations such as color-map changes or compression artefacts. A purely end-to-end multimodal
transformer therefore lacks the geometry channel required for reliable diagrammatic reasoning,
motivating approaches that preserve spatial layout explicitly.
2.2 Object-Detection–Driven Flowchart Interpretation
Many studies mitigate VLMs’ topological blind spots via a two-stage recipe: first localize the
entities, then let the language model reason. The FlowLearn baseline exemplifies this design: a
detector–plus–OCR front-end extracts node boxes and labels, which are concatenated into a prompt
for GPT-4V; node-level detection is accurate, yet edge-level F1 drops to 0.22 because the prompt
conveys no spatial cues [9]. GenFlowchart strengthens the vision stage by replacing the task-specific
detector with the zero-shot SAM proposed by Kirillov et al. [11]. SAM’s universal masks are
collapsed to bounding boxes, optical character recognition is applied, and the resulting {mask, text}
pairs are forwarded to GPT-3.5-Turbo, following the pipeline of Arbaz et al. [12]. Although this
design boosts embedding-based textual-similarity scores, our replication shows that it still misorders
branches whenever two nodes share the same axis—a structural error the original paper does not
report. For hand-drawn sketches, Schäfer et al. [13] introduces Arrow R-CNN, which augments
Faster R-CNN with head–tail keypoint predictors and halves localization error on four datasets, but
its output flows into a rule-based graph builder rather than a modern VLMs.
What unites these pipelines is the disappearance of the geometry channel: bounding-box centers,
pairwise distances, and arrow orientations are either discarded or embedded latently, so the language
model must hallucinate topology from an unordered token list. Work on natural images confirms that
explicit coordinates can help—Shikra encodes clicked points as textual tags [19], ChatSpot leverages
instruction tuning for precise region references [20], and RegionBLIP injects positional features as
soft prompts [21]—yet none of these systems target graph-based diagrams such as flowcharts.
A separate research line removes the interface altogether by predicting structure end-to-end. GRCNN
outputs node categories and an adjacency matrix in a single forward pass before emitting code with
a syntactic decoder [22]. FloCo-T5 is trained on 11,884 flow-chart images and surpasses a vanilla
CodeT5 baseline with 67.4 BLEU, 75.7 CodeBLEU, and 20 % Exact Match (EM) [2]. The authors
also show a sharp drop to 21.4 BLEU on 40 hand-drawn diagrams, indicating limited robustness to
noisy or off-distribution inputs. Because FloCo-T5 directly decodes a fixed "FloCo" token stream
into Python, it has not yet been evaluated for integrating external knowledge or chain-of-thought
reasoning (our observation).
Previous work splits into two extremes: (i) detector-plus-LLM pipelines that drop coordinates before
reasoning, and (ii) end-to-end models that predict the full graph in one shot but sacrifice linguistic
flexibility. We introduce the first pipeline that retains every entity as a (text, x, y) tuple and feeds this
sequence directly to VLM, closing the gap between spatial fidelity and expressive reasoning.
3
3 Methodology
Our proposed inference pipeline comprises seven sequential stages: text extraction via OCR, object
detection, integration of text and objects, association of arrows with their start and end points, linking
objects to arrows, prompt construction reflecting graph structure, and finally, question generation
and VLM-based reasoning. The overall architecture is illustrated in Figure 1.
3.1 Text Extraction via OCR
First, we apply the Azure AI Document Intelligence service to each input flowchart image to extract
textual content and corresponding bounding box coordinates. Off-the-shelf OCR tools were used
without modification, leveraging their robust performance on printed and scanned text. The extracted
texts and their spatial locations form the initial input to the downstream processes.
3.2 Object Detection
We then detect key flowchart elements—such as processes, decisions, and arrows—using a fine-
tuned object-detection model. Specifically, we adopt the DAMO-YOLO model [23], which is
distributed under the Apache 2.0 license and delivers competitive accuracy comparable to state-of-
the-art detection models.
We annotate nine object classes within the flowcharts:
1. Text
2. Arrow
3. Terminator
4. Data
5. Process
6. Decision
7. Connection
8. Arrow Start
9. Arrow End
For classes 1–7, standard bounding boxes encapsulate the relevant regions. For Arrow Start and
Arrow End, we annotate small bounding boxes tightly around the visual start and end points of each
arrow, respectively. It is noteworthy that arrows themselves can sometimes span very large bounding
boxes, reflecting their visual prominence.
Although text was initially annotated as an object, in the final implementation, we instead relied
exclusively on the coordinates obtained from the OCR service for text information. Of the total 99
annotated diagrams, 30 were reserved for testing, while the remaining 69 were used for training and
validation.
3.3 Integration of Text and Object Information
Next, we merge the OCR-derived text information with the detected object information. We exclude
arrows (Arrow, Arrow Start, and Arrow End) from this integration step.
For each text bounding box, if it overlaps by more than 50% with a detected object bounding box, the
text is assigned to that object. This step effectively binds semantic content to each flowchart element.
3.4 Arrow Association
We then associate detected Arrow with their corresponding start and end points. An Arrow is linked
to an Arrow Start and Arrow End based on two criteria:
1. The Arrow Start and Arrow End must be located near the edges of the Arrow’s bounding
box.
4
2. The Intersection-over-Union (IoU) between the bounding box formed by the Arrow Start
and Arrow End and the detected Arrow’s bounding box must exceed 0.5.
This matching process enables us to recover the directional information inherent in flowcharts.
Additionally, textual annotations such as “yes” or “no” that are not directly associated with any object
but are located near an Arrow are attached to that Arrow.
3.5 Linking Objects and Arrows
Once arrows have been associated with their start and end points, we link non-arrow objects (e.g.,
processes, decisions) to arrows.
For each non-arrow object, we associate any Arrow Start located near its bounding box edges as an
outgoing connection, and any Arrow End located near its edges as an incoming connection. This
step reconstructs the underlying control flow or decision logic of the diagram.
3.6 Prompt Construction
Using the extracted text, object categories, and relational information, we generate structured prompts
that represent the recovered graph structure. For each object, the prompt encodes:
1. The object category (e.g., process, decision)
2. The object’s text content
3. The preceding steps (connected via incoming arrows)
4. The subsequent steps (connected via outgoing arrows)
These graph-aware prompts are designed to make explicit the topology that is implicit in the visual
layout of the flowchart.
3.7 Question Generation and VLM Inference
Finally, we formulate two types of input for the GPT-4o VLM: one without explicit graph information
and one incorporating the constructed graph prompts. For each test flowchart, we generate three
types of questions:
1. Next-step prediction: In this flowchart diagram, what is the next step after ’xxx’?
2. Conditional branch prediction: In this flowchart diagram, if ’xxx’ is ’yyy’, what is the
next step?
3. Preceding-step discrimination: In this flowchart diagram, which of the steps before ’xxx’
except ’zzz’?
We pass these questions along with the relevant flowchart prompt to the VLM and retrieve its
answers. Answer correctness is determined by comparing the VLM’s response against a human-
annotated ground-truth answer set, with the verification itself handled via an additional LLM-assisted
comparison step.
4 Results
4.1 Effectiveness of OCR and Detection Model in FlowchartQA
We compared two approaches for flowchart-based question answering (FlowchartQA): (1) an OCR
and detection model combination (Model Ocr-Dec) and (2) a no-OCR and no-detection baseline
using only raw images (Model No-Ocr-Dec). On a specially annotated corpus of 90 questions,
we evaluated how explicitly recovering arrow directions and node connections impacts overall QA
accuracy.
5
4.2 Experimental Setup
We conducted experiments on a manually annotated corpus consisting of 30 flowchart diagrams.
Each diagram was associated with three types of questions, totaling 90 questions across different
diagram sizes (Large, Medium, and Small). The detailed settings are summarized in Table 1.
Table 1: Summary of Experimental Settings
Item Details
Corpus 30 manually annotated flowcharts (10 Large / 10 Medium / 10 Small).
Each diagram paired with three types of questions, totaling 90 questions.
Question Types Type 1: Next Step; Type 2: Conditional Branch; Type 3: Previous Step
Size Categories Large (>22 arrows), Medium (13–22 arrows), Small (<13 arrows)
Model Ocr-Dec OCR + Detection: Azure AI Document Intelligence OCR + DAMO-YOLO object detector.
Structured prompt and image input to GPT-4o.
Model No-Ocr-Dec Baseline: Direct prompt and image input to GPT-4o (no OCR, no detection).
Evaluation Metric Primarily human evaluation, supplemented by LLM-based scoring.
To evaluate the correctness of answers generated by the LLM, we compared them with manually
prepared ground-truth answers using two methods: human judgment (primary) and LLM-based
evaluation (reference).
For the human evaluation, correctness was determined by comparing the predicted object B in the
flowchart with the ground-truth object described as "A is B." The evaluation was case-insensitive
and ignored punctuation such as periods.
For the LLM-based evaluation, we used GPT-4o to assess the semantic similarity between the LLMś
response and the reference answer. A prompt was designed to determine whether the two answers
were essentially equivalent in meaning.
4.3 Overall Accuracy
Table 2 summarizes the overall accuracy across all 90 questions, aggregating results from Type 1,
Type 2, and Type 3. Human evaluation is treated as the primary metric, while automatic scoring
using a LLM is provided for reference.
Table 2: Overall accuracy (%) and raw counts across all question types (n = 90).
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Question Type (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
All (Total) 88.9 80 / 90 80.0 72 / 90 78.9 71 / 90 75.6 68 / 90
4.4 Accuracy by Question Type
Table 3 summarizes the accuracy results for each question type.
Table 3: Accuracy (%) and raw counts for each question type.
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Question Type (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
Type 1 (Next Step) 100.0 30 / 30 83.3 25 / 30 93.3 28 / 30 76.7 23 / 30
Type 2 (Cond. Branch) 90.0 45 / 50 82.0 41 / 50 84.0 42 / 50 86.0 43 / 50
Type 3 (Previous Step) 50.0 5 / 10 60.0 6 / 10 10.0 1 / 10 20.0 2 / 10
4.5 Accuracy by Diagram Size
Table 4 shows the accuracy categorized by diagram size. Again, human evaluation is treated as
primary, with LLM automatic scoring shown for reference.
6
Table 4: Accuracy (%) by diagram size with supporting counts.
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Diagram Size (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
Large 80.0 24 / 30 66.7 20 / 30 63.3 19 / 30 50.0 15 / 30
Medium 93.3 28 / 30 80.0 24 / 30 80.0 24 / 30 80.0 24 / 30
Small 93.3 28 / 30 93.3 28 / 30 93.3 28 / 30 96.7 29 / 30
5 Discussion
The experimental results revealed several important insights. First, for Type 1 (Next Step) questions,
the OCR and detection model achieved perfect accuracy (100%) according to human evaluation,
significantly outperforming the no-OCR-Dec baseline by 16.7 percentage points. LLM-based scoring
similarly showed large gains (+16.7 pp), validating the robustness of this improvement.
For Type 2 (Conditional Branch) questions, Model OCR-Dec improved by 8.0 percentage points
based on human evaluation, though LLM automatic scoring showed almost no advantage. This
discrepancy suggests that minor variations in textual explanations, which human evaluators can
tolerate, may cause automatic scorers to incorrectly penalize correct answers.
For Type 3 (Previous Step) questions, both human and LLM evaluations revealed low accuracy, with
Model no-OCR-Dec slightly outperforming Model OCR-Dec. This confirms that execution order
reasoning remains difficult without explicit graph structure input.
Regarding diagram size, Model OCR-Dec outperformed the baseline on Large and Medium diagrams
in human evaluations. Improvements were smaller or absent for Small diagrams, which tend to have
simpler structures where explicit arrow recovery has less impact.
5.1 Error Analysis and Improvement Strategies
Error analysis highlighted several recurring failure patterns. A primary source of error was mis-
linking of arrow endpoints, sometimes connecting decision branches (e.g., “Yes”/“No”) incorrectly.
Introducing an IoU-based post-correction method after detection is expected to address this issue.
Another common error was OCR over-segmentation, where contiguous phrases were split into
multiple fragments. Distance-based clustering of bounding boxes could help merge these fragmented
texts.
Furthermore, failure to recover complete graph topology, particularly when nodes had multiple
incoming edges, often led to incorrect reasoning. Representing the flowchart as a JSON-encoded
directed graph, with topological ordering explicitly embedded in prompts, is a promising solution.
Finally, it should be emphasized that LLM automatic scoring showed limitations in handling para-
phrases and extended explanations. Therefore, human evaluation was adopted as the principal
measure of accuracy, and LLM results were treated as supplementary indicators.
6 Conclusion
This study demonstrated that combining OCR and flowchart-specific object detection substantially
improves question answering accuracy for flowcharts, particularly in large diagrams and next-step
reasoning tasks (Type 1). By explicitly recovering text content and arrow directions, the proposed
method enabled LLMs to better understand the structural relationships embedded in flowchart dia-
grams.
Evaluation was primarily conducted via human judgment, supplemented by automatic scoring using
a secondary LLM. Human evaluation revealed that the OCR and detection model achieved perfect
accuracy for next-step questions (Type 1) and substantial improvements for conditional branch
questions (Type 2), confirming the effectiveness of explicitly structured input. However, LLM
automatic evaluation sometimes underreported accuracy, especially when model outputs included
7
extended explanations, highlighting the limitations of strict string-matching approaches for complex
reasoning tasks.
While significant gains were observed for next-step questions, challenges remain for conditional
branching (Type 2) and previous-step identification (Type 3). In these cases, simple text extraction
and object localization were insufficient; fine-grained understanding of control flow, decision logic,
and execution order is critical. Further improvements will require:
• High-precision detection of arrow start and end points to prevent directional ambiguity
• Explicit representation of the flowchartś graph structure in prompts, allowing the LLM to
reason over paths and dependencies
Moreover, the error analysis highlighted additional areas for refinement, such as mitigating OCR
over-segmentation errors and incorporating graph-based topological information directly into the
reasoning pipeline. Addressing these challenges is expected not only to boost performance on
complex reasoning tasks but also to improve system robustness when applied to handwritten diagrams,
BPMN, and industrial schematics.
Finally, the modular pipeline proposed here—separating visual parsing from reasoning—paves the
way for scalable, domain-adaptive flowchart understanding systems. Future work will explore
enhancing graph-structured prompting, developing confidence-aware reasoning mechanisms, and
improving automatic evaluation methods to better handle paraphrastic or explanatory outputs, thus
enabling more reliable and generalizable deployment across diverse real-world settings.
References
[1] D. Hooshyar, R.B. Ahmad, M. Yousefi, F.D. Yusop, and S.-J. Horng. A flowchart-based
intelligent tutoring system for improving problem-solving skills of novice programmers.
Journal of Computer Assisted Learning, 31(4):345–361, apr 2015. ISSN 1365-2729. doi:
10.1111/jcal.12099. URL http://dx.doi.org/10.1111/jcal.12099.
[2] Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, and Anand Mishra. Towards
making flowchart images machine interpretable. 2025. URL http://arxiv.org/pdf/2501.
17441.
[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. 2020. URL http:
//arxiv.org/pdf/2009.03300.
[4] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh,
Aidan Clark, AJ Ostrow, Akila Welihinda, and OTHERS. Gpt-4o system card. 2024. URL
http://arxiv.org/pdf/2410.21276.
[5] Hello gpt-4o | openai. https://openai.com/index/hello-gpt-4o/, 2025. Accessed:
2025-04-30.
[6] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens,
Dongfu Jiang, Weiming Ren, Yuxuan Sun, and OTHERS. Mmmu: A massive multi-discipline
multimodal understanding and reasoning benchmark for expert agi. 2023. URL http://
arxiv.org/pdf/2311.16502.
[7] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao
Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical
reasoning of foundation models in visual contexts. 2023. URL http://arxiv.org/pdf/
2310.02255.
[8] Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on
document images. 2020. URL http://arxiv.org/pdf/2007.00398.
[9] Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, and Longin Jan Latecki. FlowLearn:
Evaluating Large Vision-Language Models on Flowchart Understanding. IOS Press, oct 2024.
ISBN 9781643685489. doi: 10.3233/faia240473. URL http://dx.doi.org/10.3233/
FAIA240473.
[10] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. 2015. URL http://arxiv.org/pdf/1506.02640.
8
[11] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson,
Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, and OTHERS. Segment
anything. 2023. URL http://arxiv.org/pdf/2304.02643.
[12] Abdul Arbaz, Heng Fan, Junhua Ding, Meikang Qiu, and Yunhe Feng. GenFlowchart: Parsing
and Understanding Flowchart Using Generative AI, page 99–111. Springer Nature Singapore,
2024. ISBN 9789819754922. doi: 10.1007/978-981-97-5492-2_8. URL http://dx.doi.
org/10.1007/978-981-97-5492-2_8.
[13] Bernhard Schäfer, Margret Keuper, and Heiner Stuckenschmidt. Arrow r-cnn for handwritten
diagram recognition. International Journal on Document Analysis and Recognition (IJDAR),
24(1–2):3–17, feb 2021. ISSN 1433-2825. doi: 10.1007/s10032-020-00361-1. URL http:
//dx.doi.org/10.1007/s10032-020-00361-1.
[14] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan,
Jiaqi Wang, Yu Qiao, Dahua Lin, and OTHERS. Are we on the right way for evaluating large
vision-language models? 2024. URL http://arxiv.org/pdf/2403.20330.
[15] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali
Farhadi. A diagram is worth a dozen images. 2016. URL http://arxiv.org/pdf/1603.
07396.
[16] Yifan Wu, Lutao Yan, Leixian Shen, Yunhai Wang, Nan Tang, and Yuyu Luo. Chartinsights:
Evaluating multimodal large language models for low-level chart question answering. 2024.
URL http://arxiv.org/pdf/2405.07001.
[17] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang,
Xindi Wu, Haotian Liu, Sadhika Malladi, and OTHERS. Charxiv: Charting gaps in realistic
chart understanding in multimodal llms. 2024. URL http://arxiv.org/pdf/2406.18521.
[18] Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen,
Peng Ye, Min Dou, Botian Shi, and OTHERS. Chartx & chartvlm: A versatile benchmark
and foundation model for complicated chart reasoning. 2024. URL http://arxiv.org/pdf/
2402.12185.
[19] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra:
Unleashing multimodal llm’s referential dialogue magic. 2023. URL http://arxiv.org/
pdf/2306.15195.
[20] Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang
Peng, Runpei Dong, Chunrui Han, and OTHERS. Chatspot: Bootstrapping multimodal llms
via precise referring instruction tuning. 2023. URL http://arxiv.org/pdf/2307.09474.
[21] Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, and Fan Wang. Region-
blip: A unified multi-modal pre-training framework for holistic and regional comprehension.
2023. URL http://arxiv.org/pdf/2308.02299.
[22] Lin Cheng and Zijiang Yang. Grcnn: Graph recognition convolutional neural network for
synthesizing programs from flow charts. 2020. URL http://arxiv.org/pdf/2011.05980.
[23] Xianzhe Xu, Yiqi Jiang, Weihua Chen, Yilun Huang, Yuan Zhang, and Xiuyu Sun. Damo-yolo
: A report on real-time object detection design. 2022. URL http://arxiv.org/pdf/2211.
15444.
A Additional Evaluation Results
We provide here additional results and analysis that complement the main paper, including per-
category detection performance and relaxed IoU evaluations.
A.1 Detection Results
We evaluated the detection performance of the DAMO-YOLO model on our custom test dataset using
the COCO evaluation metrics. Table 5 shows the Average Precision (AP) and Average Recall (AR)
across different object sizes under relaxed IoU thresholds (0.10–0.50). The overall AP was 0.836
and AR reached 0.925, with large objects achieving the highest recall (AR = 0.984).
9
Table 5: Overall AP and AR (IoU=0.10–0.50) for different object sizes
Metric All Small Medium Large
[email protected]–0.50 0.836 0.785 0.832 0.831
AR@maxDets=100 0.925 0.897 0.872 0.984
Table 6 reports category-wise mean Average Precision (mAP) under the standard COCO setting (IoU
= 0.50–0.95). The Arrow class achieved moderate performance (mAP = 0.4476). However, the
average mAP for all arrow-related categories including Arrow Start and Arrow End was significantly
lower (mAP = 0.2349) compared to non-arrow categories (mAP = 0.6531).
Table 6: Per-category mAP (IoU=0.50–0.95)
Category mAP
Arrow 0.4476
Arrow-related (Arrow, Arrow Start, Arrow End) 0.2349
Non-arrow categories 0.6531
All categories 0.5137
Since the bounding boxes for Arrow Start and Arrow End are very small, their detection accuracy
tends to be underestimated when evaluated with the standard IoU range of 0.50–0.95. Therefore, we
also evaluated them under a lower IoU range of 0.10–0.50. The results are shown in Table 7.
Table 7: mAP of small objects under relaxed IoU (0.10–0.50)
Category
[email protected]–0.50
Arrow Start 0.7541
Arrow End 0.8373
B LLM-as-a-Judge Evaluation Details
For the LLM-based evaluation described in the main paper, we used the following prompt to assess
the similarity between model-generated answers and reference answers:
You are a strict judge tasked with the following:
1. A question (Question)
2. A reference answer (Reference Answer)
3. A model output (Model Output)
Please evaluate the model output by following these steps:
### Step 1: Analyze the Answers
- First, compare the reference answer and the model output.
- Determine whether they essentially match in meaning or reasoning, or if the model
output is otherwise correct based on its logic and evidence.
- Provide a thorough and logical assessment, noting any gaps or inconsistencies.
### Step 2: Final Judgment
- If the model output is substantially the same as the reference answer or
equivalently valid judge it as correct.
- If there are clear mistakes, omissions, or inconsistencies, judge it as incorrect.
### Step 3: Output in the Specified Schema
- Please output your evaluation result strictly in the following JSON format:
10
Where [Reference Answer] and [LLM Answer] were replaced with the actual reference and LLM-
generated answers, respectively. We also utilized Structured Outputs to ensure consistent formatting
of the evaluation results in JSON format, making the automated processing of judgments more
reliable.
11