0% found this document useful (0 votes)

195 views11 pages

Arrow-Guided VLM: Enhancing Flowchart

This paper presents a novel seven-stage pipeline for enhancing flowchart understanding in Vision Language Models (VLMs) by incorporating arrow direction encoding. The method improves accuracy from 80% to 89% on a benchmark of 90 questions derived from 30 annotated flowcharts, particularly excelling in next-step queries. Limitations include reliance on the precision of detection and OCR processes, as well as a small evaluation set, with future work planned to expand the benchmark and adapt the approach for other diagrammatic notations.

Uploaded by

lqzcrene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views11 pages

Arrow-Guided VLM: Enhancing Flowchart

Uploaded by

lqzcrene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Arrow-Guided VLM: Enhancing Flowchart

Understanding via Arrow Direction Encoding

Takamitsu Omasa Ryo Koshihara Masumi Morishige

Galirage Inc.
[email protected]
arXiv:2505.07864v1 [cs.AI] 9 May 2025

Abstract

Flowcharts are indispensable tools in software design and business-process anal-

ysis, yet current Vision Language Models (VLMs) frequently misinterpret the
directional arrows and graph topology that set these diagrams apart from natural
images. This paper introduces a seven-stage pipeline, grouped into three broader
processes—(1) arrow-aware detection of nodes and arrow endpoints; (2) Optical
Character Recognition (OCR) to extract node text; and (3) construction of a struc-
tured prompt that guides the VLMs. Tested on a 90-question benchmark distilled
from 30 annotated flowcharts, our method raises overall accuracy from 80% to 89%
(+9 pp), a sizeable and statistically significant gain achieved without task-specific
fine-tuning of the VLMs. The benefit is most pronounced for next-step queries
(25/30 → 30/30; 100%, +17 pp); branch-result questions improve more modestly,
and before-step queries remain difficult. A parallel evaluation with an LLM-as-a-
Judge protocol shows the same trends, reinforcing the advantage of explicit arrow
encoding. Limitations include dependence on detector and OCR precision, the
small evaluation set, and residual errors at nodes with multiple incoming edges.
Future work will enlarge the benchmark with synthetic and handwritten flowcharts
and assess the approach on Business Process Model and Notation (BPMN) and
Unified Modeling Language (UML).

1 Introduction
Flowcharts distill complex control flow, decision logic, and data transformations into a handful of
boxes and arrows. In software engineering and business-process management, these diagrams are
more than didactic artifacts, as such diagrams enable automatic code generation and serve as effective
pedagogical tools [1, 2].
Within just three years, Large Language Models (LLMs) have advanced at an unprecedented pace:
accuracy on the 57-subject Massive Multitask Language Understanding (MMLU) suite climbed
from 43.9% with GPT-3 (2021) [3] to nearly 89% with GPT-4o [4]. VLMs likewise achieve
benchmark-leading results on diverse multimodal benchmarks; for instance, GPT-4o excels on Mas-
sive Multi-discipline Multimodal Understanding (MMMU), MathVista, and Document Visual Ques-
tion Answering (DocVQA) [5–8]. However, its high accuracy deteriorates markedly once explicit
graph-topology reasoning is required. On the simulated subset of the FlowLearn benchmark, con-
verting flowcharts to Mermaid code still proves challenging: on the link-level F1 metric, Claude-3
Opus scores 0.30 and GPT-4 with Vision (GPT-4V) only 0.22 (100-sample subset; 9, Table 7),
underscoring how current VLMs struggle to recover edge relationships.
Previous approaches can be categorized into two main types. First, some studies couple off-the-
shelf object detectors such as YOLO [10] with OCR; the resulting bounding boxes and tokens
are concatenated into a prompt for a VLM, yielding only modest gains over detector-free baselines.

Preprint.
Figure 1: Overview of the seven-stage pipeline: OCR, object detection, text–object fusion, arrow
anchoring, node–arrow linking, graph-structured prompt generation, and VLM-based reasoning.

Second, other work relies on zero-shot segmentation models, most prominently the Segment Anything
Model (SAM) [11]. GenFlowchart [12], for instance, converts SAM masks into bounding boxes, adds
OCR, and queries GPT-3.5 Turbo, yet still suffers from arrow-ordering ambiguities and localization
noise. A complementary strand improves the detector itself—arrow-aware models like Arrow R-
CNN halve localization errors on handwritten diagrams [13]—but these specialised detectors have
not been fused with LLMs. Their outputs feed rule-based pipelines, so branch ordering and multi-step
reasoning remain unresolved. To close this gap, we propose the first detector–VLM fusion pipeline
for flowcharts (Fig. 1). First, a fine-tuned, arrow-aware detector localises nodes and arrowheads. The
OCR stage then extracts textual labels. Finally, the text, bbox pairs are serialised into a coordinate-rich
prompt that, together with the image, is fed to GPT-4o. Unlike prior work that lists raw labels, each
token is annotated with its normalised center-of-mass, allowing the VLM to infer edge orientations
through the geometric priors internalised during pre-training.
Motivated by these gaps, this study investigates whether tightly coupling a flowchart-aware detector
with a VLM via a coordinate-rich prompt can close the reasoning gap on diagrammatic tasks. To
investigate this question, an Arrow-aware detector is fused with GPT-4o and evaluated on a new
90-question suite spanning three query types and diverse diagram complexities, observing up to +9
pp overall and 100% accuracy on next-step queries.
These gains rest on a relatively small test set—90 questions from 30 diagrams—and remain bounded
by the detection model’s localization accuracy. These results are therefore viewed as a first step;
scaling the benchmark, adapting the pipeline to large public corpora such as FlowLearn, and
exploring detector–VLM co-training are left for future work.[9]

2 Related Work

2.1 Limitations of End-to-End VLMs on Diagram Tasks

VLMs such as GPT-4o achieve state-of-the-art scores on natural-image VQA and captioning bench-
marks; however, their accuracy drops sharply when tasks demand explicit reasoning over graph topol-
ogy or precise measurement rather than free-form visual cues. [5] [9] show that on the FlowLearn
benchmark GPT-4V and Claude-3 achieve only F1 = 0.22 and 0.30, respectively, when translating
simulated flowcharts into Mermaid code or answering edge-oriented questions; most errors stem
from missed arrowheads, confusion between incoming and outgoing edges, and OCR noise that
propagates through the reasoning process.
When Chen et al. [14] re-evaluated the AI2 Diagrams (AI2D) corpus originally introduced by
Kembhavi et al. [15], GPT-4V answered just 75.3 % of questions on the AI2D-Test split—well below

2
human-level performance. Chen et al. attribute the gap to questions that can be solved without
genuine visual reasoning and to potential data leakage, while follow-up error analyses in diagram-
specific benchmarks (e.g., FlowLearn) highlight persistent failures to associate arrows, call-outs, and
legend entries with their correct textual referents.
Data-visualisation benchmarks reinforce the trend. On the ChartInsights low-level ChartQA bench-
mark [16], GPT-4V answers only 56.1% of questions with a vanilla prompt (rising to 66.4% under
a Yes/No prompt), and simple corruptions—most notably median blur—degrade accuracy by about
15 percentage points.
On the larger, real-world CharXiv corpus, Wang et al. [17] shows that GPT-4o answers only 47.1 % of
reasoning questions correctly. Similarly, Xia et al. [18] report that GPT-4V attains just 33 % accuracy
on the ChartX question-answering task and no more than 27.2 AP on the accompanying Structured
Chart Representation Matching (SCRM) benchmark, which measures table reconstruction quality.
Across these datasets—flowcharts [9], textbook illustrations [14], and statistical charts [17, 18] —four
failure modes recur: (i) entity–label misalignment caused by invisible coordinates, (ii) cascading
OCR errors, (iii) ambiguity in arrow or series direction, and (iv) acute sensitivity to minor visual
perturbations such as color-map changes or compression artefacts. A purely end-to-end multimodal
transformer therefore lacks the geometry channel required for reliable diagrammatic reasoning,
motivating approaches that preserve spatial layout explicitly.

2.2 Object-Detection–Driven Flowchart Interpretation

Many studies mitigate VLMs’ topological blind spots via a two-stage recipe: first localize the
entities, then let the language model reason. The FlowLearn baseline exemplifies this design: a
detector–plus–OCR front-end extracts node boxes and labels, which are concatenated into a prompt
for GPT-4V; node-level detection is accurate, yet edge-level F1 drops to 0.22 because the prompt
conveys no spatial cues [9]. GenFlowchart strengthens the vision stage by replacing the task-specific
detector with the zero-shot SAM proposed by Kirillov et al. [11]. SAM’s universal masks are
collapsed to bounding boxes, optical character recognition is applied, and the resulting {mask, text}
pairs are forwarded to GPT-3.5-Turbo, following the pipeline of Arbaz et al. [12]. Although this
design boosts embedding-based textual-similarity scores, our replication shows that it still misorders
branches whenever two nodes share the same axis—a structural error the original paper does not
report. For hand-drawn sketches, Schäfer et al. [13] introduces Arrow R-CNN, which augments
Faster R-CNN with head–tail keypoint predictors and halves localization error on four datasets, but
its output flows into a rule-based graph builder rather than a modern VLMs.
What unites these pipelines is the disappearance of the geometry channel: bounding-box centers,
pairwise distances, and arrow orientations are either discarded or embedded latently, so the language
model must hallucinate topology from an unordered token list. Work on natural images confirms that
explicit coordinates can help—Shikra encodes clicked points as textual tags [19], ChatSpot leverages
instruction tuning for precise region references [20], and RegionBLIP injects positional features as
soft prompts [21]—yet none of these systems target graph-based diagrams such as flowcharts.
A separate research line removes the interface altogether by predicting structure end-to-end. GRCNN
outputs node categories and an adjacency matrix in a single forward pass before emitting code with
a syntactic decoder [22]. FloCo-T5 is trained on 11,884 flow-chart images and surpasses a vanilla
CodeT5 baseline with 67.4 BLEU, 75.7 CodeBLEU, and 20 % Exact Match (EM) [2]. The authors
also show a sharp drop to 21.4 BLEU on 40 hand-drawn diagrams, indicating limited robustness to
noisy or off-distribution inputs. Because FloCo-T5 directly decodes a fixed "FloCo" token stream
into Python, it has not yet been evaluated for integrating external knowledge or chain-of-thought
reasoning (our observation).
Previous work splits into two extremes: (i) detector-plus-LLM pipelines that drop coordinates before
reasoning, and (ii) end-to-end models that predict the full graph in one shot but sacrifice linguistic
flexibility. We introduce the first pipeline that retains every entity as a (text, x, y) tuple and feeds this
sequence directly to VLM, closing the gap between spatial fidelity and expressive reasoning.

3
3 Methodology
Our proposed inference pipeline comprises seven sequential stages: text extraction via OCR, object
detection, integration of text and objects, association of arrows with their start and end points, linking
objects to arrows, prompt construction reflecting graph structure, and finally, question generation
and VLM-based reasoning. The overall architecture is illustrated in Figure 1.

3.1 Text Extraction via OCR

First, we apply the Azure AI Document Intelligence service to each input flowchart image to extract
textual content and corresponding bounding box coordinates. Off-the-shelf OCR tools were used
without modification, leveraging their robust performance on printed and scanned text. The extracted
texts and their spatial locations form the initial input to the downstream processes.

3.2 Object Detection

We then detect key flowchart elements—such as processes, decisions, and arrows—using a fine-
tuned object-detection model. Specifically, we adopt the DAMO-YOLO model [23], which is
distributed under the Apache 2.0 license and delivers competitive accuracy comparable to state-of-
the-art detection models.
We annotate nine object classes within the flowcharts:

1. Text
2. Arrow
3. Terminator
4. Data
5. Process
6. Decision
7. Connection
8. Arrow Start
9. Arrow End

For classes 1–7, standard bounding boxes encapsulate the relevant regions. For Arrow Start and
Arrow End, we annotate small bounding boxes tightly around the visual start and end points of each
arrow, respectively. It is noteworthy that arrows themselves can sometimes span very large bounding
boxes, reflecting their visual prominence.
Although text was initially annotated as an object, in the final implementation, we instead relied
exclusively on the coordinates obtained from the OCR service for text information. Of the total 99
annotated diagrams, 30 were reserved for testing, while the remaining 69 were used for training and
validation.

3.3 Integration of Text and Object Information

Next, we merge the OCR-derived text information with the detected object information. We exclude
arrows (Arrow, Arrow Start, and Arrow End) from this integration step.
For each text bounding box, if it overlaps by more than 50% with a detected object bounding box, the
text is assigned to that object. This step effectively binds semantic content to each flowchart element.

3.4 Arrow Association

We then associate detected Arrow with their corresponding start and end points. An Arrow is linked
to an Arrow Start and Arrow End based on two criteria:

1. The Arrow Start and Arrow End must be located near the edges of the Arrow’s bounding
box.

4
2. The Intersection-over-Union (IoU) between the bounding box formed by the Arrow Start
and Arrow End and the detected Arrow’s bounding box must exceed 0.5.

This matching process enables us to recover the directional information inherent in flowcharts.
Additionally, textual annotations such as “yes” or “no” that are not directly associated with any object
but are located near an Arrow are attached to that Arrow.

3.5 Linking Objects and Arrows

Once arrows have been associated with their start and end points, we link non-arrow objects (e.g.,
processes, decisions) to arrows.
For each non-arrow object, we associate any Arrow Start located near its bounding box edges as an
outgoing connection, and any Arrow End located near its edges as an incoming connection. This
step reconstructs the underlying control flow or decision logic of the diagram.

3.6 Prompt Construction

Using the extracted text, object categories, and relational information, we generate structured prompts
that represent the recovered graph structure. For each object, the prompt encodes:

1. The object category (e.g., process, decision)

2. The object’s text content
3. The preceding steps (connected via incoming arrows)
4. The subsequent steps (connected via outgoing arrows)

These graph-aware prompts are designed to make explicit the topology that is implicit in the visual
layout of the flowchart.

3.7 Question Generation and VLM Inference

Finally, we formulate two types of input for the GPT-4o VLM: one without explicit graph information
and one incorporating the constructed graph prompts. For each test flowchart, we generate three
types of questions:

1. Next-step prediction: In this flowchart diagram, what is the next step after ’xxx’?
2. Conditional branch prediction: In this flowchart diagram, if ’xxx’ is ’yyy’, what is the
next step?
3. Preceding-step discrimination: In this flowchart diagram, which of the steps before ’xxx’
except ’zzz’?

We pass these questions along with the relevant flowchart prompt to the VLM and retrieve its
answers. Answer correctness is determined by comparing the VLM’s response against a human-
annotated ground-truth answer set, with the verification itself handled via an additional LLM-assisted
comparison step.

4 Results

4.1 Effectiveness of OCR and Detection Model in FlowchartQA

We compared two approaches for flowchart-based question answering (FlowchartQA): (1) an OCR
and detection model combination (Model Ocr-Dec) and (2) a no-OCR and no-detection baseline
using only raw images (Model No-Ocr-Dec). On a specially annotated corpus of 90 questions,
we evaluated how explicitly recovering arrow directions and node connections impacts overall QA
accuracy.

5
4.2 Experimental Setup

We conducted experiments on a manually annotated corpus consisting of 30 flowchart diagrams.

Each diagram was associated with three types of questions, totaling 90 questions across different
diagram sizes (Large, Medium, and Small). The detailed settings are summarized in Table 1.

Table 1: Summary of Experimental Settings

Item Details
Corpus 30 manually annotated flowcharts (10 Large / 10 Medium / 10 Small).
Each diagram paired with three types of questions, totaling 90 questions.
Question Types Type 1: Next Step; Type 2: Conditional Branch; Type 3: Previous Step
Size Categories Large (>22 arrows), Medium (13–22 arrows), Small (<13 arrows)
Model Ocr-Dec OCR + Detection: Azure AI Document Intelligence OCR + DAMO-YOLO object detector.
Structured prompt and image input to GPT-4o.
Model No-Ocr-Dec Baseline: Direct prompt and image input to GPT-4o (no OCR, no detection).
Evaluation Metric Primarily human evaluation, supplemented by LLM-based scoring.

To evaluate the correctness of answers generated by the LLM, we compared them with manually
prepared ground-truth answers using two methods: human judgment (primary) and LLM-based
evaluation (reference).
For the human evaluation, correctness was determined by comparing the predicted object B in the
flowchart with the ground-truth object described as "A is B." The evaluation was case-insensitive
and ignored punctuation such as periods.
For the LLM-based evaluation, we used GPT-4o to assess the semantic similarity between the LLMś
response and the reference answer. A prompt was designed to determine whether the two answers
were essentially equivalent in meaning.

4.3 Overall Accuracy

Table 2 summarizes the overall accuracy across all 90 questions, aggregating results from Type 1,
Type 2, and Type 3. Human evaluation is treated as the primary metric, while automatic scoring
using a LLM is provided for reference.

Table 2: Overall accuracy (%) and raw counts across all question types (n = 90).
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Question Type (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
All (Total) 88.9 80 / 90 80.0 72 / 90 78.9 71 / 90 75.6 68 / 90

4.4 Accuracy by Question Type

Table 3 summarizes the accuracy results for each question type.

Table 3: Accuracy (%) and raw counts for each question type.
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Question Type (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
Type 1 (Next Step) 100.0 30 / 30 83.3 25 / 30 93.3 28 / 30 76.7 23 / 30
Type 2 (Cond. Branch) 90.0 45 / 50 82.0 41 / 50 84.0 42 / 50 86.0 43 / 50
Type 3 (Previous Step) 50.0 5 / 10 60.0 6 / 10 10.0 1 / 10 20.0 2 / 10

4.5 Accuracy by Diagram Size

Table 4 shows the accuracy categorized by diagram size. Again, human evaluation is treated as
primary, with LLM automatic scoring shown for reference.

6
Table 4: Accuracy (%) by diagram size with supporting counts.
Ocr–Dec No-Ocr–Dec Ocr–Dec No-Ocr–Dec
Diagram Size (Human) (Human) (LLM) (LLM)
% n/N % n/N % n/N % n/N
Large 80.0 24 / 30 66.7 20 / 30 63.3 19 / 30 50.0 15 / 30
Medium 93.3 28 / 30 80.0 24 / 30 80.0 24 / 30 80.0 24 / 30
Small 93.3 28 / 30 93.3 28 / 30 93.3 28 / 30 96.7 29 / 30

5 Discussion

The experimental results revealed several important insights. First, for Type 1 (Next Step) questions,
the OCR and detection model achieved perfect accuracy (100%) according to human evaluation,
significantly outperforming the no-OCR-Dec baseline by 16.7 percentage points. LLM-based scoring
similarly showed large gains (+16.7 pp), validating the robustness of this improvement.
For Type 2 (Conditional Branch) questions, Model OCR-Dec improved by 8.0 percentage points
based on human evaluation, though LLM automatic scoring showed almost no advantage. This
discrepancy suggests that minor variations in textual explanations, which human evaluators can
tolerate, may cause automatic scorers to incorrectly penalize correct answers.
For Type 3 (Previous Step) questions, both human and LLM evaluations revealed low accuracy, with
Model no-OCR-Dec slightly outperforming Model OCR-Dec. This confirms that execution order
reasoning remains difficult without explicit graph structure input.
Regarding diagram size, Model OCR-Dec outperformed the baseline on Large and Medium diagrams
in human evaluations. Improvements were smaller or absent for Small diagrams, which tend to have
simpler structures where explicit arrow recovery has less impact.

5.1 Error Analysis and Improvement Strategies

Error analysis highlighted several recurring failure patterns. A primary source of error was mis-
linking of arrow endpoints, sometimes connecting decision branches (e.g., “Yes”/“No”) incorrectly.
Introducing an IoU-based post-correction method after detection is expected to address this issue.
Another common error was OCR over-segmentation, where contiguous phrases were split into
multiple fragments. Distance-based clustering of bounding boxes could help merge these fragmented
texts.
Furthermore, failure to recover complete graph topology, particularly when nodes had multiple
incoming edges, often led to incorrect reasoning. Representing the flowchart as a JSON-encoded
directed graph, with topological ordering explicitly embedded in prompts, is a promising solution.
Finally, it should be emphasized that LLM automatic scoring showed limitations in handling para-
phrases and extended explanations. Therefore, human evaluation was adopted as the principal
measure of accuracy, and LLM results were treated as supplementary indicators.

6 Conclusion

This study demonstrated that combining OCR and flowchart-specific object detection substantially
improves question answering accuracy for flowcharts, particularly in large diagrams and next-step
reasoning tasks (Type 1). By explicitly recovering text content and arrow directions, the proposed
method enabled LLMs to better understand the structural relationships embedded in flowchart dia-
grams.
Evaluation was primarily conducted via human judgment, supplemented by automatic scoring using
a secondary LLM. Human evaluation revealed that the OCR and detection model achieved perfect
accuracy for next-step questions (Type 1) and substantial improvements for conditional branch
questions (Type 2), confirming the effectiveness of explicitly structured input. However, LLM
automatic evaluation sometimes underreported accuracy, especially when model outputs included

7
extended explanations, highlighting the limitations of strict string-matching approaches for complex
reasoning tasks.
While significant gains were observed for next-step questions, challenges remain for conditional
branching (Type 2) and previous-step identification (Type 3). In these cases, simple text extraction
and object localization were insufficient; fine-grained understanding of control flow, decision logic,
and execution order is critical. Further improvements will require:

• High-precision detection of arrow start and end points to prevent directional ambiguity
• Explicit representation of the flowchartś graph structure in prompts, allowing the LLM to
reason over paths and dependencies

Moreover, the error analysis highlighted additional areas for refinement, such as mitigating OCR
over-segmentation errors and incorporating graph-based topological information directly into the
reasoning pipeline. Addressing these challenges is expected not only to boost performance on
complex reasoning tasks but also to improve system robustness when applied to handwritten diagrams,
BPMN, and industrial schematics.
Finally, the modular pipeline proposed here—separating visual parsing from reasoning—paves the
way for scalable, domain-adaptive flowchart understanding systems. Future work will explore
enhancing graph-structured prompting, developing confidence-aware reasoning mechanisms, and
improving automatic evaluation methods to better handle paraphrastic or explanatory outputs, thus
enabling more reliable and generalizable deployment across diverse real-world settings.

References
[1] D. Hooshyar, R.B. Ahmad, M. Yousefi, F.D. Yusop, and S.-J. Horng. A flowchart-based
intelligent tutoring system for improving problem-solving skills of novice programmers.
Journal of Computer Assisted Learning, 31(4):345–361, apr 2015. ISSN 1365-2729. doi:
10.1111/jcal.12099. URL http://dx.doi.org/10.1111/jcal.12099.
[2] Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, and Anand Mishra. Towards
making flowchart images machine interpretable. 2025. URL http://arxiv.org/pdf/2501.
17441.
[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. 2020. URL http:
//arxiv.org/pdf/2009.03300.
[4] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh,
Aidan Clark, AJ Ostrow, Akila Welihinda, and OTHERS. Gpt-4o system card. 2024. URL
http://arxiv.org/pdf/2410.21276.
[5] Hello gpt-4o | openai. https://openai.com/index/hello-gpt-4o/, 2025. Accessed:
2025-04-30.
[6] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens,
Dongfu Jiang, Weiming Ren, Yuxuan Sun, and OTHERS. Mmmu: A massive multi-discipline
multimodal understanding and reasoning benchmark for expert agi. 2023. URL http://
arxiv.org/pdf/2311.16502.
[7] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao
Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical
reasoning of foundation models in visual contexts. 2023. URL http://arxiv.org/pdf/
2310.02255.
[8] Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on
document images. 2020. URL http://arxiv.org/pdf/2007.00398.
[9] Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, and Longin Jan Latecki. FlowLearn:
Evaluating Large Vision-Language Models on Flowchart Understanding. IOS Press, oct 2024.
ISBN 9781643685489. doi: 10.3233/faia240473. URL http://dx.doi.org/10.3233/
FAIA240473.
[10] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. 2015. URL http://arxiv.org/pdf/1506.02640.

8
[11] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson,
Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, and OTHERS. Segment
anything. 2023. URL http://arxiv.org/pdf/2304.02643.
[12] Abdul Arbaz, Heng Fan, Junhua Ding, Meikang Qiu, and Yunhe Feng. GenFlowchart: Parsing
and Understanding Flowchart Using Generative AI, page 99–111. Springer Nature Singapore,
2024. ISBN 9789819754922. doi: 10.1007/978-981-97-5492-2_8. URL http://dx.doi.
org/10.1007/978-981-97-5492-2_8.
[13] Bernhard Schäfer, Margret Keuper, and Heiner Stuckenschmidt. Arrow r-cnn for handwritten
diagram recognition. International Journal on Document Analysis and Recognition (ĲDAR),
24(1–2):3–17, feb 2021. ISSN 1433-2825. doi: 10.1007/s10032-020-00361-1. URL http:
//dx.doi.org/10.1007/s10032-020-00361-1.
[14] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan,
Jiaqi Wang, Yu Qiao, Dahua Lin, and OTHERS. Are we on the right way for evaluating large
vision-language models? 2024. URL http://arxiv.org/pdf/2403.20330.
[15] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali
Farhadi. A diagram is worth a dozen images. 2016. URL http://arxiv.org/pdf/1603.
07396.
[16] Yifan Wu, Lutao Yan, Leixian Shen, Yunhai Wang, Nan Tang, and Yuyu Luo. Chartinsights:
Evaluating multimodal large language models for low-level chart question answering. 2024.
URL http://arxiv.org/pdf/2405.07001.
[17] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang,
Xindi Wu, Haotian Liu, Sadhika Malladi, and OTHERS. Charxiv: Charting gaps in realistic
chart understanding in multimodal llms. 2024. URL http://arxiv.org/pdf/2406.18521.
[18] Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zĳun Chen,
Peng Ye, Min Dou, Botian Shi, and OTHERS. Chartx & chartvlm: A versatile benchmark
and foundation model for complicated chart reasoning. 2024. URL http://arxiv.org/pdf/
2402.12185.
[19] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra:
Unleashing multimodal llm’s referential dialogue magic. 2023. URL http://arxiv.org/
pdf/2306.15195.
[20] Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang
Peng, Runpei Dong, Chunrui Han, and OTHERS. Chatspot: Bootstrapping multimodal llms
via precise referring instruction tuning. 2023. URL http://arxiv.org/pdf/2307.09474.
[21] Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, and Fan Wang. Region-
blip: A unified multi-modal pre-training framework for holistic and regional comprehension.
2023. URL http://arxiv.org/pdf/2308.02299.
[22] Lin Cheng and Zĳiang Yang. Grcnn: Graph recognition convolutional neural network for
synthesizing programs from flow charts. 2020. URL http://arxiv.org/pdf/2011.05980.
[23] Xianzhe Xu, Yiqi Jiang, Weihua Chen, Yilun Huang, Yuan Zhang, and Xiuyu Sun. Damo-yolo
: A report on real-time object detection design. 2022. URL http://arxiv.org/pdf/2211.
15444.

A Additional Evaluation Results

We provide here additional results and analysis that complement the main paper, including per-
category detection performance and relaxed IoU evaluations.

A.1 Detection Results

We evaluated the detection performance of the DAMO-YOLO model on our custom test dataset using
the COCO evaluation metrics. Table 5 shows the Average Precision (AP) and Average Recall (AR)
across different object sizes under relaxed IoU thresholds (0.10–0.50). The overall AP was 0.836
and AR reached 0.925, with large objects achieving the highest recall (AR = 0.984).

9
Table 5: Overall AP and AR (IoU=0.10–0.50) for different object sizes
Metric All Small Medium Large
[email protected]–0.50 0.836 0.785 0.832 0.831
AR@maxDets=100 0.925 0.897 0.872 0.984

Table 6 reports category-wise mean Average Precision (mAP) under the standard COCO setting (IoU
= 0.50–0.95). The Arrow class achieved moderate performance (mAP = 0.4476). However, the
average mAP for all arrow-related categories including Arrow Start and Arrow End was significantly
lower (mAP = 0.2349) compared to non-arrow categories (mAP = 0.6531).

Table 6: Per-category mAP (IoU=0.50–0.95)

Category mAP
Arrow 0.4476
Arrow-related (Arrow, Arrow Start, Arrow End) 0.2349
Non-arrow categories 0.6531
All categories 0.5137

Since the bounding boxes for Arrow Start and Arrow End are very small, their detection accuracy
tends to be underestimated when evaluated with the standard IoU range of 0.50–0.95. Therefore, we
also evaluated them under a lower IoU range of 0.10–0.50. The results are shown in Table 7.

Table 7: mAP of small objects under relaxed IoU (0.10–0.50)

Category [email protected]–0.50
Arrow Start 0.7541
Arrow End 0.8373

B LLM-as-a-Judge Evaluation Details

For the LLM-based evaluation described in the main paper, we used the following prompt to assess
the similarity between model-generated answers and reference answers:
You are a strict judge tasked with the following:

1. A question (Question)
2. A reference answer (Reference Answer)
3. A model output (Model Output)

Please evaluate the model output by following these steps:

### Step 1: Analyze the Answers

- First, compare the reference answer and the model output.
- Determine whether they essentially match in meaning or reasoning, or if the model
output is otherwise correct based on its logic and evidence.
- Provide a thorough and logical assessment, noting any gaps or inconsistencies.

### Step 2: Final Judgment

- If the model output is substantially the same as the reference answer or
equivalently valid judge it as correct.
- If there are clear mistakes, omissions, or inconsistencies, judge it as incorrect.

### Step 3: Output in the Specified Schema

- Please output your evaluation result strictly in the following JSON format:

10
Where [Reference Answer] and [LLM Answer] were replaced with the actual reference and LLM-
generated answers, respectively. We also utilized Structured Outputs to ensure consistent formatting
of the evaluation results in JSON format, making the automated processing of judgments more
reliable.

FlowLearn: Dataset for Flowchart Comprehension
No ratings yet
FlowLearn: Dataset for Flowchart Comprehension
14 pages
Multimodal Benchmark for Graph Understanding
No ratings yet
Multimodal Benchmark for Graph Understanding
16 pages
Graph-To-Vision: Multi-Graph Understanding and Reasoning Using Vision-Language Models
No ratings yet
Graph-To-Vision: Multi-Graph Understanding and Reasoning Using Vision-Language Models
25 pages
Graph-Grounded LLMs - Leveraging Graphical Function Calling To Minimize LLM Hallucinations
No ratings yet
Graph-Grounded LLMs - Leveraging Graphical Function Calling To Minimize LLM Hallucinations
8 pages
Rethinking Vlms and Llms For Image Classification
No ratings yet
Rethinking Vlms and Llms For Image Classification
23 pages
LLM4Hypergraph: Benchmarking LLMs' Comprehension
No ratings yet
LLM4Hypergraph: Benchmarking LLMs' Comprehension
26 pages
OCRBench: Evaluating Multimodal OCR
No ratings yet
OCRBench: Evaluating Multimodal OCR
16 pages
Task-Specific Prompting for VLMs in Math
No ratings yet
Task-Specific Prompting for VLMs in Math
10 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Can Graph Learning Improve Planning in LLM-based Agents?
No ratings yet
Can Graph Learning Improve Planning in LLM-based Agents?
39 pages
Paper 1
No ratings yet
Paper 1
17 pages
Spynet
No ratings yet
Spynet
10 pages
Evaluating OCR in Large Multimodal Models
No ratings yet
Evaluating OCR in Large Multimodal Models
13 pages
Wavelets Meet Large Language Models
No ratings yet
Wavelets Meet Large Language Models
16 pages
Driving VQA
No ratings yet
Driving VQA
25 pages
LLM-Seg: For AI Researchers
No ratings yet
LLM-Seg: For AI Researchers
10 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
Vision-Language Models Intro Guide
No ratings yet
Vision-Language Models Intro Guide
76 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Simple Is Effective Role of Graphs and Llms in KG Retrieval Augmented Generation
No ratings yet
Simple Is Effective Role of Graphs and Llms in KG Retrieval Augmented Generation
29 pages
LlamaV-o1 Rethinking Step-By-step Visual Reasoning in LLMs
No ratings yet
LlamaV-o1 Rethinking Step-By-step Visual Reasoning in LLMs
27 pages
Enhancing LLM Reasoning Via Vision-Augmented Prompting
No ratings yet
Enhancing LLM Reasoning Via Vision-Augmented Prompting
26 pages
Enhancing VQA with BLIVA Model
No ratings yet
Enhancing VQA with BLIVA Model
12 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
NeurIPS 2024 Understanding Transformer Reasoning Capabilities Via Graph Algorithms Paper Conference
No ratings yet
NeurIPS 2024 Understanding Transformer Reasoning Capabilities Via Graph Algorithms Paper Conference
51 pages
(2024) Understanding Transformer Reasoning Capabilities Via Graph Algorithms (Sanford, Fatemi, Hall, Tsitsulin, Kazemi, Halcrow, Perozzi, Mirrokni)
No ratings yet
(2024) Understanding Transformer Reasoning Capabilities Via Graph Algorithms (Sanford, Fatemi, Hall, Tsitsulin, Kazemi, Halcrow, Perozzi, Mirrokni)
46 pages
Pixel Reasoner
No ratings yet
Pixel Reasoner
23 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Combining Knowledge Graph and LLMs For
No ratings yet
Combining Knowledge Graph and LLMs For
11 pages
Efficient Multimodal Large Language Models - A Survey
No ratings yet
Efficient Multimodal Large Language Models - A Survey
36 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
Sub 2
No ratings yet
Sub 2
4 pages
2508 15690v1
No ratings yet
2508 15690v1
23 pages
Autonomous Car
100% (1)
Autonomous Car
12 pages
GUNDAM - Aligning Large Language Models With Graph Understanding
No ratings yet
GUNDAM - Aligning Large Language Models With Graph Understanding
16 pages
2007 06504
No ratings yet
2007 06504
5 pages
Video-ChatGPT: Advanced Video Dialogue Model
No ratings yet
Video-ChatGPT: Advanced Video Dialogue Model
17 pages
CLIP as RNN: Zero-Shot Segmentation
No ratings yet
CLIP as RNN: Zero-Shot Segmentation
18 pages
Deep Learning in Computer Vision Techniques
No ratings yet
Deep Learning in Computer Vision Techniques
52 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
ELIP
No ratings yet
ELIP
31 pages
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
No ratings yet
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
13 pages
GPT 2 Finetunning
No ratings yet
GPT 2 Finetunning
15 pages
LLMs for Enhanced Chart VQA Reasoning
No ratings yet
LLMs for Enhanced Chart VQA Reasoning
16 pages
Graphs and Large Language Models: A Survey
No ratings yet
Graphs and Large Language Models: A Survey
13 pages
TensorFlow Graph Visualizer Guide
No ratings yet
TensorFlow Graph Visualizer Guide
12 pages
Vision-Based Urban Flood Monitoring
No ratings yet
Vision-Based Urban Flood Monitoring
29 pages
VisionGPT LLM-Assisted Real-Time Anomaly Detection For Safe Visual Navigation
No ratings yet
VisionGPT LLM-Assisted Real-Time Anomaly Detection For Safe Visual Navigation
15 pages
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
No ratings yet
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
24 pages
LLM-Enhanced Graph Learning
No ratings yet
LLM-Enhanced Graph Learning
22 pages
Data Science With Llms and Interpretable Models: Sebastian Bordt, Ben Lengerich, Harsha Nori, Rich Caruana
No ratings yet
Data Science With Llms and Interpretable Models: Sebastian Bordt, Ben Lengerich, Harsha Nori, Rich Caruana
8 pages
Scene Exploration by Vision-Language Models: Venkatesh Sripada, Samuel Carter, Frank Guerin, Amir Ghalamzan
No ratings yet
Scene Exploration by Vision-Language Models: Venkatesh Sripada, Samuel Carter, Frank Guerin, Amir Ghalamzan
7 pages
Drive LM
No ratings yet
Drive LM
52 pages
Visual ChatGPT: Multi-Modal Interaction System
No ratings yet
Visual ChatGPT: Multi-Modal Interaction System
17 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
Test Paper Syllabus - 11th June 2025
No ratings yet
Test Paper Syllabus - 11th June 2025
1 page
Notes On Business Policy Assignment
No ratings yet
Notes On Business Policy Assignment
5 pages
Vehicle Indemnity
No ratings yet
Vehicle Indemnity
2 pages
Tad722ge PDF
100% (1)
Tad722ge PDF
2 pages
Intermediate Accounting: Inventory Systems
No ratings yet
Intermediate Accounting: Inventory Systems
46 pages
Project Work Schedule and Status Updates
No ratings yet
Project Work Schedule and Status Updates
6 pages
BlackBerry's Evolution in Business
100% (1)
BlackBerry's Evolution in Business
33 pages
CHAPTER 1 Randy's Group
No ratings yet
CHAPTER 1 Randy's Group
6 pages
Session-wise Schedule of Tax Topics
No ratings yet
Session-wise Schedule of Tax Topics
6 pages
Unity and Lessons from Family Stories
No ratings yet
Unity and Lessons from Family Stories
11 pages
Energy-Efficient Hybrid Flow-Shop Scheduling Under
No ratings yet
Energy-Efficient Hybrid Flow-Shop Scheduling Under
25 pages
GST Registration Certificate for Ruchi Agarwal
No ratings yet
GST Registration Certificate for Ruchi Agarwal
3 pages
Administering Intravenous (IV) Medication - OSCE Guide - Geeky Medics
No ratings yet
Administering Intravenous (IV) Medication - OSCE Guide - Geeky Medics
7 pages
Memo Deposit of Shares
No ratings yet
Memo Deposit of Shares
3 pages
Dismantling 5F Process Pumps Guide
No ratings yet
Dismantling 5F Process Pumps Guide
8 pages
By Dr. Waiter J. Rauch Translated by Jay Carrigan: Sevlju
No ratings yet
By Dr. Waiter J. Rauch Translated by Jay Carrigan: Sevlju
5 pages
Standalone Balance Sheet: Assets
No ratings yet
Standalone Balance Sheet: Assets
6 pages
Microbit Micropython Readthedocs Io en Latest
No ratings yet
Microbit Micropython Readthedocs Io en Latest
139 pages
FNR 179
No ratings yet
FNR 179
20 pages
Disbarment of Atty. Pascual-Lopez for Conflict
No ratings yet
Disbarment of Atty. Pascual-Lopez for Conflict
2 pages
Audit Adjustments for Gomez Ltd.
No ratings yet
Audit Adjustments for Gomez Ltd.
2 pages
Power System Analysis Using ETAP
No ratings yet
Power System Analysis Using ETAP
6 pages
Essential Interview Preparation Guide
No ratings yet
Essential Interview Preparation Guide
12 pages
Marine Engineering Double Degree Plan
No ratings yet
Marine Engineering Double Degree Plan
1 page
Webinar Funnel Formula
No ratings yet
Webinar Funnel Formula
8 pages
SEC Summary: James Bay Lithium Project
No ratings yet
SEC Summary: James Bay Lithium Project
395 pages
Operational Qualification For Ahu
No ratings yet
Operational Qualification For Ahu
14 pages
Achille Civatte - A Master of The Microscope
No ratings yet
Achille Civatte - A Master of The Microscope
2 pages
Financement du Commerce International
No ratings yet
Financement du Commerce International
25 pages

Arrow-Guided VLM: Enhancing Flowchart

Uploaded by

Arrow-Guided VLM: Enhancing Flowchart

Uploaded by

Arrow-Guided VLM: Enhancing Flowchart

Understanding via Arrow Direction Encoding

Takamitsu Omasa Ryo Koshihara Masumi Morishige

Flowcharts are indispensable tools in software design and business-process anal-

2.1 Limitations of End-to-End VLMs on Diagram Tasks

2.2 Object-Detection–Driven Flowchart Interpretation

3.1 Text Extraction via OCR

3.2 Object Detection

3.3 Integration of Text and Object Information

3.4 Arrow Association

3.5 Linking Objects and Arrows

3.6 Prompt Construction

1. The object category (e.g., process, decision)

3.7 Question Generation and VLM Inference

4.1 Effectiveness of OCR and Detection Model in FlowchartQA

We conducted experiments on a manually annotated corpus consisting of 30 flowchart diagrams.

Table 1: Summary of Experimental Settings

4.3 Overall Accuracy

4.4 Accuracy by Question Type

Table 3 summarizes the accuracy results for each question type.

4.5 Accuracy by Diagram Size

5.1 Error Analysis and Improvement Strategies

A Additional Evaluation Results

A.1 Detection Results

Table 6: Per-category mAP (IoU=0.50–0.95)

Table 7: mAP of small objects under relaxed IoU (0.10–0.50)

B LLM-as-a-Judge Evaluation Details

Please evaluate the model output by following these steps:

### Step 1: Analyze the Answers

### Step 2: Final Judgment

### Step 3: Output in the Specified Schema

You might also like