GUIAgents With Foundation Models
GUIAgents With Foundation Models
A Comprehensive Survey
Shuai Wang1 , Weiwen Liu*1 , Jingxuan Chen1 , Weinan Gan1 , Xingshan Zeng1 ,
Shuai Yu1 , Xinlong Hao1 , Kun Shao1 , Yasheng Wang1 , and Ruiming Tang1
1
Huawei Noah’s Ark Lab
{wangshuai231, liuweiwen8}@huawei.com
cilitate intelligent agents being capable of per- human language, develop detailed plans, and ex-
forming complex tasks. By leveraging the abil- ecute complex tasks. These breakthroughs offer
ity of (M)LLMs to process and interpret Graph- new opportunities for AI researchers to tackle chal-
ical User Interfaces (GUIs), these agents can lenges that were once deemed highly difficult, such
autonomously execute user instructions by sim- as automating tasks within GUIs. As a result, nu-
ulating human-like interactions such as click-
merous studies have been published focusing on
ing and typing. This survey consolidates recent
research on (M)LLM-based GUI agents, high- (M)LLM-based GUI agents, as shown in Figure 1,
lighting key innovations in data, frameworks, especially over the last two years. However, few
and applications. We begin by discussing rep- efforts have been made to comprehensively sum-
resentative datasets and benchmarks. Next, we marize and compare the research in this emerging
summarize a unified framework that captures field of GUI agents. A systematic review is ur-
the essential components used in prior research, gently needed to provide a holistic understanding
accompanied by a taxonomy. Additionally, we
and inspire future developments.
explore commercial applications of (M)LLM-
based GUI agents. Drawing from existing work, This paper presents a comprehensive survey of
we identify several key challenges and propose GUI agents with foundation models. We organize
future research directions. We hope this paper the survey around three key areas: data, framework,
will inspire further developments in the field of and application. First, we investigate the avail-
(M)LLM-based GUI agents. able datasets and benchmarks for GUI agents, and
list them as a resource for researchers in this area.
1 Introduction
Second, we review the recent works on (M)LLM-
Graphical User Interfaces (GUIs) serve as the pri- based GUI agents, which are classified by their
mary interaction points between humans and dig- input modalities and learning modes. Finally, we
ital devices. People interact with GUIs daily on summarize the latest industrial applications with
mobile phones and websites, and a well-designed (M)LLM-based GUI agents, which hold significant
GUI agent can significantly enhance the user ex- commercial potential.
perience. Consequently, research on GUI agents
has been extensive. However, traditional rule-based 2 GUI Agents Data Source
and reinforcement learning-based methods struggle Recent research has focused on developing datasets
with tasks that require human-like interactions (Gur and benchmarks to train and evaluate the capabili-
et al., 2018; Liu et al., 2018), limiting their practical ties of (M)LLM-based GUI agents. These studies
application. can be broadly classified into two categories based
In recent years, advancements in Large Lan- on whether they involve interaction with the actual
guage Models (LLMs) and Multimodal Large Lan- environment: static datasets (Rawles et al., 2023;
guage Models (MLLMs) have elevated their abili- Zhang et al., 2024b; Li et al., 2024a; Lu et al., 2024;
ties in language understanding and cognitive pro- Venkatesh et al., 2023; Chen et al., 2024a; Li et al.,
cessing to unprecedented levels (OpenAI et al., 2020) and dynamic datasets (Zhou et al., 2023;
*
Corresponding authors. Gao et al., 2024; Rawles et al., 2024; Chen et al.,
Transformer-based (M)LLM-based
Commercial
Open-source (M)LLM
Text-only
Close-source (M)LLM
Open- and Close-source (M)LLM
Seq2Act AutoDroid MobileGPT T3A
Google Tsinghua KAIST Google
Text & Vision
Auto-UI AppAgentV2
SJTU Tencent
May '20 May '22 Oct '22 Aug '23 Sept '23 Dec '23 Jan '24 Feb '24 May '24 Jun '24 Aug '24 Oct '24
Time
Figure 1: Illustration of the growth trend in the field of GUI Agent with Foundation Models.
Android-In-The-Zoo Zhang et al. (2024b) in- PIXELHELP Li et al. (2020) proposes a new
troduces a benchmark dataset with 18,643 screen- class of problems focused on translating natural lan-
action pairs and chained action reasoning annota- guage instructions into actions on mobile user inter-
tions, aimed at advancing GUI navigation agent faces. PIXELHELP introduces three new datasets:
research. PIXELHELP, ANDROIDHowTO, and RICOSCA,
collectively comprising 187 multi-step instructions
AndroidControl Li et al. (2024a) comprises for model training.
15,283 demonstrations of daily tasks performed
with Android apps, where each task instance is WebArena Zhou et al. (2023) implements a ver-
accompanied by both high- and low-level human- satile website covering e-commerce, social forums,
generated instructions. This dataset can be utilized collaborative software development, and content
to assess model performance both within and be- management which includes 812 test examples to
yond the domain of the training data. ground high-level natural language instructions,
with current models like TEXT-BISON-001, GPT-
GUI-Odyssey Lu et al. (2024) introduces a com-
3.5, and GPT-4, achieving 14.41% accuracy com-
prehensive dataset for training and assessing cross-
pared to 78.24% for humans.
application navigation agents. The dataset com-
prises 7,735 episodes, encompassing six types of
cross-application tasks, 201 distinct applications, ASSISTGUI Gao et al. (2024) introduces a
and 1,399 application combinations. novel benchmark for evaluating model manipula-
tion of mouse/keyboard on Windows. ASSISTGUI
UGIF Venkatesh et al. (2023) introduces a com- includes Includes 100 tasks from 9 software apps
prehensive multilingual, multimodal user inter- (e.g., After Effects, MS Word) with project files for
face (UI) localization dataset with 4,184 tasks in accurate assessment.
AndroidWorld Rawles et al. (2024) presents While language models excel at understanding user
an Android environment capable of delivering re- intent (Touvron et al., 2023; OpenAI et al., 2024),
ward signals for 116 programmatic tasks spanning navigating device UIs requires a reliable visual per-
20 real-world Android apps. The environment ception model for optimal interaction.
constructs tasks dynamically, with parameters ex- A UI Perceiver appears explicitly or implicitly
pressed in natural language, enabling an infinite in the GUI agent framework. For agents based
number of task variations. on single-modal LLM (Wen et al., 2023, 2024b;
Li et al., 2020), a UI Perceiver is usually an ex-
SPA-Bench Chen et al. (2024b) proposes an in-
plicit module of the agent framework. However,
teractive environment designed to simulate real-
for agents with multi-modal LLM (Hong et al.,
world conditions for evaluating GUI agents. This
2023; Zhang et al., 2023; Wang et al., 2024b), UI
environment encompasses 340 tasks that involve
perception is seen as a capability of the model it-
both system and third-party apps, supporting single-
self.
app and cross-app scenarios in both English and
Chinese. UI perception is also an important problem in
GUI agent research, therefore, some work (You
3 (M)LLM-based GUI Agent et al., 2024; Zhang et al., 2021) focuses on under-
standing and processing the UI, rather than building
With the human-like capabilities of (M)LLMs, GUI the agent. For example, You et al. (2024) propose
agents aim to deal with various tasks to meet users’ a series of referring and grounding tasks, which
needs. To better stimulate the ability of (M)LLMs, provide valuable insights into the pre-training of
the framework of GUI agents should be carefully GUIs.
designed. In this section, we first summarize a
Task Planner: The GUI agent should effec-
systematic construction from existing work, care-
tively decompose complex tasks, often employing
fully select some typical cases, and discuss their
a Chain-of-Thought (CoT) approach (Wei et al.,
related designs for different modules. Then, we
2023). Due to the complexity of these tasks, recent
give a comprehensive taxonomy for GUI agents.
studies (Zhang et al., 2024a; Wang et al., 2024a)
The two key aspects of GUI agents, input modality,
have introduced an additional module to support
and learning mode, are used to classify the existing
more detailed planning.
work. From these two dimensions, we include the
current major work and help new researchers get a Throughout the GUI agent’s process, plans may
whole view of GUI agents. adapt dynamically based on decision feedback, typ-
ically achieved through a ReAct style (Yao et al.,
3.1 (M)LLM-based GUI Agent Construction 2023). For instance, Zhang et al. (2023) use on-
The objective of GUI agents is to automatically con- screen observations to enhance the CoT for im-
trol a device in order to complete tasks defined by proved decision-making, while Wang et al. (2024a)
the user. Typically, GUI agents take a user’s query develop a reflection agent that provides feedback
and the device’s UI status as inputs, and provide a to refine plans.
series of human-like operations to accomplish the Decision Maker: A Decision Maker is responsi-
tasks. ble for providing the next operation(s) to control a
As shown in Figure 2, we conclude that the con- device. Most studies (Lu et al., 2024; Zhang et al.,
struction of (M)LLM-based GUI agent consists of 2024a; Wen et al., 2024a), define a set of sam-
five parts: GUI Perceiver, Task Planner, Decision ple UI-related actions—such as click, text entry,
Maker, Memory Retriever, and Executor. There and scroll—as a basic space. In more complicated
are many variants of this construction. Wang et al. cases, Ding (2024) encapsulates a sequence of
(2024a) propose a multi-agent GUI control frame- actions to create Standard Operating Procedures
work with a planning agent, a decision agent, and a (SOPs) to guide further operations.
reflection agent to solve the navigation challenges As the power of GUI agents improves, the gran-
in mobile device operation tasks, which have simi- ularity of operations becomes more refined. Re-
lar functions. cent work has progressed from element-level oper-
GUI Perceiver: To effectively complete a de- ations (Zhang et al., 2023; Wang et al., 2024b) to
vice task, a GUI agent needs to accurately interpret coordinate-level control (Wang et al., 2024a; Hong
user input and detect changes in the device’s UI. et al., 2023).
(M)LLM-Based GUI Agents
Executor: As the link between GUI agents and 2023b; Li et al., 2020; Gur et al., 2022; Jiang et al.,
devices, the Executor maps the outputs to relevant 2023; Nakano et al., 2022) often require a GUI
environments. For real device execution, most stud- perceiver to convert GUI into text-based input.
ies (Zhang et al., 2023; Wang et al., 2024b,a) utilize For instance, Li et al. (2020) transform the
Android Debug Bridge (ADB) to control the device. screen into a series of object descriptions and ap-
Differently, Rawles et al. (2024) conduct tests in a plies a transformer-based method for action map-
simulator, where additional UI-related information ping. The problem definitions and datasets have
can be accessed. spurred further research. Wen et al. (2024a) further
Memory Retriever: Memory Retriever is de- convert GUI to simplified HTML representation
signed as a additional source of information to help for compatibility with the base model. By combin-
agents perform tasks more effectively (Wang et al., ing GUI representation with app-specific knowl-
2024c). edge, they build Auto-Droid, a GUI agent based
Generally, the memory for GUI agents is gener- on off-the-shelf LLMs, including online GPT and
ally divided into internal and external categories. on-device Vicuna.
Internal memory (Lu et al., 2024) includes the pre-
MLLM-based GUI Agents: Recent studies (Shaw
vious actions, screenshots, and other statuses gen-
et al., 2023; Wang et al., 2021; You et al., 2024; Bai
erated during execution. External memory (Zhang
et al., 2021) utilize the multimodal capabilities of
et al., 2023; Ding, 2024) usually includes prior
advanced (M)LLMs to improve GUI comprehen-
knowledge and rules related to the UI or tasks.
sion and task execution.
They can serve as additional inputs to assist GUI
agents. Some works (You et al., 2024; Zhang et al., 2021;
Lee et al., 2023a; Wang et al., 2021) focus on
3.2 (M)LLM-based GUI Agent Taxonomy GUI Understanding. For example, Pix2struct (Lee
As shown in Figure 1, we conclude the existing et al., 2023a) employs a ViT-based image-encoder-
work with various dimensions. As a result, this text-decoder architecture, which pre-trains on
paper classifies existing work with the difference Screenshot-HTML data pairs and fine-tunes for
of input modality and learning mode. specific tasks. This method has shown strong per-
formance in four web-based visual comprehension
3.2.1 GUI Agents with Different Input tasks. Similarly, Screen Recognition (Zhang et al.,
modality 2021) proposes a method to convert mobile app
LLM-based GUI Agents: With the limited mul- UI into metadata, by using extensive manual an-
timodal capability, earlier GUI agents (Lee et al., notations to mark iOS UIs. The data is used to
train an on-device object detection model for UI put. Experiments with M3A variants assess how
detection and it produces accessible metadata that different input modalities—text, screenshots, and
the screen reader can use. This work produced a accessibility trees—affect GUI agent performance.
dataset containing 77637 screenshots from 4068
3.2.2 GUI Agents with Different Learning
iPhone apps with complete manual annotations.
Mode
Screen2words (Wang et al., 2021) is a novel ap-
proach for encapsulating a UI screen into a co- Prompting-based GUI Agents: Prompting is an
herent language representation, which is based on effective approach to building agents with minimal
a transformer encoder-decoder architecture. The extra computational overhead. Given the diver-
encoder includes a language encoder for APP de- sity of GUIs and tasks, numerous studies (Zhang
scription or structured data and a ResNet encoder et al., 2023; Li et al., 2024b; Wang et al., 2024a;
for encoding UI screenshots. The decoder is a trans- Humphreys et al., 2022; Wen et al., 2024b) use
former decoder that generates the representation. prompting to create GUI agents, adopting CoT or
ReAct styles.
Leveraging the visual understanding capabilities Recent Studies use prompting to build and sim-
of MLLMs, recent studies (Wang et al., 2024a; ulate the functions of each module within a GUI
Li and Li, 2023; Bai et al., 2021) explore end- agent, enabling effective GUI control.
to-end frameworks for GUI device control. For
For example, Yan et al. (2023) introduce MM-
example, Spotlight (Li and Li, 2023) proposes a
Navigator, which utilizes GPT-4V for zero-shot
Vision-Language model framework, pre-trained on
GUI understanding and navigation. For the first
Web/mobile data and fine-tuned for UI tasks. This
time, This work demonstrates the significant po-
model greatly improves the ability to understand
tential of LLMs, particularly GPT-4V, for zero-
UI. By combining screenshots with a user focus as
shot GUI tasks. Manual evaluations show that
input, Spotlight outperforms the previous methods
MM-Navigator achieves impressive performance
on multiple tasks of UI understanding, showing ver-
in generating reasonable action descriptions and
ified gains in downstream tasks. Likewise, VUT (Li
single-step instructions for iOS tasks. Additionally,
et al., 2021) is proposed for GUI understanding and
Song et al. (2023) introduce a framework for in-
multi-modal UI inputs modeling, using two Trans-
teracting with the GUI using a sequential, human-
formers: one for encoding and fusing image, struc-
like problem-solving approach. The framework
tural, and language inputs, and the other for linking
includes a YOLO-based UI understanding module
three task heads to complete five distinct UI mod-
to locate UI elements and text, a GPT-4V-based
eling tasks and learn downstream multiple tasks
task planning module to decompose the task, and
end-to-end. Experiments show that VUT’s multi-
an execution module that maps text-based actions
task learning framework can achieve the state-of-
to control the device. Wen et al. (2024b) propose
the-art (SOTA) performance on UI modeling tasks.
DroidBot-GPT, which summarizes the app’s sta-
UIbert (Bai et al., 2021) focuses on heterogeneous
tus, historical actions, and tasks into a prompt, and
GUI features and considers that the multi-modal
then uses ChatGPT to select the next action. This
information in the GUI is self-aligned. UIbert is a
approach effectively integrates historical actions
transformer-based joint image-text model, which
and user UIs without requiring any modifications
is pre-trained in large-scale unlabeled GUI data to
to the underlying LLM. Furthermore, (Zheng et al.,
learn the feature representation of UI elements. Ex-
2024) propose SeeAct, a GPT-4V-based generalist
periments on nine real-world downstream UI tasks
web agent. With screenshots as input, SeeAct gen-
show that UIBert greatly surpasses the strongest
erates action descriptions and converts them into
multimodal baseline approach.
executable actions with designed action grounding
To enhance performance, some studies (Zhang techniques.
et al., 2023; Rawles et al., 2024) utilize addi- Some studies enable the GUI agent to fully lever-
tional invisible metadata. For instance, Android- age external knowledge through prompting to com-
World (Rawles et al., 2024) establishes a fully func- plete GUI tasks.
tional Android environment with real-world tasks, AppAgent (Zhang et al., 2023) proposes a multi-
serving as a benchmark for evaluating GUI agents. modal agent framework to simulate human-like
They propose M3A, a zero-shot prompting agent mobile phone operations. The framework is di-
that uses Set-of-Marks (Yang et al., 2023) as in- vided into two phases: Exploration, where agents
explore applications and document their operations, multimodal solution combining an image-language
and Deployment, where these documents guide the encoder-decoder architecture with a Chain of Ac-
agent in observing, thinking, acting, and summariz- tions policy, fine-tuned on AitW dataset. This
ing tasks. Furthermore, building on the two-phase Chain of Actions captures intermediate previous
architecture of prior work, AppAgent V2 (Li et al., action histories and future action plans.
2024b) improves GUI parsing, document genera-
tion, and prompt integration. Unlike the previous 4 Industrial Applications of
version, the method expands beyond the use of (M)LLM-Based GUI Agents
an off-the-shelf parser by integrating optical char- Google Assistant for Android: By saying
acter recognition (OCR) and detection tools for phrases like "Hey Google, start a run on Example
UI element identification. AppAgent V2 achieves App," users can use Google Assistant for Android
outstanding performance on various benchmarks. to launch apps, perform tasks, and access content.
Wang et al. (2023) uses a pure In-context learning App Actions, powered by built-in intents (BIIs), en-
method to implement interaction between LLM hance app functionality by integrating with Google
and mobile UIs. The method divides the con- Assistant. This enables users to navigate apps and
versations between agent and user into four cate- access features through voice queries, which the
gories from the originator and purpose and designs Assistant interprets to display the desired screen or
a series of structural CoT prompting to adapt an widget.
LLM to excute mobile UI tasks. MobileGPT (Lee
et al., 2023b) emulates the cognitive processes of Apple Intelligence: Features on-device and
human use of applications to enhance the LLM- cloud models using Apple silicon, with a generic
based agent with a human-like app memory. Mobi- foundation model and specialized adapter mod-
leGPT uses random explorer to explore and gener- els for tasks like summarization and tone adjust-
ate screen-related subtasks on many apps and save ment. Evaluations show the on-device model out-
them as app memory. During the execution, the re- performs or matches small models from Mistral AI,
lated memory is recalled to complete tasks. Mobile Microsoft, and Google, while the server models
GPT achieved significant improvement compared surpass OpenAI’s GPT-3 and match GPT-4. Unlike
with the GPT-3.5 and GPT-4 baselines. services like ChatGPT, Apple runs its cloud mod-
els on proprietary servers with custom hardware.
SFT-based GUI Agents: Fine-tuning allows LLM The system ensures software integrity by refusing
to adapt to specific domains and perform cus- connections if mismatches are detected.
tomized tasks more efficiently. Some work (Wen
et al., 2023; Furuta et al., 2023; Sun et al., 2022; New Bing: Microsoft’s search engine is designed
Humphreys et al., 2022; Lee et al., 2023b; Kim to offer users a more intuitive, efficient, and com-
et al., 2023) uses SFT to allow GUI agents to use prehensive search experience. Leveraging cutting-
new modal inputs, learn specific processes, or exe- edge artificial intelligence and machine learning
cute special tasks. technologies, New Bing goes beyond traditional
keyword searches to understand the context and
For instance, MobileAgent (Ding, 2024) extracts
intent behind user queries. This allows it to deliver
information from the DOM of app pages and in-
more relevant results, personalized recommenda-
tegrate standard operating procedure (SOP) infor-
tions, and enhanced features like conversational
mation to perform in-context learning on a fine-
search, image recognition, and real-time updates.
tuned Llama 2. Furuta et al. (2023) proposes We-
With a sleek, user-friendly interface and deep inte-
bGUM for web navigation. WebGUM is based
gration with other Microsoft services, New Bing
on a T5 Transformer Encoder-Decoder framework
aims to redefine how people find information on-
with screenshots and HTML pages as inputs. We-
line, making accessing the knowledge and insights
bGUM is jointly fine-tuned with an instruction-
they need faster and easier.
optimized language model and a vision encoder,
incorporating temporal and local perceptual capa- Microsoft Copilot: An AI tool in Microsoft 365
bilities. It trained on a substantial corpus of demon- apps for productivity with GPT-based suggestions,
strations. The evaluation results on MiniWoB show task automation, and content generation. Enhances
that WebGUM outperforms GPT-4-based agents. workflows, creativity, and decision-making with
Zhang and Zhang (2023) introduces Auto-UI, a real-time insights.
Anthropic Claude 3.5: The latest version of Zhang et al. (2023) introduces the concept of explo-
Claude 3.5 introduces a groundbreaking new ca- ration, implementing it through documentation that
pability: Computer Use which allows Claude to automatically records operations and interface tran-
interact with computers like humans—by view- sition knowledge. Similarly, Li et al. (2017); Wen
ing screens, moving cursors, clicking buttons, and et al. (2024b,a) propose an automated framework to
typing text. Asana, Canva, Cognition, DoorDash, explore paths, summarizing them as a UI Transition
Replit, and The Browser Company have already be- Graph (UTG) for improved performance. However,
gun to explore these possibilities, carrying out tasks effective exploration methods to fully realize this
that require dozens, and sometimes even hundreds, goal are still challenging.
of steps to complete.
Inference Efficiency: Humans are sensitive to
AutoGLM: A new series from the ChatGLM the response time of GUIs. Typically, a delay under
family, is designed for autonomous mission com- 200 milliseconds is acceptable, however, delays be-
pletion via Graphical User Interfaces on platforms yond this threshold can rapidly degrade the user
like phones and the web. Its Android capabil- experience. For current GUI agents, inference and
ity allows it to understand user instructions au- communication delays are often measured in sec-
tonomously without manual input, enabling it to onds, leading to poor user satisfaction. Addressing
handle complex tasks such as ordering takeout, edit- how to minimize these delays, or deploying the
ing comments, shopping, and summarizing articles. (M)LLM directly on mobile devices, is therefore a
pressing issue.
MagicOS 9.0 YOYO: An advanced assistant
with four main features: natural language and vi- 6 Conclusion
sion processing, user behavior and context learning,
intent recognition and decision-making, and seam- In this paper, we systematically review the rapidly
less app integration. It understands user habits to evolving research field of (M)LLM-based GUI
autonomously fulfill requests, such as ordering cof- agents. We examine these studies from three main
fee through voice commands, by navigating apps perspectives: data sources, construction, and ap-
and services. plications. We also provide a detailed taxonomy
that connects existing research and summarizes the
5 Challenges major techniques. Additionally, we propose several
challenges and potential future directions for GUI
Despite the rapid developments and exciting agents leveraging foundation models.
achievements of previous work, the field of
(M)LLM-based GUI agents is still at an initial
stage. We summarize several significant challenges References
that need to be addressed as follows:
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas
The Gap between Benchmark and Reality: Ex- Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise
isting datasets and benchmarks are clearly divided Aguera y Arcas. 2021. UIBert: Learning Generic
Multimodal Representations for UI Understanding.
into static and dynamic categories. A static bench- ArXiv:2107.13731 [cs].
mark typically stores an execution path as a se-
quence, where the goal is to predict the next action. Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang,
In contrast, dynamic benchmarks require execution Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang,
on simulators or real devices, where the tasks must Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu,
Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan,
be fully completed. At present, the majority of Pan Zhou, Jianfeng Gao, and Lichao Sun. 2024a.
both training and evaluation data is static. How- Gui-world: A dataset for gui-oriented multimodal
ever, because (M)LLM-based GUI agents need to llm-based agents.
interpret extensive environmental status, existing
datasets and benchmarks are inadequate for actual Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang,
Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou,
applications. Weiwen Liu, Shuai Wang, et al. 2024b. Spa-bench:
A comprehensive benchmark for smartphone agent
GUI Agent Self-evolution: Self-evolution aims evaluation. In NeurIPS 2024 Workshop on Open-
to achieve the self closed loop of the GUI agent. World Agents.
Tinghe Ding. 2024. MobileAgent: enhancing mobile Gang Li and Yang Li. 2023. Spotlight: Mobile UI
control via human-machine interaction and SOP inte- Understanding using Vision-Language Models with
gration. a Focus. ArXiv:2209.14927 [cs].
Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yu- Wei Li, William Bishop, Alice Li, Chris Rawles, Fo-
taka Matsuo, Aleksandra Faust, Shixiang Shane Gu, lawiyo Campbell-Ajala, Divya Tyamagundlu, and
and Izzeddin Gur. 2023. Multimodal Web Naviga- Oriana Riva. 2024a. On the effects of data
tion with Instruction-Finetuned Foundation Models. scale on computer control agents. arXiv preprint
ArXiv:2305.11854 [cs, stat]. arXiv:2406.03679.
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng,
Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Xin Chen, Ling Chen, and Yunchao Wei. 2024b. Ap-
Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei pagent v2: Advanced agent for flexible mobile inter-
Zhou, and Mike Zheng Shou. 2024. ASSISTGUI: actions. arXiv preprint arXiv:2408.11824.
Task-Oriented Desktop Graphical User Interface Au-
tomation. ArXiv:2312.13108 [cs]. Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
Baldridge. 2020. Mapping natural language instruc-
Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jong- tions to mobile ui action sequences.
wook Choi, Manoj Tiwari, Honglak Lee, and Alek-
Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and
sandra Faust. 2022. Environment Generation for
Alexey Gritsenko. 2021. VUT: Versatile UI Trans-
Zero-Shot Compositional Reinforcement Learning.
former for Multi-Modal Multi-Task User Interface
ArXiv:2201.08896 [cs].
Modeling. ArXiv:2112.05692 [cs].
Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun
Dilek Hakkani-Tur. 2018. Learning to Navigate the Chen. 2017. Droidbot: a lightweight ui-guided test
Web. ArXiv:1812.09195 [cs, stat]. input generator for android. In Proceedings of the
39th International Conference on Software Engineer-
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng ing Companion, ICSE-C ’17, page 23–26. IEEE
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Press.
Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao
Dong, Ming Ding, and Jie Tang. 2023. CogA- Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian-
gent: A Visual Language Model for GUI Agents. lin Shi, and Percy Liang. 2018. Reinforcement Learn-
ArXiv:2312.08914 [cs]. ing on Web Interfaces Using Workflow-Guided Ex-
ploration. ArXiv:1802.08802 [cs].
Peter C. Humphreys, David Raposo, Toby Pohlen, Gre-
gory Thornton, Rachita Chhaparia, Alistair Mul- Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng,
dal, Josh Abramson, Petko Georgiev, Alex Goldin, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng
Adam Santoro, and Timothy Lillicrap. 2022. A data- Zhang, Yu Qiao, and Ping Luo. 2024. Gui odyssey:
driven approach for learning to control computers. A comprehensive dataset for cross-app gui navigation
ArXiv:2202.08137 [cs]. on mobile devices.
Yue Jiang, Eldon Schoop, Amanda Swearngin, and Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff
Jeffrey Nichols. 2023. ILuvUI: Instruction-tuned Wu, Long Ouyang, Christina Kim, Christopher
LangUage-Vision modeling of UIs from Machine Hesse, Shantanu Jain, Vineet Kosaraju, William
Conversations. ArXiv:2310.04869 [cs]. Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou,
Gretchen Krueger, Kevin Button, Matthew Knight,
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Benjamin Chess, and John Schulman. 2022. We-
2023. Language Models can Solve Computer Tasks. bGPT: Browser-assisted question-answering with hu-
ArXiv:2303.17491 [cs]. man feedback. ArXiv:2112.09332 [cs].
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexi- OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
ang Hu, Fangyu Liu, Julian Martin Eisenschlos, Ur- Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
vashi Khandelwal, Peter Shaw, Ming-Wei Chang, man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
and Kristina Toutanova. 2023a. Pix2Struct: Screen- man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
shot Parsing as Pretraining for Visual Language Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
Understanding. In Proceedings of the 40th Inter- ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
national Conference on Machine Learning, pages wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
18893–18912. PMLR. ISSN: 2640-3498. Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
Sunjae Lee, Junyoung Choi, Jungjae Lee, Hojun Choi, man, Tim Brooks, Miles Brundage, Kevin Button,
Steven Y. Ko, Sangeun Oh, and Insik Shin. 2023b. Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
Explore, Select, Derive, and Recall: Augmenting Carey, Chelsea Carlson, Rory Carmichael, Brooke
LLM with Human-like Memory for Mobile Task Au- Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully
tomation. ArXiv:2312.03003 [cs]. Chen, Ruby Chen, Jason Chen, Mark Chen, Ben
Chess, Chester Cho, Casey Chu, Hyung Won Chung, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
Dave Cummings, Jeremiah Currier, Yunxing Dai, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
Cory Decareaux, Thomas Degry, Noah Deutsch, ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Damien Deville, Arka Dhar, David Dohan, Steve Clemens Winter, Samuel Wolrich, Hannah Wong,
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
Simón Posada Fishman, Juston Forte, Isabella Ful- ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
ford, Leo Gao, Elie Georges, Christian Gibson, Vik Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo- Zheng, Juntang Zhuang, William Zhuk, and Barret
Lopes, Jonathan Gordon, Morgan Grafstein, Scott Zoph. 2024. Gpt-4 technical report.
Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Christopher Rawles, Sarah Clinckemaillie, Yifan Chang,
Yuchen He, Mike Heaton, Johannes Heidecke, Chris Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Li, William Bishop, Wei Li, Folawiyo Campbell-
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Ajala, et al. 2024. Androidworld: A dynamic bench-
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, marking environment for autonomous agents. arXiv
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun preprint arXiv:2405.14573.
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee-
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka- Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana
mali, Ingmar Kanitscheider, Nitish Shirish Keskar, Riva, and Timothy Lillicrap. 2023. Android in the
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Wild: A Large-Scale Dataset for Android Device
Christina Kim, Yongjik Kim, Jan Hendrik Kirch- Control. ArXiv:2307.10088 [cs].
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo,
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon- Peter Shaw, Mandar Joshi, James Cohan, Jonathan
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal Berant, Panupong Pasupat, Hexiang Hu, Urvashi
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Khandelwal, Kenton Lee, and Kristina Toutanova.
Leike, Jade Leung, Daniel Levy, Chak Ming Li, 2023. From Pixels to UI Actions: Learning to
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Follow Instructions via Graphical User Interfaces.
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, ArXiv:2306.00245 [cs].
Anna Makanju, Kim Malfacini, Sam Manning, Todor
Markov, Yaniv Markovski, Bianca Martin, Katie Yunpeng Song, Yiheng Bian, Yongtao Tang, and Zhong-
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer min Cai. 2023. Navigating Interfaces with AI for
McKinney, Christine McLeavey, Paul McMillan, Enhanced User Interaction. ArXiv:2312.11190 [cs].
Jake McNeil, David Medina, Aalok Mehta, Jacob
Menick, Luke Metz, Andrey Mishchenko, Pamela Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai,
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Zichen Zhu, and Kai Yu. 2022. META-GUI: To-
Mossing, Tong Mu, Mira Murati, Oleg Murk, David wards multi-modal conversational agents on mobile
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, GUI.
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh,
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Paino, Joe Palermo, Ashley Pantuliano, Giambat- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
tista Parascandolo, Joel Parish, Emy Parparita, Alex Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
man, Filipe de Avila Belbute Peres, Michael Petrov, Grave, and Guillaume Lample. 2023. LLaMA: Open
Henrique Ponde de Oliveira Pinto, Michael, Poko- and efficient foundation language models.
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow-
ell, Alethea Power, Boris Power, Elizabeth Proehl, Sagar Gubbi Venkatesh, Partha Talukdar, and Srini
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Narayanan. 2023. Ugif: Ui grounded instruction
Cameron Raymond, Francis Real, Kendra Rimbach, following.
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry-
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Bryan Wang, Gang Li, and Yang Li. 2023. Enabling
Girish Sastry, Heather Schmidt, David Schnurr, John Conversational Interaction with Mobile UI using
Schulman, Daniel Selsam, Kyla Sheppard, Toki Large Language Models. ArXiv:2209.08655 [cs].
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Grossman, and Yang Li. 2021. Screen2Words: Au-
Sokolowsky, Yang Song, Natalie Staudacher, Fe- tomatic Mobile UI Summarization with Multimodal
lipe Petroski Such, Natalie Summers, Ilya Sutskever, Learning. ArXiv:2108.03353 [cs].
Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe- Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Ji-
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, tao Sang. 2024a. Mobile-Agent-v2: Mobile Device
Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Operation Assistant with Effective Navigation via
Multi-Agent Collaboration. ArXiv:2406.01014.
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,
Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Amanda Swearngin, Jeffrey Nichols, Yinfei Yang,
2024b. Mobile-agent: Autonomous multi-modal mo- and Zhe Gan. 2024. Ferret-UI: Grounded mobile UI
bile device agent with visual perception. understanding with multimodal LLMs.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang,
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qing-
Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, wei Lin, Saravan Rajmohan, Dongmei Zhang, and
and Ji-Rong Wen. 2024c. A survey on large lan- Qi Zhang. 2024a. UFO: A UI-focused agent for win-
guage model based autonomous agents. Frontiers of dows OS interaction.
Computer Science, 18(6):186345.
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023.
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and AppAgent: Multimodal Agents as Smartphone Users.
Denny Zhou. 2023. Chain-of-thought prompting elic- ArXiv:2312.13771 [cs].
its reasoning in large language models.
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao,
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang.
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, 2024b. Android in the zoo: Chain-of-action-thought
Yaqin Zhang, and Yunxin Liu. 2023. Empowering for gui agents.
LLM to use Smartphone for Intelligent Task Automa-
tion. ArXiv:2308.15272 [cs]. Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin,
Samuel White, Kyle Murray, Lisa Yu, Qi Shan,
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Jeffrey Nichols, Jason Wu, Chris Fleizach, Aaron
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Everitt, and Jeffrey P. Bigham. 2021. Screen Recog-
Yaqin Zhang, and Yunxin Liu. 2024a. Autodroid: nition: Creating Accessibility Metadata for Mobile
Llm-powered task automation in android. Applications from Pixels. ArXiv:2101.04893 [cs].
Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Zhuosheng Zhang and Aston Zhang. 2023. You
Li. 2024b. DroidBot-GPT: GPT-powered UI automa- Only Look at Screens: Multimodal Chain-of-Action
tion for android. Agents. ArXiv:2309.11436 [cs].
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Yu Su. 2024. Gpt-4v(ision) is a generalist web agent,
Julian McAuley, Jianfeng Gao, Zicheng Liu, and if grounded.
Lijuan Wang. 2023. GPT-4V in Wonderland: Large
Multimodal Models for Zero-Shot Smartphone GUI Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou,
Navigation. ArXiv:2311.07562 [cs]. Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Graham Neubig. 2023. WebArena: A Realistic
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Web Environment for Building Autonomous Agents.
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ArXiv:2307.13854 [cs].
ran Wei, Huan Lin, Jialong Tang, Jialin Wang,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai,
Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke-
qin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni,
Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize
Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan,
Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge,
Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren,
Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing
Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan,
Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024. Qwen2 techni-
cal report.
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-
yuan Li, and Jianfeng Gao. 2023. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
ReAct: Synergizing reasoning and acting in language
models.