0% found this document useful (0 votes)
44 views16 pages

AXNav - Replaying Accessibility Tests From Natural Language

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views16 pages

AXNav - Replaying Accessibility Tests From Natural Language

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AXNav: Replaying Accessibility Tests from Natural Language

Maryam Taeb∗ Amanda Swearngin Eldon Schoop


mr21cg@[Link] aswearngin@[Link] eldon@[Link]
Florida State University Apple Apple
USA USA USA

Ruijia Cheng Yue Jiang† Jeffrey Nichols


rcheng23@[Link] [Link]@[Link] jwnichols@[Link]
Apple Aalto University Apple
USA Finland USA
arXiv:2310.02424v3 [[Link]] 5 Mar 2024

(a) Test Instructions (c) LLM-Based (d) AX Feature Control


"VoiceOver: Multi-Agent
Share a Podcast Episode" Planner
(f) Chaptered
Leaf Love
Video
(b) Cloud Device
Button Shape Voice Over Dynamic Text Bold Text

Leaf Love
VNC Formatted UI (e) Action Execution Recording & Chaptering
Understanding
Detections
VoiceOver Standard

Feasibility, Evaluation & Replanning

􀕾
Figure 1: AXNav interprets accessibility test instructions specified in natural language, executes them on a remote cloud device
using an LLM-based multiagent planner, and produces a chaptered video of the test annotated with heuristics that highlight
potential accessibility issues. To execute a test, AXNav provisions a cloud iOS device; stages the device by installing the target
app to be tested and enabling a specified assistive feature; synthesizes a tentative step-by-step plan to execute the test from the
test instructions; executes each step of the plan, updating the plan as needed; and annotates a screen recording of the test with
chapter markers and visual elements that point out potential accessibility issues.
ABSTRACT a formative study. From this we build a system that takes a manual
Developers and quality assurance testers often rely on manual test- accessibility test instruction in natural language (e.g., “Search for
ing to test accessibility features throughout the product lifecycle. a show in VoiceOver”) as input and uses an LLM combined with
Unfortunately, manual testing can be tedious, often has an over- pixel-based UI Understanding models to execute the test and pro-
whelming scope, and can be difficult to schedule amongst other duce a chaptered, navigable video. In each video, to help QA testers,
development milestones. Recently, Large Language Models (LLMs) we apply heuristics to detect and flag accessibility issues (e.g., Text
have been used for a variety of tasks including automation of UIs. size not increasing with Large Text enabled, VoiceOver navigation
However, to our knowledge, no one has yet explored the use of LLMs loops). We evaluate this system through a 10-participant user study
in controlling assistive technologies for the purposes of supporting with accessibility QA professionals who indicated that the tool
accessibility testing. In this paper, we explore the requirements of a would be very useful in their current work and performed tests
natural language based accessibility testing workflow, starting with similarly to how they would manually test the features. The study
also reveals insights for future work on using LLMs for accessibility
∗ Work done while Maryam Taeb was an intern at Apple
† Work testing.
done while Yue Jiang was an intern at Apple

Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). CCS CONCEPTS
CHI ’24, May 11–16, 2024, Honolulu, HI, USA • Human-centered computing → Accessibility systems and
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0330-0/24/05. tools; Interactive systems and tools; • Computing methodolo-
[Link] gies → Multi-agent planning.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

KEYWORDS • A formative study with 6 professional QA and accessibility


Accessibility, UI testing, Large language models testers revealing motivation and design considerations for
a system to support accessibility testing through natural
ACM Reference Format: language instruction-based manual tests.
Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, • A novel system, AXNav, that converts manual accessibility
and Jeffrey Nichols. 2024. AXNav: Replaying Accessibility Tests from Nat-
test instructions into replayable, navigable videos by using
ural Language. In Proceedings of the CHI Conference on Human Factors in
a large language model and a pixel-based UI element de-
Computing Systems (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. ACM,
New York, NY, USA, 16 pages. [Link] tection model. The system helps testers pinpoint potential
issues (e.g., non-increasing text, loops) with multiple types
of accessibility features (e.g., Dynamic Text, VoiceOver) and
1 INTRODUCTION replays tasks through accessibility services to enable testers
Many mobile apps still have incomplete support for accessibility to visualize and hear the task as a user of the accessibility
features [4, 23, 43, 56, 57]. Developers of these apps may not im- service might perform it.
plement or test accessibility support due to a lack of awareness • A user study with 10 professional QA and accessibility testers
[4], organizational support [8, 43], or experience in accessibility revealing key insights into how accessibility testers might
testing [8]. For apps that do support accessibility features, devel- use natural language-based automation within their manual
opers often work in tandem with experienced accessibility quality testing workflow.
assurance (QA) testers [8]. Employees in both roles may use auto-
mated tools like accessibility scanners [2, 3], linters [27], and test 2 RELATED WORK
automation [21, 55] to execute UI test scenarios. However, despite
AXNav is most closely related to works that use text instructions as
many available tools, the majority of testing for accessibility is
an input for UI automation, which is useful beyond accessibility use
still done manually. This may in part be due to the limitations of
cases. In this work, we specifically target UI navigation from natural
the tools themselves. For instance, UI tests can be brittle [32, 40]
language for accessibility testing, thus we also review accessibility
or non-existent [17, 30, 34], and scanners can provide false posi-
testing tools and approaches.
tives [50]. In addition, manual testing can reveal issues that cannot
be detected by automated techniques alone [37].
However, manually testing all possible accessibility scenarios 2.1 Large Language Models and UI interaction
and features is costly and hard to scale. In a formative study with A key contribution of AXNav is its LLM-based planner that can
six accessibility QA testers, we found they often had difficulties navigate mobile apps to execute specific tasks or arrive at particu-
keeping up with the scope of apps and features they were assigned lar views. Our multi-agent system architecture is loosely based on
to test. This causes testers to limit the scope of their tests, potentially ResponsibleTA, which presents a framework for facilitating collabo-
letting bugs slip through, and can lead to test instructions becoming ration between LLM agents for web UI navigation tasks [58]. Since
outdated. While research has addressed some of these challenges AXNav is designed for testing rather than end-user automation,
through automation [44, 45], there are still manual costs associated it removes some components (e.g., a system to mask user-specific
with writing and recording tests to be replayed. Recorded tests information), and combines other modules (e.g., AXNav combines
often need to be updated when the UI or navigation flow changes, evaluation and completeness verification, and AXNav proposes ac-
similar to UI automation tests, which must specify each step in the tions and feasibility in the same step). These changes significantly
navigation flow in code [32, 40]. reduce the number of LLM turns taken, which lowers cost and
To address some of these challenges and support existing manual reduces latency.
testing workflows of accessibility QA testers, we explore the use of Other UI navigation works for web and mobile apps have recently
natural language instructions to specify accessibility testing steps emerged. Wang et al. [52], describe prompting techniques to adapt
to a system. Manual test instructions are common artifacts within LLMs for use with mobile UIs, and evaluate an LLM-based agent’s
organizations that often have large databases of manual steps for ability to predict the UI element that will perform an action on
QA testers. Our system, AXNav, interprets natural language test a given screen. AXNav’s UI navigation system builds upon this
instructions to produce a set of concrete actions that can be taken in work by supporting more complex, multi-step tasks. Other works
an app, which it then adapts automatically as the interface evolves. map from detailed, multi-step instructions to actions in mobile
AXNav executes these actions on a live cloud device, enabling and apps [22, 33, 49]. AutoDroid injects known interaction traces from
configuring accessibility features as needed, and runs heuristics on random app crawls into an LLM prompt to help execute actions
target screens to flag potential issues to manual testers. AXNav’s with an LLM agent [54]. AXNav can interpret a wide variety of
output is a chaptered, annotated video that captures the interaction instruction types, from highly specific step-by-step instructions
trace along with heuristic results. to unconstrained goals within an app (“add an item to the cart”),
Our approach is motivated by prior work that uses Large Lan- without relying on prior app knowledge. Furthermore, AXNav is
guage Models (LLMs) to recreate bug reports [22], test GUIs [36], able to modify its plan when the UI changes, if it encounters errors,
and automate tasks for web interfaces [47]. To our knowledge, AX- or if the test instructions are incorrect.
Nav is the first work that uses LLMs for accessibility testing, or The emergence of LLM-based UI navigation systems has moti-
controlling accessibility services [51] and settings [12]. vated the need for more interaction datasets. Android in the Wild
The contributions of this work are: presents a large dataset of human demonstrations of tasks on mobile
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

apps for evaluating LLM-based agents [41]. Other datasets, such as Title: iOS: VoiceOver: Search for a Show 1
PixelHelp [33] and MoTiF [11] also collect mobile app instructions 1. Go to Settings > Accessibility > VoiceOver, and enable VoiceOver (VO)
and steps. Unlike prior art, AXNav is designed to work on iOS apps, 2. Launch the Media app
3. Search for a show and verify that everything works as expected and there are accurate labels
which can have different navigation flows and complexities than 4. Turn off VO and verify that searching for a show works as expected
corresponding Android apps.
Most importantly, none of the above works have been used to iOS: Media App: Dynamic Text in Search Tab 2
interact with accessibility features or support accessibility testing
1. In Settings > Accessibility > Display & Text Size, enable larger text and set to maximum size
workflows. This is the core focus of AXNav’s contribution. 2. Launch Media App
3. Verify all text (titles, headers, etc.) font size has adjusted consistently
4. Set text size to minimum and repeat step 3
5. Reset text size to default and verify all text returns to normal
2.2 Accessibility Testing Tools
Despite the availability of accessibility guidelines and checklists iOS: Media App: Button Shapes across app 3
[1, 10, 53], linters and scanners [2, 3, 27], and platforms for test Expected Result: When Testing button shapes- we want to make sure that all text (not emojis
automation [21, 55], developers and QA testers still often prefer to or glyphs) get underlined if they are NOT inside of a button shape already. If the text is already
within a button shape, it is a bug! (We see this bug frequently)
test their apps manually [34, 35]. Testing manually by using accessi-
bility services can reveal issues that cannot be revealed by scanners
alone [37]. However, manual testing is costly and difficult to scale, Figure 2: Three sample test cases for a video streaming media
leading to a variety of automated tools and testing frameworks app testing the accessibility features of VoiceOver, Dynamic
being developed for accessibility testing [30]. Type, and Button Shapes. Testing instructions typically con-
There are a variety of tools to automatically check accessibil- sist of a title containing the app and feature under test, and
ity properties of apps [48]. Development-time [27] approaches a set of manual test instructions in natural language. The
use static analysis to examine code for potential issues. Run-time tests may also contain expected result descriptions. Some
tools [2, 3, 21, 42] examine a running app to detect accessibility tests have specific, low-level instructions (1,2) and others
issues, which enables them to detect issues beyond static analysis; give only a high-level instruction (3).
however, they still must be activated on each screen of the app to
be tested.
Another approach is to automatically crawl the app to detect
issues [4, 15, 20, 46]; however, such tools currently adopt random
exploration and thus may not fully cover or operate the UI as an end-
six accessibility QA professionals through snowball recruiting at a
user might. These crawlers also do not operate through accessibility
large technology company. Participants spanned four product and
services which leaves them unable to evaluate whether navigation
services teams across four organizations, and had a minimum of 3
paths through the app are fully accessible.
years of professional experience in accessibility and QA testing of
Latte [44] starts to bridge this gap by converting GUI tests for
iOS mobile apps. We conducted 30-minute remote interviews with
navigation flows into accessibility tests that operate using an acces-
each participant.
sibility service; however, the majority of apps still lack GUI tests
We divided our formative study into two parts. In the first part,
[34] and often require updating the code to new navigation flows
we asked participants about the challenges and benefits of manual
when a UI changes [40]. Removing the requirements for GUI tests
accessibility testing, their cadence for performing manual tests,
to be available, A11yPuppetry [45] lets developers record UI flows
and whether and how they write testing instructions. We also
through their app and replay them using accessibility services (i.e.,
asked them to describe the areas and features they tested and to
TalkBack [24]). This idea has also been explored in prior work for
demonstrate a manual test for an app and feature of their choice.
web applications [9]. However, a key challenge with record and
From our domain knowledge and review of prior work, we hy-
replay approaches is that they can also be brittle and difficult to
pothesized that a significant portion of time spent testing was
maintain as the UI evolves [32, 40]. By using LLMs, AXNav can
manually navigating to specific screens in apps, and that a sys-
interpret plain text instructions at different levels of granularity,
tem to automatically perform this navigation from existing manual
and adapt them to new context when UIs change.
test instructions would be useful. The second half of the formative
AXNav was not intended to fully scan apps for accessibility
study was designed to check this assumption and elicit features
issues. Rather, it was designed to flag a subset of potential issues
that would be useful for a system to help support manual testing.
during test replay to aid manual accessibility QA testers, based on
In this phase, we played a screen recording of an author manually
feedback from formative interviews. Our system architecture could
performing an accessibility test from an internal database of ex-
also be extended to run accessibility audits during each step of the
isting tests (Figure 2.1—“Search for a Show” in a media app using
replay, similar to accessibility app crawlers [20, 46]; however, in
VoiceOver). We asked participants to imagine a system replaying
this work we focus on navigation and replay through accessibility
the test instructions on the device and instructed them to think
services and not on holistic reporting of accessibility issues.
aloud while watching the screen recording, noting any features an
automated tool should support. We asked about the benefits and
3 FORMATIVE INTERVIEWS drawbacks of this functionality and how it might be used in testing
To better understand the challenges and benefits of manual acces- workflows, if at all. We include the full set of formative interview
sibility testing and elicit requirements for AXNav, we recruited questions in our supplementary materials.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

3.1 Challenges & Benefits of Manual Testing are filing bugs. In their work, they often work with engineers who
Participants noted a key benefit of manual testing is to experience may lack familiarity with the accessibility feature under test, so
the feature as an end user might (P2-P5). One participant, P3, being they try to make instructions as specific as possible.
a VoiceOver user, mentioned this enables them to more realistically For another two participants on a different team, their accessibil-
test the feature as it is meant to be used: “the advantages are that ity testing instructions primarily consisted of a large regression test
we can test literally, from the user perspective myself, and a num- suite across ten apps with 300 individual test cases they perform
ber of my teammates are users of the features because of various annually (P5, P6). Among these tests, some had concrete low-level
accessibility needs that we have. So we are the foremost experts in steps, but many were abstract, high-level, and assumed the QA
the functionality of those particular tests and what the expected tester has a high level of expertise on both the app and the acces-
results would be.” (P3) sibility feature to be tested. Figure 2 contains three example test
Participants from three teams brought up challenges, including cases for a video streaming app for the accessibility features Voice
an overwhelming scope of features and scenarios, leaving them to Over, Dynamic Type, and Button Shapes. Each test case typically
target only a few key features and tasks for testing (P2, P4-P6). P5 has a title containing the platform, feature, and app to be tested,
stated: “It’s not like necessarily difficult. It is just like, repetitive but only some test cases have step by step instructions, and only
and kind of boring and the scope is so big a lot of times like, if some test cases have an “Expected Result” specified.
you’re looking at the <App Name Anonymized> app, there’s so
many pages and so many views and so many buttons and different 3.4 Features in a Natural Language-Based
types of elements and everything. That is overwhelming and you Accessibility Testing Tool
feel you’re going to miss something”. In the second part of our formative study, we elicited features by
Writing manual tests was also noted as a challenge by partici- having participants imagine a system replaying manual testing
pants from three teams, who write down or have existing test suites instructions on an iPhone, while watching a screen recording of
of manual instructions (P3-P6). Participants noted it was easy for one of the authors performing a manual test. The video was a
those tests to become outdated when apps are updated, challenging screen recording only and had no additional features. We then
less experienced QA testers’ ability to interpret and follow test asked participants what features such a system should support in
instructions (P3, P4). Finding the right time for accessibility testing the context of accessibility testing. Here we summarize the key
was also mentioned by four participants, as they worked with apps features revealed by both this task and part one of our interviews
that are frequently updated across various product milestones (P3- that we incorporated into the design of AXNav.
P6). Two participants also mentioned trying to develop automated
tests in their work, which they described as easily breaking and not 3.4.1 F0: Natural Language Interpretation and Replay. Our QA
covering all possible scenarios (P5, P6). testers liked to observe the behavior of the interactions as they
were performing manual testing. They wished for more automa-
3.2 Testing Process tion in their workflows, but did not have time to spend writing
All participants took part in accessibility testing at various times and updating automated tests. They also often already had large
throughout the product lifecycle. They tested annually as new databases of manual testing instructions available. Thus one goal
features were added, or on regular release cycles of app interfaces. of our work was to enable testers to use their existing testing instruc-
The participants’ daily work consists of manually performing tests tions, written at multiple levels of abstraction, as input to a system
for accessibility features (e.g., VoiceOver, Dynamic Type) across that can interpret those instructions and replay them on a device. We
various products, or additionally writing accessibility frameworks hypothesized such a system could complement testers’ workflows
and automation code. through automation without requiring writing and updating fully
To test purely visual accessibility features, the participants typi- automated tests.
cally toggle on the feature under test and validate that the app’s UI
3.4.2 F1: Quickly Navigate and Visualize Executed Steps. To provide
renders or behaves correctly based on the setting. For accessibility
QA testers with the benefit of observing tests as an end user, we
services tests (e.g., VoiceOver), they typically enable the feature,
record videos of each test for the tester to examine. While watching
and then either navigate the app to perform a task using the feature
the video demonstration of the test, multiple participants requested
or navigate to a specific screen to validate the navigation order or
to review portions of the video multiple times to better understand
another behavior of the feature.
what action the system took and to further examine screens for
potential bugs. To improve video navigation, we add chapter labels
3.3 Granularity and Availability of Manual to the video that indicate either the action taken or flag potential
Accessibility Test Instructions issues. We also annotate system actions on the impacted video
Manual testing instructions are an extremely common artifact frames with a pink ‘+’ cursor. The chapters also allow users to skip
within our organization, existing in both manual test databases and back and repeat watching key segments quickly (Figure 1.f).
bug-tracking tools. One team we interviewed (two participants) We also received feedback from two participants during our
noted they own a large database of manual instructions for UI tests interviews requesting the system to let them replay the instructions
(P3, P4), but none of these instructions are specifically for accessibil- on a live local device and take control during various parts of the
ity testing. They also noted that they frequently write down manual test (P3, P4). This would be a more useful interaction particularly
instructions for accessibility features, or “repro steps”, when they for P3, a VoiceOver user, as they were unable to interact with the
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

UI in the video format. Due to current constraints with our system to a desired view to be tested. It then outputs a chaptered video that
architecture, we did not provide this in AXNav, but will explore a tester can navigate and replay (F1; subsubsection 3.4.2) annotated
the feasibility of supporting this along with providing a video for with heuristics that flag potential issues in the app (F2; subsub-
post-replay review. section 3.4.3). AXNav currently supports controlling and flagging
issues with four accessibility features: VoiceOver, a gesture-based
3.4.3 F2: Flag Potential Issues. Four participants mentioned that
screen reader [51]; Dynamic Type, which increases text size; Bold
they would like the system to flag potential issues and report fail-
Text, which increases text weight; and Button Shapes, which en-
ures. When asked what specific issues to flag would be most help-
sures clickable elements are distinguishable without color, typically
ful, participants mentioned both visual issues like dynamic type
by adding an underline or button background (Figure 1.d). We se-
resizing, and accessibility feature navigation issues (e.g., wrong
lected these features since, based on our interviews, they seemed to
navigation order, elements missing a label or not available for nav-
provide good coverage of real-world testing needs across different
igation). Participants noted that if a system could direct them to
modalities. AXNav could be extended to other accessibility and
target their testing towards any potential issues, that would save
device features in the future. For each user-provided test, AXNav
time in bug filing: “If it could detect the issue and write it down
executes the test on a specified app both with and without the
or like, ..., that would be helpful so that I can write bugs or maybe
specified assistive feature activated for comparison (F4; subsubsec-
bug can be automated.” (P6) Based on this feedback, we developed
tion 3.4.5).
custom heuristics in AXNav to flag a small subset of accessibility
AXNav consists of three main components that are used to pre-
issues to evaluate the feasibility and potential impact of this idea.
pare for, execute, and export test results: (1) Device Allocation and
We use the video output to flag issues by adding a chapter label at
Control, (2) Test Planning and Execution, and (3) Test Results Ex-
the location of the potential issue in the video.
port. These components work together to provision and stage a
Some participants also requested the system to save screenshots
cloud iOS device for testing, automatically navigate through an app
in addition to the video output (F1) so that when they find issues,
running on the cloud device to execute the test, and collect and
they can directly upload the screenshots to a bug tracking tool –
process test results.
“if your product did that, I think that would be a huge time saver
because most of my time is taking screenshots and clipping” (P5).
Screenshots also enable AXNav to flag potential visual accessibility
4.1 Device Allocation and Control
issues through postprocessing. Before executing a test, AXNav provisions a remote cloud iOS
device and prepares it according to the parameters it extracts from
3.4.4 F3: Realistic VoiceOver Navigation and Captioning. In the the test instructions. AXNav extracts the name of the app to be
video recording of the manual test, we showed participants, we tested and the assistive technology to use in the test (e.g., Dynamic
activated the UI elements for each step directly like a sighted user Type) from the instructions to automatically install the app and
might, rather than swiping through elements on a screen to find UI select the assistive feature to test. Instructions typically take the
elements as a non-sighted user might. Several participants noticed form of those shown in Figure 2.
this, and noted that the system should replay the test to be as similar During setup, AXNav installs a custom application that provides
as possible to how a user of the accessibility feature performs the task an interface to operating system APIs that silences several system
(P2, P3-P5). Additionally, our video also included the VoiceOver notifications, controls screen recording, and interacts with assistive
captions panel for this task, which three participants mentioned technologies. AXNav uses an operating system API to toggle and
was an important feature to include in our final system. configure the specific accessibility feature under test (e.g., Dynamic
3.4.5 F4: Perform Tests With and Without Accessibility Features. Type size). If the test is for VoiceOver, AXNav activates the caption
Our participants shared many manual test scripts that instructed panel (F3; subsubsection 3.4.4) and sets the speaking rate to 0.25
testers to perform tests with and without the accessibility feature to accommodate for speeding up the exported video in the Test
under test toggled on. For example, the “Search for a Show” test in Results Export step.
Figure 2 instructs the tester to first turn on VoiceOver to perform When the device is ready for the test to be executed, AXNav
the test, and to perform the same test after turning off VoiceOver. launches the app under test, and begins screen recording. The test
As participants noted, testing with the feature turned on and off execution engine can interact with the cloud device over a remote
helps QA testers verify if the system returns to the correct state desktop connection and the accessibility-specific features supported
after turning off the feature under test. Thus, AXNav repeats the by the custom application (see subsubsection 4.2.4).
navigation steps twice for most tests, first replaying the test with 4.1.1 Accessibility Feature Control and Replay. AXNav uses dif-
the feature on and then replaying the test with the feature off. ferent sequences to test supported accessibility features. For tests
with Dynamic Type, the system launches the target application,
4 AXNAV SYSTEM increases the Dynamic Type Size, navigates to the target screen
Based on our formative interviews with QA testers, we designed and specified in the test, takes a screenshot, kills the application, and
built AXNav, a system that interprets an accessibility test authored repeats this process for all four Dynamic Type sizes and, finally,
in natural language, and replays the test instructions on a mobile without Dynamic Type on. This enables testers to observe the cor-
device while manipulating the accessibility feature to be tested responding changes on the screen as the size is increased.
(F0; subsubsection 3.4.1). AXNav interprets plain text instructions, For Bold Text and Button Shapes, AXNav navigates to the target
which can be authored at varying levels of specificity, to navigate screen specified in the test with and without the feature enabled, and
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

(a) Provided a goal by a user, formulate a step-by-step plan that accomplishes their goal with the
current app. UI elements on the user's current screen are provided [...]
Preparatory

(b)
Start the stopwatch using VoiceOver
Test Instructions

(c) 1. Provision cloud iOS device


Device Allocation 2. Activate VoiceOver
and Control 3. Launch the Clock app
4. Once test is complete, kill Clock app and rerun tests with VoiceOver off

(d)
1. The user is currently in the 'World Clock' tab. Tap on the 'Stopwatch' tab.
Test Plan 2. Tap on the 'Start' button. The stopwatch should start running.
Proposal

ID: 0 Label: Text, Text: Edit, BoundingBox from (38, 159) to (171, 252)
ID: 1 Label: Icon (Type: add), BoundingBox from (1138, 157) to (1239, 255)
ID: 2 Label: Text, Text: World Clock, BoundingBox from (37, 277) to (661, 411)
(e) ID: 3 Label: Button, Text: Today, -7HRS, Cupertino, 1:39PM, BoundingBox from (0, 427) to (1284, 721)
Representing ID: 4 Label: Button, Text: Today, - 4HRS, New York, 4:39PM, BoundingBox from (0, 717) to (1284, 1007)
UI Elements ID: 5 Label: Tab, Text: World Clock, BoundingBox from (0, 2542) to (346, 2692)
ID: 6 Label: Tab, Text: Alarm, BoundingBox from (346, 2542) to (624, 2692)
ID: 7 Label: Tab, Text: Stopwatch, BoundingBox from (624, 2542) to (979, 2692)
ID: 8 Label: Tab, Text: Timer, BoundingBox from (979, 2542) to (1267, 2692)

(f)
1. {'action': {'type': 'tap', 'element_id': 7}} # Stopwatch tab
Take Actions 2. {'action': {'type': 'tap', 'element_id': 2}} # Start button
From Plan Steps

(g)
1. The 'Stopwatch' tab is now active and the 'Start' button is visible. Success.
Evaluate 2. The 'Start' button has changed to 'Stop', and the time has started to increase. Success.
Action Results

(h)
Test Results No VoiceOver loops detected. No missing UI elements detected.
Export

Figure 3: Overview of intermediate steps used by AXNav to interpret natural language test instructions; provision and stage a
device for testing; formulate and execute a plan to navigate the UI for the test; and export the test results.

saves pairwise screenshots of each tested screen with the feature UI elements, text, and icons [14, 57]. AXNav formats detected UI
on and off for comparison. elements as text strings to be ingested by the LLM, described in
For VoiceOver, AXNav replays the instructions once with VoiceOver subsubsection 4.2.1. To interact with the device, AXNav provides
toggled on, and again with VoiceOver off. tools that the LLM invokes to send touch or keyboard input events
and VoiceOver gestures.
4.2 Test Planning and Execution
AXNav uses an LLM-based UI navigation system that can trans-
late from natural language test instructions into a set of actionable 4.2.1 Test Plan Proposal. The planner agent is the heart of AXNav
steps, execute steps on a live device by calling APIs that interact (Figure 4), and it formulates a tentative plan containing instructions
with a device, and feed results back to improve the navigation plan to navigate to a desired view in an application from its current state.
(see Figure 4). We use OpenAI GPT-4 [38] in our implementation, The planner agent takes as input the accessibility test instructions
but AXNav can be easily adapted to use other LLMs. Our system (Figure 3.b), the name of the app under test, and the formatted UI
architecture is loosely inspired by ResponsibleTA [58], but elimi- element detections from a screenshot of an iOS device. The planner
nates some elements (e.g., masking LLM inputs), and merges other agent’s prompt contains instructions to formulate a tentative plan
elements (e.g., combining feasibility with actions). It consists of (Figure 3.a; Figure 4, Tentative Plan) to accomplish the test goal
three LLM-based agents: the planner agent, the action agent, and with the current app and the set of actions that can be taken in
the evaluation agent. To provide device state to the LLM agents, a step. To adapt to changes in the UI or unexpected errors (e.g.,
we use existing pixel-based machine learning models to recognize permissions request dialogs), the prompt includes instructions to
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Propose a step-by-step plan AXNav detects a keyboard, it filters all UI elements detected on the
that meets the test’s goal
keyboard, except for a submit button (“return”, “search”, “go”, etc.).
Planner/
Tentative Plan 4.2.3 Mapping from Plan Steps to Concrete Actions. For each step
Replanner
in the plan proposed by the planning prompt, AXNav implements
For each action an LLM-based “action agent” to map from the text instruction to a
Continue concrete action (Figure 4, Action) to take on a particular UI element
Step through (Figure 3.f), inspired by prior work [22, 33]. This agent performs
Action
the plan
several critical subtasks to navigate UIs in a single step: it identifies
how to map a natural language instruction to the specific context of
a UI, evaluates the feasibility of the requested action, and produces
Use tools to act on the arguments for a function call to execute the task. The action agent’s
Tools
UI (Tap/Swipe/Text)
subtask-to-action prompt contains instructions to output a specific
Replan action to take on a given screen, represented by the formatted UI
detections. The available actions are:
Evaluate results of
Evaluation • Tap: Tap a UI element given its ID. The prompt instructs
taking the action
the agent that tapping an object that is inferred to be non-
clickable is acceptable if it is the only reasonable option on
Figure 4: Planning and replanning workflow of our LLM- a screen.
Based Multi-Agent Planner • Swipe: Swipe in a cardinal direction (up/down/left/right)
from a specified (x, y) coordinate. The system tells that agent
that swiping can be used to scroll to view more options
traverse backward through the app if an unexpected state is en- available on a screen if needed.
countered, and to accept an imperfect plan if needed, since it can be • TextEntry: Tap a UI element given its ID and then enter a
revised later. The planner agent’s prompt also instructs the model given text string by emulating keystrokes. The agent is told
to to provide reasonable search queries if the test does not specify to come up with appropriate text if it is not provided.
them, based on the app name and the current context of the screen. • Stop: Stop execution of the current step and prepare feed-
The expected output of the planner agent is a JSON-formatted back for the replanner to update the plan as needed. The
object that contains a list of steps. Each step contains a thought feedback must specify what information is needed in an
designed to facilitate Chain-of-Thought (CoT) reasoning [31] that updated plan.
answers how the step will help achieve the user’s goal; evaluation, The output of the action agent is a JSON-formatted object that
which suggests criteria to determine task success; action, a brief, contains a thought to elicit CoT reasoning, relevant UI IDs, a
specific description of an input to provide on a given screen (e.g., list of UI elements the agent considers relevant (also to elicit CoT
tap, swipe, enter text); and a status field, which is initialized as reasoning), and a single action, which specifies a function call in
“todo” and updated to “success” when a step is executed correctly. JSON to execute interactions on the device.
An illustrative plan is shown in Figure 3.d.
4.2.4 Executing VoiceOver Actions. For action execution in VoiceOver,
4.2.2 Representing UI Elements to the Agents. AXNav describes the system interacts with the device through VoiceOver’s accessibil-
the UI to the LLM as a list of UI elements in plain text, which each ity service. We implement this in a custom application that provides
contains an incrementing integer as an id; the classification of the an interface to a Swift API (built on top of XCTest [55]) that can
UI element (e.g., Icon, Toggle); text contained by the UI element, if trigger key VoiceOver gestures [51] for AXNav. These gestures
any; and the coordinates of the bounding box around the element execute VoiceOver gestures in the same way a user of VoiceOver
(Figure 3.e). For example, an element with ID 3 might appear as: (3) would perform them (F3; subsubsection 3.4.4). Supported gestures
[Button (Clickable)] "Try It Free" (194, 1563) to (1042, are as follows:
1744). AXNav uses this simplified list because it economizes on Right swipe through all elements (read-all). This command
tokens, unlike prior approaches that format UI elements as JSON triggers the VoiceOver Right Swipe gesture multiple times to nav-
or HTML [22, 58]. igate through all exposed elements on the screen, typically in a
AXNav infers the elements in a UI using the Screen Recognition top-left to bottom-right ordering. Our system limits the number of
model from Zhang et al. [57] to predict bounding boxes, labels, text elements navigated to 50 to save time and avoid getting stuck in
content, and the clickability of UI elements from screenshot pixels loops or screens with infinite scroll. After right-swiping through
of iOS devices. Using pixels to detect UI elements makes AXNav the first 50 elements, the system activates the first tab, if it exists,
agnostic to the underlying UI framework [18]. AXNav groups and and navigates through all tabs from left to right in the tab bar.
sorts detected elements in reading order, and flags an element if it
is recognized as a top-left back button, using the postprocessing ap- Activate an element (activate-from-coordinates). This com-
proaches from [57]. AXNav also detects the presence of a keyboard mand issues VoiceOver’s Right Swipe and Double Tap gestures to
(to hint that a text field is selected) by detecting the presence of locate and activate an on-screen element. In our formative inter-
single-character OCR results on the lower third of a screenshot. If views, our prototype video demonstrated the “Search for a Show”
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

task in VoiceOver by directly navigating to relevant UI elements


using the VoiceOver Tap gesture followed by Double Tap. However,
participants gave us feedback that they preferred the demonstra-
tion to be more similar to how a non-sighted user would find and
activate a UI element, by using Right Swipe to navigate through
UI elements to find the target UI element to activate, and then
activating the element using Double Tap (F3; subsubsection 3.4.4).
To confirm this, we observed a screen reader user performing the
“Search for a Show” task, who followed a roughly similar pattern.
activate-from-coordinates takes as input x and y coordi-
nates corresponding to the center of the UI detection bounding
box to be activated; and the UI Type label from the UI detection
model (e.g., Tab). If the UI Type is Tab, the system navigates the
VoiceOver cursor directly to the leftmost tab element, uses the Right
Figure 5: Examples of issues flagged by our heuristics for
Swipe gesture to swipe to the first tab containing the x and y coordi-
Button Shapes (left) and Dynamic Text (right). The Button
nates, and then activates it using Double Tap. If the UI Type is not
Shapes heuristic flags the Collections row which has a button
Tab, the system navigates forward from the current element using
shape and also is underlined (a possible bug). The Dynamic
Right Swipe until it reaches the last VoiceOver element or finds an
Type heuristic flags several text elements with red boxes
element containing x and y which it activates using Double Tap. If
indicating the size has not increased with the DT size update
the system does not find the element, it navigates backward using
(a possible bug).
Left Swipe until it reaches the first VoiceOver element containing x
and y and if so, activates it using Double Tap. If the system does not
find an element containing the coordinates, the command returns did not change, the action likely failed; if the target element is not
without activating any element. visible, more scrolling may be required; and if the last action was to
Scroll (Up/Down/Left/Right) (scroll-<direction>). This com- click on a text field, the evaluation should be whether a keyboard
mand issues the VoiceOver Three Finger Swipe gesture, which scrolls is visible.
the current screen in the given cardinal direction by one page. The output of the evaluation agent is a JSON object that contains
To prevent the VoiceOver caption panel from interfering with the evaluation_criteria, to encourage CoT reasoning; a result of
UI detection model’s assessment of the state of the app, the system success, failure, or task completion; and an explanation, which
removes the caption panel from the formatted UI detections using a the system feeds back into the Planner to revise the plan if the
heuristic based on a fixed height from device dimensions. When the evaluation fails.
input test instructions specify to perform a task that requires navi- If the evaluation result is positive, then execution proceeds with
gating through multiple UI elements and screens, the system trig- the action agent being prompted with the next step in the plan. If
gers VoiceOver navigation using activate-from-coordinates the evaluation result is negative, the planner agent is prompted
when the action agent instructs a TextEntry or Tap action. If the ac- to replan, which updates the tentative plan from the current step
tion agent instructs the system to perform a Scroll action, the onwards. The planner agent’s replanning prompt is similar to the
system calls the corresponding scroll-<direction> action in initial planning prompt, but includes the previous plan, the current
VoiceOver. If the instructions state to navigate to a specific screen step being executed, and information about the stop condition
to verify the VoiceOver elements and navigation order, the system or evaluation error. The resulting JSON output contains a new
calls read-all once it reaches the final step of UI navigation, to tentative plan, revised from the current step onward.
swipe through all exposed elements on the screen. This enables
testers to determine whether all elements within that screen are 4.3 Test Heuristics
accessible by VoiceOver. AXNav can currently flag four types of potential accessibility is-
sues in the output video: VoiceOver navigation loops and missing
4.2.5 Evaluation and Replanning. Once an action is executed on the
elements, Dynamic Type text resizing failures, and Button Shapes
device, AXNav implements a third LLM-based “evaluation agent”
failures (see Figure 5).
to evaluate the results of the taken action (Figure 4, Evaluation).
An illustrative example of evaluation output is shown in Figure 3.g. 4.3.1 VoiceOver loop detection and missing VoiceOver elements.
AXNav prompts the evaluation agent with the test goal, the Our system detects loops in VoiceOver navigation order during the
entire current tentative plan, the action JSON object (including activate-from-coordinates and read-all commands. To detect
the function call and “thought”), the UI detections of the screen loops, the system maintains a list of all visited VoiceOver elements,
before the action was taken, and UI detections of the screen after and detects a looping bug if any element is revisited during the
the action was taken. The prompt also includes evaluation hints command. To enable the system to navigate the remaining task
designed to reduce navigation errors: if UI elements significantly steps, the system attempts to break out of the loop by finding the
change, the action likely succeeded; if the state of the current screen next VoiceOver element below the element where the looping was
changes, but a new view is not opened, err on the side of the action detected, navigating to it, and either continuing with read-all or
succeeding; if the last action was a scroll or swipe, but the screen activate-from-coordinates.
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

4.3.2 Missing VoiceOver elements. A typical accessibility error oc- Regression Testing Apps Performance
curs when an element detected by AXNav’s UI element detec- Diff. VO BT DT BS Success Partial Fail Acc.
tion algorithm cannot be navigated by VoiceOver. The system Easy 17 3 21 3 42 0 2 95.5%
flags this issue when a VoiceOver element cannot be found during Hard 15 1 2 0 11 2 5 61.1%
activate-from-coordinates. Total: 32 4 23 3 Overall Accuracy: 85.5%
Table 1: Total evaluation test case counts for our Regression
4.3.3 Dynamic Type. The Dynamic Type heuristic determines if
Testing Dataset for the AX features of VoiceOver (VO), Dy-
text elements and their associated icons increase in size when the
namic Type (DT), Bold Text (BT), and Button Shapes (BS),
system-wide Dynamic Type size is increased. The heuristic takes
which we total for the difficulty level of Easy and Hard re-
two inputs: a screenshot of a view with a baseline text size, and
spectively. We report the performance of navigation replay
another screenshot of the same view with a Dynamic Type size
as full success, partial success (some but not all steps com-
increased by one increment.
pleted), and failure, along with overall accuracy.
The heuristic first uses a UI element detection model [57] on each
screenshot to recognize text elements and perform OCR [6]. The
heuristic then uses fuzzy string matching with Levenshtein distance
to find corresponding text elements between the two screenshots, Free Apps Performance
with a partial similarity threshold set to 50%. The heuristic excludes
Diff. VO BT DT BS Success Partial Fail Acc.
elements without matches. For a text element to pass the heuris-
Easy 0 4 2 1 5 1 0 83.3%
tic, its corresponding UI element must increase by an adjustable
Hard 5 1 3 4 9 3 2 64.3%
threshold set to 10% compared to the baseline screenshot.
Total: 5 5 5 5 Overall Accuracy: 70.0%
To identify icons paired with text elements, which should typi-
Table 2: Total evaluation test case counts for our Free Apps
cally scale along with the text, the heuristic greedily matches icons
Dataset for the AX features VoiceOver (VO), Dynamic Type
to text elements in both screenshots by minimizing the distance be-
(DT), Bold Text (BT), and Button Shapes (BS), which we total
tween the icon’s right bounding box coordinate to the text element’s
for each difficulty level of Easy and Hard. We report the per-
left coordinate. To remove icons that are not to the immediate left
formance of navigation replay as full success, partial success
of the text, the heuristic excludes icons with a gap of more than half
(some but not all steps completed), and failure, along with
the icon’s width to the right text element or whose top and bottom
overall accuracy.
are not bounded by the text element’s bounding box. The heuristic
pairs icons with their adjacent text elements, and applies the same
10% threshold in the bounding box area to pass the heuristic.

4.3.4 Button Shapes. The Button Shapes heuristic determines, for Many video players include features to view all chapter markers by
a given screenshot, whether clickable text outside of the clickable name and navigate directly to the start of a given chapter. To help
container is underlined. This heuristic takes a single screenshot of a communicate actions while watching, AXNav overlays markers on
view with Button Shapes activated. The heuristic uses a UI element the video stream that label each action taken with crosshairs for tap
detection model [57] to locate and classify elements in the UI, along actions and arrows indicating scroll direction. Potential accessibility
with their predicted clickability. For every clickable container ele- issues from heuristic results are also overlaid on the video stream
ment (Buttons and Tabs), the heuristic flags any contained element with colored bounding boxes in either orange or cyan. AXNav also
that is also underlined, which indicates a bug. For any uncontained speeds up the exported video by a factor of 2.5 to minimize pauses
text element predicted as clickable, the heuristic flags it if it is not due to the latency of its LLM-based agents.
underlined.
The heuristic detects underlines in text elements by extracting 5 TECHNICAL EVALUATION
the image patch of the text bounding box, binarizing the patch using We conducted two evaluations of AXNav to determine the accuracy
Otsu’s method [39], edge-detecting the image with the Canny edge of our test replay. Few datasets currently exist in the literature for
detector [13], and using the Hough Line transform [19] to detect UI navigation tasks for mobile apps from natural language, and
any horizontal line that spans at least 75% of the width of the patch. we are aware of no such datasets for iOS apps specifically. Instead,
If a text element is underlined when it should not be (or vice versa), we evaluated the system on a regression test suite used within our
it fails the heuristic. company to test a set of media apps, and created our own dataset
from free apps within the Apple App Store.
4.4 Output Video Generation
AXNav’s output is a video of the test execution. Throughout the 5.1 Regression Testing Dataset
replay process, AXNav records the screen of the cloud device and First, we evaluated the system on a large regression manual test
logs timestamps of every action performed on the device, along suite. Some examples of this test suite are shown in Figure 2. From
with actions and activated UI elements. To improve the navigability that test suite, we extracted 64 test cases from 5 apps testing the
of the video, AXNav adds named chapter markers that demarcate accessibility features that AXNav supports: VoiceOver, Dynamic
each step of the test being performed and each issue flagged by Type, Button Shapes, and Bold Text. We discarded two of the tests
a heuristic (F1 & F2; subsubsection 3.4.2 & subsubsection 3.4.3). due to our account lacking the necessary subscription to view the
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

screen(s) being tested. The final set contains 62 test cases. Note system’s performance on each test case and then met to discuss
that this regression test suite is used for manual testing and is not and resolve any differences.
constructed for the purpose of being used by any automated system. For the regression testing dataset, our system successfully re-
Many of the tests are very high level and assume the QA tester has played 95.5% of easy test cases, and 61.1% of hard test cases for
a high level of expertise on the feature and the app under test. We an overall success rate of 85.5%. Table 1 summarizes these results.
chose to evaluate AXNav on this dataset since it is a representative Within our organization’s apps, support for the supported acces-
set of real-world accessibility tests. sibility features is already high; the accessibility test success rate
across these tests was 78%. We are also working with the owners of
5.2 Free Apps Dataset this regression testing dataset to report the accessibility test failures
We also constructed a dataset of accessibility testing instructions in our internal bug-tracking system.
for publicly available apps. We randomly selected apps from a For the free apps dataset, our system successfully replayed 83.3%
public list of the 100 most popular free apps in the Apple App Store, of easy test cases, and 64.3% of hard test cases for an overall suc-
ultimately selecting five apps from different app categories. Then cess rate of 70.0%. Table 2 summarizes these results. Support for
for each app, one researcher on our team drafted four manual tests, the accessibility features of Bold Text, Dynamic Type, and Button
one for each of AXNav’s supported accessibility features, using Shapes unfortunately were low across the five apps, resulting in an
the regression testing suite as an example. We validated that the accessibility test success rate for these apps of only 15.0% across the
tests were realistic by discussing them with an expert accessibility 20 test cases. This further motivates the potential impact of using
QA tester from the formative study. The final dataset consists of 20 systems like ours within the app development workflow.
manual tests across five apps and four accessibility features. While the navigation replay success of our system is good for
both datasets, our system fails to replay some tests. In some cases,
5.3 Accuracy Results the navigation replay fails because the test requires tapping on
a certain item in a collection where only some items have the re-
We evaluated the difficulty of each test through a rubric based on
quired condition (e.g., have a subscription available) but the planner
prior work [26], which rates each type of instruction task into Easy
agent typically suggests activating the first item. In other cases,
or Hard categories for evaluation.
the planner agent cannot deduce enough knowledge about the app
• Easy regular expression-based retrieval task: These tests and predicts that key functionality for the replay does not exist
can be completed in a single step by matching the correct UI in the app. In a few cases, key UI elements needed to be activated
element with the correct action, and possibly scrolling on the for the test that were located offscreen and required scrolling to
resulting page. The role of the planner agent in completing reach, and AXNav did not continue scrolling long enough to find
these tests is minimal and in many cases, the test could be them. Another challenge we have seen is that the planner agent
completed entirely by the action agent. sometimes is unable to determine when to stop and gets into an
• Hard structured problem-solving or open-loop plan- infinite loop. These are areas we hope to improve in future work.
ning task: These tests require the system to take multiple
actions across multiple screens. That requires the planner
6 USER STUDY
agent to reason about the steps needed to complete the test
and correct itself as needed as the test proceeds. It also re- We presented our system in user study sessions with 10 professional
quires the action and evaluation agents to ensure multiple accessibility testers. The goal of the user study is to understand
steps are completed successfully, beyond just the one step how AXNav could assist accessibility testers in their workflows,
required for easy tasks. specifically, how well the system could replicate manual accessibil-
ity tests, aid testers in finding accessibility issues, and be integrated
To group the tests into the above categories, two authors inde-
into existing test workflow.
pendently rated each test and then met to discuss and resolve any
differences. Table 1 and Table 2 show the total counts for each level
across the four supported accessibility feature categories and the 6.1 Procedure
two separate datasets. We conducted 10 1-to-1 interview-based study sessions. During
To repeat each test, we input the test instructions into the system, each session, we first presented an overview of AXNav to the par-
reset the phone’s current state to match the initial state specified by ticipant. We then showed three videos generated by AXNav and
the test, and then executed the test instructions on the device. Dur- the associated test instructions, in randomized order. Each video
ing this process, we recorded all interactions between the system showed an accessibility test on iOS media applications for e-books,
and the app. For both datasets, we report navigation replay success, news stories, and podcasts, respectively, with different UI elements
which measures whether our system can follow the instructed steps and layouts. The videos were selected from the set of videos used in
successfully to reach the desired destination, and accessibility test section 5, based on their coverage of different accessibility features,
success for whether the accessibility feature test succeeded. We also including VoiceOver, Dynamic Type, and Button Shapes. Two of the
report navigation partial success, which indicates that AXNav re- tests shown in the videos were selected from those with the diffi-
played one or more steps in the test but did not end up in the correct culty level of Easy, and one test with the difficulty level of Hard. The
final state. We determined success based on our own manual evalu- tests shown in the videos represented real accessibility tests that
ation based on the expected behavior for each accessibility feature. our participants would perform, as they were selected from the set
To ensure consistency, two researchers independently scored the of test instructions authored and used by testers in the organization.
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

We chose to show videos to participants as they are the primary of accessibility QA testers and accessibility engineers. We recruited
output produced by AXNav, offering a realistic representation of participants via internal communication tools. In contrast to our
interaction with our system. Furthermore, since AXNav is not a formative study, all participants in this study were sighted and
production system, it was not optimized for speed, and can take did not use screen readers. Two participants from our formative
several minutes to an hour to produce a video. In practice, this is not study, P5 and P6, also participated in this study. Since we did not
a critical limitation, since many tests can be run in parallel, possibly collect information on the pronouns of our participants, we used
overnight, and reviewed all at once following their completion. The the gender-neutral pronoun “they/them” to refer to all participants
specific videos and associated test instructions that we used for the in our findings. Interview questions and participant demographics
user studies are as follows: are shared in Supplemental Materials.
(1) VO: This video shows a test of a podcast application. The test
instruction prompts the system to share an episode of a pod- 6.3 Data Collection and Analysis
cast show through text message using Voice Over. (Difficulty The data collected during the study includes audio and video record-
level Hard)1 ings of the study sessions with the consent of the participants. We
(2) DT: This video shows a test of Dynamic Text in a news appli- transcribed all the recordings into text format using an automated
cation. The test instruction prompts the system to increase tool. The research team also took field notes during the session
the size of the text in four different fonts in a specific tab of and used the notes to guide the analysis. The length of the sessions
the application. (Difficulty level Easy) ranged from 29 minutes to 49 minutes, with an average length of
(3) BS: This video shows a test of Button Shapes in an e-book 37 minutes. The interview with P9 only covered two videos (VO
application. The test instruction prompts the system to test and BS) due to the participant’s availability.
the Button Shape feature across all the tabs in the application. We performed a thematic analysis on the qualitative data from
(Difficulty level Easy) the user study [25]. Two authors of the paper first individually
All three videos contained some accessibility issues, which we coded all the transcripts, then presented the codes to each other
prompted the participants to discover using the heuristics as part and collaboratively and iteratively constructed an affinity diagram
of the system. Furthermore, all videos deliberately contained errors of quotes and codes together to develop themes. The following
and imperfect navigation to conservatively showcase the capabili- findings section presents the resulting themes. We also reported
ties of our system. Specifically, the VO video shares a podcast itself the descriptive statistics of the data collected from the Likert scale
instead of an episode, and some false positive errors are flagged in rating questions, including the mean, standard deviation (SD) and
the DT and BS videos. We intentionally presented those imperfec- sample size (N), to supplement our qualitative insights.
tions to the participants to show the performance of the system
conservatively, and to trigger a discussion of limitations and future 6.4 Findings
directions. 6.4.1 Performance of the Automatic Test Navigation.
For each video, the researcher asked the participant to think
aloud as they watched the video to 1) point out any accessibility Automatic test navigation replicates manual test. Participants
issues related to the input test, and 2) point out any places where generally agreed that the system navigated applications in a similar
the test performed by the system could be improved. After each path as they would conduct tests manually, especially in the BS
video, we interviewed each participant about how well the test and VO test cases. For VO, Participants rated 4.60 (SD = 0.52, N
in the video met their expectations, and how well the heuristics = 10) on average in the similarity regarding the navigation path
assisted them in finding any accessibility issues. Besides qualitative between human testers and the AI (between “very good match” and
questions, we also asked the participants to provide 5-point Likert “extremely good match” with their manual testing procedures). P3
scale ratings on how similar the tests in the videos are to their was impressed by the system’s ability to execute the test: “my mind
manual tests, and how useful the heuristics are for tests to identify is blown that it was able to find that [shared button] buried within
accessibility bugs. Following the viewing of all three videos, we that actions menu.” Similarly, in the BS test case, Participants rated
asked about the participants’ overall attitude toward the system, 4.35 (SD = 0.75, N = 10) on average. In P9’s opinion, the system’s
how they envisioned incorporating it into their workflow, and heuristics might outperform most human testers in BS, since it
any areas they identified for improvement. Additionally, we asked could be subjective for a human tester to determine what consists
participants to provide 5-point Likert scale ratings assessing our of a button shape. Participants also reacted positively to the chapter
system’s usefulness in its current form and with ideal performance feature, as it enabled efficient navigation through the video.
within their workflow.
Differences in system and human approaches. Some of the ap-
proaches the system provided were different from what human
6.2 Participants
testers would do. Compared to BS and VO, the system’s perfor-
We recruited 10 participants who are full-time employees at a large mance in DT received 3.39 (SD = 0.78, N = 9) on average, a relatively
technology company. All participants perform manual accessibility lower rating that was between “moderately good match” and “good
tests as part of their professional work, having professional titles match” with manual testing procedures. A main difference is that
1 This
the system always relaunches the application between the tests of
video does not include any issue flagged by the system. In order to show partici-
pants what heuristics in VO look like, we presented a supplementary video of another different text sizes, while human testers tend to use the control cen-
VO case where the system flags a VoiceOver navigation loop in the chapters. ter to adjust text sizes within the application without relaunching
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

it in order to mimic what a real user would do. In fact, participants participants appreciated the heuristics providing an extra layer
recognized a potential benefit of AXNav’s approach, as it added an of caution, as P10 said, “I actively like the red [annotation boxes
additional layer of testing: “I really like that launches the app in around potential issues] because I think the red is like ‘take a look
between changing the text size, because I think it’s a separate class at this’ and then even if it’s not necessarily an issue, that’s not
of bug, whether or not, it responds to a change in text size versus hurtful.”
having the text size there initially.” (P8) Similarly, P9 found in the
VO example that the system waited for spoken output, which was Risks of over-reliance on heuristics. Participants expressed the
not something that a human tester would typically do, but might concern of over-reliance on the heuristics provided by the system. In
be beneficial for more thorough tests. some sessions of our study, although participants found issues that
At the same time, participants also suggested that future versions were not marked by the heuristics, they were worried that those
of the system could enable exploratory and alternative navigation, false negatives might bias testers: “if things are marked as green,
as well as more in-depth tests of the UI structure. For example, for and maybe there actually is an issue in there, maybe that would
BS, participants mentioned that they would have explored more dissuade somebody from looking there.” (P10) This could influence
nested content in the application to ensure the Button Shape feature testers of different experience levels differently. An experienced
works for all elements (P2, P6). For VO, participants wished the tester might rely on their expertise to find issues, while a novice
system could support alternative, non-linear pathways that VO tester might over-rely on the suggested bugs (or non-bugs) made by
users could go through (P7) and navigation using both swiping and the system. As P8 explained: “If somebody is kind of experienced
tapping gestures (P4). Another common request is the ability to with large text testing, they kind of know what to look for... If it’s an
scroll through the screen of an application when testing display inexperienced tester, they might not know that the false positives
features like DT and BS. are false positives and might file bugs.” (P8)
A mechanism to explain how the heuristics were generated and
Reaction to navigation errors. The VO video contains a slight applied to the test cases might help with the issue of over-reliance.
error in the navigation: the navigation shares a show instead of For example, P7 imagined it to be a series of “human-readable
sharing an episode. Only 2 out of 10 participants (P2 and P5) were strings, like what it actually found... human-readable descriptions
able to identify this navigation error. Most participants ignored the of what the error is in addition to seeing the boxes.” Other sugges-
error, potentially due to over-reliance on the automatic navigation, tions focus on making the heuristics more digestible for the testers.
as P2 said, “it worked well enough that I almost kind of let that slip. Currently, we show the heuristics as screenshots with annotations
I needed to watch this video twice. Maybe I got over-reliant on [it].” separate from the videos. Participants suggested it would be easier
To address this error, P2 elaborated on how they would re-write to comprehend the heuristics if they were encoded in the video
the test instruction so that the agent could potentially correct the and separated from regular chapters (P6), and only annotated the
mistake: “I would have [written], like, navigate to an episode, click potential issues (P1). P7 brought up the idea to include a dashboard
the dot dot dot menu... I would suspect that this model would have or summary mechanism in the system, so that a tester “instead of
done a better job finding the actual episode...” P5, instead, described just having a scrub through this video,” could see “a summary of
how they would navigate the application themselves based on the the errors as well.”
instruction: “I would definitely do it the same route as it did through
the more button, [but] instead of a certain episode, I would just 6.4.3 Integration in Accessibility Testing Workflow. Overall, partici-
switch it to show.” pants reacted positively to our system. Participants rated 4.70 (SD
= 0.48, N = 10) (between “useful” and “very useful”) on average for
6.4.2 Identifying Accessibility Issues with Automatic Navigation. For how useful the system is in their existing workflow if it performs
all three cases of VO, BS, and DT, all participants spotted at least extremely well, and 3.95 (SD = 0.96, N = 10) (between “moderately
one accessibility issue, and agreed that the issues they discovered useful” and “useful”) on average to the system in its current form.
were significant enough to be filed in the internal bug reporting Participants expressed excitement about the potential of integrating
system within their company. the system and bringing automation to their workflow. For instance,
when asked for a rating on the overall usefulness of the system, P3
Heuristics aid discovery of issues. Overall, participants agreed answered: “[I will rate] it like a 5 million... Even with the current
that the heuristics provided by the system assisted them in finding limitations, it is very useful... just being able to feed it some real
the issues. For VO, BS, and DT respectively, participants on average simple steps and have it do anything at all is massively powerful.”
rated 4.06 (SD = 1.38, N = 9) (between “useful” and “very useful”), The next sections unpack a range of ways that AXNav might be
4.75 (SD = 0.43, N = 10), and 3.67 (SD = 1.09, N = 9) (between integrated into existing test workflows.
“moderately useful” and “useful”) on the usefulness of the heuristics.
Specifically, the potential issues flagged in the chapters allowed Automating test planning. A compelling use case for AXNav is
participants to navigate to where the issue was and review it with to automate the planning and setup of the test, which, according to
greater attention. The heuristics in particular helped direct testers’ our participants, is a time-consuming part of accessibility testing as
attention to the potential issues, which might otherwise be too it can involve an excessive amount of manual work to “go through
subtle to discover: “Watching it in a video, as opposed to actually and find all of the labels to tap through” (P3). The step-by-step
interacting with it, I think it is easier to potentially miss things... executable test plan generated from natural language from our
So, having some sort of automatic detection to surface things [is system can reduce the amount of tedious work: “rather than having
good].” (P8) Even though they sometimes resulted in false positives, to hard code navigation logic, it seems that this is able to determine
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

those pathways for you... I think this idea is really awesome and 7 DISCUSSION
would definitely save a lot of hours of not having to hard code the Accessibility QA testing is still by-and-large a manual effort and
setup steps to go through a workflow with VoiceOver.” (P4) P4 also there are benefits to not leaving such testing up to full automa-
envisioned using the system as a test authoring tool, which can tion [37]. The majority of QA testers we interviewed desired more
generate templates that can be run daily. automation to free up time for more complex testing. However,
they lack the time and resources to effectively use existing automa-
Complementing manual tests. Participants found the system help- tion methods. With AXNav, a key goal is to use testers’ existing
ful in reducing workload and saving time in running tests. Some metadata (e.g., databases of manual instructions) and build a tool
participants would like to embrace the automation provided by the to complement existing workflows. Our user study indicates that
system, keeping the system running a large scale of tests in the AXNav, even in its current form, can be useful in their workflows.
background while the team could focus on more important tasks: AXNav also serves as an initial exploration into using recent ad-
“you can run it in an automated fashion. You don’t need to be there. vances in LLMs and UI navigation in accessibility testing workflows,
You can run it overnight. You can run it continually without scaling which other systems can build upon. In this section, we discuss
up some more people” (P7). As P8 imagined, “this could run on each some limitations of our evaluation and the AXNav system that we
new build [of the software], and then what all the QA engineer has plan to address in future work, and potential extensions of AXNav
to do is potentially a review about an hour’s worth of videos that beyond accessibility testing workflows.
were generated by the system, potentially automatically flagging
issues.” The system can also provide consistency and standardiza- 7.1 Differences between automated navigation
tion in tests, which “ensure[s] that everything is run the same way and manual testing
every time.” (P8)
At the same time, some participants are more cautious about AXNav employs one workflow specifically for VoiceOver tests,
automation and would like to use the system as a supplement to where the system uses forward swipes until finding a target element
their manual work. P4 believed that even with the flagged issues, before activating it. As shown in the user study, this may not reflect
they would still pay attention to the system-generated videos to a how a VoiceOver user might navigate the task as the user may have
degree similar to how they would test them manually. P1 imagined prior knowledge of the app structure. This would enable users to
that they would still test manually, but would use the video as skip around to various parts of the screen to activate the desired
validation of their tests “to see if it could catch things that I couldn’t element. While sometimes such differences can be complementary
catch.” (P1) Some also imagined handing lower-risk tests, such as test strategies, future versions of the system could explore how to
testing Button Shapes, to the system, while using the time saved simulate alternative patterns of interactions.
by the system to manually and carefully test higher-risk tests that
will be a regulatory blocker. (P2)
7.2 Improving navigation performance
While AXNav achieves reasonable test replay accuracy, it can en-
Aiding downstream bug reporting. The videos generated by the counter errors arising from a lack of sufficient knowledge about
system can also facilitate bug reporting in the downstream pipeline. apps or understanding when to stop (see section 5). We expect
Participants agreed that the video along with the chapters gener- that improvements in modeling (i.e., by fine-tuning a model on
ated by the system could be used to triage any accessibility issues successful navigation paths or integrating existing app knowledge
that they would report to the engineering teams. In their current into prompts [54]) can improve navigation performance in future
practice, testers would sometimes include screenshots or screen versions of AXNav. Other approaches, such as using multimodal
recording video clips to demonstrate the discovered issue. Our sys- models [28], could be considered for future iterations.
tem prepared a navigable video automatically, streamlining this
process: “I thought to be able to jump to specifically when the issue 7.3 Mitigating errors and over-reliance
is and scrub a couple of seconds back or a couple seconds forward Like all machine learning and heuristic-based systems, AXNav
is super useful for engineering.” (P7) is not expected to always produce perfect output. However, it is
important to mitigate the risk of these errors on QA testers. Prior
Educating novices about accessibility testing. The system can also works have shown there is a risk of over-reliance on AI systems
serve as an educational tool for those who are new to accessibility since users can view the AI as an authority and be reluctant to
tests. The system can not only help new QA professionals, but challenge it [7, 16]. This is also the case for the navigation and
also developers from under-resourced teams where there are no heuristics of AXNav. For example, only 2 out of 10 user study
dedicated QA teams or pipelines. For example, P2 found the videos participants were able to spot the navigation error in the VoiceOver
and heuristics helpful in terms of demonstrating certain accessibility example (see section 6.4.1). While evaluating the correctness of
bugs that people should be looking for: “This will be very useful LLM-based systems remains an active area of research [29], there
for some of the folks that never do accessibility testing and [for] are additional techniques that could be considered for future work
they [to] have a context or starting point for even knowing what to enable AXNav to report whether it executed a navigation task
a VoiceOver bug is.” (P2) In a way, our system has the potential correctly. For example, the navigation path itself could be evaluated
to demonstrate and raise awareness of accessibility issues among through heuristics, another LLM, or by using existing knowledge of
broader developer communities, even for those who do not have apps. Another way to mitigate over-reliance in future work would
QA resources. be to provide transparency signals, such as confidence scores and
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

textual explanations of how the predictions were made, echoing consider focused testing for specific accessibility needs. For exam-
design guidelines on transparency and explainability for human-AI ple, if a test is for users with motor impairments, issues like target
collaboration [5]. size would be important to surface.
Lastly, we have only built AXNav to work with the iOS operating
system. However, the system architecture and workflow should be
7.4 Limitations in the User study extensible to other platforms where provided APIs are available to
Our user study had participants watch and comment on videos control the accessibility features under test. A body of work has
generated by AXNav. Our study design mimicked how accessibility explored general UI navigation in other platforms [22, 33, 49, 54].
testers would interact with AXNav in their actual workflows (i.e.,
reviewing videos generated by an automatic system and spotting
7.7 More applications of the AXNav system
accessibility issues, as elaborated in section 6.4.3), but this design
has some limitations. First, we only showed the same set of 3 videos We have so far evaluated AXNav for QA testing, but there are many
to all the participants. Although the set of videos covers different opportunities beyond this as indicated by our user study and other
types of accessibility tests, participants’ feedback could be biased by work in this area. One that we would like to explore is using this
this limited set of examples. Second, we only showed users videos system as a tool to help novice developers better understand the
where navigation mostly worked to probe how they would use the behaviors of accessibility features and how they should be tested
system in their workflow. We did not show examples where the by generating realistic simulations of behavior on their own apps.
replay failed, and therefore were not able to collect user feedback Additionally, natural language instructions are used in manual
on failed replay and how it would be handled. Third, in order to UI testing, bug reports, and reproduction steps [22], and natural
keep user study sessions short, the participants did not directly language automation systems may benefit from the techniques we
write their own tests and generate videos using the tool themselves. present in this paper to reconstruct these types of tests. These are
In future work, we plan to deploy AXNav in a longitudinal study so examples of use cases we hope to explore in future work.
that we can better understand how QA testers instruct the system
and interact with its output. 8 CONCLUSION
In this paper, we presented a system to support accessibility test
7.5 Accessibility of AXNav interpretation and replay through natural language instructions.
Our system achieves good technical success in replaying realistic
One key limitation of AXNav currently is its output video format manual test instructions, achieving 70% and 85% navigation replay
which is not by default accessible to screen reader users. People success. We evaluated our system with 10 professional accessibil-
with disabilities are commonly employed in accessibility testing ity testers who would find the system very useful in their work
such as non-sighted screen reader testers. AXNav should make the and revealed a number of promising future opportunities and in-
video format accessible by ensuring all visual content is described – sights into how we can leverage LLM-based task automation within
such as heuristic boxes, screen changes, and chapter annotations. accessibility testing.
Non-sighted users may also find other output formats more useful.
The screen reader user in our formative study requested AXNav
replay test cases live on a local device to enable them to take control, REFERENCES
which is feasible and something we plan to do in future work. Lastly, [1] Accessibility on iOS 2023. Accessibility on iOS. [Link]
accessibility/
future versions of AXNav should be accessible to testers beyond [2] Accessibility Programming Guide 2022. Accessibility Programming Guide for
screen reader use cases (e.g., testers with motor impairments). OS X: Testing for Accessibility on OS X. [Link]
archive/documentation/Accessibility/Conceptual/AccessibilityMacOSX/
[Link]
[3] Accessibility Scanner 2023. Accessibility Scanner. [Link]
7.6 Accessibility feature support and apps/details?id=[Link]&hl=en_US
generalizability [4] Abdulaziz Alshayban, Iftekhar Ahmed, and Sam Malek. 2020. Accessibility issues
in Android apps: state of affairs, sentiments, and ways forward. In Proceedings
Our studies uncovered the need to support testing additional ac- of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul,
South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY,
cessibility features beyond the four that AXNav supports. Future USA, 1323–1334. [Link]
versions of AXNav can support more navigational accessibility [5] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira
services (e.g., Voice Control) and other accessibility settings (e.g., Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen,
Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-
display features such as contrast adjustment and motion reduction) AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in
provided the device’s operating system provides APIs to control Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing
those features. AXNav currently surfaces some potential accessibil- Machinery, New York, NY, USA, 1–13. [Link]
[6] Apple. 2020. Recognizing Text in Images. [Link]
ity issues through its heuristics (e.g., Dynamic Type resizing issues); documentation/vision/recognizing_text_in_images/
however, these do not cover all accessibility issues we could surface. [7] Advait Bhat, Saaket Agashe, and Anirudha Joshi. 2021. How do people interact
with biased text prediction models while writing?. In Proceedings of the First
Future versions of AXNav could incorporate existing accessibility Workshop on Bridging Human–Computer Interaction and Natural Language Pro-
inspection tools similar to Groundhog [46] to report issues such cessing, Su Lin Blodgett, Michael Madaio, Brendan O’Connor, Hanna Wallach,
as missing UI element descriptions or minimum target sizes. We and Qian Yang (Eds.). Association for Computational Linguistics, Online, 116–121.
[Link]
could also add a dashboard to summarize the issues found during [8] Tingting Bi, Xin Xia, David Lo, John Grundy, Thomas Zimmermann, and Denae
AXNav’s replay, as study participants proposed. AXNav could also Ford. 2022. Accessibility in Software Practice: A Practitioner’s Perspective. ACM
AXNav: Replaying Accessibility Tests from Natural Language CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Transactions on Software Engineering Methodology 31, 4, Article 66 (July 2022), [30] Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zim-
26 pages. [Link] mermann, and David Lo. 2015. Understanding the Test Automation Culture of
[9] Jeffrey P. Bigham, Jeremy T. Brudvik, and Bernie Zhang. 2010. Accessibility App Developers. In 2015 IEEE 8th International Conference on Software Testing,
by demonstration: enabling end users to guide developers to web accessibility Verification and Validation (ICST) (Graz, Austria). IEEE, New York, NY, USA, 1–10.
solutions. In Proceedings of the 12th International ACM SIGACCESS Conference on [Link]
Computers and Accessibility (Orlando, Florida, USA) (ASSETS ’10). Association [31] Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke
for Computing Machinery, New York, NY, USA, 35–42. [Link] Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances
1878803.1878812 in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal,
[10] Build accessible apps 2023. Build accessible apps. [Link] D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red Hook,
com/guide/topics/ui/accessibility/ NY, USA, 22199–22213. [Link]
[11] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and file/[Link]
Bryan A. Plummer. 2022. A Dataset for Interactive Vision-Language Navigation [32] Xiao Li, Nana Chang, Yan Wang, Haohua Huang, Yu Pei, Linzhang Wang, and
with Unknown Command Feasibility. arXiv:2202.02312 [[Link]] Xuandong Li. 2017. ATOM: Automatic Maintenance of GUI Test Scripts for
[12] Button Shapes 2023. Accessibility (Button Shapes). [Link] Evolving Mobile Applications. In 2017 IEEE International Conference on Software
design/human-interface-guidelines/accessibility Testing, Verification and Validation (ICST). IEEE, New York, NY, USA, 161–171.
[13] John Canny. 1986. A Computational Approach to Edge Detection. IEEE Trans- [Link]
actions on Pattern Analysis and Machine Intelligence PAMI-8, 6 (1986), 679–698. [33] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Map-
[Link] ping Natural Language Instructions to Mobile UI Action Sequences. In Pro-
[14] Jieshan Chen, Amanda Swearngin, Jason Wu, Titus Barik, Jeffrey Nichols, and ceedings of the 58th Annual Meeting of the Association for Computational Lin-
Xiaoyi Zhang. 2022. Towards Complete Icon Labeling in Mobile Applications. In guistics. Association for Computational Linguistics, Online, 8198–8210. https:
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems //[Link]/10.18653/v1/[Link]-main.729
(New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New [34] Jun-Wei Lin, Navid Salehnamadi, and Sam Malek. 2021. Test automation in open-
York, NY, USA, Article 387, 14 pages. [Link] source Android apps: a large-scale empirical study. In Proceedings of the 35th
[15] Sen Chen, Chunyang Chen, Lingling Fan, Mingming Fan, Xian Zhan, and Yang Liu. IEEE/ACM International Conference on Automated Software Engineering (Virtual
2022. Accessible or Not? An Empirical Investigation of Android App Accessibility. Event, Australia) (ASE ’20). Association for Computing Machinery, New York,
IEEE Transactions on Software Engineering 48, 10 (2022), 3954–3968. https: NY, USA, 1078–1089. [Link]
//[Link]/10.1109/TSE.2021.3108162 [35] Mario Linares-Vásquez, Cárlos Bernal-Cardenas, Kevin Moran, and Denys Poshy-
[16] Valerie Chen, Q. Vera Liao, Jennifer Wortman Vaughan, and Gagan Bansal. 2023. vanyk. 2017. How do Developers Test Android Applications?. In 2017 IEEE
Understanding the Role of Human Intuition on Reliance in Human-AI Decision- International Conference on Software Maintenance and Evolution (ICSME). IEEE,
Making with Explanations. Proc. ACM Hum.-Comput. Interact. 7, CSCW2, Article New York, NY, USA, 613–622. [Link]
370 (oct 2023), 32 pages. [Link] [36] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che,
[17] Luis Cruz, Rui Abreu, and David Lo. 2019. To the attention of mobile software Dandan Wang, and Qing Wang. 2023. Chatting with GPT-3 for Zero-Shot Human-
developers: guess what, test your app! Empirical Software Engineering 24, 4 (2019), Like Mobile Automated GUI Testing. arXiv:2305.09434 [[Link]]
2438–2468. [Link] [37] Jennifer Mankoff, Holly Fait, and Tu Tran. 2005. Is your web page accessible? a
[18] Morgan Dixon and James Fogarty. 2010. Prefab: Implementing Advanced Behav- comparative study of methods for assessing web page accessibility for the blind.
iors Using Pixel-Based Reverse Engineering of Interface Structure. In Proceedings In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
of the SIGCHI Conference on Human Factors in Computing Systems (Atlanta, Geor- (Portland, Oregon, USA) (CHI ’05). Association for Computing Machinery, New
gia, USA) (CHI ’10). Association for Computing Machinery, New York, NY, USA, York, NY, USA, 41–50. [Link]
1525–1534. [Link] [38] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [[Link]]
[19] Richard O. Duda and Peter E. Hart. 1972. Use of the Hough Transformation [39] Nobuyuki Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms.
to Detect Lines and Curves in Pictures. Commun. ACM 15, 1 (Jan 1972), 11–15. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62–66. https:
[Link] //[Link]/10.1109/TSMC.1979.4310076
[20] Marcelo Medeiros Eler, Jose Miguel Rojas, Yan Ge, and Gordon Fraser. 2018. [40] Minxue Pan, Tongtong Xu, Yu Pei, Zhong Li, Tian Zhang, and Xuandong Li. 2022.
Automated Accessibility Testing of Mobile Apps. In 2018 IEEE 11th International GUI-Guided Test Script Repair for Mobile Apps. IEEE Transactions on Software
Conference on Software Testing, Verification and Validation (ICST). IEEE, New York, Engineering 48, 3 (2022), 910–929. [Link]
NY, USA, 116–126. [Link] [41] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lil-
[21] Espresso 2023. Espresso. [Link] licrap. 2023. Android in the Wild: A Large-Scale Dataset for Android Device
espresso Control. arXiv:2307.10088 [[Link]]
[22] Sidong Feng and Chunyang Chen. 2023. Prompting Is All You Need: Automated [42] Roboelectric 2021. Roboelectric. [Link]
Android Bug Replay with Large Language Models. arXiv:2306.01987 [[Link]] [43] Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O. Wobbrock.
[23] Raymond Fok, Mingyuan Zhong, Anne Spencer Ross, James Fogarty, and Jacob O. 2017. Epidemiology as a Framework for Large-Scale Mobile Application Ac-
Wobbrock. 2022. A Large-Scale Longitudinal Analysis of Missing Label Acces- cessibility Assessment. In Proceedings of the 19th International ACM SIGAC-
sibility Failures in Android Apps. In Proceedings of the 2022 CHI Conference on CESS Conference on Computers and Accessibility (Baltimore, Maryland, USA)
Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Asso- (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 2–11.
ciation for Computing Machinery, New York, NY, USA, Article 461, 16 pages. [Link]
[Link] [44] Navid Salehnamadi, Abdulaziz Alshayban, Jun-Wei Lin, Iftekhar Ahmed, Stacy
[24] Get started on Android with Talkback 2023. Get started on Android with Talkback. Branham, and Sam Malek. 2021. Latte: Use-Case and Assistive-Service Driven
[Link] Automated Accessibility Testing Framework for Android. In Proceedings of the
[25] Greg Guest, Kathleen M MacQueen, and Emily E Namey. 2011. Applied the- 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan)
matic analysis. Sage Publications, Thousand Oaks, CA. [Link] (CHI ’21). Association for Computing Machinery, New York, NY, USA, 11 pages.
9781483384436 [Link]
[26] Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, [45] Navid Salehnamadi, Ziyao He, and Sam Malek. 2023. Assistive-Technology Aided
Douglas Eck, and Aleksandra Faust. 2023. A Real-World WebAgent with Planning, Manual Accessibility Testing in Mobile Apps, Powered by Record-and-Replay. In
Long Context Understanding, and Program Synthesis. arXiv:2307.12856 [[Link]] Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
[27] Improve your code 2023. Improve your code with lint checks. [Link] (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York,
[Link]/studio/write/lint?hl=en NY, USA, Article 73, 20 pages. [Link]
[28] Yue Jiang, Eldon Schoop, Amanda Swearngin, and Jeffrey Nichols. 2023. ILuvUI: [46] Navid Salehnamadi, Forough Mehralian, and Sam Malek. 2023. Groundhog:
Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations. An Automated Accessibility Crawler for Mobile Apps. In Proceedings of the 37th
arXiv:2310.04869 [[Link]] IEEE/ACM International Conference on Automated Software Engineering (Rochester,
[29] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA,
Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran- Article 50, 12 pages. [Link]
Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tris- [47] Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasu-
tan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Gan- pat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova.
guli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane 2024. From Pixels to UI Actions: Learning to Follow Instructions via
Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Graphical User Interfaces. Advances in Neural Information Processing Sys-
Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, tems 36 (2024). [Link]
and Jared Kaplan. 2022. Language Models (Mostly) Know What They Know. [Link]
arXiv:2207.05221 [[Link]]
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Taeb, Swearngin, Schoop, et. al.

[48] Camila Silva, Marcelo Medeiros Eler, and Gordon Fraser. 2018. A survey on the [53] WCAG 2 Overview 2023. WCAG 2 Overview. [Link]
tool support for the automatic evaluation of mobile accessibility. In Proceedings standards-guidelines/wcag/
of the 8th International Conference on Software Development and Technologies [54] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li,
for Enhancing Accessibility and Fighting Info-Exclusion (Thessaloniki, Greece) Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering LLM
(DSAI ’18). Association for Computing Machinery, New York, NY, USA, 286–293. to use Smartphone for Intelligent Task Automation. arXiv:2308.15272 [[Link]]
[Link] [55] XCTest 2023. XCTest. [Link]
[49] Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. 2023. UGIF: UI [56] Shunguo Yan and P. G. Ramachandran. 2019. The Current Status of Accessibility
Grounded Instruction Following. arXiv:2211.07615 [[Link]] in Mobile Apps. ACM Transactions on Accessible Computing 12, 1, Article 3
[50] Markel Vigo, Justin Brown, and Vivienne Conway. 2013. Benchmarking web (February 2019), 31 pages. [Link]
accessibility evaluation tools: measuring the harm of sole reliance on automated [57] Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray,
tests. In Proceedings of the 10th International Cross-Disciplinary Conference on Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, Aaron Everitt, and
Web Accessibility (Rio de Janeiro, Brazil) (W4A ’13). Association for Computing Jeffrey P Bigham. 2021. Screen Recognition: Creating Accessibility Metadata for
Machinery, New York, NY, USA, Article 1, 10 pages. [Link] Mobile Applications from Pixels. In Proceedings of the 2021 CHI Conference on
2461121.2461124 Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association
[51] VoiceOver 2023. VoiceOver. [Link] for Computing Machinery, New York, NY, USA, Article 275, 15 pages. https:
voiceover-gestures-iph3e2e2281/ios //[Link]/10.1145/3411764.3445186
[52] Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction [58] Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, and Yan Lu. 2023. Responsible
with Mobile UI Using Large Language Models. In Proceedings of the 2023 CHI Task Automation: Empowering Large Language Models as Responsible Task
Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI Automators. arXiv:2306.01242 [[Link]]
’23). Association for Computing Machinery, New York, NY, USA, Article 432,
17 pages. [Link] Received 14 September 2023; revised 12 December 2024; accepted 19 January
2024

You might also like