0% found this document useful (0 votes)
16 views12 pages

Guiding Chatgpt To Fix Web Ui Tests Via Explanation-Consistency Checking

This study explores the integration of ChatGPT with traditional Web UI test repair techniques to improve the accuracy of element matching and repair for broken UI tests. By leveraging ChatGPT's language understanding for global matching and implementing an explanation validator to mitigate hallucination, the proposed approach enhances existing methods like Water and Vista. Evaluation results indicate that this combination significantly improves the effectiveness of Web UI test repair processes.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

Guiding Chatgpt To Fix Web Ui Tests Via Explanation-Consistency Checking

This study explores the integration of ChatGPT with traditional Web UI test repair techniques to improve the accuracy of element matching and repair for broken UI tests. By leveraging ChatGPT's language understanding for global matching and implementing an explanation validator to mitigate hallucination, the proposed approach enhances existing methods like Water and Vista. Evaluation results indicate that this combination significantly improves the effectiveness of Web UI test repair processes.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Guiding ChatGPT to Fix Web UI Tests via

Explanation-Consistency Checking
Zhuolin Xu Qiushi Li Shin Hwei Tan
Concordia University Concordia University Concordia University
Canada Canada Canada
[email protected] [email protected] [email protected]

ABSTRACT techniques extract and compute the similarity of these information


The rapid evolution of Web UI incurs time and effort in maintaining to select the most similar element as the result of the matching.
UI tests. Existing techniques in Web UI test repair focus on finding As there can be many attributes of a Web UI element which can
arXiv:2312.05778v2 [cs.SE] 27 Jan 2024

the target elements on the new web page that match the old ones be used for matching, these techniques may prioritize certain at-
so that the corresponding broken statements can be repaired. We tributes. For example, Water, one of the classical Web UI test repair
present the first study that investigates the feasibility of using prior techniques, performs matching via several steps by using different
Web UI repair techniques for initial local matching and then using sets of attributes in each step. First, it searches for elements that
ChatGPT to perform global matching. Our key insight is that given are exactly the same by matching five attributes (id, XPath, class,
a list of elements matched by prior techniques, ChatGPT can lever- linkText, name). If the first step fails to find the matching element
age the language understanding to perform global view matching in the new version that makes the test pass, it then finds similar
and use its code generation model for fixing the broken statements. DOM nodes using other additional attributes. Specifically, it finds
To mitigate hallucination in ChatGPT, we design an explanation the element with the same tagname, and then computes the similar-
validator that checks whether the provided explanation for the ity between the element with the same tagname using normalized
matching results is consistent, and provides hints to ChatGPT via a Levenshtein distance between the XPaths of 𝑒𝑜𝑙𝑑 and 𝑒𝑛𝑒𝑤 . Then,
self-correction prompt to further improve its results. Our evalua- it further matches using other attributes (e.g., screen position of a
tion on a widely used dataset shows that the ChatGPT-enhanced DOM node), and computes a similarity score based on the weighted
techniques improve the effectiveness of existing Web test repair sum of XPath and other attributes where it prioritizes XPath simi-
techniques. Our study also shares several important insights in larity based on the heuristic that the XPaths of the nodes “should
improving future Web UI test repair techniques. be very similar across versions” [7]). As the prioritization and the
predefined order used for matching these set of attributes are usu-
ACM Reference Format:
ally based on heuristic made by the tool developers, the matching
Zhuolin Xu, Qiushi Li, and Shin Hwei Tan. 2024. Guiding ChatGPT to
algorithm may not accurately reflect the evolution of the Web ele-
Fix Web UI Tests via Explanation-Consistency Checking. In Proceedings
of ACM Conference (Conference’17). ACM, New York, NY, USA, 12 pages. ment, causing inaccuracy in the matching step, and subsequently
https://doi.org/10.1145/nnnnnnn.nnnnnnn unable to find the repair for the broken statement. Meanwhile, prior
learning-based techniques show promising results in combining
1 INTRODUCTION different types of information for repairing broken GUI tests in
Android apps (e.g., combining word and layout embeddings [47],
When developers change the attributes of user interfaces (UI) of or fusing GUI structure and visual information [46]). The richer
a Web application due to the rapidly changing requirements, the representation used by these learning-based techniques has been
corresponding Web UI tests need to be manually updated for test shown to help improving the accuracy of the UI matching step.
maintenance. To reduce manual efforts in repairing broken Web UI To solve the aforementioned problem of Web UI test repair and
tests, several automated approaches have been proposed [7, 26, 41]. to hinge on richer representation in learning-based approach, we
The key step in automated repair of Web UI tests is to modify the present the first feasibility study of combining ChatGPT with tradi-
broken statements containing outdated element locators by match- tional Web UI test repair approaches for solving the aforementioned
ing the element 𝑒𝑜𝑙𝑑 in the old version of a Web application with element matching problem in prior techniques. Our use of Chat-
the element in the new version 𝑒𝑛𝑒𝑤 [14]. Prior Web UI test repair GPT is based on the encouraging results shown in prior studies for
techniques mostly rely on a set of Document Object Model (DOM) solving related software maintenance tasks, (e.g., (1) test genera-
attributes (e.g., identifiers and XPath) [7] or visual information [41] tion [11], and (2) automated program repair [9, 40]). However, the
to determine whether the two elements 𝑒𝑜𝑙𝑑 and 𝑒𝑛𝑒𝑤 match. These Web UI test repair problem is different from these tasks as it mainly
Permission to make digital or hard copies of all or part of this work for personal or involves Web element matching where accurate matching results
classroom use is granted without fee provided that copies are not made or distributed will usually lead to the correct repairs being generated. Specifically,
for profit or commercial advantage and that copies bear this notice and the full citation our study evaluates the effectiveness of integrating ChatGPT into
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, two representative Web test repair techniques (Water and Vista).
to post on servers or to redistribute to lists, requires prior specific permission and/or a To further evaluate the heuristic used in Water that prioritizes
fee. Request permissions from [email protected].
XPath similarity, we also design a simplified variant of Water
Conference’17, July 2017, Washington, DC, USA
© 2024 Association for Computing Machinery. that performs matching using only Levenshtein distance between
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 XPaths of the old element 𝑒𝑜𝑙𝑑 and candidate new element 𝑒𝑛𝑒𝑤 (in
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

this paper, we call this approach Edit Distance). Our study focuses that the proposed workflow further improves the effectiveness of
on Java Selenium UI tests. The key insight of our study is that we prior approaches in Web UI element matching and test repair.
can improve the overall matching results by first using existing Web
test repair approaches to obtain an initial list of candidate matched
elements (which may be biased by the prioritization used by a partic- 2 BACKGROUND AND RELATED WORK
ular technique), and then use ChatGPT to perform global matching Automatic UI test repair. Different from traditional unit tests,
to further select the best matched element in the new version. UI tests are typically used in UI applications (e.g., Web apps and
The key insights of our approach are twofold: (1) we first rely mobile apps) to validate the functionality of a particular component
on traditional Web UI test repair approaches to obtain an initial through a sequence of UI event actions (e.g., clicking a button) [7].
list of candidate matched elements (which may be biased by the When the UI application under test evolves, the corresponding
prioritization used by a particular technique), and then use Chat- UI tests may crash, leading to significant effort required to manu-
GPT to perform global matching to further select the best matched ally fix these broken tests. Several techniques have been proposed
element in the new version, and (2) as ChatGPT may suffer from for UI test maintenance [5–7, 13, 20, 24, 26, 41, 47, 49]. Most of
the hallucination problem [11], we design our prompt based on these techniques focus on maintaining UI tests for mobile appli-
OpenAI’s official documentation by asking ChatGPT to generate an cations [5, 24, 35, 46, 47], Web applications [6, 7, 20, 26, 41] or
explanation along with each selection, and propose an explanation desktop applications [13, 49]. Existing Web UI test repair tech-
validator that automatically checks for the consistency of the pro- niques [7, 26, 41] typically involve (1) executing the test, (2) ex-
vided explanation to detect the hallucination, and inform ChatGPT tracting information from the Webpage based on various textual
for any incorrect explanation to provide a chance for ChatGPT to attributes [7] or visual information [41] of Web elements, (3) using
self-correct the initial selection. different matching algorithms to match the element based on the
Our study of the three approaches (Water,Vista, and Edit Dis- extracted information, and (4) updating the locator of the matched
tance) with ChatGPT aims to answer the following question: element in the UI test to fix the broken test.
Model-based approaches [5, 13, 17, 24, 49] build a model of the
RQ1: Can ChatGPT help in improving the accuracy of Web element application under test and modify the event flow to fix test cases.
matching of prior Web test repair approaches? Meanwhile, several heuristic-based approaches [7, 26, 41] relocate
RQ2: What is the effectiveness of ChatGPT in repairing broken elements using UI matching algorithms and update the broken
Selenium statements for Web test repair? tests via replacement of the matched elements. As heuristic-based
RQ3: When ChatGPT explains about the element matching result, approaches do not require building a model which may not be
what is the quality of the explanation? practical for large-scale applications, our paper mainly evaluates
RQ4: Can our proposed explanation validator guide ChatGPT in these approaches. A recent approach [47] first matches elements
self-correction to improve the overall matching and repair results? of two versions of Android apps using various similarity metrics
Contributions. Our contributions are summarised as follows: (e.g., semantic embedding similarity, and layout similarity based on
node embeddings of the GUI layout tree), and then repairs the tests
Study: We present the first feasibility study of combining tradi- by updating broken locators. We did not compare against [6, 26]
tional test repair approaches with ChatGPT for Web UI test repair. as their matching algorithm is not publicly available and exclude
Our study reveals several findings: (1) the combination with Chat- model-based tool [16] as the provided code fails to compile due to
GPT helps to improve the accuracy of all evaluated approaches; missing dependencies.
(2) to our surprise, although Edit Distance performs the worst UI element matching. Given an element from an old version, the
individually, its combination with ChatGPT outperforms all the goal of a UI element matching task is to find the corresponding
evaluated approaches in element matching and repair; (3) the repair element in the new version. UI element matching has been applied
performance of all evaluated approaches is generally similar to the to several domains (e.g., test reuse [30], automated compatibility
matching performance but there are two cases where the repair testing [36], and automated maintenance of UI tests). Although
performance decreases, which shows the limitations of evaluated there are various techniques for UI element matching, only a few of
approaches; (4) our proposed workflow of checking for explanation them are suitable for our evaluation. For example, COLOR [20] and
consistency and generating self-correct prompt as hint to ChatGPT GUIDER [46] use a combination of attribute and visual information
could further improve the effectiveness of matching and repair for to match elements. However, COLOR only recommends updated
certain combinations (e.g., Water+ChatGPT). locators to testers instead of repairing broken tests while GUIDER
Technique: We proposed a novel Web UI test repair technique focuses on Android test repair. As our goal is to study UI test repair,
that uses traditional Web UI test repair approach for initial match- we didn’t compare with tools [4, 39] that perform matching without
ing and then uses ChatGPT for global matching to further improve repair. Water [7] and Vista [41] are classic UI test repair tools,
the UI matching accuracy. To combat the hallucination problem in which match elements according to attribute information and visual
ChatGPT generated response, we also design an explanation valida- information, respectively. Existing repair work mainly relies on
tor that automatically checks for the consistency of the generated designing some strategies based on element information [7, 41], or
explanation to provide a mechanism for self-correction. simply combining machine learning algorithms with the repair of
Evaluation: We evaluate our proposed approaches against three UI test cases [47]. Due to the differences between UI test cases and
baselines (Water,Vista,Edit Distance) and their combinations existing APR work [44], it is not suitable to directly use existing
with ChatGPT. Our evaluation on a widely used dataset [17] shows APR approaches to repair UI test cases. In this paper, we focus
Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking Conference’17, July 2017, Washington, DC, USA

on Web UI test repair. Different from these prior approaches, we instead of By.name), and (2) the XPath from “Candidate 6” for
explore the feasibility of using prior test repair approaches to obtain updating the broken statement.
an initial ranked list of elements and use ChatGPT to perform global
1 - driver . findElement ( By . name (" category ") ). sendKeys
view matching. 2 - (" Category1 ") ;
Instead of fixing broken Web UI tests, several approaches fo- 3 + driver . findElement ( By . xpath ("/ html [1]/ body [1]/ div [3]
4 + / form [1]/ table [1]/ tbody [1]/ tr [2]/ td [2]/ input [1]") )
cus on improving the robustness of the element [22, 29]. Leotta 5 + . sendKeys (" Category1 ") ;
et al. developed Robula+ [29], which generates locators to see if
an attribute uniquely locates an element based on a predefined Figure 1: A ChatGPT fix for the broken statement in Man-
attribute priority. They also later proposed using multiple locator tisBT that replaces with different locators.
generators to further increase the robustness [22]. They determine
the final result by having different locator generators vote, where With self-correction. Figure 2 shows the fix generated by Wa-
they give higher weight to more reliable locator generators like ter+ChatGPT before (𝑟𝑒𝑠𝑢𝑙𝑡1) and after self-correction (𝑟𝑒𝑠𝑢𝑙𝑡2).
Robula+. Our design of the experiment is similar in essence with The explanation provided by ChatGPT for 𝑟𝑒𝑠𝑢𝑙𝑡1 is “Because they
this approach, letting ChatGPT make a selection among multiple share the most similar attributes: xpath, text, tagName, linkText.”.
candidates. While improving the robustness of locators is beyond Our explanation validator detects that the explanation of the at-
the scope of this paper, it could be worthwhile future work. tributes text and linkText are inconsistent with the selection (Expla-
LLMs in software maintenance. LLMs such as ChatGPT have nation Consistency=0.5). After self-correction, Water+ChatGPT
been applied for solving various software maintenance tasks [9, repair this broken statement correctly and the result is represented
11, 31, 40, 43, 45], including related tasks such as (1) test gener- as 𝑟𝑒𝑠𝑢𝑙𝑡2. The explanation for 𝑟𝑒𝑠𝑢𝑙𝑡𝑠2 is “Because they share the
ation [11, 21, 48], Android UI testing [12, 27], predicting flaky most similar attributes: xpath, text, tagName, linkText.” (Explana-
tests [10], and (2) automated program repair [9, 19, 31, 40]. ChatGPT tion Consistency=0.75 because all attributes except for XPath are
is an open-source transformer-based chatbot that assists develop- consistent with the selection).
ers in building conversational AI applications [38]. Our proposed
approach uses ChatGPT for (1) Web UI element matching, (2) pro- 1 - driver . findElement ( By . xpath ("//*[ @id =\" content
2 - wrapper \"]/ div [1]/ ul / li [3]/ a ") ) . click () ;
viding explanations for the selected elements, and (3) updating 3 // result1 : before self - correction
broken statements in tests. To improve code-related generation 4 + driver . findElement ( By . xpath ("/ html [1]/ body [1]/ div [1]/
5 + div [2]/ div [1]/ ul [1]/ li [1]/ a [1]") ) . click () ;
for LLM-based approaches, a recent approach was proposed using 6 // result2 : after self - correction
knowledge gained during the pre-training and fine-tuning stage to 7 + driver . findElement ( By . xpath ("/ html [1]/ body [1]/ div [1]/
8 + div [2]/ div [2]/ div [1]/ div [4]/ div [1]/ h2 [1]/ a [1]") )
augment training data and their results show significant improve- 9 + . click () ;
ment for code summarization and code generation [43]. Similar
to this approach, our proposed workflow also uses the knowledge Figure 2: ChatGPT fixes for the broken statement in Collab-
gained during the selection of UI elements (i.e., explanation gen- tive that shows the effectiveness of self-correction.
erated by ChatGPT) to improve the UI matching results. Different
from the aforementioned work, this paper focuses on using Chat-
GPT for solving the Web UI test repair problem that aims to perform 4 METHODOLOGY
UI test update by fixing broken locators. To the best of our knowl-
edge, our study also presents the first attempt that investigates Broken Statement
driver.findElement(By.xpath(“/html/body/div[7]...”)).click();
and improves the quality of explanations provided by ChatGPT by 4 Times
designing an explanation validator to check the reliability of the 1 Element Extraction 2 Selecting Candidates
3
Matching Prompt
Element
Element 111 Matching Instruction
explanation to improve the effectiveness of UI matching and test Element
Element
id:
id:
id:
XPath:
1 WATER VISTA
Target Element
Old and New XPath:
XPath:
XPath Edit Distance
repair. Webpages
......
......
......
Candidate Elements
4 Times Element 1 1
Element
Element 1
......
6 5 Repair Prompt
Repair Validator
Check: Matching Result Repair Instruction 4
Explanation Validator
3 MOTIVATING EXAMPLES Check: Locator Matched Elements
Check Explanation
Check: Assertion Broken Statement Consistency (EC)
We show two examples where the combination of ChatGPT helps
8 Repair Prompt
to improve the matching and test repair results. 7
Self-correction Prompt
Self-correction Repair Instruction
Without self-correction. Table 1 shows the target element to be Instruction
Matched Elements
Repaired
Statement
If Repair Failure
matched by Waterin the old version of the MantisBT app, and & EC < 1 History Prompt
History Answer Broken Statement

the list of candidate elements returned by Water. Initially, Water


have incorrectly ranked the “Candidate 1” as the top 1 element.
Given the ranked list generated by Water, ChatGPT can make Figure 3: The workflow of Web UI test repair using ChatGPT
the correct matching by returning “Candidate 6” as the correct
element. ChatGPT also explains that it selects the element because Figure 3 shows the overall workflow of our proposed approach.
“they share the most similar attributes: xpath, text” (its explanation Our approach first extracts information about the corresponding
consistency (EC) is equal to 1). After matching, our approach feeds elements on the old Webpage and all elements on the new Webpage.
ChatGPT with the repair prompt, and it generates the repair in Then, we use a similarity algorithm (Water, Vista or Edit Dis-
Figure 1. The generated fix uses (1) a different locator (By.xpath tance) to rank elements extracted from the new Webpage where
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

Table 1: An example that shows the target element to be matched, and a list of candidate elements returned by Water

{numericId=70, id=‘’, name=‘new_category’, class=‘’, xpath=‘/html[1]/body[1]/div[4]/form[1]/table[1]/tbody[1]/tr[2]/td[2]/input[1]’,


Target Element
text=‘Category1’, tagName=‘input’, linkText=‘’, x=363, y=278, width=261, height=21, isLeaf=true}
{numericId=20, id=‘’, name=‘’, class=‘button-small’, xpath=‘/html[1]/body[1]/table[1]/tbody[1]/tr[1]/td[3]/form[1]/input[1]’,
Candidate 1
text=‘Switch’, tagName=‘input’, linkText=‘’, x=951, y=121, width=51, height=20, isLeaf=true}
... ...
{numericId=70, id=‘’, name=‘name’, class=‘’, xpath=‘/html[1]/body[1]/div[3]/form[1]/table[1]/tbody[1]/tr[2]/td[2]/input[1]’,
Candidate 6
text=‘Category1’, tagName=‘input’, linkText=‘’, x=403, y=295, width=261, height=21, isLeaf=true}

top-ranking elements are selected as candidate elements for Chat- 4.2 Information Extraction
GPT to perform element matching. In the matching prompt, we ask We used UITESTFIX [26] to extract the information about the Web
ChatGPT to explain about its selection by listing the most similar elements. UITESTFIX retrieves the HTML source code of the tar-
attribute it considers. Given the generated explanation, our expla- get Webpage through a Web browser, and then employs the Jsoup
nation validator checks the consistency of the result. In the repair library [15] to analyze the source code and extract the Web ele-
prompt, we ask ChatGPT to repair the broken statement based on ments’ information in it. Note that although we use UITESTFIX
its selected element. Then, we validate the repair result by checking for information extraction, we did not compare it with UITESTFIX
if the matching is correct, and if the locator and assertion (if exist) as its matching algorithm is not publicly available. Specifically, we
are updated correctly. Due to the randomness of ChatGPT, we rerun extract information about the following attributes: id, name, class,
the matching and repair four times, and use the best results across XPath, text, tagName, linkText, x, y, width, height, and isLeaf. The
the four runs, following the procedure in prior study [40]. For each first two columns of Table 3 show the descriptions and examples
incorrect repair where the explanation consistency (EC) is less than of these attributes, whereas the “Used by Tools” column shows
1, our approach generates a self-correctness prompt to ChatGPT whether Water, Vista consider these attributes when matching
as a hint to encourage it to provide another answer based on the elements and how these tools use these attributes.
inconsistent explanation of their matching result. After feeding
the repair prompt to ChatGPT, we obtain the ChatGPT’s repaired
broken statement.
4.3 Candidate Selection
To address the token length limitation of ChatGPT, our approach
4.1 Prompt Design fed ChatGPT with a list of pre-selected candidate Web elements.
We interact with ChatGPT via the API of the gpt-3.5-turbo model [1]. Specifically, our approach first obtains 10 top-ranked candidate ele-
To obtain prompt with the best results, we design our prompt based ments from the matching results of prior approaches (Water,Vista,
on the official OpenAI documentations, including: (1) the official Edit Distance). We only chose 10 elements because (1) the gpt-
ChatGPT API Guide [34] and (2) OpenAI six strategies for achieving 3.5-turbo model that we use has a limitation of 4096 tokens, and (2)
better results [32]. Table 2 shows for each sentence of the prompt each element is represented by 12 attributes, which can be lengthy.
(the “Prompt Content” column), the corresponding rule (the “Rule We briefly introduce the three approaches below:
in OpenAI official documentation” column) that inspires the design.
The ChatGPT API Guide emphasizes the importance of system Vista: Vista uses Fast Normalized Cross Correlation algorithm [3]
instructions in providing high-level guidance for conversations so to calculate the template matching results. It returns a list where
we use system instructions to design the Web UI test repair context elements of a particular screen position are ranked in descending
part by telling ChatGPT that (1) it is a UI test repair tool, and (2) order based on their similarity scores. Instead of using the top 1
outlining the steps in fixing a broken statement. element, we modify Vista to retrieve the top 10 elements based on
Due to token limits, we summarize ChatGPT’s matching results the matching point coordinates as candidates for the pre-selection.
from the previous dialogue to inform ChatGPT of the specific ele- Water: As shown in Table 3, Water first finds whether there are
ment information it should use in the subsequent repair process, candidate elements that have the same id, XPath, class, linkText,
aligning with the tactic “For dialogue applications that require very or name to the target elements. If so, Water add them into the
long conversations, summarize or filter previous dialogue.” [32]. candidate list. Then Water ranks the elements on the new Web-
Following the rule “Use delimiters to clearly indicate distinct parts page according to their similarity score. If there are candidates with
of the input” [32], we use angle brackets to separate sections of same first scores, Water calculates the second score with tagName,
text. This aids ChatGPT in recognizing the content of the target el- coordinates, clickable, visible, zindex (the screen stack order of the
ement (element on the old Webpage to be matched) and the broken element), hash (checksum of the textual information of the ele-
statement. ment). During the second similarity score calculation, Water gives
Table 2 shows the patterns of prompt and used rules. Our com- a greater weight (0.9) to the XPath similarity because the authors
plete prompt generation rules are as follow: of Water assume that the XPaths of the matched elements are
Matching Prompt: Context[p1, p2, p3] + Input[p8] usually very similar after the version update. Instead of its original
Repair Prompt: Context[p1, p4, p5] + Input[p9] setting that returns one best matched element, as Water keeps
Self-Correction Prompt: Context[p6, p7] the candidates in a ranked list, we use the top 10 elements in the
ranked list as candidates for the pre-selection.
Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking Conference’17, July 2017, Washington, DC, USA

Table 2: The patterns of prompt and generation rules

PID Rule in OpenAI’s official documentation Prompt Content


Web UI test repair context patterns: Context
p1 Use system instruction to give high level instructions [34] You are a web UI test script repair tool.
p2 Split complex tasks into simpler subtasks [33] To repair the broken statement, you need to choose the element most similar to the target element from the given candidate element list firstly.
Give me your selected element’s numericId and a brief explanation containing the attributes that are most similar to the target element.
p3 Provide examples [32] Your answer should follow the format of this example:
“The most similar element’s numericId: 1. Because they share the most similar attributes: id, xpath, text.”
p4 Summarize or filter previous dialogue [32] To repair the broken statement, you chose the element <selected element>as the most similar to the target element from the given candidate element list.
p5 Specify the steps required to complete a task [32] Now based on your selected element, update the locator and outdated assertion of the broken statement. Give the result of repaired statement.
p6 Use delimiters to clearly indicate distinct parts This is a previous prompt: <Matching Prompt>
of the input [32] This is your previous answer: <Corresponding Answer>
p7 But your explanation for attributes <attributes>are inconsistent with your selection and this will influence the correctness of your answer. Please answer again.
Input pattern: Input
p8 Use delimiters to clearly indicate distinct parts Target element: <target element>
of the input [32] Candidate elements: <candidate element list>
p9 Broken statement: <broken statement>

Table 3: Extracted attributes and the tools that use these attributes. The superscript number for the last column indicates the
priority of an attribute for a tool.

Attribute Description (Example) Used by Tools


id Unique identifier for an element () Water𝐸1
name Name for an element (submit) Water𝐸5
class Class name for an element () Water𝐸3
XPath Path of an element in the DOM tree (/html[1]/body[1]/div[1]/div[4]/form[1]/input[1]) Water𝐸2 , Water𝐿6 , Edit Distance𝐿1
text Textual information of an element (Enter)
tagName Name of the HTML tag (input)
linkText The text content of a hyperlink () Water𝐸4
x X-coordinate for an element (66) Water6
y Y-coordinate for an element (183) Water6
width Width of an element (48) Vista1
height Height of an element (21) Vista1
isLeaf True if the element is a leaf node in the DOM tree, and false otherwise (true)
1 The superscript ‘E’ means the tool checks whether the candidate element’s attribute is exactly the same as the target element.
2 The superscript ‘L’ indicates the tool use Levenshtein Distance to calculate the similarity.
3 The number in superscript indicates the priority of the attribute.
4 We refer to the version of Water implemented in prior work [41] to determine the precise priority of attributes in Water.
5 We refer to VISUAL mode of Vista [41] for Vista’s priority.

Edit Distance: Inspired by the key idea of prioritizing XPath most similar element for the consistency calculation 𝑐𝑜𝑛𝑠 (𝑎𝑖 , 𝑅)
similarity in Water, we design a simplified element matching algo- where 𝑐𝑜𝑛𝑠 (𝑎𝑖 , 𝑅)=1 if the most similar element has been selected:
rithm to only consider the XPath similarity. Similar to Water, we
Position: We use Euclidean distance for calculating position-related
use Levenshtein distance [37] to measure the difference between
attributes (e.g., x and y coordinates), and we choose the element
XPaths. Levenshtein distance measures the minimum number of
with the minimum distance as the most similar.
insertions, deletions, and substitutions required to transform one
Size: We calculate the product of the width and the height of an
string into another. Greater XPath Edit Distance means that the two
element, and consider the element with the smallest size difference
elements are less similar. This algorithm is also employed in other
as the most similar.
field of Web testing (e.g., prior work have used the Levenshtein
isLeaf: We check if the value of isLeaf is the same (true or false).
distance to detect conflicting and outdated data on Webpages[18]).
Other attributes: We use Levenshtein edit distance to measure
This variant returns the top 10 elements ranked in descending order
similarity, with the smallest distance considered as the most similar.
according to their XPath similarities.
In the cases where multiple candidate elements and the target
4.4 Explanation Validator element share the same similarity for a particular attribute, we will
retain the results of multiple candidate elements and consider the
As shown in the highlighted texts in Table 2, we instruct ChatGPT
explanation given by ChatGPT to be consistent if either one of
to generate a brief explanation for describing the attributes that it
these candidate elements has selected.
uses for selecting the best matched element. Our intuition is that if
the provided explanation is consistent with its selection, then the selec- Definition 4.1 (Explanation Consistency (EC)). Given the target
tion is more likely to be accurate (and a repaired statement is more element 𝑡 (element in the old version of the Webpage to be matched),
likely to be successfully generated). Based on this intuition, we the selection results 𝑅 and an explanation 𝑒 where 𝑒 mentioned one
have developed an explanation validator for checking whether the or more attributes 𝐴 = 𝑎 1, 𝑎 2, ..., 𝑎𝑛 , we calculate the Explanation
explanation given by ChatGPT is consistent with the actual selec- Consistency (EC) of 𝑒 by computing
tion. Specifically, for each attribute 𝑎 mentioned in the explanation, cons(𝑎𝑖 , 𝑅) for each attribute 𝑎𝑖 where 𝑐𝑜𝑛𝑠 (𝑎𝑖 , 𝑅) checks whether
our explanation validator calculates the following to determine the each attribute 𝑎𝑖 of 𝑅 is most similar to that of the target element 𝑡
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

(𝑐𝑜𝑛𝑠 (𝑎𝑖 , 𝑅)=1 if the most similar element based on 𝑎_𝑖 is selected Table 4: Statistics of open-source Web apps in our dataset
in 𝑅, and 𝑐𝑜𝑛𝑠 (𝑎𝑖 , 𝑅)=0 otherwise. Application ΔV Old Version Updated Version Tests Broken Stmt
Í𝑛 AddressBook 8 4.0 6.1 2 2
𝑖=1 𝑐𝑜𝑛𝑠 (𝑎𝑖 ∈ 𝐴, 𝑅) Claroline 29 1.10.7 1.11.5 27 53
𝐸𝐶 (𝑒) =
|𝐴| Collabtive 5 0.65 1 4 11
MantisBT 38 1.1.8 1.2.0 25 66
Def. 4.1 presents the definition for Explanation Consistency (EC). MRBS 24 1.2.6.1 1.4.9 4 7
For each explanation generated by ChatGPT, our explanation val-
idator checks whether the attributes mentioned in the explanation
that are similar between the selected element and the target ele-
ment are consistent with the calculated values of all mentioned
attributes. If the calculated values for all mentioned attributes are ChatGPT costs USD 10.96 in total (which uses 8,429 API requests,
consistent, then our explanation validator considers the provided and 7,178,802 tokens).
explanation as consistent across all mentioned attributes (𝐸𝐶=1).
6 RQ1: EFFECTIVENESS OF UI MATCHING
5 EXPERIMENTAL SETUP
Dataset. We use an existing dataset [17] for evaluating the effec-
6.1 Matching Result and Analysis
tiveness of Web test repair approaches. The dataset contains Java Our evaluation includes a total of six approaches, including individ-
Selenium UI tests from five open-source Web applications (3 of ual baselines (Vista, Water, Edit Distance), and their respective
which are used in the experiment of VISTA [41]). All of these open- combinations with ChatGPT. For each approach, we recorded its
source applications are hosted in Sourceforge (except for MantisBT respective abilities to correctly match the ground truth elements of
that is hosted in GitHub [2]). We follow the same filtering process of the broken statements in our dataset.
a prior evaluation [26] to remove duplicated tests and non-broken The “Matching” columns in Table 5 show the comparison results
tests. Subsequently, we obtained 62 test cases containing 139 broken for the six approaches on the evaluated applications. If we com-
statements as our dataset. pare the overall matching results for the individual baseline with
Preparing ground truth dataset. As our evaluation dataset only its corresponding combination with ChatGPT, we observe that all
has tests for the old versions of Web applications, we need to man- combinations outperform the individual baseline (i.e., 97 versus 68
ually label the ground truths for the matching UI elements for the for Vista, 86 versus 81 for Water, and 122 versus 43 for Edit Dis-
new versions. Specifically, the first two authors (i.e., annotators) tance). This result confirms our hypothesis that the combinations
independently labelled the ground truths for each UI element lo- of prior test repair approaches with ChatGPT help to improve the
cated in the broken statement in the dataset for the three individual UI matching results.
baselines (Water,Vista,Edit Distance). Both annotators are grad-
uate students with more than one year of experience in relevant Finding 1: All the combinations of prior Web test repair ap-
research of Web UI test repair. For 12 cases out of 3*139=417 cases proaches with ChatGPT outperforms the corresponding stan-
(139 for each of the three individual baselines), the annotators have dalone approach (without ChatGPT).
disagreements so they meet to resolve the disagreement (Cohen’s Implication 1: Our suggested workflow of using prior test
Kappa=0.94 which indicates almost perfect agreement). repair approaches for pre-selecting a ranked list of candidate
Table 4 shows the old and the new updated versions of the open- elements and then using ChatGPT for a global matching is
source Web applications in our dataset. The “ΔV” column denotes effective in improving the overall accuracy of UI matching.
the number of versions that occur between the old and the new
versions (a greater number of versions usually denotes more sub-
stantial UI changes between the two versions, leading to more We notice that Vista’s matching algorithm that uses visual in-
challenging matching and repair). The “Test” column represents formation is more effective for certain applications (e.g., Claro-
the number of tests for each app, whereas the “Broken Stmt” de- line and MRBS). However, the overall results show that Edit Dis-
notes the number of broken statements for each app. Note that one tance+ChatGPT performs the best (fixing 122 out of a total of
test could have multiple broken statements. In our experiment, we 139 broken statements), outperforming all the other approaches.
assume that each broken statement is independent of subsequent Specifically, the combination of Edit Distance+ChatGPT yields
statements (a repair failure of a preceding statement will not be the greatest improvement (183.72%) over the individual Edit Dis-
propagated to subsequent statements) so that we can measure the tance approach. After combining with ChatGPT, it matches 79
effectiveness of ChatGPT in repairing each broken statement. This broken statements that originally could not be matched and 43 bro-
assumption is similar to prior evaluations of learning-based auto- ken statements that could originally be matched, without matching
mated program repair techniques where the correct fault location any originally matchable broken statements incorrectly. Consid-
is provided (perfect localization) [25, 28]. ering the fact the individual Edit Distance performs the worst
All experiments are run on a computer with an Intel Core i5 among all the individual baselines, we think this result is a bit coun-
processor (1.6GHz) and 12 GB RAM. For the experiments related terintuitive because one would select the tool that performs well
to ChatGPT, we use the API of gpt-3.5-turbo model and set the individually (i.e., Water) for combining with ChatGPT to achieve
temperature to 0.8 as used in prior work [9]. Our experiment with further improvement.
Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking Conference’17, July 2017, Washington, DC, USA

Table 5: The number of correct matching and the number of correct repairs of different approaches
Vista Vista+ChatGPT Water Water+ChatGPT Edit Distance Edit Distance+ChatGPT
Applications
Matching Repair Matching Repair Matching Repair Matching Repair Matching Repair Matching Repair
AddressBook 1 1 1 1 2 2 2 2 0 0 2 2
Claroline 48 47 51 51 18 18 18 18 1 1 47 47
Collabtive 1 1 1 1 10 10 7 7 8 8 8 8
MantisBT 11 11 37 37 50 50 56 56 34 34 63 62
MRBS 7 7 7 7 1 1 3 3 0 0 2 2
Total 68 67 97 97 81 81 86 86 43 43 122 121

Finding 2: Although the individual Edit Distance approach is expanded to the top 10 elements in the candidate list, Edit Dis-
performs the worst among all the individual baselines, its combi- tance is more likely to hit the ground truth element compared to
nation with ChatGPT outperforms all the evaluated approaches. Vista and Water, leading to more potential improvement when
Implication 2: Edit Distance, a simplified variant of Water, combining with ChatGPT.
integrates well with ChatGPT compared to other combinations.
By ranking elements using only XPath similarity, Edit Dis-
tance delegates the responsibility of matching other attributes
to ChatGPT which later performs global view matching. 7 RQ2: EFFECTIVENESS OF TEST REPAIR
Before checking for the repair correctness of each repaired state-
6.2 Ranking performance of the baselines ment, we first check if ChatGPT has the correct matching result for
As the combination of Edit Distance+ChatGPT outperforms all the broken statement because a correct repair can only be generated
evaluated approaches, we are interested in analyzing the reasons after a correct matching.
behind the improvement. Specifically, as the baselines Water and Repair Correctness. As the ground truth repaired statements (the
Vista originally return only one element as the best matching result, repair written by developers of the Web apps) are unavailable in
we analyze the ranking performance of each baseline approach. our dataset and the repaired statements generated by ChatGPT
Given a selected element 𝑠𝑒 by a tool 𝑡 and the correct element 𝑡𝑒 may have diverse fix patterns, we need to manually validate the
(i.e., the corresponding element in the ground truth), we consider 𝑡 correctness of all generated repaired statements. To reduce the
hits if 𝑠𝑒 is exactly the same as 𝑡𝑒 . If one of the elements in ranked manual effort in validating the repair correctness for the generated
list produced by a tool 𝑡 hits, we record its ranking to evaluate repaired statements, we use a semi-automated approach to verify
the ranking performance. We employ metrics commonly used in the correctness of each generated repaired statement. Specifically,
100% given the original broken statement 𝑜, the repaired statement 𝑟 , and
Hit Ratio of Each Method

the ground truth element 𝑒, we design a parser that automatically


75%
parses the locator type (e.g., By.name and By.xpath), and the ex-
50% pression within the locator (e.g., the XPath value) in the repaired
VISTA
25% WATER
statement 𝑟 to verify their correctness with respect to the ground
Edit Distance truth element 𝑒. For example, if ChatGPT uses By.xpath as the
0%
0 1 2 3 4 5 6 7 8 9 10 locator, it should use the XPath information of the element rather
Top N Candidates in the List
than the value of other attributes. To identify cases that require
Figure 4: Comparison of Hit Ratios for the three baselines subsequent manual analysis, our parser also finds repaired state-
ments where more substantial changes have been introduced by
evaluating top-N recommendation task [23]: Top-K Hit Ratio (HR, ChatGPT, including (1) different types of locators are used in 𝑜
or Recall, in this study, the proportion of experimental instances and 𝑟 are used or (2) additional statements have been added to the
in which the top N candidates selected by each baseline contain repaired statement 𝑟 .
ground truth in the candidate list). Figure 4 depicts the variation of The “Repair” columns in Table 5 shows the repair results for all
Hit Ratio with increasing values of N for the three baselines. evaluated approaches. We can observe that the number of cor-
We observe several interesting trends in Figure 4: 1) When N=1, rect repair is almost equal to the number of correct matching
the Hit Ratio for Vista and Water has already reached around 50%, (except for one case in Vista and Edit Distance+ChatGPT, re-
while Edit Distance’s Hit Ratio is only 30.9%. This indicates that spectively). Vista fails to repair one statement in Claroline after
the top candidate selected by Vista and Water already effectively correctly matching the element as it currently does not support
hits ground truth, whereas Edit Distance’s advantage is less ev- the repair of assertion values [41]. Meanwhile, for MantisBT, Edit
ident. 2) When N increases, the Hit Ratios for Vista and Water Distance+ChatGPT correctly matches the element but generates
increase slowly. Even when N grows to 10, their Hit Ratios only in- a repair that modify the original intention of the broken state-
crease by less than 25% compared to when N is 1. In contrast, Edit ments. Specifically, we consider the repair as incorrect because the
Distance’s Hit Ratio gradually increases, reaching 68.3% when broken statement sends a text input “” to the element, but Edit
N=5; and when N=9 and 10, Edit Distance’s Hit Ratio reaches 84.2% Distance’s repaired statement sends “Test” to the element, which
and 90.1%, respectively. Hence, if the number of selected candidates change the intention of original test statement. Nevertheless, the
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

Edit Distance+ChatGPT combination still excels with the best assertTrue) and assertion value (update the XPath). In this case,
performance with 121 correct repairs. the statement that replaces assertTrue with assertEquals pre-
serves the semantic of assertion as the original statement still invoke
Finding 3: The repair performance of all evaluated approaches
.equals() within the assertion. In 10.39% of correct repairs that
is generally similar to the matching performance, except for
involve assertions, ChatGPT used different assertions.
two cases where the generated repairs are incorrect.
Implication 3: After matching the correct element, most ap- 1 - Assert . assertTrue ( driver . findElement ( By . xpath ("//*[ @id =\"
proaches could generate the correct repairs. For two cases, 2 - loginBox \"]/ h3 ") ) . getText () . equals (" Authentication ") ) ;
3 + Assert . assertEquals (" Authentication ", driver . findElement
we identify some limitations of evaluated approaches, includ- 4 + ( By . xpath ("/ html [1]/ body [1]/ div [1]/ div [2]/ div [1]/ div [1]
ing lack of diverse fix patterns (Vista) and introducing non- 5 + / div [1]/ div [1]") ) . getText () ) ;

intention perserving changes (Edit Distance+ChatGPT).


Figure 7: A ChatGPT fix for the broken statement in Claroline
Fix Patterns used by ChatGPT. Our prompt only directs Chat-
that uses different assertions (replacing assertEquals with
GPT to modify the locator and outdated assertion in the broken
assertTrue) and assertion value (update XPath).
statement based on the elements it selects for repair because we
want to keep the expressions outside the locator unchanged. It is Single statement change: Use different locators and values.
crucial that ChatGPT refrains from altering expressions outside Figure 1 shows an example of ChatGPT generated repair where it
the broken statement locator to preserve the original intention of uses a different type of locator (replacing By.name with By.xpath)
the test. For example, actions such as changing a click operation and locator value (update the XPath). Compared to other locators
to entering text on the element should not take place. During the (By.name and By.className), we observe that most broken state-
manual validation of the repair result, we examine both the cor- ments tend to use By.xpath locators (70.50%). Among the gener-
rectness of the repaired locator and ensure that the content outside ated correct repair, we observed that ChatGPT often use “By.xpath”
the locator remains unaltered. To investigate the diversity of the when updating the broken locator (99.82%). In 26.39% correct repair,
generated repairs, we manually analyze and derive some common ChatGPT changed the locators and values. Among these cases, we
fix patterns used by ChatGPT. Overall, we found five fix patterns only observe one case where ChatGPT changes By.xpath into an-
in the repair of ChatGPT, described below: other locator (By.className). For the remaining cases, ChatGPT
Single statement change: Modify locator values . 73.61% of replaces other locators with By.xpath.
correct repairs share this fix pattern. Figure 5 shows an example Multi-statements change: Add statements and modify loca-
of ChatGPT generated repair where it modifies the value of the tors. 1.26% of correct repairs have multi-statements changes. Fig-
locator for updating the values of the XPath. ure 8 shows an example of ChatGPT generated repair where (1) it
stores the element into a temporary variable element, (2) performs
1 - driver . findElement ( By . xpath ("//*[ @id =\" the click action on the variable, and (3) add an assertion to check
2 - content \"]/ form [2]/ div [4]/ input ") ) . click () ; for the validity of the text of the element. Since the test where the
3 + driver . findElement ( By . xpath ("/ html [1]/ body [1]/
4 + div [1]/ div [4]/ form [2]/ div [3]/ input [1]") ). click () ; broken statement is located has other assertion that serves as test or-
acle, adding the assertions is unnecessary in this example. However,
Figure 5: A ChatGPT fix for the broken statement in Address- this example demonstrates the ability of ChatGPT in generating
Book that only modifies the value of locator. assertions which could be helpful for test case generation.

Single statement change: Modify assertion values. Figure 6 1 - driver . findElement ( By . xpath ("// div [ @id =' claroBody ']/
shows an example of ChatGPT generated repair where it modifies 2 - div [2]/ div ") ). click () ;
3 + WebElement element = driver . findElement ( By . xpath ("/
the value of the assertion for updating the values of the XPath. 4 + html [1]/ body [1]/ div [1]/ div [2]/ div [1]/ div [1]") ) ;
5 + element . click () ;
As Vista does not currently support update of assertion values, 6 + String actualText = element . getText () ;
ChatGPT could help Vista in generating repaired statement that 7 + String expectedText = " The new user has been sucessfully
8 + created ";
use more diverse fix patterns in this case. We observe that 13.82% 9 + assertEquals ( expectedText , actualText ) ;
correct repairs involve assertion. In 7.79% of these cases, ChatGPT
modified assertion values to update outdated expected values. Figure 8: A ChatGPT generated fix for the broken statement
in Claroline that adds statements.
1 - assertEquals ( driver . findElement ( By . xpath ("// div [ @id = '
2 - claroBody ']/ div [2]/ div ") ). getText () , " This username
3 - is already taken ") ;
4 + assertEquals ( driver . findElement ( By . xpath ("/ html [1]/ body [1]/
8 RQ3: QUALITY OF CHATGPT’S
5 + div [1]/ div [2]/ div [1]/ div [2]") ). getText () , " This user EXPLANATION ON MATCHING RESULT
6 + name is already taken ") ;
To assess the quality of ChatGPT’s explanations for element match-
Figure 6: A ChatGPT fix for the broken statement in Claroline ing results, we use two metrics: (1) mention frequency (the number
that changes the assertion value for XPath update. of times where an attribute 𝑎 has been mentioned 𝑀), and (2) men-
tion consistency (the number of times where an attribute 𝑎 has been
Single statement change: Use different assertions and values. consistently mentioned 𝐶 where the consistency is determined
Figure 7 shows an example of ChatGPT generated repair where by our explanation validator described in Section 4.4). These two
it uses a different type of assertion (replacing assertEquals with metrics helps in answering the research questions below:
Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking Conference’17, July 2017, Washington, DC, USA

Table 6: The number of times an attribute is mentioned (M) and the number of times where the mentioned attribute is consistent
(C) in the generated explanation for each approach

id name class XPath text tagName linkText location size isLeaf


Approach
C M C M C M C M C M C M C M C M C M C M
Vista 1 16 11 30 98 111 304 542 392 498 390 412 111 122 74 201 113 201 164 169
Water 12 122 28 55 148 161 254 555 346 484 299 410 176 192 154 221 213 221 201 201
Edit Distance 121 121 23 131 159 164 291 556 452 510 316 347 126 149 128 306 179 304 178 179
Total 134 259 62 216 405 436 849 1653 1190 1492 1005 1169 413 463 356 728 505 726 543 549

RQ3a: What are the frequently mentioned attributes in ChatGPT’s categories respectively, we compute the Point-Biserial Correlation
explanation? Coefficient (𝑟 𝑝𝑏𝑖 ) [42] for all explanations. Since this is a special
RQ3b: What are the mentioned attributes that are consistent in case of the Pearson Correlation, we assess the strength of the re-
ChatGPT’s explanation? lationship by evaluating the numerical value of the correlation
Table 6 presents the mention frequency and mention consistency coefficient. For the three combinations with ChatGPT, we obtain
of ChatGPT for each attribute for each approach. It can be observed the following values for the Point-Biserial Correlation Coefficient:
in the table that ChatGPT mentions two attributes most frequently: 𝑟 𝑝𝑏𝑖 =0.51 for Vista+ChatGPT, 𝑟 𝑝𝑏𝑖 =0.84 for Water+ChatGPT, and
XPath (1653) and text (1492), wheares the least mentioned attribute 𝑟 𝑝𝑏𝑖 =0.49 for Edit Distance+ChatGPT. These values indicate that
is name (216). Compared to the priority imposed by prior test repair EC and the correctness of the matching for the Vista+ChatGPT
approaches (shown in Table 3), this result show that ChatGPT and Edit Distance+ChatGPT are only moderately correlated but
has certain preference towards particular attributes since it often strongly correlated for the Water+ChatGPT combination. The strong
mention the XPath and text attributes regardless of the baseline correlation between EC and Water+ChatGPT’s correctness of the
used for selecting the list of candidate elements. matching implies that our explanation validator will be most effec-
tive in improving the matching results for the Water+ChatGPT
Finding 4: ChatGPT frequently mentions the the XPath and combination among all the three evaluated combinations.
text attributes in the provided explanations.
Implication 4: Similar to other approaches that imposed certain Finding 6: Among the three evaluated combinations, EC for
priority, ChatGPT has certain preferences towards the XPath the Water+ChatGPT combination is strongly correlated with
and text attributes. the matching correctness.
Implication 6: The strong correlation indicates that our expla-
Table 6 also shows that the three attributes with the greatest nation validator that checks and generates the self-correction
mention consistency are text (1190), tagName (1005), and XPath prompt will be most effective in helping the Water+ChatGPT
(849). Among the two most frequently mentioned attributes (XPath combination to further improve the matching correctness.
and text), we can observe that although ChatGPT often prefers
using these attributes for matching elements, prioritizing the text 9 RQ4: IMPROVING RELIABILITY OF
attribute over the XPath attribute will lead to better matching results
CHATGPT’S GENERATED EXPLANATION
(as the text attribute is often mentioned consistently).
Table 7 presents the results of VISTA, WATER, and XPath tools
Finding 5: The most frequently mentioned attribute by Chat- combined with ChatGPT before and after self-correction (SC), com-
GPT (i.e., XPath) tends to lead to incorrect matching (low men- paring the correct matches and average explanation consistency
tioned consistency). In contrast, the text attribute is high in (EC). We can observe that except for Edit Distance+ChatGPT, the
mention frequency and mention consistency. number of correct matches has increased for the other approaches.
Implication 5: Prioritizing the text attribute over the XPath Notably, the improvement over Water+ChatGPT is the greatest (it
attribute may be a better strategy for ChatGPT as it will lead to gets 5 more correct matches and repair after SC). Meanwhile, we
more accurate matching. think that Edit Distance+ChatGPT does not show any improve-
ment due to the moderately positive correlation between EC and
Correlation between EC and correctness of the matching. Our matching accuracy (0.49).
explanation validator measures the explanation consistency (EC) as
a mechanism to check and improve the reliability of the matching Finding 7: The improvement offered by the self-correction
results provided by ChatGPT. However, even if our explanation mechanism varies across different baselines. Among the three
validator can accurately assess EC, the final matching result may combinations, Water benefits the most from self-correction (5
not be correct (i.e., may not retrieve the target element in the la- more correct matching after the correction).
beled ground truth). To investigate the correlation between our Implication 7: Our proposed workflow of checking for expla-
proposed EC and the correctness of the final matching result (i.e., nation consistency and generating self-correct prompt as hint
whether it selects the element in ground truth), we measure the to ChatGPT to improve the reliability of its explanation could
correlation between these two variables. As these two variables further improve the effectiveness of matching and repair for
belong to discrete (whether it selects the element in ground truth is certain combinations (e.g., Water+ChatGPT).
a binary variable) and continuous (EC is a value between 0 and 1)
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

Table 7: The matching and repair results before and after


self-correction using LLM for improving general purpose test repair technique
that fixes broken assertions [8].
VISTA+ChatGPT WATER+ChatGPT XPath+ChatGPT
Improving reliability of ChatGPT output. Prior study has ex-
Approach pressed concerns regarding the tendency of ChatGPT to “halluci-
before SC after SC before SC after SC before SC after SC
Matching 97 97 86 91 122 123 nate” when solving specific tasks [11], our workflow that checks
Repair 97 97 86 91 121 122 whether the explanations provided by ChatGPT are consistent along
with the selected elements aims to improve the reliability of Chat-
GPT generated outputs shows promising results for improving the
matching accuracy for certain combinations (Finding 7). Although
10 IMPLICATIONS AND DISCUSSIONS our study are limited to solving the tasks of Web UI element match-
Our study identifies several key implications and suggestions for ing and repair, we believe that our proposed way of checking for
future test repair and ChatGPT research. explanation consistencies is general and can be applied to improve
Prioritization of attributes by Web UI test repair tools. Table 3 the reliability of ChatGPT for other software maintenance tasks
shows that prior test repair tools have different ways of prioritiz- (e.g., test generation and automated program repair).
ing attributes when matching elements (e.g., Water gives higher
priority to XPath similarity, whereas Vista uses the position and 11 THREATS TO VALIDITY
the size information of an element to extract visual information for
External Threats. During the ground truth construction and eval-
matching). As the initial selected priorities may be bias, our study
uating our way of calculating EC, we mitigate potential bias by
shows that a global matching using ChatGPT can improve over
first asking two annotators (two authors of the paper) to manually
prior approaches (Finding 1) by mitigating this bias, leading to more
construct and cross-validate the “ground truth target element and
accurate matching. In fact, our study of the frequently mentioned
patch”, the two annotators meet to resolve any disagreement dur-
attributes in ChatGPT’s explanation also reveals that ChatGPT has
ing the annotation, and further discuss with other authors until a
preferences towards certain attributes, e.g., XPath and text (Finding
consensus is reached. For verifiability, we also release our dataset.
5). Although this paper only studies the prioritization of attributes
Due to limited resources and cost budget, our experiments use the
used by two traditional Web test repair approaches, we foresee
cost-effective GPT 3.5-Turbo model which may be less effective
that other UI matching techniques that measure similarities using
than newer models in solving the Web test repair task. As ChatGPT
multiple properties may share similar bias. In future, it is worth-
performance may varies in different settings, our experiments may
while to study (1) the characteristics of the selected prioritization
not generalize beyond the studied settings, and fixing other forms
in other tasks where UI matching algorithms are used (e.g., com-
of UI tests (this paper only focuses on Java Selenium Web UI tests).
patibility testing [36]), and (2) improving the effectiveness of other
We mitigate this threat by reusing settings and suggestions in prior
UI matching techniques via subsequent matching using LLMs.
work (e.g., referring to OpenAPI official documentations for prompt
Test repair techniques used for selecting candidates elements.
design and using temperature 0.8 as prior paper suggested [9]), and
Our study that compares three baseline approaches (Water,Vista,
evaluating on several test repair tools that use different algorithms
and Edit Distance) for the pre-selection of candidate elements
(e.g., text-based and visual-based). As our study only evaluates the
shows that a simplified version of Water (i.e., Edit Distance)
test repair capability of ChatGPT, the findings may not apply for
integrates well with ChatGPT where the combination of Edit Dis-
other LLMs. Nevertheless, our suggested workflow of using LLMs
tance+ChatGPT outperforms all the evaluated approaches (Finding
to perform global matching and rematching based on inconsistent
2). Compared to Water that matches using multiple attributes (e.g.,
explanation are still generally applicable.
id, XPath) and Vista that uses complex information for matching
Internal Threats. Our experimental scripts may have bugs that
(position information when extracting visual information), Edit
can affect our results. To mitigate this threat, we made our scripts
Distance that matches using only XPath information is not effec-
and results publicly available.
tive as a standalone matching algorithm (it performs worst among
Conclusion Threats. Conclusion threats of our study include (1)
all individual baselines). However, our study shows that by using
overfitting of our dataset and (2) subjectivity of ground truth con-
only XPath similarity for matching, Edit Distance delegates the
struction. To ensure that the matching and repair task do not overlap
responsibility of matching using other attributes to ChatGPT which
with the training dataset of ChatGPT to minimize (1), we manually
can later perform better global view matching. Intuitively, one may
analyze the updated UI tests (2) we have manually labeled and
think that Water that performs the best among the individual base-
created a ground truth dataset which can be used to support future
lines would lead to the best combination with ChatGPT. As the
research in Web UI test repair. We mitigate (2) by cross-validating
best individual baseline may not be the most effective combination
between two annotators during ground truth construction.
with ChatGPT, our study urges researchers to perform thorough
evaluation when choosing appropriate baseline to combine with
ChatGPT for solving other software maintenance tasks. 12 CONCLUSIONS
LLM-based test repair. Our study shows promising results in us- This paper presents the first feasibility study that investigates the
ing LLMs like ChatGPT for Web test repair (Finding 3). Our manual effectiveness of using prior Web UI repair techniques for initial local
analysis shows that it can generate correct repairs for the broken matching and then using ChatGPT to perform global matching
statements using a diverse set of fix patterns, including modifica- to mitigate the bias in prioritization of attributes used by prior
tions and generations of assertions. This shows the promises of approaches. To mitigate hallucination in ChatGPT, we propose an
Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking Conference’17, July 2017, Washington, DC, USA

explanation validator that checks the consistency of the provided Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23).
explanation, and provides hints to ChatGPT via a self-correction IEEE Press, 919–931. https://doi.org/10.1109/ICSE48619.2023.00085
[22] Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2015. Using
prompt to further improve its results. Our evaluation on a widely multi-locators to increase the robustness of web test cases. In 2015 IEEE 8th
used dataset shows that the combination with ChatGPT improve International Conference on Software Testing, Verification and Validation (ICST).
IEEE, 1–10.
the effectiveness of existing Web test repair techniques. Our study [23] Dong Li, Ruoming Jin, Jing Gao, and Zhi Liu. 2020. On Sampling Top-K Rec-
also reveals several findings and implications. As an initial study ommendation Evaluation. In Proceedings of the 26th ACM SIGKDD International
that focuses on LLM-based Web test repair, we hope that our study Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA)
(KDD ’20). Association for Computing Machinery, New York, NY, USA, 2114–2124.
could shed light in improving future Web UI test repair approaches. https://doi.org/10.1145/3394486.3403262
[24] Xiao Li, Nana Chang, Yan Wang, Haohua Huang, Yu Pei, Linzhang Wang, and
Xuandong Li. 2017. ATOM: Automatic Maintenance of GUI Test Scripts for
REFERENCES Evolving Mobile Applications. In 2017 IEEE International Conference on Software
[1] 2022. https://platform.openai.com/docs/api-reference Testing, Verification and Validation (ICST). 161–171. https://doi.org/10.1109/ICST
[2] 2022. https://github.com/mantisbt/mantisbt .2017.22
[3] Kai Briechle and Uwe D Hanebeck. 2001. Template matching using fast nor- [25] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: A Novel Deep
malized cross correlation. In Optical Pattern Recognition XII, Vol. 4387. SPIE, Learning-based Approach for Automated Program Repair. In 2022 IEEE/ACM
95–102. 44th International Conference on Software Engineering (ICSE). 511–523. https:
[4] Sacha Brisset, Romain Rouvoy, Lionel Seinturier, and Renaud Pawlak. 2023. SFTM: //doi.org/10.1145/3510003.3510177
Fast matching of web pages using Similarity-based Flexible Tree Matching. In- [26] Yuanzhang Lin, Guoyao Wen, and Xiang Gao. 2023. Automated Fixing of Web UI
formation Systems 112 (2023), 102126. Tests via Iterative Element Matching. In 38th IEEE/ACM International Conference
[5] Nana Chang, Linzhang Wang, Yu Pei, Subrota K Mondal, and Xuandong Li. 2018. on Automated Software Engineering.
Change-based test script maintenance for android apps. In 2018 IEEE International [27] Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and
Conference on Software Quality, Reliability and Security (QRS). IEEE, 215–225. Qing Wang. 2023. Fill in the Blank: Context-aware Automated Text Input Gener-
[6] Wei Chen, Hanyang Cao, and Xavier Blanc. 2021. An Improving Approach for ation for Mobile GUI Testing. In 2023 IEEE/ACM 45th International Conference on
DOM-Based Web Test Suite Repair. In Web Engineering. Springer International Software Engineering (ICSE). 1355–1367. https://doi.org/10.1109/ICSE48619.2023
Publishing, 372–387. .00119
[7] Shauvik Roy Choudhary, Dan Zhao, Husayn Versee, and Alessandro Orso. 2011. [28] Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and
Water: Web application test repair. In Proceedings of the First International Work- Lin Tan. 2020. Coconut: combining context-aware neural translation models
shop on End-to-End Test Script Engineering. 24–29. using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT
[8] Brett Daniel, Danny Dig, Tihomir Gvero, Vilas Jagannath, Johnston Jiaa, Damion international symposium on software testing and analysis. 101–114.
Mitchell, Jurand Nogiec, Shin Hwei Tan, and Darko Marinov. 2011. Reassert: [29] F. Ricca M. Leotta, A. Stocco and P. Tonella. 2016. Robula+: An algorithm for
a tool for repairing broken unit tests. In Proceedings of the 33rd International generating robust XPath locators for web testing. In J. Softw. Evol. Process, Vol. 28.
Conference on Software Engineering. 1010–1012. 177–204.
[9] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei [30] Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, and Valerio Terragni. 2021. Seman-
Tan. 2023. Automated repair of programs from large language models. In 2023 tic Matching of GUI Events for Test Reuse: Are We There Yet?. In Proceedings of
IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
1469–1481. (ISSTA) (Virtual, Denmark). Association for Computing Machinery, New York,
[10] Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, NY, USA, 177–190. https://doi.org/10.1145/3460319.3464827
Language Model-Based Predictor for Flaky Tests. IEEE Transactions on Software [31] Ehsan Mashhadi and Hadi Hemmati. 2021. Applying CodeBERT for Auto-
Engineering 49, 4 (2023), 1912–1927. https://doi.org/10.1109/TSE.2022.3201209 mated Program Repair of Java Simple Bugs. In 2021 IEEE/ACM 18th Inter-
[11] Robert Feldt, Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Socratest: Towards national Conference on Mining Software Repositories (MSR). 505–509. https:
Autonomous Testing Agents via Conversational Large Language Models. In //doi.org/10.1109/MSR52588.2021.00063
Proceedings of the 38th IEEE/ACM International Conference on Automated Software [32] OpenAI. 2023. Six Strategies for Getting Better Results with GPT. https://plat
Engineering. form.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-
[12] Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated better-results Accessed: November 15, 2023.
Android Bug Replay with Large Language Models. arXiv preprint arXiv:2306.01987 [33] OpenAI. 2023. Strategy: Split Complex Tasks into Simpler Subtasks. https://platfo
(2023). rm.openai.com/docs/guides/prompt-engineering/strategy-split-complex-tasks-
[13] Zebao Gao, Zhenyu Chen, Yunxiao Zou, and Atif M. Memon. 2016. SITAR: GUI into-simpler-subtasks Accessed: November 15, 2023.
Test Script Repair. IEEE Transactions on Software Engineering 42, 2 (2016), 170–186. [34] OpenAI Help. 2023. ChatGPT API Transition Guide. https://help.openai.com/en
https://doi.org/10.1109/TSE.2015.2454510 /articles/7042661-chatgpt-api-transition-guide Accessed: November 15, 2023.
[14] Mouna Hammoudi, Gregg Rothermel, and Paolo Tonella. 2016. Why do Record/Re- [35] Minxue Pan, Tongtong Xu, Yu Pei, Zhong Li, Tian Zhang, and Xuandong Li. 2020.
play Tests of Web Applications Break?. In 2016 IEEE International Confer- GUI-Guided Test Script Repair for Mobile Apps. IEEE Transactions on Software
ence on Software Testing, Verification and Validation (ICST). 180–190. https: Engineering (2020), 1–1. https://doi.org/10.1109/TSE.2020.3007664
//doi.org/10.1109/ICST.2016.16 [36] Yanwei Ren, Youda Gu, Zongqing Ma, Hualiang Zhu, and Fei Yin. 2022. Cross-
[15] Pete Houston. 2013. Instant jsoup How-to. Packt Publishing Ltd. Device Difference Detector for Mobile Application GUI Compatibility Testing. In
[16] Javaria Imtiaz, Muhammad Zohaib Iqbal, et al. 2021. An automated model-based 2022 IEEE International Conference on Software Testing, Verification and Validation
approach to repair test suites of evolving web applications. Journal of Systems Workshops (ICSTW). IEEE, 253–260.
and Software 171 (2021), 110841. [37] Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE
[17] Javaria Imtiaz, Muhammad Zohaib Iqbal, and et al. 2021. An automated model- Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522–532.
based approach to repair test suites of evolving web applications. Journal of [38] R. Santhosh, M. Abinaya, V. Anusuya, and D. Gowthami. 2023. ChatGPT:
Systems and Software 171 (2021), 110841. Opportunities, Features and Future Prospects. In 2023 7th International Con-
[18] Nour Jnoub, Wolfgang Klas, Peter Kalchgruber, and Elaheh Momeni. 2018. A Flex- ference on Trends in Electronics and Informatics (ICOEI). 1614–1622. https:
ible Algorithmic Approach for Identifying Conflicting/Deviating Data on the Web. //doi.org/10.1109/ICOEI56765.2023.10125747
In 2018 International Conference on Computer, Information and Telecommunication [39] Fei Shao, Rui Xu, Wasif Haque, Jingwei Xu, Ying Zhang, Wei Yang, Yanfang
Systems (CITS). 1–5. https://doi.org/10.1109/CITS.2018.8440185 Ye, and Xusheng Xiao. 2021. WebEvo: taming web application evolution via
[19] Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, detecting semantic structure changes. In Proceedings of the 30th ACM SIGSOFT
and Ivan Radiček. 2023. Repair Is Nearly Generation: Multilingual Program Repair International Symposium on Software Testing and Analysis. 16–28.
with LLMs. Proceedings of the AAAI Conference on Artificial Intelligence 37, 4 (Jun. [40] Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An
2023), 5131–5140. https://doi.org/10.1609/aaai.v37i4.25642 Analysis of the Automatic Bug Fixing Performance of ChatGPT. In 2023 IEEE/ACM
[20] Hiroyuki Kirinuki, Haruto Tanno, and Katsuyuki Natsukawa. 2019. COLOR: International Workshop on Automated Program Repair (APR). 23–30. https://doi.
correct locator recommender for broken test scripts using various clues in web org/10.1109/APR59189.2023.00012
application. In 2019 IEEE 26th International Conference on Software Analysis, [41] Andrea Stocco, Rahulkrishna Yandrapally, and Ali Mesbah. 2018. Visual web test
Evolution and Reengineering (SANER). IEEE, 310–320. repair. In Proceedings of the 2018 26th ACM Joint Meeting on European Software
[21] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Engineering Conference and Symposium on the Foundations of Software Engineering.
Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with 503–514.
Pre-Trained Large Language Models. In Proceedings of the 45th International
Conference’17, July 2017, Washington, DC, USA Zhuolin Xu, Qiushi Li, and Shin Hwei Tan

[42] Robert F Tate. 1954. Correlation between a discrete and a continuous variable. [46] Tongtong Xu, Minxue Pan, Yu Pei, Guiyin Li, Xia Zeng, Tian Zhang, Yuetang Deng,
Point-biserial correlation. The Annals of mathematical statistics 25, 3 (1954), and Xuandong Li. 2021. Guider: Gui structure and vision co-guided test script
603–607. repair for android apps. In Proceedings of the 30th ACM SIGSOFT International
[43] Hung Quoc To, Nghi DQ Bui, Jin Guo, and Tien N Nguyen. 2023. Better Language Symposium on Software Testing and Analysis (ISSTA). 191–203.
Models of Code through Self-Improvement. arXiv preprint arXiv:2304.01228 [47] Juyeon Yoon, Seungjoon Chung, Kihyuck Shin, Jinhan Kim, Shin Hong, and Shin
(2023). Yoo. 2022. Repairing Fragile GUI Test Cases Using Word and Layout Embedding.
[44] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST).
program repair in the era of large pre-trained language models. In Proceedings of 291–301. https://doi.org/10.1109/ICST53961.2022.00038
the 45th International Conference on Software Engineering (ICSE 2023). Association [48] Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen.
for Computing Machinery. 2023. LLM for Test Script Generation and Migration: Challenges, Capabilities,
[45] Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. 2024. De- and Opportunities. arXiv preprint arXiv:2309.13574 (2023).
vGPT: Studying Developer-ChatGPT Conversations. In Proceedings of the Inter- [49] Sai Zhang, Hao Lü, and Michael D Ernst. 2013. Automatically repairing broken
national Conference on Mining Software Repositories (MSR 2024). workflows for evolving GUI applications. In Proceedings of the 2013 International
Symposium on Software Testing and Analysis. 45–55.

You might also like