Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
1 INTRODUCTION
As the Internet continues to evolve and expand, more and more websites emerge, contributing to the
diverse and ever-growing online world. As of 2024, the digital landscape comprises approximately
1.09 billion websites [2], supporting a variety of applications in people’s daily lives.
The design and development of Graphical User Interfaces (GUIs) are vital for creating a website. A
well-designed GUI not only enhances the website’s visual attractiveness but also improves usability
and user satisfaction. In such a process, GUI design involves shaping the website’s aesthetics[1],
such as layout, colors, and typography [9, 29]. In contrast, GUI development is about implementing
that aesthetic through programming languages. Nevertheless, such conversion is a complex and
time-consuming task. Developers must manually map visual elements to their corresponding
implementation details, which can lead to errors and discrepancies between the original design
and the final looks [9, 30, 39, 40, 59].
To allow developers to transform design diagrams into functional GUI code more easily, several
automated GUI code generation methods have been proposed, which can be further categorized
∗ Yintong Huo is the corresponding author.
Authors’ addresses: Jingyu Xiao, The Chinese University of Hong Kong, Hong Kong, China, [email protected];
Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China, [email protected]; Yintong Huo, Singapore
Management University, Singapore, Singapore, [email protected]; Zhiyao Xu, Tsinghua University, Beijing, China,
[email protected]; Michael R.Lyu, The Chinese University of Hong Kong, Hong Kong, China, [email protected].
2 Xiao et al.
https://www.fun.com/adult-cakeworthy-never-land-denim-jacket.html
(a) Example of interactive (b) Ratio of interactive and static ele- (c) Implemented vs. unimplemented in-
elements. ments. teractive elements ratio of GPT-4o.
Fig. 1. Interaction example and interactive elements ratio of different types of webpages.
into two types: learning-based and LLM-based approaches. The learning-based methods, such
as Pix2code [8], design a novel method based on CNN and LSTM to generate user interface
code by reverse-engineering a single GUI image input. Chen et al.[10] present a neural machine
translator to extract visual features in UI images, encode these features’ spatial layouts, and
generate GUI skeletons in a unified neural network framework. However, these deep learning-based
methods exhibit compromised performance and fail in generalizing to diverse web page elements
due to their limited knowledge learning from training samples. Recently, incorporating visual
information into Large Language Models (LLMs) has led to the development of Multimodal Large
Language Models (MLLMs) [3, 12, 31, 50, 56]. Leading models in this domain, such as GPT-4o [43],
Claude-3.5 [4], and Gemini-1.5 [23], have achieved excellent performance in visual understanding
tasks [16, 55]. Furthermore, research has shown that LLMs have remarkable performance on various
code intelligence tasks [26], including code generation [19, 20, 22, 28, 32, 57], code completion [13,
17, 18, 33, 34, 42], and code summarization [6, 11, 21, 24, 36, 37]. These advances create new
opportunities for the Design-to-Code task, i.e., generating code from screenshots to replicate web
page elements, layout, text, and colors. For example, Design2Code [49] designs three types of
prompts to stimulate MLLMs’ web content understanding and self-refined capabilities for GUI
code generation. DCGen [51] proposes a divide-and-conquer-based approach to prompt MLLMs to
generate webpage elements more accurately.
Regardless of the continuous investigation on promoting the models’ capability, their evaluation
scope is restricted to static pages. More specifically, existing research [25, 49, 58] only focuses
on the static appearance of the webpage (e.g., color, layouts), ignoring the dynamic interactive
properties and functionality of elements, such as size selection list, quantity adjustment button
shown in Fig. 1(a), and other designs for user engagements. Additionally, we observe that such
interactive elements account for a large proportion of the webpage in real-world software practices.
We randomly select 10 real-world webpages with different topics to analyze the ratio of interactive
elements, the results in Fig. 1(b) indicate that interactive elements take up more than 50% cases.
Then we utilize GPT-4o [43] to generate the GUI code containing interactive elements. As shown
in Fig. 1(c), fewer than 15% of interactive elements are correctly implemented, highlighting the
current limitations in handling webpage interactive design.
Static webpages inherently limit user interaction with web elements, hindering access to new
content (such as browsing images via carousel buttons) or impeding task completion (like selecting
clothing sizes from drop-down menus), thereby impairing user experience. In this context, evalu-
ations of static pages become inadequate for real-world webpage deployments, where dynamic
elements are prevalent. Therefore, We argue that a benchmark for webpages that includes inter-
active elements is essential to enhance the practicality, usability, and user engagement of studies on
auto-generated GUI code. In this paper, we emphasize the importance of webpage interactions by
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 3
Aspects Findings
MLLMs exhibit limited performance in reproducing fine-grained interaction features,
Limitations such as structure, text, and position (Finding 1).
Performance based on the type of interaction: MLLMs excel at handling interaction
with fixed pattern (e.g., selection list) and clear changes (e.g., new window creation),
while struggle with interactions that involve complex changes (e.g., iframe,
progress) and subtle visual modifications (Finding 3).
The predominant failures are “No interaction”, “Partial implementation”,
Failure “Interactive element missing”, and “Wrong position after interaction”.
Types The most critical failures include “Interactive element Missing”, “Effect on wrong
element”, “Wrong Function” and “No interaction” (Finding 4).
★Well-designed prompts are effective: Chain-of-Thought enables step-by-step
interaction analysis, while marking interaction areas provides essential visual signals.
Key
Both approaches improve the quality of generated interactions (Finding 2).
Factors
★ Enhanced visual saliency significantly improves interaction generation,
particularly in complicated cases (Finding 5)
★Supplementary textual descriptions substantially boost MLLMs’ interaction
generation capabilities (Finding 6).
investigating the following question: to what extent MLLMs can produce interaction code
based on the visual design?
To this end, we provide a systematic analysis of MLLMs’ capability in reproducing dynamic
interactions on webpages. Specifically, we first define the Interaction-to-Code task, i.e., generating
code from a series of screenshots representing webpage interactions to replicate interactive elements.
Then we build the Interaction2Code benchmark that encompasses a diverse array of webpages
and interactions. It comprises 97 unique web pages and 213 distinct interactions, spanning 15
webpage types and 30 interaction categories. By curating a wide range of interaction types, we offer
a representative and diverse evaluation dataset for assessing the capabilities of MLLMs producing
dynamic webpages in a more realistic scenario. We mainly investigate the following six research
questions (RQs):
• RQ1: How do different MLLMs perform in Interaction-to-Code task under different prompts?
• RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
• RQ3: How do MLLMs perform in code generation across different interaction scenarios?
• RQ4: What types of mistakes do MLLMs make in generating interactions?
• RQ5: How does visual saliency influence the quality of generated interactions?
• RQ6: Which representation modality – visual signals or textual description, enhances MLLMs to
generate interaction code?
To address RQ1, we design three distinct prompt types: direct prompts, Chain-of-Thought
prompts, and Mark prompts (which mark the interaction areas) to evaluate the performance of
three state-of-the-art MLLMs under varying prompt conditions. For RQ2, we conduct user studies
where participants interact with the generated webpages to assess the usability. In RQ3, we analyze
MLLMs’ performance across different interaction scenarios by calculating usability rates for various
interaction types. To answer RQ4, we invite human annotators to categorize and discuss webpage
generation failures, followed by data analysis to reveal the most prevalent error patterns and
their severity. For RQ5, we evaluate the generated interactions across varying saliency levels to
4 Xiao et al.
<body>
<h1>Front-End Development</h1>
<p id="myParagraph">This is an example</p> <script>
<button id=“myButton”, const button = document.getElementById('myButton’);
draggable="true">Click Me</button> const paragraph = document.getElementById('myParagraph');
<script>JavaScript Codes</script> button.addEventListener('click', function() {
</body> paragraph.textContent = 'Button clicked!’;
});
</html> </script>
investigate their impact on interaction generation performance. Finally, for RQ6, we examine
the influence of interaction representation modality by comparing three input configurations:
visual-only, textual description-only, and combined visual-textual.
Based on our experimental results, we present six key findings, shown in Table 1, including
the limitations of MLLMs, failure types and key factors for enhancing interaction generation
performance. Our contributions are summarized as follows:
• Task formulation. To the best of our knowledge, this is the first study to formulate the
Interaction-to-Code task and present a systematic study on the code generation capabilities
of MLLMs for dynamic interaction of webpages.
• Benchmark. We build the first real-world webpage interaction datasets Interaction2Code con-
taining 97 webpages and 213 interactions, spanning 15 webpage topics and 30 interaction
categories.
• Key Findings. Our in-depth analysis reveals the limitations of MLLMs, identifies 10 representa-
tive failure types and their underline cause, and provides key factors for enhancing performance
on the Interaction-to-Code task. This key findings offer valuable implications for researchers
and developers engaged in automated front-end development.
2 BACKGROUND
2.1 Basic Knowledge about Front-end Development
Front-end development focuses on what users see and interact with in their web browsers. Visual
design and interactive implementation are two key parts of creating visually appealing and user-
friendly interfaces. The primary technologies used in front-end development are Hypertext Markup
Language (HTML), Cascading Style Sheets (CSS), and JavaScript.
2.1.1 HTML. HTML (HyperText Markup Language) is a markup language used to create web page
content. It defines the structure and content of a web page through tags, such as titles, paragraphs,
and buttons, as shown in Fig 2; each HTML element includes an opening tag, content, and a closing
tag, forming the basic block of a webpage. HTML does not support complex interactions, but some
specific elements (e.g., form, button) and attributes can be used to implement basic interactive
functions. For example, the HTML code in Fig. 2 set the “draggable” attribute as true in the button
flag to allow user to drag the button.
2.1.2 CSS. CSS (Cascading Style Sheets) is a style sheet language used to describe the style of
HTML documents. It allows web developers to control the layout, fonts, colors, spacing, and other
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 5
visual effects of the page. CSS can achieve interactive effects through pseudo-classes, pseudo-
elements, transitions and animations. For example, the CSS program between the style tag in Fig. 2
leverages the “:hover” pseudo-class to add an interaction on the button. The button’s color will
change from green to blue once the mouse hovers. The transition (“transition: background-color
0.5s”) can smoothly change the color of the button over 0.5 second to create an animation effect.
2.1.3 JavaScript. JavaScript is a high-level, dynamic, and versatile programming language that
is primarily used for adding interactivity and dynamic behavior to websites. JavaScript enables
developers to create rich, interactive user experiences, manipulate the Document Object Model
(DOM), and handle events. For example, Fig. 2 shows that the JavaScript program between the
script tag adds an event listener on the button, once clicked, the text content of the paragraph will
be changed to “Button clicked!”.
In summary, the interaction of the front end of the web page comes from HTML tags and
attributes, style changes implemented by CSS, and custom events implemented by JavaScript.
3 PROBLEM DEFINITION
To describe the interactions within a webpage, we define the Webpage Interaction Graph (WIG):
𝑊 𝐼𝐺 = {𝑁 , 𝐸}, (1)
where 𝑁 = {𝑆 0, 𝑆 1, .., 𝑆𝑛 } is a finite set of nodes representing the screenshots of the webpage.
𝐸 = {𝐼 0, 𝐼 1, ..., 𝐼𝑚 } represents a series of interaction events that connect different screenshots with
directed edges, indicating transitions caused by user interactions. We use numbers to represent
different interaction events and then correspond them to the screenshots. For example, the first
screenshot represents the original web page, the second screenshot represents the result after
interaction 1, and the third screenshot represents the result after interaction 2. Let 𝐶𝑜 denote the
original webpage file, including HTML, CSS, and JavaScript code, 𝑆 0𝑜 denote the screenshot of the
webpage, 𝑆𝑛𝑜 denote the webpage screenshot after interaction 𝐼𝑛 , 𝐺𝑜 denotes the webpage interaction
graph of the original webpage. To achieve the Interaction-to-Code task, 𝑀 takes the 𝐺𝑜 as input and
6 Xiao et al.
outputs the generated code file 𝐶𝑔 = 𝑀 (𝐺𝑜 ) to implement both the visual design and interactions
of the original webpage 𝐶𝑜 . Fig. 3 illustrates an example of Interaction-to-Code task.
shop
Table 2. Quantitative metrics. 11% Other
blog
8% 29%
business 6%
Min Max Average Std
news 6%
Length (tokens) 2457 726,317 141,084 160,438 3%
6% 3% technology
Tag Count 34 12,694 1,291 1,574 4% 3% video
4% 4% 4%3%3%3%
book
DOM Depth 6 37 18 6 product
homepage sport
Unique Tags 8 58 31 9 hotel encyclopedia
form food study
Total size 97
Fig. 4. Topic distribution.
Interaction Type Distribution. To get a sense of the range of interaction types covered in
our benchmark, we manually annotate what type of interactions they are based on the element
tag and the visual effect perspective. Tag categories come from the HTML tags like button, image,
link, and so on. Buttons, input boxes, and links are the most frequent types and play a great role in
2 https://git-scm.com/docs/git-difftool
8 Xiao et al.
human-website interaction. Visual categories involve changes in color, size, position, text, etc, the
explanations are as follows:
• New component: it represents new elements are generated after an interaction. For example, as
shown in Fig 7(c), two new input elements will be generated after selecting the third choice.
• Text: text change after interaction, As shown in Fig. 8(i), after clicking the “Select” button, the
text on it will change to “Selected”.
• Color: it denotes the color change after interaction. For example, the background color change
from while to dark after clicking the dark label as illustrated in Fig. 8(c).
• New window: it represents that a new window is generated after the interaction, such as a form
popping up after clicking the contact button, as shown in Fig. 8(f).
• Position: it indicates that the position of the element changes after the interaction. For example,
on a text editing website, clicking the right button can move the text from the left to the right.
• Size: it indicates that the size of the element changes after the interaction. For example, the text
size will increase after clicking the large label as shown in Fig. 8(h).
• Switch: it indicates the switching of content. For example, in Fig. 7(b), after clicking the “M”
button, the clothes parameter will be switched from “S” to “M”.
Note that one interaction may belong to multiple tag categories and visual categories. Table 3
demonstrates that Interaction2Code benchmark has a rich set of interaction types, including 23 tag
categories and 7 visual categories.
5 STUDY SETUP
5.1 Evaluation Models
We employ three state-of-the-art (SOTA) MLLMs: Gemini 1.5 [23], GPT-4o [43] and Claude-3.5
[4] to evaluate their performance on Interaction-to-Code task. the specific model numbers are
20240806 for GPT-4o, 20240620 for Claude-3.5-Sonnet, and Gemini-1.5-flash-latest accessed during
October 2024. In configuring the MLLM models, we set the temperature to 1 and the maximum
number of tokens output for the three models as 4096. All other parameters were kept at their
default settings as outlined in the relevant API documentation [5, 23, 44].
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 9
5.2 Metrics
We employ both full webpage metric and interactive part metric to judge the capability of MLLMs
in the Interaction-to-Code task. We measure the quality of webpages generated by MLLMs from
the perspectives of visual, structure, and text:
• Visual Similarity. We use CLIP score [46]to measure the visual similarity. This metric measures
the semantic similarity between the generated and original webpages, serving as an indicator
of how effectively the generated GUI captures the intended visual elements and overall design
concept.
• Structure Similarity. SSIM [52] (Structural Similarity Index Measure) score is applied to calcu-
late the structure similarity. It evaluates the layout and compositional accuracy, emphasizing the
spatial arrangement and structural similarities between the generated and original webpages.
• Text Similarity. We first use python OCR tools to recognize the text in the original and the
generated webpages, and then use the Bilingual Evaluation Understudy (BLEU) score [45] to
measure the text similarity between the two web pages.
For the interactive parts of webpages, in addition to the above visual, structure and text similarity,
we also evaluate them from the perspective of the position and function of the interaction.
• Position Similarity. The position similarity between original interaction 𝐼𝑜 and generated
interaction 𝐼𝑔 is defined as follows:
𝑃𝑜𝑠𝑠𝑖𝑚 (𝐼𝑜 , 𝐼𝑔 ) = 1 − 𝑚𝑎𝑥 (𝑎𝑏𝑠 (𝑥𝑜 − 𝑥𝑔 ), 𝑎𝑏𝑠 (𝑦𝑜 − 𝑦𝑔 )), (2)
where (𝑥𝑜 , 𝑦𝑜 ) and (𝑥𝑔 , 𝑦𝑔 ) are normalized coordinates (in [0, 1]) of the center of the interactive
area.
• Function Usability. This metric is used to measure whether the interactive function is usable,
human annotators are asked to interact with the generated webpage and judge the usability. Let
𝑁 (·) denote the quantity, we can calculate the Usability Rate (UR):
𝑁 (𝑢𝑠𝑎𝑏𝑙𝑒)
𝑈𝑅 = . (3)
𝑁 (𝑢𝑠𝑎𝑏𝑙𝑒) + 𝑁 (𝑢𝑛𝑢𝑠𝑎𝑏𝑙𝑒)
5.3 Prompt Design
We design three types of prompt methods: direct prompt, chain-of-thought prompt, and mark
prompt, as shown in Fig 5. In the direct prompt, the first screenshot represents the original webpage
state, while subsequent screenshots depict states after specific interactions. Requirements are applied
to guide MLLMs in replicating the webpage design and interaction. In particular, requirement 3
involves letting MLLMs number interactive elements to allow direct identification by ID, enabling
automated interaction and screenshot capture for generated webpages. For the Chain-of-Thought
(CoT) prompt [53], we use the instruction “let’s think step by step” and design three intermediate
steps: analyze the interaction effects, locate the interactive elements, and implement the interaction.
For the Mark prompt, We use red bounding boxes to highlight the areas of interaction, prompting
MLLMs to focus on the interactive parts.
6 EXPERIMENTS
In this work, we conduct experiments to answer the following questions:
• RQ1: How do different MLLMs perform in Interaction-to-Code task under different prompts?
• RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
• RQ3: How do MLLMs perform in code generation across different interaction scenarios?
• RQ4: What types of mistakes do MLLMs make in generating interactions?
• RQ5: How does visual saliency influence the quality of generated interactions?
10 Xiao et al.
Direct Prompt
[Instruction]:
You are a web developer proficient in HTML, CSS and JavaScript. The user provides some screenshots of a webpage. The first screenshot [image1]
shows the webpage in its original state, while others [image2, image3,…] show the webpage after the user has interacted with certain elements.
You are tasked with creating a webpage that replicates the design and interaction observed in screenshots.
[Requirements]:
1. Design Replication: Pay attention to layout, color and so on to make the webpage look identical to the first screenshot .
2. Interaction Replication : Implement the changes shown in screenshots caused by interactions (e.g., clicks).
3. Number Interactions: You need to number interactive elements from interact-1 to interact-n, interact-1 corresponds to the interaction presented in
the second screenshot, and interact-2 corresponds to the interaction presented in the third screenshot, and so on. For example, if the button is
clicked in the second screenshot, the id of the button is set to interact-1: "<button id="interact1">Click Me!</button>"
…
Combine HTML, CSS and JavaScript codes into one file and respond the codes only:
Combine HTML, CSS and JavaScript codes into one file and respond the codes only: Combine HTML, CSS and JavaScript codes into one file and
respond the codes only:
• RQ6: Which representation modality – visual signals or textual description, enhances MLLMs to
generate interaction code?
6.1 RQ1: How do different MLLMs perform in Interaction-to-Code task under different
prompts?
We present the results of three leading MLLMs under three different prompts in Table 4, bold values
indicate the optimal performance, and underlined values indicate the second-best performance.
First, we can make the following observations of MLLMs under direct prompting:
(1) Generation of interactive elements presents greater challenges than static full web-
page generation. Table 4 shows that the performance metrics for interactive components are
notably lower than those for complete webpages under direct prompts. Regarding visual similarity,
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 11
MLLMs attain approximately 0.73-0.78 for full pages, compared to 0.71-0.76 for interactive elements.
Structure similarity shows a more pronounced disparity, with MLLMs achieving 0.6-0.78 for full
pages but only 0.4-0.56 for interactive components. Similarly, text similarity scores reach about
0.65 for full pages, contrasting with approximately 0.5 for interactive elements.
(2) MLLMs demonstrate limitations in accurately reproducing fine-grained features
of interaction. The performance of MLLMs on fine-grained metrics (such as structure, text, and
position similarity) is notably weaker compared to their performance on coarse-grained metrics
like CLIP score. As illustrated in Table 4, for the interaction part, the CLIP similarity exceeds 0.7,
whereas text similarity hovers around 0.5, position similarity approximates 0.45-0.62, and structure
similarity ranges between 0.4 and 0.5.
(3) Claude-3.5 outperforms GPT-4o and Gemini-1.5 in the Interaction-to-Code task.
Experiment results of direct prompting reveals a consistent performance ranking, with Claude-3.5
leading, followed by GPT-4o, and Gemini-1.5 showing the lowest performance.
To improve the performance of interaction, we further propose CoT and Mark prompts to force
models to focus on the interaction part, resulting in the following observations:
(4) Both CoT and Mark prompts enhance model performance compared to direct prompt,
the Mark prompt demonstrates superior performance compared to the CoT prompt. GPT-
4o’s metrics (CLIP, SSIM, text, position) of the interaction part improve from direct prompting
scores (0.7328, 0.4221, 0.4848, 0.6053) to (0.7212, 0.4556, 0.4902, 0.6079) with CoT, and further to
(0.7454, 0.5583, 0.5241, 0.6123) with Mark prompting. However, both prompting methods slightly
decrease full-page metrics, likely due to their focused emphasis on interactive elements rather than
overall page composition.
6.2 RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
Although the above metrics have measured the generation effect of the interaction from different
perspectives, the functional evaluation of the interaction still requires human evaluations.
Pairwise Model Comparison Setting. We ask three human annotators to rank a pair of
generated interactions (one from the baseline, the other from the tested methods) to decide which
one implements the reference interaction function better. We use Gemini-1.5 with direct prompt
as the baseline and collect the other eight methods’ Win/Tie/Lose rates against this baseline. The
results are shown in Fig 6(a); a higher win rate and lower loss rate suggest better quality as judged
by human annotators.
Functionality Evaluation Setting. We also ask the three annotators to evaluate the functionality
(i.e., usability) of generated interaction. If the interactive function is consistent with ground truth,
it is regarded as usable, otherwise unusable. We calculate the usability rate of different schemes,
the results are shown in Fig 6(b).
12 Xiao et al.
0 20 40 60 80 100 0 20 40 60 80 100
Percentage (%) Percentage (%)
Fig. 6. Human evaluation, a higher win rate indicates better quality and a higher usability rate indicates
better functionality.
Model Prompt button input link iframe textarea option select form progress
Direct 0.5395 0.5172 0.4583 0.3750 0.5238 0.6667 0.7000 0.6667 0.2857
Gemini CoT 0.5682 0.6176 0.4167 0.6250 0.6296 0.8125 0.6111 0.8750 0.4545
Mark 0.6111 0.6750 0.5333 0.5000 0.5357 0.6875 0.7500 0.8000 0.7273
Direct 0.6742 0.8485 0.5556 0.7222 0.8571 0.8889 0.8889 0.9091 0.4000
GPT CoT 0.6941 0.7857 0.6667 0.5000 0.7143 0.9375 0.8421 0.9000 0.2727
Mark 0.8316 0.8000 0.8276 0.7778 0.8519 0.9500 0.8947 0.8750 0.7000
Direct 0.6857 0.7750 0.8485 0.6111 0.7407 0.8235 0.9333 0.9167 0.6000
Claude CoT 0.7071 0.8205 0.6296 0.4444 0.7586 0.9048 0.9474 1.0000 0.3636
Mark 0.8788 0.9024 0.8667 0.7368 1.0000 0.9412 0.8750 1.0000 0.5833
Average 0.6878 0.7491 0.6448 0.5880 0.7346 0.8458 0.8269 0.8825 0.4875
Results. First, our human evaluation reveals that Claude-3.5 consistently demonstrates superior
performance compared to other baseline models. Second, both CoT and Mark prompting strategies
can enhance model performance beyond direct prompting, showing higher win rates and usability
rates across most models (except Claude’s CoT prompt). Third, Mark prompting yields the most
significant improvements in usability, with GPT-4o showing 10% and 12% increases compared to
Direct and CoT prompts, respectively (Fig. 6(b)). Notably, GPT-4o with Mark prompting outperforms
Claude under both Direct and CoT conditions, highlighting the importance of visual attention. Last
but not least, these human evaluation results align with Finding 2, validating that our automatic
evaluation metrics are reasonable.
6.3 RQ3: How do MLLMs perform in code generation across different interaction
scenarios?
In this section, we study the performance of MLLMs on the Interaction-to-Code task under different
interaction types. The results of varying tag categories with high frequency and visual categories
are shown in Table 5 and Table 6, respectively.
For tag categories, form, select, and option are the easiest interaction types to generate,
achieving a usability rate higher than 80%. This is because these interactions scenarios always
contain fixed patterns, for example, selection and option only appear in drop-down lists, and
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 13
Model Prompt new component text color new window position size switch
Direct 0.5893 0.5246 0.4103 0.5000 0.5000 0.6000 0.5000
Gemini CoT 0.6119 0.5231 0.4186 0.8125 0.4615 0.5000 0.4000
Mark 0.5758 0.6719 0.4894 0.7647 0.7143 0.7143 0.6000
Direct 0.7164 0.7353 0.5208 0.9048 0.7500 0.6154 0.5000
GPT CoT 0.7538 0.8060 0.5909 0.8500 0.5625 0.8750 0.4000
Mark 0.8493 0.9054 0.7907 0.8889 0.7895 0.9000 0.9000
Direct 0.7333 0.7639 0.7111 0.7917 0.7333 0.7857 0.5000
Claude CoT 0.8205 0.8194 0.5918 0.7619 0.5000 0.6364 0.7143
Mark 0.9178 0.9189 0.8333 1.0000 0.8235 0.8182 0.7500
Average 0.7298 0.7409 0.5952 0.8083 0.6483 0.7161 0.5849
Table 7. Failure types and their influences, where represents full impact and represents partial impact.
Failure User
Failure Type Content Function Usability Rate
Object Experience
(a) Interactive element missing 0%
Interactive (b) No interaction 6.93%
element (c) Wrong interactive element 91.96%
(d) Wrong type of interactive element 88.89%
(e) Wrong position of interactive element 97.83%
(f) Wrong position after interaction 93.81%
Interaction (g) Wrong type of interaction effects 55.88%
effects (h) Effect on wrong element 0%
(i) Partial Implementation 75.29%
(j) Wrong function 0%
form often merely contains input boxes. In contrast, iframe and progress elements show lower
usability rates (<60%), attributed to their complexity: iframes involve embedding external content,
while progress bars require intricate component coordination for functions like audio control or
price range adjustment, raising difficulties for MLLM to understand.
For visual categories, MLLMs excel at generating interactions that result in prominent visual
changes, such as creating new windows, and components. However, they struggle with subtle
visual modifications, such as color shifts and positional adjustments, indicating their limitations in
handling fine-grained interaction effects.
Finding 3: Performance varies by interaction type: MLLMs are good at handling interactions
with fixed pattern (e.g., selection list) and obvious changes (e.g., new window creation),
while struggling with interactions involving complex changes (e.g., iframe, progress) and
subtle visual modifications (e.g., position change).
and refine the failure type until everyone reaches a consensus. Finally, we manually annotate the
failure types of all interactions and calculate the Usability Rate (UR) based on the human evaluation
results of RQ2. Table 7 shows the results of failure types and their influence, it contains 10 types of
failure. Ten representative failure examples are shown in Fig 7 and Fig 8, where the first row shows
the reference interaction, and the second row shows the generated interaction by MLLMs.
Failure reason analysis. Failures (a), (c), (e), and (f) stem from MLLMs’ limitations in element
localization. Failures (d) and (g) are caused by MLLMs’ misidentification of element types. Failures
(b), (h), (i), and (j) arise from MLLMs’ misunderstanding of interaction.
Base on the failure distribution in Fig 9, we find that, the main failure modes include “No
interaction”, “Partial implementation”, “Interactive element missing”, and “Wrong posi-
tion after interaction”. Model-specific analysis reveals distinct patterns: Gemini-1.5’s failures are
dominated by “No interaction” and “Partial implementation” (>50%), while GPT-4o mainly faces
issues with “Interactive element missing” and “No interaction” (>20%). Claude-3.5’s challenges are
primarily in “No interaction” and “Wrong position after interaction” (>20%). These failures stem
from two key issues: MLLMs’ inadequate interaction comprehension leading to “No interaction”
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 15
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50
Percentage (%) Percentage (%) Percentage (%)
and “Partial implementation”, imprecise element and interactive effects localization of MLLMs
results in “Interactive element missing” and “Wrong position after interaction”.
Besides, the most serious failures are “Interactive element Missing”, “Effect on wrong
element”, “Wrong Function” and “No interaction”. The severity of the failures depends on
the usability rate (UR), with higher UR meaning lower severity and lower UR meaning higher
severity. As illustrated in Table 7, failure (a), (b), (h) and (j) exhibit UR lower than 10%, rendering
the generated interactions completely ineffective.
6.5 RQ5: How does visual saliency influence the quality of generated interactions?
The visual perception limitations of MLLMs affect their performance on visual understanding tasks,
especially when facing small low-resolution objects [60]. In this section, we examine the impact
of interaction area ratio (i.e., visual saliency) on generation outcomes. Let 𝐼 denote interaction, 𝑆𝐼
denote the screenshot of the webpage after interaction 𝐼 , we define the visual saliency (𝑉 𝑆) of the
interaction as follows:
𝑎𝑟𝑒𝑎(𝐼 )
, (4) 𝑉 𝑆 (𝐼 ) =
𝑎𝑟𝑒𝑎(𝑆𝐼 )
where 𝑎𝑟𝑒𝑎() calculates the size (in pixels) of a component. A higher VS score indicates a larger
area influenced by the interaction and, consequently, a higher visual saliency.
We first calculate the visual saliency for all interactions and plot the distribution, as shown in
Figure 11. We then divide the samples into five groups based on the distribution results, keeping the
number of samples in each group roughly balanced. The VS ranges for the five groups are as follows:
[0, 0.025), [0.025, 0.05), [0.05, 0.1], [0.1, 0.2), [0.2, 1). Figure 10 shows the box plot distribution of
metrics for Gemini-1.5 across these five groups, allowing us to draw the first observation:
(1) The group with higher visual saliency has higher SSIM and position similarity.
Although the clip and text similarity fluctuates among different groups, as shown in Fig 10(a),
Fig 10(b) shows that the SSIM and position similarity significantly increases as the visual saliency
increases. As shown in Fig 10(b), the group [0.2, 1) shows the highest metrics, while the group
[0, 0.025) shows the lowest metrics. This demonstrates that MLLMs are more likely to capture
structural and positional features for samples with high visual saliency.
16 Xiao et al.
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
(0-0.025) (0.025-0.05) (0.05-0.1) (0.1-0.2) (0.2-1) (0-0.025) (0.025-0.05) (0.05-0.1) (0.1-0.2) (0.2-1)
Fig. 10. Interaction part metrics distribution of different groups of Gemini-1.5 under the direct prompt.
We then randomly sample 10 webpages from failure cases and crop the screenshots to increase
the visual saliency of the interactions in the webpages (for example, if the webpage is cropped to 1/2
of the original, the visual saliency of the interaction will be doubled). Fig 12 shows the relationship
between the magnification factor and the metrics of generation results. We observe that:
(2) Enhanced visual saliency facilitates the effective generation. When the magnification
factor is set to 1, all evaluation metrics yield values of 0, indicating the unsuccessful interaction
generation. Upon increasing VS by 1.2 times, the model is able to reproduce interactions, but
with relatively low metric scores. As the magnification factor increases from 1.2 to 3, we observe
substantial improvements in performance metrics: the CLIP and SSIM similarities approach 0.8,
while text and position similarities reach approximately 0.6. This suggests that models are effectively
overcoming the original failure cases.
Cumulative Probability
0.6 Text
5.0 0.6
0.4 Position
2.5 0.4 0.2
Fig. 11. Visual saliency distribution. Fig. 12. Metrics under different magnification.
Finding 5: Visual saliency affects the MLLMs’ performance on interaction generation, and
enhancing visual saliency can lead to more accurate code generation.
Gemini-1.5 GPT-4o
Prompt Modality
CLIP SSIM Text Position CLIP SSIM Text Position
V 0.3338 0.1587 0.2777 0.3342 0.3737 0.1793 0.2539 0.3951
Direct T 0.3116 0.1550 0.1687 0.3999 0.4174 0.4067 0.2316 0.4293
V+T 0.5679 0.3010 0.2732 0.5964 0.6735 0.5612 0.3919 0.7157
V 0.4357 0.1975 0.3072 0.4303 0.3871 0.3101 0.2433 0.4461
CoT T 0.3677 0.0897 0.2290 0.4403 0.5579 0.1828 0.3045 0.5465
V+T 0.5503 0.4027 0.3558 0.5656 0.6440 0.4800 0.4287 0.7080
V 0.4502 0.3256 0.2197 0.4302 0.5015 0.4520 0.3389 0.5025
Mark T 0.5019 0.2478 0.2921 0.5301 0.4613 0.4454 0.2805 0.4810
V+T 0.5946 0.4327 0.3416 0.4791 0.6923 0.4336 0.4248 0.7469
only (T), and combined visual-textual input (V+T). Table 8 presents the results, with bold values
indicating the best performance and underlined values showing the second-best performance. We
can make the following observations:
(1) Integrating both visual and textual descriptions enables MLLMs to achieve optimal
performance on the Interaction-to-Code task. It is challenging to determine whether visual-
only or text description-only inputs are superior based on Table 8, as there are instances where
“V” is better and others where “T” excels. However, the combined approach (V+T) consistently
outperforms single-modality inputs in most scenarios across all three prompt types. The result
suggests a complementary relationship between visual and textual inputs, underscoring the benefits
of integrating both modalities for advanced performance.
(2) Supplementary text descriptions can bridge the performance gap across different
model capabilities and prompt strategies. Under direct prompting, Gemini-1.5 with combined
visual and textual inputs (V+T) demonstrates superior performance compared to GPT-4o using
either visual (V) or textual (T) inputs alone. Furthermore, Gemini-1.5’s performance with combined
inputs under direct prompting surpasses its own performance with visual-only input, even when
enhanced by Chain-of-Thought (CoT) or Mark prompting strategies.
Finding 6: The incorporation of visual and textual inputs considerably enhances MLLMs’
capability to generate interactions. With textual descriptions, even a weaker model can
achieve comparable performance to those of superior models without textual descriptions.
7 DISCUSSION
Implications for Researchers. The findings of our study shed light on following future directions
to improve the quality of MLLM-generated UI code in practice.
• Enhancing MLLMs’ recognition of fine-grained webpage features. As noticed in Finding 1,
MLLMs often struggle to reproduce details of interactions, such as position, text, and structure.
Therefore, it is essential to explore strategies to improve the model’s sensitivity on these fine-
grained features.
• Correcting errors in MLLM-generated code. In RQ4, we outline common mistakes when
MLLMs generate interactive components. Developing automated methods to identify failure
types and fix errors is crucial in reproducing reliable and usable webpages.
18 Xiao et al.
• Enhancing the MLLM’s grounding of GUI elements and its understanding of interactions.
In RQ4, we analyze that the existing failures arise from the inability of MLLMs to accurately locate
the interacting elements, understand their functionalities, and comprehend the interactions.
Therefore, it is essential to enhance the capabilities of MLLMs in this area. Alternatively, a GUI
interactive element recognition model and an interactive analysis model could be implemented
prior to MLLM input to address these limitations.
Implications for Developers. Based on our findings, we propose the following practical guidelines
for developers leveraging MLLMs in automated front-end development:
• Applying visual markers for interactive elements. Derived from Finding 2, incorporating
mark prompts with red bounding boxes significantly enhances MLLMs’ ability to generate
accurate interactions. These visual markers enable MLLMs to precisely identify both interactive
elements and their effect areas.
• Optimize interactive element visibility. Finding 5 indicates that enhanced visual saliency
leads to more effective interaction generation. We recommend increasing the visual saliency of
the interaction by slicing the image, or even just inputting in the interactive area to generate
the code for the interaction part first, followed by the integration of the generated code into the
main webpage code.
• Provide comprehensive interaction descriptions. As evidenced by Finding 6, detailed textual
descriptions improve interaction generation quality. Developers can include explicit descriptions
(like the position, interactive elements, and effects) of interaction in their prompts to make
MLLMs understand the interaction clearly.
8 THREATS TO VALIDITY
Limited context length. As webpages become more complex with numerous interactions, the input
context expands, potentially exceeding the context window constraints of MLLMs (e.g., 128K tokens
for GPT-4o). Nevertheless, this limitation can be mitigated by employing iterative generation,
progressively producing interactions for a webpage over multiple rounds.
Model selection. This study utilizes three prominent Multimodal Large Language Models (MLLMs)
to conduct experiments. There are some open source MLLMs such as LLaVa [35] we don’t test, we
will test the performance of these models on Interaction-to-Code task in the future work.
Unable to handle interactions that require back-end. Some complex functional interactions (e.g.,
login, search, etc.) are implemented by server-side scripting languages like Python. The benchmark
we collect does not include back-end code; we cannot verify the generation effect of such interactions,
but we believe our work is an important step toward generating interactive websites.
9 CONCLUSION
This paper presents the first systematic evaluation of MLLMs in the Interaction-to-Code task. We
introduce a formal definition of the Interaction-to-Code paradigm and establish the comprehensive
Interaction2Code benchmark encompassing diverse interaction scenarios. Through extensive
automated and human evaluations, we assess MLLMs’ performance and usability of generated
interactions. Our key findings reveal the limitations of MLLMs in the Interaction-to-Code task,
failure types, and key factors (prompts, enhanced visual saliency, and supplementary textual
descriptions) for enhancing the interaction generation performance of MLLMs.
REFERENCES
[1] 2024. The 10 best user interface (UI) design tools to try in 2024. UX Design Institute (2024). https://www.
uxdesigninstitute.com/blog/user-interface-ui-design-tools/ Accessed: 2024-10-06.
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 19
[2] 2024. Top Website Statistics For 2024. Forbes Advisor (2024). https://www.forbes.com/advisor/business/software/
website-statistics/ Accessed: 2024-10-06.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances
in neural information processing systems 35 (2022), 23716–23736.
[4] Anthropic. 2024. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet Accessed:
2024-09-29.
[5] Anthropic. 2024. Vision Documentation. https://docs.anthropic.com/en/docs/vision Accessed: 2024-10-18.
[6] Shushan Arakelyan, Rocktim Jyoti Das, Yi Mao, and Xiang Ren. 2023. Exploring Distributional Shifts in Large
Language Models for Code Analysis. In Conference on Empirical Methods in Natural Language Processing. https:
//api.semanticscholar.org/CorpusID:257557735
[7] Batuhan Aşıroğlu, Büşta Rümeysa Mete, Eyyüp Yıldız, Yağız Nalçakan, Alper Sezen, Mustafa Dağtekin, and Tolga
Ensari. 2019. Automatic HTML code generation from mock-up images using machine learning techniques. In 2019
Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). Ieee, 1–4.
[8] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the
ACM SIGCHI symposium on engineering interactive computing systems. 1–6.
[9] C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu. 2018. From UI design image to GUI skeleton: a neural machine translator
to bootstrap mobile GUI implementation. In Proceedings of the 40th International Conference on Software Engineering.
665–676.
[10] Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018. From ui design image to gui skeleton:
a neural machine translator to bootstrap mobile gui implementation. In Proceedings of the 40th International Conference
on Software Engineering. 665–676.
[11] Fuxiang Chen, Fateme Moradian Fard, David Lo, and Timofey Bryksin. 2022. On the Transferability of Pre-trained
Language Models for Low-Resource Programming Languages. 2022 IEEE/ACM 30th International Conference on Program
Comprehension (ICPC) (2022), 401–412. https://api.semanticscholar.org/CorpusID:248266381
[12] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2022. Visualgpt: Data-efficient adaptation of pretrained
language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 18030–18040.
[13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda,
Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry,
Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios
Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu
Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever,
and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021).
https://api.semanticscholar.org/CorpusID:235755472
[14] Wen-Yin Chen, Pavol Podstreleny, Wen-Huang Cheng, Yung-Yao Chen, and Kai-Lung Hua. 2022. Code generation
from a graphical user interface via attention-based encoder–decoder model. Multimedia Systems 28, 1 (2022), 121–130.
[15] André Armstrong Janino Cizotto, Rodrigo Clemente Thom de Souza, Viviana Cocco Mariani, and Leandro dos
Santos Coelho. 2023. Web pages from mockup design based on convolutional neural network and class activation
mapping. Multimedia Tools and Applications 82, 25 (2023), 38771–38797.
[16] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li,
Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning. ArXiv abs/2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266
[17] Victor C. Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh, Han Liu, and Saleema Amershi. 2022.
Aligning Offline Metrics and Human Judgments of Value of AI-Pair Programmers. ArXiv abs/2210.16494 (2022).
https://api.semanticscholar.org/CorpusID:253237523
[18] Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Robert Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan,
Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2023. A Static Evaluation of Code
Completion by Large Language Models. ArXiv abs/2306.03203 (2023). https://api.semanticscholar.org/CorpusID:
259088657
[19] Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. ArXiv
abs/2304.07590 (2023). https://api.semanticscholar.org/CorpusID:258179537
[20] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng,
and Yiling Lou. 2023. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation.
ArXiv abs/2308.01861 (2023). https://api.semanticscholar.org/CorpusID:260439062
20 Xiao et al.
[21] Shuzheng Gao, Xinjie Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu. 2023. Constructing Effective In-
Context Demonstration for Code Intelligence Tasks: An Empirical Study. ArXiv abs/2304.07575 (2023). https:
//api.semanticscholar.org/CorpusID:263867793
[22] Henry Gilbert, Michael Sandborn, Douglas C. Schmidt, Jesse Spencer-Smith, and Jules White. 2023. Semantic Com-
pression with Large Language Models. 2023 Tenth International Conference on Social Networks Analysis, Management
and Security (SNAMS) (2023), 1–8. https://api.semanticscholar.org/CorpusID:258309482
[23] Google. 2024. Gemini API. https://ai.google.dev/gemini-api Accessed: 2024-10-06.
[24] Jian Gu, Pasquale Salza, and Harald C. Gall. 2022. Assemble Foundation Models for Automatic Code Summarization.
2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), 935–946. https:
//api.semanticscholar.org/CorpusID:245986582
[25] Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, and Wenbin Jiang. 2024.
VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs. arXiv preprint arXiv:2404.06369
(2024).
[26] Xinying Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John C. Grundy, and Haoyu
Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. ArXiv abs/2308.10620
(2023). https://api.semanticscholar.org/CorpusID:261048648
[27] Vanita Jain, Piyush Agrawal, Subham Banga, Rishabh Kapoor, and Shashwat Gulyani. 2019. Sketch2Code: transforma-
tion of sketches to UI in real-time using deep neural network. arXiv preprint arXiv:1910.08930 (2019).
[28] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language
Models. ArXiv abs/2306.02907 (2023). https://api.semanticscholar.org/CorpusID:259076266
[29] Kati Kuusinen and Tommi Mikkonen. 2013. Designing User Experience for Mobile Apps: Long-Term Product
Owner Perspective. 2013 20th Asia-Pacific Software Engineering Conference (APSEC) 1 (2013), 535–540. https:
//api.semanticscholar.org/CorpusID:18632493
[30] Valéria Lelli, Arnaud Blouin, and Benoît Baudry. 2015. Classifying and Qualifying GUI Defects. 2015 IEEE 8th
International Conference on Software Testing, Verification and Validation (ICST) (2015), 1–10. https://api.semanticscholar.
org/CorpusID:2288032
[31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
[32] Jia Li, Ge Li, Yongming Li, and Zhi Jin. 2023. Enabling Programming Thinking in Large Language Models Toward
Code Generation. ArXiv abs/2305.06599 (2023). https://api.semanticscholar.org/CorpusID:263896057
[33] Tsz On Li, Wen yi Zong, Yibo Wang, Haoye Tian, Y. Wang, and S. C. Cheung. 2023. Nuances are the Key: Unlocking
ChatGPT to Find Failure-Inducing Tests with Differential Prompting. 2023 38th IEEE/ACM International Conference on
Automated Software Engineering (ASE) (2023), 14–26. https://api.semanticscholar.org/CorpusID:258298446
[34] Zongjie Li, Chaozheng Wang, Zhibo Liu, Hao Wang, Shuai Wang, and Cuiyun Gao. 2022. CCTEST: Testing and
Repairing Code Completion Systems. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
(2022), 1238–1250. https://api.semanticscholar.org/CorpusID:251623193
[35] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT:
Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
[36] Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using Deep Learning to Generate Complete
Log Statements. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) (2022), 2279–2290.
https://api.semanticscholar.org/CorpusID:245906103
[37] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and
Gabriele Bavota. 2021. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. 2021
IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 336–347. https://api.semanticscholar.
org/CorpusID:231786586
[38] Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett, and Denys Poshyvanyk. 2018. Machine
learning-based prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering 46,
2 (2018), 196–221.
[39] Kevin Moran, Boyang Li, Carlos Bernal-Cárdenas, Dan Jelf, and Denys Poshyvanyk. 2018. Automated Reporting of
GUI Design Violations for Mobile Apps. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)
(2018), 165–175. https://api.semanticscholar.org/CorpusID:3634687
[40] T. A. Nguyen and C. Csallner. 2015. Reverse engineering mobile application user interfaces with remaui (t). In 2015
30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 248–259.
[41] Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobile application user interfaces with remaui
(t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 248–259.
[42] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In International
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 21