0% found this document useful (0 votes)

41 views21 pages

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

The document presents a systematic investigation into the use of multi-modal large language models (MLLMs) for generating interactive web pages, addressing the limitations of current automated methods that primarily focus on static elements. It introduces the Interaction2Code benchmark, which includes 97 unique web pages and 213 interactions, and highlights key findings regarding MLLMs' performance, common failure types, and factors influencing interaction generation. The study aims to enhance the practicality and usability of automated front-end development by providing insights and implications for future research and development in this area.

Uploaded by

Edgar Chan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views21 pages

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

Uploaded by

Edgar Chan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Interaction2Code: How Far Are We From Automatic

Interactive Webpage Generation?

JINGYU XIAO, The Chinese University of Hong Kong, China
YUXUAN WAN, The Chinese University of Hong Kong, China
arXiv:2411.03292v1 [cs.SE] 5 Nov 2024

YINTONG HUO∗ , Singapore Management University, Singapore

ZHIYAO XU, Tsinghua University, China
MICHAEL R.LYU, The Chinese University of Hong Kong, China
Converting webpage design into functional UI code is a critical step for building websites, which can be labor-
intensive and time-consuming. To automate this design-to-code transformation process, various automated
methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed.
However, these studies were merely evaluated on a narrow range of static web pages and ignored dynamic
interaction elements, making them less practical for real-world website deployment.
To fill in the blank, we present the first systematic investigation of MLLMs in generating interactive web-
pages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark
that contains 97 unique web pages and 213 distinct interactions, spanning 15 webpage types and 30 interaction
categories. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both
automatic metrics and human evaluations, thereby summarizing six findings accordingly. Our experimental
results highlight the limitations of MLLMs in generating fine-grained interactive features and managing
interactions with complex transformations and subtle visual modifications. We further analyze failure cases
and their underlying causes, identifying 10 common failure types and assessing their severity. Additionally,
our findings reveal three critical influencing factors, i.e., prompts, visual saliency, and textual descriptions, that
can enhance the interaction generation performance of MLLMs. Based on these findings, we elicit implications
for researchers and developers, providing a foundation for future advancements in this field. Datasets and
source code are available at https://github.com/WebPAI/Interaction2Code.
Additional Key Words and Phrases: Multi-modal Large Language Model, Code Generation, User Interface,
Web Development, Empirical Study

1 INTRODUCTION
As the Internet continues to evolve and expand, more and more websites emerge, contributing to the
diverse and ever-growing online world. As of 2024, the digital landscape comprises approximately
1.09 billion websites [2], supporting a variety of applications in people’s daily lives.
The design and development of Graphical User Interfaces (GUIs) are vital for creating a website. A
well-designed GUI not only enhances the website’s visual attractiveness but also improves usability
and user satisfaction. In such a process, GUI design involves shaping the website’s aesthetics[1],
such as layout, colors, and typography [9, 29]. In contrast, GUI development is about implementing
that aesthetic through programming languages. Nevertheless, such conversion is a complex and
time-consuming task. Developers must manually map visual elements to their corresponding
implementation details, which can lead to errors and discrepancies between the original design
and the final looks [9, 30, 39, 40, 59].
To allow developers to transform design diagrams into functional GUI code more easily, several
automated GUI code generation methods have been proposed, which can be further categorized
∗ Yintong Huo is the corresponding author.

Authors’ addresses: Jingyu Xiao, The Chinese University of Hong Kong, Hong Kong, China, [email protected];
Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China, [email protected]; Yintong Huo, Singapore
Management University, Singapore, Singapore, [email protected]; Zhiyao Xu, Tsinghua University, Beijing, China,
[email protected]; Michael R.Lyu, The Chinese University of Hong Kong, Hong Kong, China, [email protected].
2 Xiao et al.
https://www.fun.com/adult-cakeworthy-never-land-denim-jacket.html

Interactive Elements Static Elements Implemented Unimplemented

blog 53% 47% blog 3% 97%
code 50% 50% code 11% 89%
music 96% 4% music 6% 94%
news 68% 32% news 14% 86%
search engine 88% 12% search engine 11% 89%
shopping 91% 9% shopping 5% 95%
sports 59% 41% sports 11% 89%
technology 74% 26% technology 13% 87%
video 60% 40% video 12% 88%
social media 61% 39% social media 8% 92%
0 20 40 60 80 100 0 20 40 60 80 100
Percentage of interactive and static elements(%) Percentage of implemented vs.unimplemented interactive elements(%)

(a) Example of interactive (b) Ratio of interactive and static ele- (c) Implemented vs. unimplemented in-
elements. ments. teractive elements ratio of GPT-4o.

Fig. 1. Interaction example and interactive elements ratio of different types of webpages.

into two types: learning-based and LLM-based approaches. The learning-based methods, such
as Pix2code [8], design a novel method based on CNN and LSTM to generate user interface
code by reverse-engineering a single GUI image input. Chen et al.[10] present a neural machine
translator to extract visual features in UI images, encode these features’ spatial layouts, and
generate GUI skeletons in a unified neural network framework. However, these deep learning-based
methods exhibit compromised performance and fail in generalizing to diverse web page elements
due to their limited knowledge learning from training samples. Recently, incorporating visual
information into Large Language Models (LLMs) has led to the development of Multimodal Large
Language Models (MLLMs) [3, 12, 31, 50, 56]. Leading models in this domain, such as GPT-4o [43],
Claude-3.5 [4], and Gemini-1.5 [23], have achieved excellent performance in visual understanding
tasks [16, 55]. Furthermore, research has shown that LLMs have remarkable performance on various
code intelligence tasks [26], including code generation [19, 20, 22, 28, 32, 57], code completion [13,
17, 18, 33, 34, 42], and code summarization [6, 11, 21, 24, 36, 37]. These advances create new
opportunities for the Design-to-Code task, i.e., generating code from screenshots to replicate web
page elements, layout, text, and colors. For example, Design2Code [49] designs three types of
prompts to stimulate MLLMs’ web content understanding and self-refined capabilities for GUI
code generation. DCGen [51] proposes a divide-and-conquer-based approach to prompt MLLMs to
generate webpage elements more accurately.
Regardless of the continuous investigation on promoting the models’ capability, their evaluation
scope is restricted to static pages. More specifically, existing research [25, 49, 58] only focuses
on the static appearance of the webpage (e.g., color, layouts), ignoring the dynamic interactive
properties and functionality of elements, such as size selection list, quantity adjustment button
shown in Fig. 1(a), and other designs for user engagements. Additionally, we observe that such
interactive elements account for a large proportion of the webpage in real-world software practices.
We randomly select 10 real-world webpages with different topics to analyze the ratio of interactive
elements, the results in Fig. 1(b) indicate that interactive elements take up more than 50% cases.
Then we utilize GPT-4o [43] to generate the GUI code containing interactive elements. As shown
in Fig. 1(c), fewer than 15% of interactive elements are correctly implemented, highlighting the
current limitations in handling webpage interactive design.
Static webpages inherently limit user interaction with web elements, hindering access to new
content (such as browsing images via carousel buttons) or impeding task completion (like selecting
clothing sizes from drop-down menus), thereby impairing user experience. In this context, evalu-
ations of static pages become inadequate for real-world webpage deployments, where dynamic
elements are prevalent. Therefore, We argue that a benchmark for webpages that includes inter-
active elements is essential to enhance the practicality, usability, and user engagement of studies on
auto-generated GUI code. In this paper, we emphasize the importance of webpage interactions by
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 3

Table 1. Summarization of key findings.

Aspects Findings
MLLMs exhibit limited performance in reproducing fine-grained interaction features,
Limitations such as structure, text, and position (Finding 1).
Performance based on the type of interaction: MLLMs excel at handling interaction
with fixed pattern (e.g., selection list) and clear changes (e.g., new window creation),
while struggle with interactions that involve complex changes (e.g., iframe,
progress) and subtle visual modifications (Finding 3).
The predominant failures are “No interaction”, “Partial implementation”,
Failure “Interactive element missing”, and “Wrong position after interaction”.
Types The most critical failures include “Interactive element Missing”, “Effect on wrong
element”, “Wrong Function” and “No interaction” (Finding 4).
★Well-designed prompts are effective: Chain-of-Thought enables step-by-step
interaction analysis, while marking interaction areas provides essential visual signals.
Key
Both approaches improve the quality of generated interactions (Finding 2).
Factors
★ Enhanced visual saliency significantly improves interaction generation,
particularly in complicated cases (Finding 5)
★Supplementary textual descriptions substantially boost MLLMs’ interaction
generation capabilities (Finding 6).

investigating the following question: to what extent MLLMs can produce interaction code
based on the visual design?
To this end, we provide a systematic analysis of MLLMs’ capability in reproducing dynamic
interactions on webpages. Specifically, we first define the Interaction-to-Code task, i.e., generating
code from a series of screenshots representing webpage interactions to replicate interactive elements.
Then we build the Interaction2Code benchmark that encompasses a diverse array of webpages
and interactions. It comprises 97 unique web pages and 213 distinct interactions, spanning 15
webpage types and 30 interaction categories. By curating a wide range of interaction types, we offer
a representative and diverse evaluation dataset for assessing the capabilities of MLLMs producing
dynamic webpages in a more realistic scenario. We mainly investigate the following six research
questions (RQs):
• RQ1: How do different MLLMs perform in Interaction-to-Code task under different prompts?
• RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
• RQ3: How do MLLMs perform in code generation across different interaction scenarios?
• RQ4: What types of mistakes do MLLMs make in generating interactions?
• RQ5: How does visual saliency influence the quality of generated interactions?
• RQ6: Which representation modality – visual signals or textual description, enhances MLLMs to
generate interaction code?
To address RQ1, we design three distinct prompt types: direct prompts, Chain-of-Thought
prompts, and Mark prompts (which mark the interaction areas) to evaluate the performance of
three state-of-the-art MLLMs under varying prompt conditions. For RQ2, we conduct user studies
where participants interact with the generated webpages to assess the usability. In RQ3, we analyze
MLLMs’ performance across different interaction scenarios by calculating usability rates for various
interaction types. To answer RQ4, we invite human annotators to categorize and discuss webpage
generation failures, followed by data analysis to reveal the most prevalent error patterns and
their severity. For RQ5, we evaluate the generated interactions across varying saliency levels to
4 Xiao et al.

<!DOCTYPE html> <style>

<html lang="en"> #myButton {
background-color:green;
<head> transition: background-color 0.5s;}
<title>Front-End Development</title> #myButton:hover{
<style>CSS Codes</style> background-color:blue;}
</head> </style>

<body>
<h1>Front-End Development</h1>
<p id="myParagraph">This is an example</p> <script>
<button id=“myButton”, const button = document.getElementById('myButton’);
draggable="true">Click Me</button> const paragraph = document.getElementById('myParagraph');
<script>JavaScript Codes</script> button.addEventListener('click', function() {
</body> paragraph.textContent = 'Button clicked!’;
});
</html> </script>

Fig. 2. Example code of HTML, CSS and JavaScript.

investigate their impact on interaction generation performance. Finally, for RQ6, we examine
the influence of interaction representation modality by comparing three input configurations:
visual-only, textual description-only, and combined visual-textual.
Based on our experimental results, we present six key findings, shown in Table 1, including
the limitations of MLLMs, failure types and key factors for enhancing interaction generation
performance. Our contributions are summarized as follows:
• Task formulation. To the best of our knowledge, this is the first study to formulate the
Interaction-to-Code task and present a systematic study on the code generation capabilities
of MLLMs for dynamic interaction of webpages.
• Benchmark. We build the first real-world webpage interaction datasets Interaction2Code con-
taining 97 webpages and 213 interactions, spanning 15 webpage topics and 30 interaction
categories.
• Key Findings. Our in-depth analysis reveals the limitations of MLLMs, identifies 10 representa-
tive failure types and their underline cause, and provides key factors for enhancing performance
on the Interaction-to-Code task. This key findings offer valuable implications for researchers
and developers engaged in automated front-end development.

2 BACKGROUND
2.1 Basic Knowledge about Front-end Development
Front-end development focuses on what users see and interact with in their web browsers. Visual
design and interactive implementation are two key parts of creating visually appealing and user-
friendly interfaces. The primary technologies used in front-end development are Hypertext Markup
Language (HTML), Cascading Style Sheets (CSS), and JavaScript.
2.1.1 HTML. HTML (HyperText Markup Language) is a markup language used to create web page
content. It defines the structure and content of a web page through tags, such as titles, paragraphs,
and buttons, as shown in Fig 2; each HTML element includes an opening tag, content, and a closing
tag, forming the basic block of a webpage. HTML does not support complex interactions, but some
specific elements (e.g., form, button) and attributes can be used to implement basic interactive
functions. For example, the HTML code in Fig. 2 set the “draggable” attribute as true in the button
flag to allow user to drag the button.
2.1.2 CSS. CSS (Cascading Style Sheets) is a style sheet language used to describe the style of
HTML documents. It allows web developers to control the layout, fonts, colors, spacing, and other
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 5

visual effects of the page. CSS can achieve interactive effects through pseudo-classes, pseudo-
elements, transitions and animations. For example, the CSS program between the style tag in Fig. 2
leverages the “:hover” pseudo-class to add an interaction on the button. The button’s color will
change from green to blue once the mouse hovers. The transition (“transition: background-color
0.5s”) can smoothly change the color of the button over 0.5 second to create an animation effect.
2.1.3 JavaScript. JavaScript is a high-level, dynamic, and versatile programming language that
is primarily used for adding interactivity and dynamic behavior to websites. JavaScript enables
developers to create rich, interactive user experiences, manipulate the Document Object Model
(DOM), and handle events. For example, Fig. 2 shows that the JavaScript program between the
script tag adds an event listener on the button, once clicked, the text content of the paragraph will
be changed to “Button clicked!”.
In summary, the interaction of the front end of the web page comes from HTML tags and
attributes, style changes implemented by CSS, and custom events implemented by JavaScript.

2.2 Related Work

UI code generation techniques can be divided into three categories: Deep Learning (DL) based
methods, Computer Vision (CV) based methods, and Multimodal Large Language Model (MLLM)
based methods. (1) DL-based methods: [7, 9, 15, 38, 54] leverages CNNs to automatically prototype
software GUIs. Pix2code [8] utilizes CNNs and LSTM to extract features from GUI images to
generate a domain-specific language (DSL). [14] implements an encoder-decoder framework with
an attention mechanism to generate the DSL. (2) CV-based methods: Sketch2Code [27] inputs hand-
drawn sketches into object detection models to learn the object representation, which is read by the
UI parser to generate code for targeted platforms. REMAUI [41] identifies user interface elements via
optical character recognition (OCR) techniques and then infers a suitable user interface hierarchy
and exports the results as source code. (3) MLLM-based methods: with the help of MLLMs’ powerful
understanding of images, Design2Code [49] generates UI code through text-augmented and self-
revision prompting. To solve the element omission distortion and misarrangement problems during
UI code generation, DCGen [51] proposes a divide-and-conquer-based approach to generate the
code of the submodules separately and then assemble them to construct the full webpage. DeclarUI
[61] uses the element segmentation method to accurately generate elements and page transition
graphs to prompt MLLMs to generate mobile app UI with jump logic. Although the above works
achieve decent performance on the UI code generation task, none of them consider the generation
of interactive elements.

3 PROBLEM DEFINITION
To describe the interactions within a webpage, we define the Webpage Interaction Graph (WIG):
𝑊 𝐼𝐺 = {𝑁 , 𝐸}, (1)
where 𝑁 = {𝑆 0, 𝑆 1, .., 𝑆𝑛 } is a finite set of nodes representing the screenshots of the webpage.
𝐸 = {𝐼 0, 𝐼 1, ..., 𝐼𝑚 } represents a series of interaction events that connect different screenshots with
directed edges, indicating transitions caused by user interactions. We use numbers to represent
different interaction events and then correspond them to the screenshots. For example, the first
screenshot represents the original web page, the second screenshot represents the result after
interaction 1, and the third screenshot represents the result after interaction 2. Let 𝐶𝑜 denote the
original webpage file, including HTML, CSS, and JavaScript code, 𝑆 0𝑜 denote the screenshot of the
webpage, 𝑆𝑛𝑜 denote the webpage screenshot after interaction 𝐼𝑛 , 𝐺𝑜 denotes the webpage interaction
graph of the original webpage. To achieve the Interaction-to-Code task, 𝑀 takes the 𝐺𝑜 as input and
6 Xiao et al.

Prompts: You are a web developer

proficient in HTML, CSS and
JavaScript. You are tasked with
select destination list creating a webpage that replicates
the appearance and interaction of
the screenshots.

click date button

Multimodal Large
Language Models

click the number of

people button

Fig. 3. An example of the input and output of the Interaction-to-Code task.

outputs the generated code file 𝐶𝑔 = 𝑀 (𝐺𝑜 ) to implement both the visual design and interactions
of the original webpage 𝐶𝑜 . Fig. 3 illustrates an example of Interaction-to-Code task.

4 THE INTERACTION2CODE BENCHEMARK

In this section, we describe how to construct representative webpages and interactions for Interaction-
to-Code tasks and report the statistical results of the dataset.

4.1 Dataset Collection

Our overall goal is to obtain a set of well-structured webpages that represent a variety of real-world
use cases (i.e., diverse webpages and interactions). We follow these steps for automatic processing
and manual filtering.
Webpage Selection. Following the Design2Code [49], we begin by collecting website links from
the C4 validation set [48]. To select web pages from these links, we employ four PhD students
majoring in computer science, each with experience in front-end development. Each student is
assigned to select approximately 25 web pages. The selection criteria were as follows: 1) complexity:
each web page must contain at least four meaningful interactions; 2) diversity: the selection process
aims to include a wide range of web pages with different topics and interaction types. Ultimately,
we compile a dataset consisting of 97 web pages.
Automatic Interaction. After gathering representative web pages, we utilize Selenium Web-
Driver1 to simulate user interactions with the pages. Our focus is on interactions within a single
page, so we eliminate all external links to prevent navigation away from the current page. Ad-
ditionally, we replace all images and videos with placeholders to mitigate the impact of external
dependencies on the code generation task. Subsequently, we traverse all elements on the webpage
using the Document Object Model (DOM) tree and capture different states by taking screenshots
before and after specific interactions.
1 https://selenium-python.readthedocs.io/
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 7

Post-processing. Post-processing consists of two steps: 1) interaction selection: in real-world

webpages, there are many trivial interactions, like underline added to text when the mouse is
hovering. To preserve meaningful interactions and ensure the complexity and diversity of interac-
tions, while also avoiding the context input into MLLMs being too long and exceeding the model’s
limitations, we manually delete some simple and meaningless interactions. Finally, 1-4 important
and functional interactions are retained on one webpage. 2) Manual annotation: we ask the four
PhD students to annotate the topics of the web pages and the types of interactions for benchmark
diversity analysis.
Interaction Part Extraction. After obtaining the screenshots before and after the interaction,
we extract the interactive part from them to evaluate the generation effect of the interaction part. If
the interaction does not change the size of the webpage, we can directly subtract the pixels of the
two screenshots to obtain different areas (the area where the pixel value is not 0 after subtraction
is the interaction area). However, some interactions will change the size of the web page (e.g.,
generating new components). In this case, we use the Git difference tool 2 to calculate the different
line and column numbers of the two screenshots. The areas where these rows and columns intersect
are the areas affected by the interaction.

4.2 Data Statistics and Diversity

Quantitative Metrics. To measure the diversity and complexity of our dataset, we adopt the
same statistical metrics as those in Design2Code [49], with the results presented in Table 2. The
Length indicates the token length obtained through the GPT-2 tokenizer [47], tag count refers to
the number of tags in the HTML code, DOM depth signifies the maximum depth of the HTML’s
DOM Tree, and unique tags denote the number of unique tags in the HTML code. Table 2 shows
that the data is rich in HTML tags (1,291 in a page on average).
Topic Distribution. A user’s interaction with a webpage is influenced by its function and topic.
For instance, social websites promote interactions about message writing and sending, whereas
e-commerce sites focus on product feature selection and shopping cart modification. To understand
the variety of topics represented in our benchmark, we manually categorized webpages based on
their functions. As illustrated in Figure 4, our benchmark covers a diverse range of web topics with
more than 15 types, including shop, blog, business, news, book and so on.

shop
Table 2. Quantitative metrics. 11% Other
blog
8% 29%
business 6%
Min Max Average Std
news 6%
Length (tokens) 2457 726,317 141,084 160,438 3%
6% 3% technology
Tag Count 34 12,694 1,291 1,574 4% 3% video
4% 4% 4%3%3%3%
book
DOM Depth 6 37 18 6 product
homepage sport
Unique Tags 8 58 31 9 hotel encyclopedia
form food study
Total size 97
Fig. 4. Topic distribution.

Interaction Type Distribution. To get a sense of the range of interaction types covered in
our benchmark, we manually annotate what type of interactions they are based on the element
tag and the visual effect perspective. Tag categories come from the HTML tags like button, image,
link, and so on. Buttons, input boxes, and links are the most frequent types and play a great role in
2 https://git-scm.com/docs/git-difftool
8 Xiao et al.

Table 3. Interaction type frequency.

human-website interaction. Visual categories involve changes in color, size, position, text, etc, the
explanations are as follows:
• New component: it represents new elements are generated after an interaction. For example, as
shown in Fig 7(c), two new input elements will be generated after selecting the third choice.
• Text: text change after interaction, As shown in Fig. 8(i), after clicking the “Select” button, the
text on it will change to “Selected”.
• Color: it denotes the color change after interaction. For example, the background color change
from while to dark after clicking the dark label as illustrated in Fig. 8(c).
• New window: it represents that a new window is generated after the interaction, such as a form
popping up after clicking the contact button, as shown in Fig. 8(f).
• Position: it indicates that the position of the element changes after the interaction. For example,
on a text editing website, clicking the right button can move the text from the left to the right.
• Size: it indicates that the size of the element changes after the interaction. For example, the text
size will increase after clicking the large label as shown in Fig. 8(h).
• Switch: it indicates the switching of content. For example, in Fig. 7(b), after clicking the “M”
button, the clothes parameter will be switched from “S” to “M”.
Note that one interaction may belong to multiple tag categories and visual categories. Table 3
demonstrates that Interaction2Code benchmark has a rich set of interaction types, including 23 tag
categories and 7 visual categories.

5 STUDY SETUP
5.1 Evaluation Models
We employ three state-of-the-art (SOTA) MLLMs: Gemini 1.5 [23], GPT-4o [43] and Claude-3.5
[4] to evaluate their performance on Interaction-to-Code task. the specific model numbers are
20240806 for GPT-4o, 20240620 for Claude-3.5-Sonnet, and Gemini-1.5-flash-latest accessed during
October 2024. In configuring the MLLM models, we set the temperature to 1 and the maximum
number of tokens output for the three models as 4096. All other parameters were kept at their
default settings as outlined in the relevant API documentation [5, 23, 44].
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 9

5.2 Metrics
We employ both full webpage metric and interactive part metric to judge the capability of MLLMs
in the Interaction-to-Code task. We measure the quality of webpages generated by MLLMs from
the perspectives of visual, structure, and text:
• Visual Similarity. We use CLIP score [46]to measure the visual similarity. This metric measures
the semantic similarity between the generated and original webpages, serving as an indicator
of how effectively the generated GUI captures the intended visual elements and overall design
concept.
• Structure Similarity. SSIM [52] (Structural Similarity Index Measure) score is applied to calcu-
late the structure similarity. It evaluates the layout and compositional accuracy, emphasizing the
spatial arrangement and structural similarities between the generated and original webpages.
• Text Similarity. We first use python OCR tools to recognize the text in the original and the
generated webpages, and then use the Bilingual Evaluation Understudy (BLEU) score [45] to
measure the text similarity between the two web pages.
For the interactive parts of webpages, in addition to the above visual, structure and text similarity,
we also evaluate them from the perspective of the position and function of the interaction.
• Position Similarity. The position similarity between original interaction 𝐼𝑜 and generated
interaction 𝐼𝑔 is defined as follows:
𝑃𝑜𝑠𝑠𝑖𝑚 (𝐼𝑜 , 𝐼𝑔 ) = 1 − 𝑚𝑎𝑥 (𝑎𝑏𝑠 (𝑥𝑜 − 𝑥𝑔 ), 𝑎𝑏𝑠 (𝑦𝑜 − 𝑦𝑔 )), (2)
where (𝑥𝑜 , 𝑦𝑜 ) and (𝑥𝑔 , 𝑦𝑔 ) are normalized coordinates (in [0, 1]) of the center of the interactive
area.
• Function Usability. This metric is used to measure whether the interactive function is usable,
human annotators are asked to interact with the generated webpage and judge the usability. Let
𝑁 (·) denote the quantity, we can calculate the Usability Rate (UR):
𝑁 (𝑢𝑠𝑎𝑏𝑙𝑒)
𝑈𝑅 = . (3)
𝑁 (𝑢𝑠𝑎𝑏𝑙𝑒) + 𝑁 (𝑢𝑛𝑢𝑠𝑎𝑏𝑙𝑒)
5.3 Prompt Design
We design three types of prompt methods: direct prompt, chain-of-thought prompt, and mark
prompt, as shown in Fig 5. In the direct prompt, the first screenshot represents the original webpage
state, while subsequent screenshots depict states after specific interactions. Requirements are applied
to guide MLLMs in replicating the webpage design and interaction. In particular, requirement 3
involves letting MLLMs number interactive elements to allow direct identification by ID, enabling
automated interaction and screenshot capture for generated webpages. For the Chain-of-Thought
(CoT) prompt [53], we use the instruction “let’s think step by step” and design three intermediate
steps: analyze the interaction effects, locate the interactive elements, and implement the interaction.
For the Mark prompt, We use red bounding boxes to highlight the areas of interaction, prompting
MLLMs to focus on the interactive parts.

6 EXPERIMENTS
In this work, we conduct experiments to answer the following questions:
• RQ1: How do different MLLMs perform in Interaction-to-Code task under different prompts?
• RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
• RQ3: How do MLLMs perform in code generation across different interaction scenarios?
• RQ4: What types of mistakes do MLLMs make in generating interactions?
• RQ5: How does visual saliency influence the quality of generated interactions?
10 Xiao et al.

Direct Prompt
[Instruction]:
You are a web developer proficient in HTML, CSS and JavaScript. The user provides some screenshots of a webpage. The first screenshot [image1]
shows the webpage in its original state, while others [image2, image3,…] show the webpage after the user has interacted with certain elements.
You are tasked with creating a webpage that replicates the design and interaction observed in screenshots.
[Requirements]:
1. Design Replication: Pay attention to layout, color and so on to make the webpage look identical to the first screenshot .
2. Interaction Replication : Implement the changes shown in screenshots caused by interactions (e.g., clicks).
3. Number Interactions: You need to number interactive elements from interact-1 to interact-n, interact-1 corresponds to the interaction presented in
the second screenshot, and interact-2 corresponds to the interaction presented in the third screenshot, and so on. For example, if the button is
clicked in the second screenshot, the id of the button is set to interact-1: "<button id="interact1">Click Me!</button>"
…
Combine HTML, CSS and JavaScript codes into one file and respond the codes only:

Chain-of-Thought (CoT) Prompt Mark Prompt

[Instruction] [Instruction]
[Requirements] [Requirements]
[CoT] [Mark]
You should think step by step: In the first screenshot, the interactive elements are
Step 1: Understand the interaction effects by analyzing the difference between the highlighted with red bounding boxes. In other screenshots,
first and other screenshots. the interaction effects are highlighted with red bounding
Step 2: Locate interactive elements. boxes. Pay attention to the position of the red bounding
Step 3: Implement the interaction: the interaction function should cause the boxes, which mark the position of interaction. But do not
difference you analyze in Step 1 and be implemented the interactive element you generate the red bounding box, which is just used for
locate from step 2. marking the interaction area.

Combine HTML, CSS and JavaScript codes into one file and respond the codes only: Combine HTML, CSS and JavaScript codes into one file and
respond the codes only:

Fig. 5. The three kinds of prompts for MLLMs.

Table 4. Performance of different MLLMs under different prompts on Interaction-to-Code task.

Full Page Interaction Part

Model Prompt
CLIP SSIM Text CLIP SSIM Text Position
Direct 0.7321 0.7254 0.6481 0.7169 0.4101 0.4321 0.5860
Gemini-1.5 CoT 0.7264 0.7044 0.6525 0.7266 0.4821 0.4916 0.5893
Mark 0.7245 0.6934 0.6491 0.7120 0.4719 0.4992 0.5901
Direct 0.7651 0.6554 0.6511 0.7328 0.4221 0.4848 0.6053
GPT-4o CoT 0.7286 0.6111 0.6356 0.7212 0.4556 0.4902 0.6079
Mark 0.7429 0.6074 0.6571 0.7454 0.5583 0.5241 0.6123
Direct 0.7845 0.7819 0.6578 0.7251 0.5145 0.5070 0.6140
Claude-3.5 CoT 0.7552 0.7284 0.6421 0.7316 0.5437 0.4887 0.6102
Mark 0.7601 0.7531 0.6465 0.7577 0.5634 0.4940 0.5911

• RQ6: Which representation modality – visual signals or textual description, enhances MLLMs to
generate interaction code?

6.1 RQ1: How do different MLLMs perform in Interaction-to-Code task under different
prompts?
We present the results of three leading MLLMs under three different prompts in Table 4, bold values
indicate the optimal performance, and underlined values indicate the second-best performance.
First, we can make the following observations of MLLMs under direct prompting:
(1) Generation of interactive elements presents greater challenges than static full web-
page generation. Table 4 shows that the performance metrics for interactive components are
notably lower than those for complete webpages under direct prompts. Regarding visual similarity,
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 11

MLLMs attain approximately 0.73-0.78 for full pages, compared to 0.71-0.76 for interactive elements.
Structure similarity shows a more pronounced disparity, with MLLMs achieving 0.6-0.78 for full
pages but only 0.4-0.56 for interactive components. Similarly, text similarity scores reach about
0.65 for full pages, contrasting with approximately 0.5 for interactive elements.
(2) MLLMs demonstrate limitations in accurately reproducing fine-grained features
of interaction. The performance of MLLMs on fine-grained metrics (such as structure, text, and
position similarity) is notably weaker compared to their performance on coarse-grained metrics
like CLIP score. As illustrated in Table 4, for the interaction part, the CLIP similarity exceeds 0.7,
whereas text similarity hovers around 0.5, position similarity approximates 0.45-0.62, and structure
similarity ranges between 0.4 and 0.5.
(3) Claude-3.5 outperforms GPT-4o and Gemini-1.5 in the Interaction-to-Code task.
Experiment results of direct prompting reveals a consistent performance ranking, with Claude-3.5
leading, followed by GPT-4o, and Gemini-1.5 showing the lowest performance.

Finding 1: While MLLMs demonstrate competence in generating static webpages, they

encounter more challenges when producing interactions. Specifically, MLLMs struggle to
reproduce details (e.g., structure, text, position) of interactions.

To improve the performance of interaction, we further propose CoT and Mark prompts to force
models to focus on the interaction part, resulting in the following observations:
(4) Both CoT and Mark prompts enhance model performance compared to direct prompt,
the Mark prompt demonstrates superior performance compared to the CoT prompt. GPT-
4o’s metrics (CLIP, SSIM, text, position) of the interaction part improve from direct prompting
scores (0.7328, 0.4221, 0.4848, 0.6053) to (0.7212, 0.4556, 0.4902, 0.6079) with CoT, and further to
(0.7454, 0.5583, 0.5241, 0.6123) with Mark prompting. However, both prompting methods slightly
decrease full-page metrics, likely due to their focused emphasis on interactive elements rather than
overall page composition.

Finding 2: Chain-of-Thought (CoT) and Mark prompts enhance interaction generation in

distinct ways: CoT leverages step-by-step analysis of interactive components, while Mark
prompts focus on clear interaction areas. Empirical results indicate that the Mark prompt
leads to greater improvements compared to the CoT method.

6.2 RQ2: How do humans evaluate the usability of interactions generated by MLLMs?
Although the above metrics have measured the generation effect of the interaction from different
perspectives, the functional evaluation of the interaction still requires human evaluations.
Pairwise Model Comparison Setting. We ask three human annotators to rank a pair of
generated interactions (one from the baseline, the other from the tested methods) to decide which
one implements the reference interaction function better. We use Gemini-1.5 with direct prompt
as the baseline and collect the other eight methods’ Win/Tie/Lose rates against this baseline. The
results are shown in Fig 6(a); a higher win rate and lower loss rate suggest better quality as judged
by human annotators.
Functionality Evaluation Setting. We also ask the three annotators to evaluate the functionality
(i.e., usability) of generated interaction. If the interactive function is consistent with ground truth,
it is regarded as usable, otherwise unusable. We calculate the usability rate of different schemes,
the results are shown in Fig 6(b).
12 Xiao et al.

Win Tie Lose Usable Unusable

Gemini (CoT) 32% 50% 18% Gemini (Direct) 45% 55%

Gemini (CoT) 49% 51%

Gemini (Mark) 32% 53% 15%
Gemini (Mark) 56% 44%
GPT-4o (Direct) 40% 57% 3%
GPT-4o (Direct) 66% 34%
GPT-4o (CoT) 42% 54% 4%
GPT-4o (CoT) 68% 32%
GPT-4o (Mark) 51% 46% 3%
GPT-4o (Mark) 78% 22%
Claude (Direct) 47% 48% 5% 70% 30%
Claude (Direct)

Claude (CoT) 46% 50% 4% Claude (CoT) 69% 31%

Claude (Mark) 53% 44% 3% Claude (Mark) 83% 17%

0 20 40 60 80 100 0 20 40 60 80 100
Percentage (%) Percentage (%)

(a) Pairwise model comparision. (b) Usability evaluation.

Fig. 6. Human evaluation, a higher win rate indicates better quality and a higher usability rate indicates
better functionality.

Table 5. Usability rate of different tag categories.

Model Prompt button input link iframe textarea option select form progress
Direct 0.5395 0.5172 0.4583 0.3750 0.5238 0.6667 0.7000 0.6667 0.2857
Gemini CoT 0.5682 0.6176 0.4167 0.6250 0.6296 0.8125 0.6111 0.8750 0.4545
Mark 0.6111 0.6750 0.5333 0.5000 0.5357 0.6875 0.7500 0.8000 0.7273
Direct 0.6742 0.8485 0.5556 0.7222 0.8571 0.8889 0.8889 0.9091 0.4000
GPT CoT 0.6941 0.7857 0.6667 0.5000 0.7143 0.9375 0.8421 0.9000 0.2727
Mark 0.8316 0.8000 0.8276 0.7778 0.8519 0.9500 0.8947 0.8750 0.7000
Direct 0.6857 0.7750 0.8485 0.6111 0.7407 0.8235 0.9333 0.9167 0.6000
Claude CoT 0.7071 0.8205 0.6296 0.4444 0.7586 0.9048 0.9474 1.0000 0.3636
Mark 0.8788 0.9024 0.8667 0.7368 1.0000 0.9412 0.8750 1.0000 0.5833
Average 0.6878 0.7491 0.6448 0.5880 0.7346 0.8458 0.8269 0.8825 0.4875

Results. First, our human evaluation reveals that Claude-3.5 consistently demonstrates superior
performance compared to other baseline models. Second, both CoT and Mark prompting strategies
can enhance model performance beyond direct prompting, showing higher win rates and usability
rates across most models (except Claude’s CoT prompt). Third, Mark prompting yields the most
significant improvements in usability, with GPT-4o showing 10% and 12% increases compared to
Direct and CoT prompts, respectively (Fig. 6(b)). Notably, GPT-4o with Mark prompting outperforms
Claude under both Direct and CoT conditions, highlighting the importance of visual attention. Last
but not least, these human evaluation results align with Finding 2, validating that our automatic
evaluation metrics are reasonable.

6.3 RQ3: How do MLLMs perform in code generation across different interaction
scenarios?
In this section, we study the performance of MLLMs on the Interaction-to-Code task under different
interaction types. The results of varying tag categories with high frequency and visual categories
are shown in Table 5 and Table 6, respectively.
For tag categories, form, select, and option are the easiest interaction types to generate,
achieving a usability rate higher than 80%. This is because these interactions scenarios always
contain fixed patterns, for example, selection and option only appear in drop-down lists, and
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 13

Table 6. Usability rate of different visual categories.

Model Prompt new component text color new window position size switch
Direct 0.5893 0.5246 0.4103 0.5000 0.5000 0.6000 0.5000
Gemini CoT 0.6119 0.5231 0.4186 0.8125 0.4615 0.5000 0.4000
Mark 0.5758 0.6719 0.4894 0.7647 0.7143 0.7143 0.6000
Direct 0.7164 0.7353 0.5208 0.9048 0.7500 0.6154 0.5000
GPT CoT 0.7538 0.8060 0.5909 0.8500 0.5625 0.8750 0.4000
Mark 0.8493 0.9054 0.7907 0.8889 0.7895 0.9000 0.9000
Direct 0.7333 0.7639 0.7111 0.7917 0.7333 0.7857 0.5000
Claude CoT 0.8205 0.8194 0.5918 0.7619 0.5000 0.6364 0.7143
Mark 0.9178 0.9189 0.8333 1.0000 0.8235 0.8182 0.7500
Average 0.7298 0.7409 0.5952 0.8083 0.6483 0.7161 0.5849

Table 7. Failure types and their influences, where represents full impact and represents partial impact.

Failure User
Failure Type Content Function Usability Rate
Object Experience
(a) Interactive element missing 0%
Interactive (b) No interaction 6.93%
element (c) Wrong interactive element 91.96%
(d) Wrong type of interactive element 88.89%
(e) Wrong position of interactive element 97.83%
(f) Wrong position after interaction 93.81%
Interaction (g) Wrong type of interaction effects 55.88%
effects (h) Effect on wrong element 0%
(i) Partial Implementation 75.29%
(j) Wrong function 0%

form often merely contains input boxes. In contrast, iframe and progress elements show lower
usability rates (<60%), attributed to their complexity: iframes involve embedding external content,
while progress bars require intricate component coordination for functions like audio control or
price range adjustment, raising difficulties for MLLM to understand.
For visual categories, MLLMs excel at generating interactions that result in prominent visual
changes, such as creating new windows, and components. However, they struggle with subtle
visual modifications, such as color shifts and positional adjustments, indicating their limitations in
handling fine-grained interaction effects.

Finding 3: Performance varies by interaction type: MLLMs are good at handling interactions
with fixed pattern (e.g., selection list) and obvious changes (e.g., new window creation),
while struggling with interactions involving complex changes (e.g., iframe, progress) and
subtle visual modifications (e.g., position change).

6.4 RQ4: What types of mistakes do MLLMs make in generating interactions?

We ask annotators to analyze the difference between the generated interactions and the original
ones, then summarize the failure types and evaluate their influence from content, function and user
experience. In specific, we first randomly select 20 interactions for analysis and then discuss, revise,
14 Xiao et al.

Reference Reference Reference Reference Reference

Generated Generated Generated Generated Generated

(d) Wrong Types of (e) Wrong Position of

(a) Interactive Element Missing (b) No Interaction (c)Wrong Interactive Element Interactive Element Interactive Element

Fig. 7. Failure on interactive elements.

Reference Reference Reference Reference

Reference

Generated Generated Generated Generated Generated

(f) Wrong Position of (g) Wrong type of

(h) Effect on Wrong Element (i) Partial Implementation (j) Wrong Function
Interaction Effect interaction Effect

Fig. 8. Failure of interaction effects.

and refine the failure type until everyone reaches a consensus. Finally, we manually annotate the
failure types of all interactions and calculate the Usability Rate (UR) based on the human evaluation
results of RQ2. Table 7 shows the results of failure types and their influence, it contains 10 types of
failure. Ten representative failure examples are shown in Fig 7 and Fig 8, where the first row shows
the reference interaction, and the second row shows the generated interaction by MLLMs.
Failure reason analysis. Failures (a), (c), (e), and (f) stem from MLLMs’ limitations in element
localization. Failures (d) and (g) are caused by MLLMs’ misidentification of element types. Failures
(b), (h), (i), and (j) arise from MLLMs’ misunderstanding of interaction.
Base on the failure distribution in Fig 9, we find that, the main failure modes include “No
interaction”, “Partial implementation”, “Interactive element missing”, and “Wrong posi-
tion after interaction”. Model-specific analysis reveals distinct patterns: Gemini-1.5’s failures are
dominated by “No interaction” and “Partial implementation” (>50%), while GPT-4o mainly faces
issues with “Interactive element missing” and “No interaction” (>20%). Claude-3.5’s challenges are
primarily in “No interaction” and “Wrong position after interaction” (>20%). These failures stem
from two key issues: MLLMs’ inadequate interaction comprehension leading to “No interaction”
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 15

No interaction 37.25% No failure 37.72% No failure 51.17%

18.0% No failure 12.99% Interactive element missing 17.53% No interaction
15.65% Partial Implementation 8.45% No interaction 6.88% Wrong position after interaction
10.17% Wrong function 7.98% Wrong position of interactive element 5.79% Wrong interactive element
5.95% Wrong interactive element 7.35% Wrong position after interaction 4.85% Partial Implementation
3.76% Interactive element missing 6.73% Partial Implementation 4.85% Interactive element missing
3.45% Wrong position after interaction 5.79% Wrong interactive element 3.45% Wrong position of interactive element
2.97% Wrong position of interactive element 4.07% Wrong type of interaction effects 1.88% Wrong function
2.04% Wrong type of interactive element 3.6% Wrong type of interactive element 1.41% Wrong type of interactive element
0.78% Effect on wrong element 2.98% Effect on wrong element 1.25% Wrong type of interaction effects
0.0% Wrong type of interaction effects 2.35% Wrong function 0.94% Effect on wrong element

0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50
Percentage (%) Percentage (%) Percentage (%)

(a) Gemini-1.5. (b) GPT-4o. (c) Claude-3.5.

Fig. 9. Failure distribution of three MLLMs.

and “Partial implementation”, imprecise element and interactive effects localization of MLLMs
results in “Interactive element missing” and “Wrong position after interaction”.
Besides, the most serious failures are “Interactive element Missing”, “Effect on wrong
element”, “Wrong Function” and “No interaction”. The severity of the failures depends on
the usability rate (UR), with higher UR meaning lower severity and lower UR meaning higher
severity. As illustrated in Table 7, failure (a), (b), (h) and (j) exhibit UR lower than 10%, rendering
the generated interactions completely ineffective.

Finding 4: “No interaction”, “Partial implementation”, “Interactive element missing”, and

“Wrong position after interaction” constitute the most frequent failures. “Interactive element
Missing”, “Effect on wrong element”, “Wrong Function”, and “No interaction” proves partic-
ularly critical, accounting for only 10% usable cases in generated interactions. Enhancing
MLLMs’ ability to locate interactive elements and understand the interactions to avoid
common and serious failures is very important.

6.5 RQ5: How does visual saliency influence the quality of generated interactions?
The visual perception limitations of MLLMs affect their performance on visual understanding tasks,
especially when facing small low-resolution objects [60]. In this section, we examine the impact
of interaction area ratio (i.e., visual saliency) on generation outcomes. Let 𝐼 denote interaction, 𝑆𝐼
denote the screenshot of the webpage after interaction 𝐼 , we define the visual saliency (𝑉 𝑆) of the
interaction as follows:

𝑎𝑟𝑒𝑎(𝐼 )
, (4) 𝑉 𝑆 (𝐼 ) =
𝑎𝑟𝑒𝑎(𝑆𝐼 )
where 𝑎𝑟𝑒𝑎() calculates the size (in pixels) of a component. A higher VS score indicates a larger
area influenced by the interaction and, consequently, a higher visual saliency.
We first calculate the visual saliency for all interactions and plot the distribution, as shown in
Figure 11. We then divide the samples into five groups based on the distribution results, keeping the
number of samples in each group roughly balanced. The VS ranges for the five groups are as follows:
[0, 0.025), [0.025, 0.05), [0.05, 0.1], [0.1, 0.2), [0.2, 1). Figure 10 shows the box plot distribution of
metrics for Gemini-1.5 across these five groups, allowing us to draw the first observation:
(1) The group with higher visual saliency has higher SSIM and position similarity.
Although the clip and text similarity fluctuates among different groups, as shown in Fig 10(a),
Fig 10(b) shows that the SSIM and position similarity significantly increases as the visual saliency
increases. As shown in Fig 10(b), the group [0.2, 1) shows the highest metrics, while the group
[0, 0.025) shows the lowest metrics. This demonstrates that MLLMs are more likely to capture
structural and positional features for samples with high visual saliency.
16 Xiao et al.

CLIP Text SSIM Position

1.0 1.0

0.8 0.8

0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
(0-0.025) (0.025-0.05) (0.05-0.1) (0.1-0.2) (0.2-1) (0-0.025) (0.025-0.05) (0.05-0.1) (0.1-0.2) (0.2-1)

(a) CLIP and Text. (b) SSIM and Position.

Fig. 10. Interaction part metrics distribution of different groups of Gemini-1.5 under the direct prompt.

We then randomly sample 10 webpages from failure cases and crop the screenshots to increase
the visual saliency of the interactions in the webpages (for example, if the webpage is cropped to 1/2
of the original, the visual saliency of the interaction will be doubled). Fig 12 shows the relationship
between the magnification factor and the metrics of generation results. We observe that:
(2) Enhanced visual saliency facilitates the effective generation. When the magnification
factor is set to 1, all evaluation metrics yield values of 0, indicating the unsuccessful interaction
generation. Upon increasing VS by 1.2 times, the model is able to reproduce interactions, but
with relatively low metric scores. As the magnification factor increases from 1.2 to 3, we observe
substantial improvements in performance metrics: the CLIP and SSIM similarities approach 0.8,
while text and position similarities reach approximately 0.6. This suggests that models are effectively
overcoming the original failure cases.
Cumulative Probability

10.0 1.0 1.0

CLIP
Histogram 0.8
7.5 0.8 SSIM
CDF
Number

0.6 Text
5.0 0.6
0.4 Position
2.5 0.4 0.2

0.0 0.2 0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.5 2.0 2.5 3.0
Visual Saliency Scale

Fig. 11. Visual saliency distribution. Fig. 12. Metrics under different magnification.

Finding 5: Visual saliency affects the MLLMs’ performance on interaction generation, and
enhancing visual saliency can lead to more accurate code generation.

6.6 RQ6: Which representation modality – visual signals or textual description,

enhances MLLMs to generate interaction code?
The performance of MLLMs in UI code generation largely depends on their ability to comprehend
interactions. Certain interactions, particularly complex ones or those with low visual saliency, pose
significant challenges for MLLMs when relying solely on screenshots for comprehension. Natural
language descriptions may serve as a valuable supplement to enhance understanding.
To investigate the impact of different input signals, we conduct experiments on Gemini-1.5 and
GPT-4o using 10 randomly selected webpages from failure cases. Human annotators provide textual
descriptions for each interaction (e.g., "clicking the login button triggers a new window with two
input boxes"). We evaluate three experimental conditions: visual input only (V), textual description
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 17

Table 8. Performance of MLLMs with different modality inputs.

Gemini-1.5 GPT-4o
Prompt Modality
CLIP SSIM Text Position CLIP SSIM Text Position
V 0.3338 0.1587 0.2777 0.3342 0.3737 0.1793 0.2539 0.3951
Direct T 0.3116 0.1550 0.1687 0.3999 0.4174 0.4067 0.2316 0.4293
V+T 0.5679 0.3010 0.2732 0.5964 0.6735 0.5612 0.3919 0.7157
V 0.4357 0.1975 0.3072 0.4303 0.3871 0.3101 0.2433 0.4461
CoT T 0.3677 0.0897 0.2290 0.4403 0.5579 0.1828 0.3045 0.5465
V+T 0.5503 0.4027 0.3558 0.5656 0.6440 0.4800 0.4287 0.7080
V 0.4502 0.3256 0.2197 0.4302 0.5015 0.4520 0.3389 0.5025
Mark T 0.5019 0.2478 0.2921 0.5301 0.4613 0.4454 0.2805 0.4810
V+T 0.5946 0.4327 0.3416 0.4791 0.6923 0.4336 0.4248 0.7469

only (T), and combined visual-textual input (V+T). Table 8 presents the results, with bold values
indicating the best performance and underlined values showing the second-best performance. We
can make the following observations:
(1) Integrating both visual and textual descriptions enables MLLMs to achieve optimal
performance on the Interaction-to-Code task. It is challenging to determine whether visual-
only or text description-only inputs are superior based on Table 8, as there are instances where
“V” is better and others where “T” excels. However, the combined approach (V+T) consistently
outperforms single-modality inputs in most scenarios across all three prompt types. The result
suggests a complementary relationship between visual and textual inputs, underscoring the benefits
of integrating both modalities for advanced performance.
(2) Supplementary text descriptions can bridge the performance gap across different
model capabilities and prompt strategies. Under direct prompting, Gemini-1.5 with combined
visual and textual inputs (V+T) demonstrates superior performance compared to GPT-4o using
either visual (V) or textual (T) inputs alone. Furthermore, Gemini-1.5’s performance with combined
inputs under direct prompting surpasses its own performance with visual-only input, even when
enhanced by Chain-of-Thought (CoT) or Mark prompting strategies.

Finding 6: The incorporation of visual and textual inputs considerably enhances MLLMs’
capability to generate interactions. With textual descriptions, even a weaker model can
achieve comparable performance to those of superior models without textual descriptions.

7 DISCUSSION
Implications for Researchers. The findings of our study shed light on following future directions
to improve the quality of MLLM-generated UI code in practice.
• Enhancing MLLMs’ recognition of fine-grained webpage features. As noticed in Finding 1,
MLLMs often struggle to reproduce details of interactions, such as position, text, and structure.
Therefore, it is essential to explore strategies to improve the model’s sensitivity on these fine-
grained features.
• Correcting errors in MLLM-generated code. In RQ4, we outline common mistakes when
MLLMs generate interactive components. Developing automated methods to identify failure
types and fix errors is crucial in reproducing reliable and usable webpages.
18 Xiao et al.

• Enhancing the MLLM’s grounding of GUI elements and its understanding of interactions.
In RQ4, we analyze that the existing failures arise from the inability of MLLMs to accurately locate
the interacting elements, understand their functionalities, and comprehend the interactions.
Therefore, it is essential to enhance the capabilities of MLLMs in this area. Alternatively, a GUI
interactive element recognition model and an interactive analysis model could be implemented
prior to MLLM input to address these limitations.
Implications for Developers. Based on our findings, we propose the following practical guidelines
for developers leveraging MLLMs in automated front-end development:
• Applying visual markers for interactive elements. Derived from Finding 2, incorporating
mark prompts with red bounding boxes significantly enhances MLLMs’ ability to generate
accurate interactions. These visual markers enable MLLMs to precisely identify both interactive
elements and their effect areas.
• Optimize interactive element visibility. Finding 5 indicates that enhanced visual saliency
leads to more effective interaction generation. We recommend increasing the visual saliency of
the interaction by slicing the image, or even just inputting in the interactive area to generate
the code for the interaction part first, followed by the integration of the generated code into the
main webpage code.
• Provide comprehensive interaction descriptions. As evidenced by Finding 6, detailed textual
descriptions improve interaction generation quality. Developers can include explicit descriptions
(like the position, interactive elements, and effects) of interaction in their prompts to make
MLLMs understand the interaction clearly.

8 THREATS TO VALIDITY
Limited context length. As webpages become more complex with numerous interactions, the input
context expands, potentially exceeding the context window constraints of MLLMs (e.g., 128K tokens
for GPT-4o). Nevertheless, this limitation can be mitigated by employing iterative generation,
progressively producing interactions for a webpage over multiple rounds.
Model selection. This study utilizes three prominent Multimodal Large Language Models (MLLMs)
to conduct experiments. There are some open source MLLMs such as LLaVa [35] we don’t test, we
will test the performance of these models on Interaction-to-Code task in the future work.
Unable to handle interactions that require back-end. Some complex functional interactions (e.g.,
login, search, etc.) are implemented by server-side scripting languages like Python. The benchmark
we collect does not include back-end code; we cannot verify the generation effect of such interactions,
but we believe our work is an important step toward generating interactive websites.

9 CONCLUSION
This paper presents the first systematic evaluation of MLLMs in the Interaction-to-Code task. We
introduce a formal definition of the Interaction-to-Code paradigm and establish the comprehensive
Interaction2Code benchmark encompassing diverse interaction scenarios. Through extensive
automated and human evaluations, we assess MLLMs’ performance and usability of generated
interactions. Our key findings reveal the limitations of MLLMs in the Interaction-to-Code task,
failure types, and key factors (prompts, enhanced visual saliency, and supplementary textual
descriptions) for enhancing the interaction generation performance of MLLMs.

REFERENCES
[1] 2024. The 10 best user interface (UI) design tools to try in 2024. UX Design Institute (2024). https://www.
uxdesigninstitute.com/blog/user-interface-ui-design-tools/ Accessed: 2024-10-06.
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 19

[2] 2024. Top Website Statistics For 2024. Forbes Advisor (2024). https://www.forbes.com/advisor/business/software/
website-statistics/ Accessed: 2024-10-06.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances
in neural information processing systems 35 (2022), 23716–23736.
[4] Anthropic. 2024. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet Accessed:
2024-09-29.
[5] Anthropic. 2024. Vision Documentation. https://docs.anthropic.com/en/docs/vision Accessed: 2024-10-18.
[6] Shushan Arakelyan, Rocktim Jyoti Das, Yi Mao, and Xiang Ren. 2023. Exploring Distributional Shifts in Large
Language Models for Code Analysis. In Conference on Empirical Methods in Natural Language Processing. https:
//api.semanticscholar.org/CorpusID:257557735
[7] Batuhan Aşıroğlu, Büşta Rümeysa Mete, Eyyüp Yıldız, Yağız Nalçakan, Alper Sezen, Mustafa Dağtekin, and Tolga
Ensari. 2019. Automatic HTML code generation from mock-up images using machine learning techniques. In 2019
Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). Ieee, 1–4.
[8] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the
ACM SIGCHI symposium on engineering interactive computing systems. 1–6.
[9] C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu. 2018. From UI design image to GUI skeleton: a neural machine translator
to bootstrap mobile GUI implementation. In Proceedings of the 40th International Conference on Software Engineering.
665–676.
[10] Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018. From ui design image to gui skeleton:
a neural machine translator to bootstrap mobile gui implementation. In Proceedings of the 40th International Conference
on Software Engineering. 665–676.
[11] Fuxiang Chen, Fateme Moradian Fard, David Lo, and Timofey Bryksin. 2022. On the Transferability of Pre-trained
Language Models for Low-Resource Programming Languages. 2022 IEEE/ACM 30th International Conference on Program
Comprehension (ICPC) (2022), 401–412. https://api.semanticscholar.org/CorpusID:248266381
[12] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2022. Visualgpt: Data-efficient adaptation of pretrained
language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 18030–18040.
[13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda,
Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry,
Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios
Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu
Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever,
and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021).
https://api.semanticscholar.org/CorpusID:235755472
[14] Wen-Yin Chen, Pavol Podstreleny, Wen-Huang Cheng, Yung-Yao Chen, and Kai-Lung Hua. 2022. Code generation
from a graphical user interface via attention-based encoder–decoder model. Multimedia Systems 28, 1 (2022), 121–130.
[15] André Armstrong Janino Cizotto, Rodrigo Clemente Thom de Souza, Viviana Cocco Mariani, and Leandro dos
Santos Coelho. 2023. Web pages from mockup design based on convolutional neural network and class activation
mapping. Multimedia Tools and Applications 82, 25 (2023), 38771–38797.
[16] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li,
Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning. ArXiv abs/2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266
[17] Victor C. Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh, Han Liu, and Saleema Amershi. 2022.
Aligning Offline Metrics and Human Judgments of Value of AI-Pair Programmers. ArXiv abs/2210.16494 (2022).
https://api.semanticscholar.org/CorpusID:253237523
[18] Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Robert Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan,
Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2023. A Static Evaluation of Code
Completion by Large Language Models. ArXiv abs/2306.03203 (2023). https://api.semanticscholar.org/CorpusID:
259088657
[19] Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. ArXiv
abs/2304.07590 (2023). https://api.semanticscholar.org/CorpusID:258179537
[20] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng,
and Yiling Lou. 2023. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation.
ArXiv abs/2308.01861 (2023). https://api.semanticscholar.org/CorpusID:260439062
20 Xiao et al.

[21] Shuzheng Gao, Xinjie Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu. 2023. Constructing Effective In-
Context Demonstration for Code Intelligence Tasks: An Empirical Study. ArXiv abs/2304.07575 (2023). https:
//api.semanticscholar.org/CorpusID:263867793
[22] Henry Gilbert, Michael Sandborn, Douglas C. Schmidt, Jesse Spencer-Smith, and Jules White. 2023. Semantic Com-
pression with Large Language Models. 2023 Tenth International Conference on Social Networks Analysis, Management
and Security (SNAMS) (2023), 1–8. https://api.semanticscholar.org/CorpusID:258309482
[23] Google. 2024. Gemini API. https://ai.google.dev/gemini-api Accessed: 2024-10-06.
[24] Jian Gu, Pasquale Salza, and Harald C. Gall. 2022. Assemble Foundation Models for Automatic Code Summarization.
2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), 935–946. https:
//api.semanticscholar.org/CorpusID:245986582
[25] Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, and Wenbin Jiang. 2024.
VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs. arXiv preprint arXiv:2404.06369
(2024).
[26] Xinying Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John C. Grundy, and Haoyu
Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. ArXiv abs/2308.10620
(2023). https://api.semanticscholar.org/CorpusID:261048648
[27] Vanita Jain, Piyush Agrawal, Subham Banga, Rishabh Kapoor, and Shashwat Gulyani. 2019. Sketch2Code: transforma-
tion of sketches to UI in real-time using deep neural network. arXiv preprint arXiv:1910.08930 (2019).
[28] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language
Models. ArXiv abs/2306.02907 (2023). https://api.semanticscholar.org/CorpusID:259076266
[29] Kati Kuusinen and Tommi Mikkonen. 2013. Designing User Experience for Mobile Apps: Long-Term Product
Owner Perspective. 2013 20th Asia-Pacific Software Engineering Conference (APSEC) 1 (2013), 535–540. https:
//api.semanticscholar.org/CorpusID:18632493
[30] Valéria Lelli, Arnaud Blouin, and Benoît Baudry. 2015. Classifying and Qualifying GUI Defects. 2015 IEEE 8th
International Conference on Software Testing, Verification and Validation (ICST) (2015), 1–10. https://api.semanticscholar.
org/CorpusID:2288032
[31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
[32] Jia Li, Ge Li, Yongming Li, and Zhi Jin. 2023. Enabling Programming Thinking in Large Language Models Toward
Code Generation. ArXiv abs/2305.06599 (2023). https://api.semanticscholar.org/CorpusID:263896057
[33] Tsz On Li, Wen yi Zong, Yibo Wang, Haoye Tian, Y. Wang, and S. C. Cheung. 2023. Nuances are the Key: Unlocking
ChatGPT to Find Failure-Inducing Tests with Differential Prompting. 2023 38th IEEE/ACM International Conference on
Automated Software Engineering (ASE) (2023), 14–26. https://api.semanticscholar.org/CorpusID:258298446
[34] Zongjie Li, Chaozheng Wang, Zhibo Liu, Hao Wang, Shuai Wang, and Cuiyun Gao. 2022. CCTEST: Testing and
Repairing Code Completion Systems. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
(2022), 1238–1250. https://api.semanticscholar.org/CorpusID:251623193
[35] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT:
Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
[36] Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using Deep Learning to Generate Complete
Log Statements. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) (2022), 2279–2290.
https://api.semanticscholar.org/CorpusID:245906103
[37] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and
Gabriele Bavota. 2021. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. 2021
IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 336–347. https://api.semanticscholar.
org/CorpusID:231786586
[38] Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett, and Denys Poshyvanyk. 2018. Machine
learning-based prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering 46,
2 (2018), 196–221.
[39] Kevin Moran, Boyang Li, Carlos Bernal-Cárdenas, Dan Jelf, and Denys Poshyvanyk. 2018. Automated Reporting of
GUI Design Violations for Mobile Apps. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)
(2018), 165–175. https://api.semanticscholar.org/CorpusID:3634687
[40] T. A. Nguyen and C. Csallner. 2015. Reverse engineering mobile application user interfaces with remaui (t). In 2015
30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 248–259.
[41] Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobile application user interfaces with remaui
(t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 248–259.
[42] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In International
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? 21

Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:252668917

[43] OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2024-10-06.
[44] OpenAI. 2024. Vision Guide. https://platform.openai.com/docs/guides/vision Accessed: 2024-10-18.
[45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision.
In International conference on machine learning. PMLR, 8748–8763.
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[48] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine
learning research 21, 140 (2020), 1–67.
[49] Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. 2024. Design2Code: How Far Are We From
Automating Front-End Engineering? arXiv preprint arXiv:2403.03163 (2024).
[50] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot
learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
[51] Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael R Lyu. 2024. Automat-
ically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach. arXiv preprint arXiv:2406.16386
(2024).
[52] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error
visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
[53] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems 35 (2022), 24824–24837.
[54] Y. Xu, L. Bo, X. Sun, B. Li, J. Jiang, and W. Zhou. 2021. image2emmet: Automatic code generation from web user
interface image. Journal of Software: Evolution and Process 33, 8 (2021), e2369.
[55] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023. The Dawn
of LMMs: Preliminary Explorations with GPT-4V(ision). ArXiv abs/2309.17421 (2023). https://api.semanticscholar.org/
CorpusID:263310951
[56] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023. A Survey on Multimodal
Large Language Models. ArXiv abs/2306.13549 (2023). https://api.semanticscholar.org/CorpusID:259243718
[57] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yu Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang.
2023. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. In International
Conference on Software Engineering. https://api.semanticscholar.org/CorpusID:256459413
[58] Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng,
Jinhong Wang, Tianhua Tao, Junbo Li, et al. 2024. Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation
Framework for Multimodal LLMs. arXiv preprint arXiv:2406.20098 (2024).
[59] Clemens Zeidler, Christof Lutteroth, Wolfgang Stuerzlinger, and Gerald Weber. 2013. Evaluating Direct Manipulation
Operations for Constraint-Based Layout. In IFIP TC13 International Conference on Human-Computer Interaction.
https://api.semanticscholar.org/CorpusID:8243987
[60] Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, and Maosong Sun. 2024. Exploring perceptual limitation of
multimodal large language models. arXiv preprint arXiv:2402.07384 (2024).
[61] Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang. 2024. Bridging Design and Development
with Automated Declarative UI Code Generation. arXiv preprint arXiv:2409.11667 (2024).

Automatically Generating UI Code From Screenshot: A Divide-and-Conquer-Based Approach
No ratings yet
Automatically Generating UI Code From Screenshot: A Divide-and-Conquer-Based Approach
22 pages
Design2Code: How Far Are We From
No ratings yet
Design2Code: How Far Are We From
21 pages
Design2Code:: Benchmarking Multimodal Code Generation For Automated Front-End Engineering
No ratings yet
Design2Code:: Benchmarking Multimodal Code Generation For Automated Front-End Engineering
19 pages
Automated HTML
No ratings yet
Automated HTML
5 pages
Major Research Paper
No ratings yet
Major Research Paper
7 pages
Designbench: A Comprehensive Benchmark For Mllm-Based Front-End Code Generation
No ratings yet
Designbench: A Comprehensive Benchmark For Mllm-Based Front-End Code Generation
12 pages
Large Language ModelBrained GUI Agents
No ratings yet
Large Language ModelBrained GUI Agents
78 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
80 pages
GUIAgents With Foundation Models
No ratings yet
GUIAgents With Foundation Models
10 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
78 pages
VISION2UI: Dataset for UI Code Generation
No ratings yet
VISION2UI: Dataset for UI Code Generation
10 pages
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
No ratings yet
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
14 pages
A Deep Learning Based Object Detection System For User Interface Code Generation
No ratings yet
A Deep Learning Based Object Detection System For User Interface Code Generation
5 pages
PG-Agent: An Agent Powered by Page Graph: Weizhi Chen Ziwei Wang Leyang Yang
No ratings yet
PG-Agent: An Agent Powered by Page Graph: Weizhi Chen Ziwei Wang Leyang Yang
13 pages
Front End Development Automation Tool: Missing Features?: B. Motivation
No ratings yet
Front End Development Automation Tool: Missing Features?: B. Motivation
5 pages
张驰基于多模态大语言模型的GUI智能体已ok
No ratings yet
张驰基于多模态大语言模型的GUI智能体已ok
37 pages
Text To Web Application Using LLM
No ratings yet
Text To Web Application Using LLM
4 pages
Image To HTML Ai Paper
No ratings yet
Image To HTML Ai Paper
4 pages
WorkArena - How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
No ratings yet
WorkArena - How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
26 pages
WebSight: HTML from Web Screenshots
No ratings yet
WebSight: HTML from Web Screenshots
9 pages
Paper 4
No ratings yet
Paper 4
14 pages
Mographgpt:: Creating Interactive Scenes Using Modular LLM and Graphical Control
No ratings yet
Mographgpt:: Creating Interactive Scenes Using Modular LLM and Graphical Control
16 pages
AI-Driven Test Automation For GUIs - Diebold Nixdorf
No ratings yet
AI-Driven Test Automation For GUIs - Diebold Nixdorf
8 pages
Efficient and Aesthetic UI Design With A Deep
No ratings yet
Efficient and Aesthetic UI Design With A Deep
5 pages
Modern Application Development Overview
No ratings yet
Modern Application Development Overview
6 pages
Improving Web Element Localization by Using A Large Language Model
No ratings yet
Improving Web Element Localization by Using A Large Language Model
18 pages
Learn by Interaction Advancing Agentic Ai For Web Automation With Lang Graph
No ratings yet
Learn by Interaction Advancing Agentic Ai For Web Automation With Lang Graph
10 pages
Harnessing GUI Grounding For Advanced Visual GUI Agents
No ratings yet
Harnessing GUI Grounding For Advanced Visual GUI Agents
20 pages
AI Chatbot for Institute Queries
No ratings yet
AI Chatbot for Institute Queries
8 pages
A Systematic Literature Review On Automatic Website Generation
No ratings yet
A Systematic Literature Review On Automatic Website Generation
24 pages
Automated HTML from Hand Drawn Images
No ratings yet
Automated HTML from Hand Drawn Images
23 pages
A Review On The Evolution of Human Computer Interaction
No ratings yet
A Review On The Evolution of Human Computer Interaction
5 pages
End-To-End Software Construction Using Chatgpt: An Experience Report
No ratings yet
End-To-End Software Construction Using Chatgpt: An Experience Report
22 pages
AI-Powered Prototyping Prompt Cheat Sheet (20 Prompts)
No ratings yet
AI-Powered Prototyping Prompt Cheat Sheet (20 Prompts)
5 pages
Examining How The Large Language Models Impact The Conceptual Design With Human Designers A Comparative Case Study
No ratings yet
Examining How The Large Language Models Impact The Conceptual Design With Human Designers A Comparative Case Study
18 pages
College Enquiry Chat Bot
100% (2)
College Enquiry Chat Bot
47 pages
Twitter Sentiment Analysis Using Python TweetX
No ratings yet
Twitter Sentiment Analysis Using Python TweetX
3 pages
Li Et Al. - 2023 - User Experience Design Professionals' Perceptions
No ratings yet
Li Et Al. - 2023 - User Experience Design Professionals' Perceptions
25 pages
Web-Speak: A Customizable Speech-Based Web Navigation Interface For People With Disabilities Using Artificial Intelligence
No ratings yet
Web-Speak: A Customizable Speech-Based Web Navigation Interface For People With Disabilities Using Artificial Intelligence
8 pages
Zhao 等 - 2025 - WorldGUI Dynamic Testing for Comprehensive Desktop GUI Automation
No ratings yet
Zhao 等 - 2025 - WorldGUI Dynamic Testing for Comprehensive Desktop GUI Automation
19 pages
SSRN 5117742
No ratings yet
SSRN 5117742
10 pages
生成式人工智能的表达性交互设计
No ratings yet
生成式人工智能的表达性交互设计
10 pages
Prompting - Can (A) I Have A Word With You - A Taxonomy On The Design Dimension
No ratings yet
Prompting - Can (A) I Have A Word With You - A Taxonomy On The Design Dimension
10 pages
Code Agents
No ratings yet
Code Agents
24 pages
Open-source Language Agent Framework
No ratings yet
Open-source Language Agent Framework
9 pages
Sketch2Code: Transform Sketches to Code
No ratings yet
Sketch2Code: Transform Sketches to Code
4 pages
Varcons Project Development List
No ratings yet
Varcons Project Development List
4 pages
Abhisak
No ratings yet
Abhisak
31 pages
Document 3
No ratings yet
Document 3
45 pages
Design2Code: AI for Front-End Automation
No ratings yet
Design2Code: AI for Front-End Automation
8 pages
Base Paper
No ratings yet
Base Paper
12 pages
RadVLM: A Multitask Conversational Vision-Language Model For Radiology
No ratings yet
RadVLM: A Multitask Conversational Vision-Language Model For Radiology
18 pages
Xulu Yao Thesis
No ratings yet
Xulu Yao Thesis
120 pages
ML Project
No ratings yet
ML Project
13 pages
Virtual Assistant for the Visually Impaired
No ratings yet
Virtual Assistant for the Visually Impaired
6 pages
Exploring The Role of AI in Web Design and Development A Voyage Through Automated Code Generation
No ratings yet
Exploring The Role of AI in Web Design and Development A Voyage Through Automated Code Generation
8 pages
Narrative Inquiry As A Research Methodology: My Experience So Far
No ratings yet
Narrative Inquiry As A Research Methodology: My Experience So Far
12 pages
Voverobaku
No ratings yet
Voverobaku
3 pages
Borg 1999 - Grammar Teaching
100% (1)
Borg 1999 - Grammar Teaching
11 pages
Teaching Literacy To Learners With Dyslexia A Multisensory Approach 3rd Edition by Kathleen KellySylvia Phillips
No ratings yet
Teaching Literacy To Learners With Dyslexia A Multisensory Approach 3rd Edition by Kathleen KellySylvia Phillips
322 pages
BLOOM, William. Personal Identity, National Identity and International Relations, Cambridge University Press 1990 - Compressed
No ratings yet
BLOOM, William. Personal Identity, National Identity and International Relations, Cambridge University Press 1990 - Compressed
195 pages
Understanding Material Recovery Facility
No ratings yet
Understanding Material Recovery Facility
15 pages
Grade 12 Absenteeism Impact Study
100% (1)
Grade 12 Absenteeism Impact Study
18 pages
Gavita Oana Ro PDF
100% (2)
Gavita Oana Ro PDF
32 pages
Master The Boards USMLE Step 2 CK 4th Edition Get It Now
No ratings yet
Master The Boards USMLE Step 2 CK 4th Edition Get It Now
311 pages
Writing ToR For An Evaluation (2004)
No ratings yet
Writing ToR For An Evaluation (2004)
4 pages
IB - SEHS:Biology - IA - Checklist 2
No ratings yet
IB - SEHS:Biology - IA - Checklist 2
4 pages
1962 - Machlup, Fritz - The Production and Distribution of Knowledge in The United States PDF
No ratings yet
1962 - Machlup, Fritz - The Production and Distribution of Knowledge in The United States PDF
444 pages
Evidence-Informedpolicyandpractice Rolea
No ratings yet
Evidence-Informedpolicyandpractice Rolea
12 pages
Research Design Types Explained
No ratings yet
Research Design Types Explained
19 pages
Sensory Integration
No ratings yet
Sensory Integration
5 pages
Project Report - Chapters - Skelton
No ratings yet
Project Report - Chapters - Skelton
9 pages
PhD Entrance Exam: Literary Criticism
No ratings yet
PhD Entrance Exam: Literary Criticism
7 pages
Defining Multilingualism
No ratings yet
Defining Multilingualism
18 pages
Simran Sethi Bba2bmpr 1
No ratings yet
Simran Sethi Bba2bmpr 1
45 pages
Thesis Chapter 2 Synthesis Example
100% (3)
Thesis Chapter 2 Synthesis Example
7 pages
Paper Human Power
No ratings yet
Paper Human Power
6 pages
Reaching Generation Z: A Look at Why So Many Youth Are Leaving The Church
No ratings yet
Reaching Generation Z: A Look at Why So Many Youth Are Leaving The Church
18 pages
(2024.09.16) Background Study. Group 1
No ratings yet
(2024.09.16) Background Study. Group 1
52 pages
Thesis On Administration and Supervision
100% (3)
Thesis On Administration and Supervision
7 pages
BSBCUS501 Customer Service Assessment Guide
No ratings yet
BSBCUS501 Customer Service Assessment Guide
36 pages
Future Group Central Mall Internship Report
100% (2)
Future Group Central Mall Internship Report
123 pages
L4 Kerangka Konseptual (Conceptual Framework)
No ratings yet
L4 Kerangka Konseptual (Conceptual Framework)
18 pages
Asim Khwaja
No ratings yet
Asim Khwaja
5 pages
Status of Value Management Studies in Construction Projects (A Systematic Review)
No ratings yet
Status of Value Management Studies in Construction Projects (A Systematic Review)
15 pages
Icmr STS 2019
No ratings yet
Icmr STS 2019
26 pages