A clean, instruction-grounded dataset for full-page HTML rewriting generated and verified from a multi-stage LLM-driven pipeline. This repository introduces an automated data generation pipeline that uses LLMs to synthesize high-quality fine-tuning datasets for web editing tasks.
Authors: Truong Hai Dang, Jingyu Xiao, Yintong Huo
Conference: ACM AIWare 2025
Keywords: LLM, GUI Automation, Front-end Development
The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new design requirements (e.g., "center the logo") remains a challenge. This is largely due to the absence of large-scale, high-quality tuning data to align model performance with human expectations.
In this paper, we introduce a novel, automated data generation pipeline that uses LLMs to synthesize a high-quality fine-tuning dataset for web editing, named Instruct4Edit. Our approach generates diverse instructions, applies the corresponding code modifications, and performs visual verification to ensure correctness. By fine-tuning models on Instruct4Edit, we demonstrate consistent improvement in translating human intent into precise, structurally coherent, and visually accurate code changes.
- Fully Automated Pipeline: Three-stage LLM pipeline (Instruction Generator → HTML Editor → Visual Verifier)
- High-Quality Dataset: 1,150 verified samples from 2,500 initial pairs (46% acceptance rate)
- Visual Verification: Screenshot-based validation with 88% human agreement
- Multi-Modal Evaluation: SSIM, CLIP similarity, and manual verification
- Fine-Tuning Ready: Formatted datasets for training vision-language models
Instruct4Edit/
├── .env.example # Environment variables template
├── .gitignore # Git ignore file
├── README.md # This file
├── requirements.txt # Python dependencies
├── data/
│ ├── datasets/ # Final training datasets (JSON format)
│ │ ├── instruction_tuning_data.json # Filtered high-quality data
│ │ ├── unfiltered_instruction_tuning_data.json # Complete dataset for comparison
│ │ ├── vl_instruction_tuning_data.json # VL dataset (filtered)
│ │ └── vl_unfiltered_instruction_tuning_data.json # VL dataset (unfiltered)
│ ├── evaluate_samples/ # Sample evaluation data
│ │ ├── sample_1/ # Individual sample directories
│ │ ├── sample_2/ # Each contains: original.html, instruction_N.txt,
│ │ ├── sample_3/
│ │ ├── sample_4/
│ │ ├── sample_5/
│ │ ├── sample_6/
│ │ └── ...
│ ├── samples/ # Raw generated samples from pipeline
│ │ ├── Instruct4Edit # Instruct4Edit dataset
│ └── images/ # Screenshot images for VL training (generated by utils)
├── prompts/ # LLM prompts for each pipeline stage
│ ├── instruction_generation.md
│ ├── html_editing.md
│ └── verifying.md
└── src/
├── data_generation/ # LLM-based dataset generation pipeline
│ ├── dataset_generator_gemini.py
│ └── few_shot_examples.txt
├── evaluation/ # Model evaluation and comparison
│ ├── evaluate_qwen_base.py
│ ├── evaluate_qwen.py
│ ├── evaluate_qwen_vl_base.py
│ ├── evaluate_qwen_vl.py
│ ├── evaluate_gpt_openrouter.py
│ └── evaluate_gemini.py
├── filtering/ # Dataset quality filtering and processing
│ ├── dataset_filter.py
│ └── dataset_no_filter.py
├── metrics/ # Evaluation metric implementations
│ └── metrics.py
├── train/ # Model fine-tuning scripts
│ ├── train.py # Text-only model training
│ ├── train_vl.py # Vision-language model training (filtered)
│ ├── train_vl_unfiltered.py # VL training (unfiltered)
│ ├── train_llama.py # LLaMA model training
└── utils/ # Screenshot capture and dataset utilities
├── capture_screenshot.py # Core screenshot functionality using Selenium
├── capture_screenshot_vl.py # VL dataset screenshot generation (filtered)
├── capture_screenshot_vl_unfiltered.py # VL dataset screenshot generation (unfiltered)
├── dataset_split.py # Dataset splitting utilities
├── scan_verification.py # Instruct4Edit dataset scanning tools
├── utility_original_image.py # Original image extraction and processing
└── utility_update_vl_json.py
- Clone the repository:
git clone https://github.com/dangtruong01/Instruct4Edit.git
cd Instruct4Edit- Create and activate virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your API keys:
# GOOGLE_API_KEY=your_google_api_key_here
# OPENAI_API_KEY=your_openai_api_key_here
# OPENROUTER_API_KEY=your_openrouter_api_key_here- Install Chrome WebDriver (for screenshot capture):
# Download ChromeDriver from https://chromedriver.chromium.org/
# Add to PATH or specify path in environment variables
⚠️ Important: The Python scripts contain hardcoded file paths that may not match your local setup. Before running any scripts, please review and update the file paths in the source code to match your directory structure. Look for paths like:
data/datasets/instruction_tuning_data.jsondata/evaluate_samples/data/images/- Model output directories
Update these paths in the scripts according to your local environment.
Three-stage LLM pipeline for generating instruction-modification pairs
# Generate complete dataset using Gemini-powered pipeline
python src/data_generation/dataset_generator_gemini.pyPipeline Components:
- Instruction Generator: Creates 5 diverse, human-like design edit instructions per HTML sample using few-shot prompting
- HTML Editor: Applies each instruction to generate fully rewritten HTML documents
- Visual Verifier: Renders both versions and validates instruction compliance through cross-modal verification
Data Source: 500 seed HTML files from WebCode2M dataset
Outputs:
data/samples/Instruct4Edit/- Raw generated samples from pipeline- Each sample contains instruction-HTML pairs with verification results
Dataset Statistics:
- 2,500 initial instruction-HTML pairs generated (500 × 5 instructions)
- Automated verification filters samples for quality
- 88% agreement between automated and manual verification
Filter and prepare high-quality training datasets
# Create filtered high-quality dataset
python src/filtering/dataset_filter.py
# Create unfiltered dataset for comparison
python src/filtering/dataset_no_filter.pyProcessing Steps:
- Quality Filtering: Applies verification criteria to select high-quality samples
- JSON Formatting: Converts raw samples to training-ready format
Outputs:
data/datasets/instruction_tuning_data.json- Filtered high-quality samples (1,150 verified samples)data/datasets/unfiltered_instruction_tuning_data.json- Complete dataset for comparison
Generate screenshot images and create multimodal datasets
# Create VL dataset from filtered data (generates screenshots + VL JSON)
python src/utils/capture_screenshot_vl.py
# Create VL dataset from unfiltered data
python src/utils/capture_screenshot_vl_unfiltered.pyVision-Language Processing:
-
capture_screenshot_vl.py: Processes filtered dataset- Reads from
instruction_tuning_data.json - Captures screenshots of original and modified HTML
- Saves images to
data/images/ - Creates
vl_instruction_tuning_data.json
- Reads from
-
capture_screenshot_vl_unfiltered.py: Processes unfiltered dataset- Reads from
unfiltered_instruction_tuning_data.json - Captures screenshots for all samples
- Creates
vl_unfiltered_instruction_tuning_data.json
- Reads from
-
capture_screenshot.py: Core screenshot functionality using Selenium WebDriver
Outputs:
data/images/- Screenshot images (original_.png, modified_.png)data/datasets/vl_instruction_tuning_data.json- VL dataset with image pathsdata/datasets/vl_unfiltered_instruction_tuning_data.json- Unfiltered VL dataset
Train models on Instruct4Edit datasets
# Fine-tune text-only model on filtered dataset
python src/train/train.py
# Fine-tune vision-language model on filtered dataset
python src/train/train_vl.py
# Fine-tune VL model on unfiltered dataset (for comparison)
python src/train/train_vl_unfiltered.py
# Alternative: Fine-tune LLaMA model
python src/train/train_llama.pyTraining Configurations:
-
train.py: Text-only Qwen2.5-7B fine-tuning- Input:
instruction_tuning_data.json - Output:
./models/qwen2.5-7b-instruct-finetuned-design-edit
- Input:
-
train_vl.py: Vision-language Qwen2.5-VL-7B training- Input:
vl_instruction_tuning_data.json - Output:
./models/qwen2.5-vl-7b-finetuned-design-edit
- Input:
-
train_vl_unfiltered.py: VL training on unfiltered data
Training Settings:
- Method: LoRA (Low-Rank Adaptation) for parameter efficiency
- Batch Size: 1, Gradient Accumulation: 8 steps
- Learning Rate: 2e-5, Epochs: 3
- Max Sequence Length: 8192 tokens for full HTML documents
Evaluate different models on web editing tasks
# Evaluate base Qwen2.5-7B model (no fine-tuning)
python src/evaluation/evaluate_qwen_base.py
# Evaluate base Qwen2.5-VL model with vision input
python src/evaluation/evaluate_qwen_vl_base.py
# Evaluate fine-tuned models
python src/evaluation/evaluate_samples.pyEvaluation Process:
- Tests models on held-out samples from
data/evaluate_samples/ - Generates modified HTML for each instruction
- Captures screenshots for visual comparison
- Saves results for metric calculation
Calculate performance metrics and generate results
# Calculate SSIM and CLIP similarity scores
python src/metrics/metrics.py
# Additional metric calculations
python src/metrics/[other_metric_files].pyEvaluation Metrics:
- SSIM: Structural similarity between original and modified screenshots
- CLIP Score: Visual semantic similarity using CLIP embeddings
- Manual Verification: Human judgment on instruction compliance
| Model | SSIM | CLIP |
|---|---|---|
| Qwen2.5-7B-Instruct (Ours) | 0.952 | 0.993 |
| GPT-4o-mini | 0.896 | 0.987 |
| Gemini-2.5-Pro | 0.883 | 0.979 |
| Qwen2.5-7B-Base | 0.796 | 0.975 |
| Qwen2.5-7B-VL | 0.764 | 0.960 |
| Model | Passes | Fails | Pass Rate |
|---|---|---|---|
| GPT-4o-mini | 29 | 21 | 58% |
| Qwen2.5-7B-Instruct (Ours) | 28 | 22 | 56% |
| Gemini-2.5-pro | 26 | 24 | 52% |
| Qwen2.5-7B-Base | 24 | 26 | 48% |
| Qwen2.5-7B-VL | 18 | 32 | 36% |
Key Findings:
- Fine-tuning on Instruct4Edit improves SSIM by +0.156 and pass rate by +8%
- Text-only approach outperforms vision-language models for this task
- Competitive performance with larger commercial models using smaller open-source base
Each sample in instruction_tuning_data.json:
{
"id": "sample_N",
"instruction": "Natural language design modification instruction",
"original_html": "Complete original HTML document",
"modified_html": "Complete modified HTML document"
}Each sample in vl_instruction_tuning_data.json:
{
"id": "sample_N",
"instruction": "Natural language design modification instruction",
"original_html": "Complete original HTML document",
"modified_html": "Complete modified HTML document",
"original_image": "data/images/original_sample_N.png",
}data/evaluate_samples/sample_N/
├── original.html # Base HTML file
├── instruction_N.txt # Modification instruction
├── modified_N.html # Modified HTML (if generated)
├── screenshot_original.png # Original page screenshot (if captured)
├── screenshot_modified.png # Modified page screenshot (if captured)
└── verification_N.txt # Verification results (if performed)
# OpenAI API Key (for GPT evaluation)
OPENAI_API_KEY=your_openai_api_key_here
# Google API Key (for Gemini evaluation)
GOOGLE_API_KEY=your_google_api_key_here
# Hugging Face Token (for model downloads)
HUGGINGFACE_TOKEN=your_huggingface_token_heretorch>=2.0.0- PyTorch for model trainingtransformers>=4.37.0- HuggingFace transformersgoogle-generativeai>=0.3.0- Gemini APIopenai>=1.12.0- OpenAI APIselenium>=4.33.0- Web scraping and screenshot capturePillow>=9.5.0- Image processing
- Research: Benchmark dataset for web editing capabilities of LLMs
- Training: Fine-tune models for automated web development
- Evaluation: Compare different approaches to instruction-following in UI modification
- Development: Build interactive web development tools with natural language interfaces
- Multi-Framework Support: Extend to React, Vue, and other frontend frameworks
- Interactive Editing: Real-time collaboration between humans and AI
- Component-Level Editing: Fine-grained modifications within web components
For questions or issues, please contact:
- Truong Hai Dang: [[email protected]]