Skip to content

dangtruong01/Instruct4Edit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruct4Edit: Envisioning Future Interactive Web Development

Conference Paper License

A clean, instruction-grounded dataset for full-page HTML rewriting generated and verified from a multi-stage LLM-driven pipeline. This repository introduces an automated data generation pipeline that uses LLMs to synthesize high-quality fine-tuning datasets for web editing tasks.

Authors: Truong Hai Dang, Jingyu Xiao, Yintong Huo
Conference: ACM AIWare 2025
Keywords: LLM, GUI Automation, Front-end Development

🎯 Abstract

The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new design requirements (e.g., "center the logo") remains a challenge. This is largely due to the absence of large-scale, high-quality tuning data to align model performance with human expectations.

In this paper, we introduce a novel, automated data generation pipeline that uses LLMs to synthesize a high-quality fine-tuning dataset for web editing, named Instruct4Edit. Our approach generates diverse instructions, applies the corresponding code modifications, and performs visual verification to ensure correctness. By fine-tuning models on Instruct4Edit, we demonstrate consistent improvement in translating human intent into precise, structurally coherent, and visually accurate code changes.

🚀 Key Features

  • Fully Automated Pipeline: Three-stage LLM pipeline (Instruction Generator → HTML Editor → Visual Verifier)
  • High-Quality Dataset: 1,150 verified samples from 2,500 initial pairs (46% acceptance rate)
  • Visual Verification: Screenshot-based validation with 88% human agreement
  • Multi-Modal Evaluation: SSIM, CLIP similarity, and manual verification
  • Fine-Tuning Ready: Formatted datasets for training vision-language models

📁 Project Structure

Instruct4Edit/
├── .env.example           # Environment variables template
├── .gitignore            # Git ignore file
├── README.md             # This file
├── requirements.txt      # Python dependencies
├── data/
│   ├── datasets/         # Final training datasets (JSON format)
│   │   ├── instruction_tuning_data.json           # Filtered high-quality data
│   │   ├── unfiltered_instruction_tuning_data.json # Complete dataset for comparison
│   │   ├── vl_instruction_tuning_data.json        # VL dataset (filtered)
│   │   └── vl_unfiltered_instruction_tuning_data.json # VL dataset (unfiltered)
│   ├── evaluate_samples/ # Sample evaluation data
│   │   ├── sample_1/     # Individual sample directories
│   │   ├── sample_2/     # Each contains: original.html, instruction_N.txt,
│   │   ├── sample_3/
│   │   ├── sample_4/
│   │   ├── sample_5/
│   │   ├── sample_6/
│   │   └── ...
│   ├── samples/          # Raw generated samples from pipeline
│   │   ├── Instruct4Edit          # Instruct4Edit dataset
│   └── images/           # Screenshot images for VL training (generated by utils)
├── prompts/              # LLM prompts for each pipeline stage
│   ├── instruction_generation.md
│   ├── html_editing.md
│   └── verifying.md
└── src/
    ├── data_generation/  # LLM-based dataset generation pipeline
    │   ├── dataset_generator_gemini.py
    │   └── few_shot_examples.txt
    ├── evaluation/       # Model evaluation and comparison
    │   ├── evaluate_qwen_base.py
    │   ├── evaluate_qwen.py
    │   ├── evaluate_qwen_vl_base.py
    │   ├── evaluate_qwen_vl.py
    │   ├── evaluate_gpt_openrouter.py
    │   └── evaluate_gemini.py
    ├── filtering/        # Dataset quality filtering and processing
    │   ├── dataset_filter.py
    │   └── dataset_no_filter.py
    ├── metrics/          # Evaluation metric implementations
    │   └── metrics.py
    ├── train/            # Model fine-tuning scripts
    │   ├── train.py      # Text-only model training
    │   ├── train_vl.py   # Vision-language model training (filtered)
    │   ├── train_vl_unfiltered.py # VL training (unfiltered)
    │   ├── train_llama.py # LLaMA model training
    └── utils/            # Screenshot capture and dataset utilities
        ├── capture_screenshot.py              # Core screenshot functionality using Selenium
        ├── capture_screenshot_vl.py           # VL dataset screenshot generation (filtered)
        ├── capture_screenshot_vl_unfiltered.py # VL dataset screenshot generation (unfiltered)
        ├── dataset_split.py                   # Dataset splitting utilities
        ├── scan_verification.py               # Instruct4Edit dataset scanning tools
        ├── utility_original_image.py          # Original image extraction and processing
        └── utility_update_vl_json.py

🛠️ Installation

  1. Clone the repository:
git clone https://github.com/dangtruong01/Instruct4Edit.git
cd Instruct4Edit
  1. Create and activate virtual environment:
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your API keys:
# GOOGLE_API_KEY=your_google_api_key_here
# OPENAI_API_KEY=your_openai_api_key_here
# OPENROUTER_API_KEY=your_openrouter_api_key_here
  1. Install Chrome WebDriver (for screenshot capture):
# Download ChromeDriver from https://chromedriver.chromium.org/
# Add to PATH or specify path in environment variables

⚠️ Important: The Python scripts contain hardcoded file paths that may not match your local setup. Before running any scripts, please review and update the file paths in the source code to match your directory structure. Look for paths like:

  • data/datasets/instruction_tuning_data.json
  • data/evaluate_samples/
  • data/images/
  • Model output directories

Update these paths in the scripts according to your local environment.

🔄 Complete Pipeline Workflow

Stage 1: Automated Dataset Generation

Three-stage LLM pipeline for generating instruction-modification pairs

# Generate complete dataset using Gemini-powered pipeline
python src/data_generation/dataset_generator_gemini.py

Pipeline Components:

  1. Instruction Generator: Creates 5 diverse, human-like design edit instructions per HTML sample using few-shot prompting
  2. HTML Editor: Applies each instruction to generate fully rewritten HTML documents
  3. Visual Verifier: Renders both versions and validates instruction compliance through cross-modal verification

Data Source: 500 seed HTML files from WebCode2M dataset

Outputs:

  • data/samples/Instruct4Edit/ - Raw generated samples from pipeline
  • Each sample contains instruction-HTML pairs with verification results

Dataset Statistics:

  • 2,500 initial instruction-HTML pairs generated (500 × 5 instructions)
  • Automated verification filters samples for quality
  • 88% agreement between automated and manual verification

Stage 2: Dataset Preprocessing & Filtering

Filter and prepare high-quality training datasets

# Create filtered high-quality dataset
python src/filtering/dataset_filter.py

# Create unfiltered dataset for comparison
python src/filtering/dataset_no_filter.py

Processing Steps:

  1. Quality Filtering: Applies verification criteria to select high-quality samples
  2. JSON Formatting: Converts raw samples to training-ready format

Outputs:


Stage 3: Vision-Language Dataset Creation

Generate screenshot images and create multimodal datasets

# Create VL dataset from filtered data (generates screenshots + VL JSON)
python src/utils/capture_screenshot_vl.py

# Create VL dataset from unfiltered data  
python src/utils/capture_screenshot_vl_unfiltered.py

Vision-Language Processing:

  1. capture_screenshot_vl.py: Processes filtered dataset

  2. capture_screenshot_vl_unfiltered.py: Processes unfiltered dataset

  3. capture_screenshot.py: Core screenshot functionality using Selenium WebDriver

Outputs:


Stage 4: Model Fine-Tuning

Train models on Instruct4Edit datasets

# Fine-tune text-only model on filtered dataset
python src/train/train.py

# Fine-tune vision-language model on filtered dataset  
python src/train/train_vl.py

# Fine-tune VL model on unfiltered dataset (for comparison)
python src/train/train_vl_unfiltered.py

# Alternative: Fine-tune LLaMA model
python src/train/train_llama.py

Training Configurations:

  1. train.py: Text-only Qwen2.5-7B fine-tuning

  2. train_vl.py: Vision-language Qwen2.5-VL-7B training

  3. train_vl_unfiltered.py: VL training on unfiltered data

Training Settings:

  • Method: LoRA (Low-Rank Adaptation) for parameter efficiency
  • Batch Size: 1, Gradient Accumulation: 8 steps
  • Learning Rate: 2e-5, Epochs: 3
  • Max Sequence Length: 8192 tokens for full HTML documents

Stage 5: Model Evaluation

Evaluate different models on web editing tasks

# Evaluate base Qwen2.5-7B model (no fine-tuning)
python src/evaluation/evaluate_qwen_base.py

# Evaluate base Qwen2.5-VL model with vision input
python src/evaluation/evaluate_qwen_vl_base.py

# Evaluate fine-tuned models
python src/evaluation/evaluate_samples.py

Evaluation Process:

  • Tests models on held-out samples from data/evaluate_samples/
  • Generates modified HTML for each instruction
  • Captures screenshots for visual comparison
  • Saves results for metric calculation

Stage 6: Metrics & Analysis

Calculate performance metrics and generate results

# Calculate SSIM and CLIP similarity scores
python src/metrics/metrics.py

# Additional metric calculations
python src/metrics/[other_metric_files].py

Evaluation Metrics:

  • SSIM: Structural similarity between original and modified screenshots
  • CLIP Score: Visual semantic similarity using CLIP embeddings
  • Manual Verification: Human judgment on instruction compliance

📊 Evaluation Results

Quantitative Metrics (SSIM/CLIP Scores)

Model SSIM CLIP
Qwen2.5-7B-Instruct (Ours) 0.952 0.993
GPT-4o-mini 0.896 0.987
Gemini-2.5-Pro 0.883 0.979
Qwen2.5-7B-Base 0.796 0.975
Qwen2.5-7B-VL 0.764 0.960

Human Evaluation (Pass Rate)

Model Passes Fails Pass Rate
GPT-4o-mini 29 21 58%
Qwen2.5-7B-Instruct (Ours) 28 22 56%
Gemini-2.5-pro 26 24 52%
Qwen2.5-7B-Base 24 26 48%
Qwen2.5-7B-VL 18 32 36%

Key Findings:

  • Fine-tuning on Instruct4Edit improves SSIM by +0.156 and pass rate by +8%
  • Text-only approach outperforms vision-language models for this task
  • Competitive performance with larger commercial models using smaller open-source base

📝 Dataset Formats

Text-Only Dataset Format

Each sample in instruction_tuning_data.json:

{
  "id": "sample_N",
  "instruction": "Natural language design modification instruction",
  "original_html": "Complete original HTML document",
  "modified_html": "Complete modified HTML document"
}

Vision-Language Dataset Format

Each sample in vl_instruction_tuning_data.json:

{
  "id": "sample_N", 
  "instruction": "Natural language design modification instruction",
  "original_html": "Complete original HTML document",
  "modified_html": "Complete modified HTML document",
  "original_image": "data/images/original_sample_N.png",
}

Physical Sample Directory Structure

data/evaluate_samples/sample_N/
├── original.html           # Base HTML file
├── instruction_N.txt       # Modification instruction
├── modified_N.html         # Modified HTML (if generated)
├── screenshot_original.png # Original page screenshot (if captured)
├── screenshot_modified.png # Modified page screenshot (if captured)
└── verification_N.txt      # Verification results (if performed)

🔧 Configuration

Environment Variables (.env)

# OpenAI API Key (for GPT evaluation)
OPENAI_API_KEY=your_openai_api_key_here

# Google API Key (for Gemini evaluation)  
GOOGLE_API_KEY=your_google_api_key_here

# Hugging Face Token (for model downloads)
HUGGINGFACE_TOKEN=your_huggingface_token_here

Key Dependencies

  • torch>=2.0.0 - PyTorch for model training
  • transformers>=4.37.0 - HuggingFace transformers
  • google-generativeai>=0.3.0 - Gemini API
  • openai>=1.12.0 - OpenAI API
  • selenium>=4.33.0 - Web scraping and screenshot capture
  • Pillow>=9.5.0 - Image processing

🎯 Use Cases

  1. Research: Benchmark dataset for web editing capabilities of LLMs
  2. Training: Fine-tune models for automated web development
  3. Evaluation: Compare different approaches to instruction-following in UI modification
  4. Development: Build interactive web development tools with natural language interfaces

📈 Future Work

  • Multi-Framework Support: Extend to React, Vue, and other frontend frameworks
  • Interactive Editing: Real-time collaboration between humans and AI
  • Component-Level Editing: Fine-grained modifications within web components

📧 Contact

For questions or issues, please contact:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published