Cloning Bench: Evaluating AI Agents on Visual Website Cloning

We introduce Cloning Bench, a benchmark for evaluating how well autonomous AI agents can clone websites. Each agent is given a reference recording of a real website (Slack, to start) and is tasked with building a React front-end that visually matches it.

Agents run unattended in isolated Docker containers with access to browser automation, visual testing tools, and the reference material. Results are measured using SSIM (Structural Similarity Index) against the original screenshots.

Results

Models were run over a 6-hour time period and SSIM was measured throughout.

Rank	Model	Final Avg SSIM	Peak Assertion SSIM	SSIM Improvement	Test Runs	Test Success Rate	Source Lines (JSX)	Source Lines (CSS)	Assets Extracted	Interactive Features
1	Gemini	0.871	0.910	+0.254	41	71%	2,194	467 (+4.8MB prod)	62	None
2	Claude	0.757	0.790	+0.142	14	71%	925	1,657	34	Full
3	GLM	0.723	0.728	+0.060	91	25%	677	998	20	Full
4	Codex	0.583	0.606	-0.010	46	43%	483	782	19	Full

Quick Start

With Nix (recommended)

# 1. Enter the dev shell (provides Python, Node.js, uv, and all CLI tools)
nix develop

# 2. Build and run an agent container
nix build .#claude-container
docker load < result
docker run -e AWS_BEARER_TOKEN_BEDROCK \
  -v ./recordings:/bench/recordings:ro \
  -v ./workspace:/bench/workspace \
  cloning-bench-claude:latest

Without Nix

# Requires Python 3.12+ and uv
uv sync

# The CLI tools (site-test, site-test-diff, lookatdiff) are now available
site-test ./recordings/1 http://localhost:5173

How it works

cloning-bench/
│
├── recordings/              Reference recordings (screenshots, DOM, assets)
│
├── agents/
│   ├── claude/              Claude agent harness (Anthropic)
│   ├── codex/               Codex agent harness (OpenAI)
│   ├── gemini/              Gemini agent harness (Google)
│   └── glm/                 GLM agent harness
│
├── packages/
│   ├── test/                site-test: visual compliance testing framework
│   └── lookatdiff/          LLM-powered diff analysis tool
│
└── flake.nix                Nix flake for reproducible containers and dev env

Each agent runs in a Docker container built with Nix. The container includes:

Node.js 24 and Chromium for building and previewing the React app
Python 3.12 with the visual testing tools
agent-browser for headless browser automation (1280x720 viewport)
Git for version control within the workspace
The agent's own CLI (Claude Code, Codex CLI, Gemini CLI, or GLM/Pi CLI)

The agent reads the recording data, builds a Vite + React project, and enters an infinite test-fix loop:

Study the reference (DOM snapshots, accessibility trees, computed styles, assets)
Build or improve React components
Run site-test to capture screenshots and compare against the reference
Analyze visual diffs to identify remaining differences
Fix the differences and repeat

Agents are killed externally when time is up. There is no "done" state — the goal is to maximize SSIM scores in the time available.

Recording structure

Each recording captures a browsing session with one or more assertion points:

recordings/<index>/
├── video.mp4                   Full session video
├── screenplay.json             Test script with actions and assertions
├── screenshots/
│   ├── 0/                      Per-assertion directory
│   │   ├── screenshot.png      Reference screenshot (1280x720)
│   │   ├── dom.html            Full HTML snapshot
│   │   ├── manifest.json       Asset URL -> SHA256 hash mapping
│   │   ├── full/
│   │   │   ├── axtree.txt      Accessibility tree
│   │   │   └── styles.json     Computed CSS values
│   │   └── viewport/
│   │       ├── axtree.txt      Viewport-scoped accessibility tree
│   │       └── styles.json     Viewport-scoped styles
│   └── 1/, 2/, ...
└── assets/
    └── <sha256-hash>           Deduplicated assets (images, icons, fonts)

Agents use dom.html and axtree.txt to understand page structure, styles.json for CSS values, and manifest.json to extract assets. They must not copy dom.html verbatim or use reference screenshots as image sources — the UI must be rewritten as proper React components.

Visual testing

The benchmark includes two testing tools:

site-test

Executes the screenplay against a running clone, captures screenshots at each assertion point, and computes SSIM scores against the reference.

site-test <recording-dir> <clone-url>
site-test ./recordings/1 http://localhost:5173 --output-dir ./report

Output is a timestamped report folder containing:

summary.json — overall pass/fail, step counts, duration
execution-log.json — per-step results with SSIM scores
asserts/<N>/recording.png — reference screenshot
asserts/<N>/subject.png — clone screenshot
asserts/<N>/diff.png — visual diff overlay

site-test-diff

Generates a visual diff between any two screenshots with optional LLM-based dynamic content detection.

site-test-diff <reference.png> <subject.png> <output.png>
site-test-diff ref.png clone.png diff.png --no-dynamic-detection

Diff color coding

Color	Meaning	Action
Red	Structural differences	Must be fixed
Blue/Cyan	Dynamic content (timestamps, ads, counters)	Can be ignored

lookatdiff

Uses the Gemini API to analyze what visual differences mean and suggest fixes.

lookatdiff <subject.png> <diff.png> <actual.png> [-q "What needs fixing?"]

Requires GEMINI_API_KEY in the environment.

Agents

Four agents are currently supported:

Agent	Provider	CLI	Config
Claude	Anthropic	Claude Code	`agents/claude/CLAUDE.md`
Codex	OpenAI	Codex CLI	`agents/codex/AGENTS.md`
Gemini	Google	Gemini CLI	`agents/gemini/GEMINI.md`
GLM	—	Pi CLI	`agents/glm/AGENTS.md`

Each agent directory contains:

Agent-specific instructions (system prompt / markdown config)
Nix derivation for building the Docker container (nix/default.nix)
Skill definitions for agent-browser, site-test, and asset-handling

Building containers

# Build a specific agent container
nix build .#claude-container
nix build .#codex-container
nix build .#gemini-container
nix build .#glm-container

# Load and run
docker load < result
docker run -e AWS_BEARER_TOKEN_BEDROCK \
  -v ./recordings:/bench/recordings:ro \
  -v ./workspace:/bench/workspace \
  cloning-bench-claude:latest

Post-run analysis

After a run completes, use extract-transcripts to pull conversation logs, token usage, and cost data from the workspace:

extract-transcripts <workspace-dir>

This produces:

Per-agent conversation transcripts (JSONL or JSON)
Token usage summaries
A unified cost-report.json across all agents
Screenshot archives showing visual progression over time

Development

Nix devshell

nix develop

This provides:

Python 3.12 with uv and all project dependencies
Node.js 24 for the React build toolchain
Chromium for headless browser automation
direnv support — the .envrc activates the devshell automatically

Blogpost

https://vibrantlabs.com/blog/cloning-bench

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
agents		agents
nix		nix
packages		packages
src/cloning_bench		src/cloning_bench
.envrc		.envrc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
prompt.txt		prompt.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloning Bench: Evaluating AI Agents on Visual Website Cloning

Results

Quick Start

With Nix (recommended)

Without Nix

How it works

Recording structure

Visual testing

site-test

site-test-diff

Diff color coding

lookatdiff

Agents

Building containers

Post-run analysis

Development

Nix devshell

Blogpost

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cloning Bench: Evaluating AI Agents on Visual Website Cloning

Results

Quick Start

With Nix (recommended)

Without Nix

How it works

Recording structure

Visual testing

site-test

site-test-diff

Diff color coding

lookatdiff

Agents

Building containers

Post-run analysis

Development

Nix devshell

Blogpost

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages