Skip to content

feat(browser): browser automation skill via playwright-mcp integration #2186

@bug-ops

Description

@bug-ops

Summary

Zeph's current web capabilities are limited to static HTTP scraping (web_scrape / fetch). This covers about 50% of real-world use cases. Modern SPAs (React/Vue/Next.js), authenticated workflows, dynamic content, and form interactions require a real browser.

This issue proposes integrating browser automation via @playwright/mcp as an optional MCP server, complemented by a new browser skill that teaches the LLM a cost-aware escalation strategy.

Current state

WebScrapeExecutor (crates/zeph-tools/src/scrape.rs):

  • Static HTML + CSS selector extraction
  • HTTPS-only + SSRF protection
  • Fast and token-efficient

Gaps:

  • No JS rendering — SPAs return empty/broken HTML
  • No interaction — cannot click, type, authenticate
  • Blocked by Cloudflare and bot-detection

Proposed solution

Phase 1 — MCP config + skill (zero new Rust code)

Add @playwright/mcp as an optional pre-configured [[mcp.servers]] entry:

[[mcp.servers]]
id = "browser"
transport = "stdio"
command = "npx"
args = ["@playwright/mcp@latest", "--headless"]

Or via Docker (headless, no Node.js on host):

docker run -d -p 8931:8931 mcr.microsoft.com/playwright/mcp cli.js \
  --headless --browser chromium --no-sandbox --port 8931

Create .zeph/skills/browser/SKILL.md with a decision tree:

Scenario Tool
Static HTML web_scrape (fast, no overhead)
SPA / JS-rendered page browser_navigate + browser_snapshot
Form fill / login flow browser_click + browser_type
Visual capture browser_take_screenshot
JS data extraction browser_evaluate

Key playwright-mcp tools to expose (core group only, 19 tools):

  • browser_navigate, browser_snapshot (accessibility tree — token-efficient)
  • browser_click, browser_type, browser_hover, browser_select_option
  • browser_evaluate (arbitrary JS execution)
  • browser_take_screenshot, browser_console_messages, browser_wait_for
  • Tab management: browser_new_tab, browser_close_tab, browser_tab_list

Phase 2 — BrowserConfig + init wizard

  • Add [browser] config section to zeph-tools/src/config.rs
  • Wire into --init wizard: detect Node.js/Docker, offer auto-config
  • Wire into --migrate-config for adding [browser] defaults

Proposed config schema:

[browser]
enabled = false
transport = "stdio"           # "stdio" | "http"
command = "npx"
args = ["@playwright/mcp@latest", "--headless"]
url = ""                      # for http transport
caps = []                     # optional: ["vision", "pdf"]
max_tabs = 5

Phase 3 — Native BrowserExecutor (optional)

If MCP latency or Node.js dependency is unacceptable: implement crates/zeph-tools/src/browser.rs as a native ToolExecutor using a Rust WebDriver/CDP crate. Only pursue if Phase 1 proves insufficient.

Why playwright-mcp

  • Maintained by Microsoft (Playwright team); GitHub Copilot and Claude Code use it natively
  • Both stdio and HTTP/SSE transports — directly compatible with Zeph's rmcp client
  • Accessibility snapshot mode (default): structured refs, ~4x fewer tokens than screenshot approach
  • Official Docker image for headless deployment
  • Apache 2.0 license

Alternatives evaluated:

  • @modelcontextprotocol/server-puppeteerdeprecated (archived May 2025)
  • browserbase — paid cloud, vendor lock-in
  • browsermcp.io — desktop Chrome extension only, not headless
  • rust-browser-mcp — community project, immature, WebDriver limitations

Acceptance criteria

  • [[mcp.servers]] example for playwright-mcp in config.toml.example / docs
  • .zeph/skills/browser/SKILL.md with escalation decision tree and tool usage guide
  • [browser] config section in BrowserConfig struct
  • --init wizard detects Node.js/Docker and offers browser auto-config
  • --migrate-config adds [browser] defaults
  • Docs: docs/src/configuration.md browser section
  • Live session test: navigate to a JS-rendered page, extract content via browser_snapshot

Open questions

  1. Should browser state (cookies, localStorage) persist across agent turns?
  2. Screenshots: inline base64 in ToolOutput or written to .local/ + referenced by path?
  3. Token budget: should browser_snapshot output be summarized before feeding to LLM on large pages?
  4. SSRF: browser can access internal network — should the same SSRF rules from WebScrapeExecutor apply via skill constraints?

Research notes

Full research report: .local/reports/browser-skill-research.md

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexityenhancementNew feature or requestskillszeph-skills crate

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions