Web scraper with LLM-powered structured data extraction.
Ares fetches web pages, converts HTML to Markdown, and uses LLM APIs to extract structured data defined by JSON Schemas. It exposes both a CLI and a REST API, supports persistent job queues with retries, circuit breaking, rate-limiting, change detection, and graceful shutdown.
Named after the Greek god of war and courage.
Conceptual sibling of Ceres — same philosophy, different temperament. Where Ceres is the nurturing goddess of harvest, Ares charges headfirst into the web and takes what it needs.
💡 Claude Code user? Install the Ares Claude Skill to give Claude deep knowledge of Ares — architecture, traits, CLI, REST API, schemas, and extension patterns.
Ares 0.1.0 releases on crates.io on March 1st, 2026. This skill will be published to the Anthropic Skill Marketplace shortly after.
ares-cli CLI interface — arg parsing, wiring, delegation
ares-api REST API — Axum HTTP server, OpenAPI/Swagger UI, Bearer auth
ares-core Business logic — ScrapeService, WorkerService, CircuitBreaker, Throttle, traits
ares-client External adapters — HTTP fetcher, headless browser fetcher, HTML cleaner, LLM client
ares-db PostgreSQL persistence — ExtractionRepository, ScrapeJobRepository, migrations
All external dependencies are behind traits (Fetcher, Cleaner, Extractor, ExtractionStore, ExtractorFactory, JobQueue), enabling full mock-based testing. The Fetcher trait has two implementations: ReqwestFetcher for static pages and BrowserFetcher (feature-gated behind browser) for JS-rendered SPAs.
- Rust 1.88+ (edition 2024)
- Docker (for PostgreSQL and integration tests)
- An OpenAI-compatible API key (OpenAI, Gemini, or any compatible endpoint)
- Chromium / Chrome (only when using
--browserfor JS-rendered pages)
# Clone and build
git clone <repo-url> && cd Ares
cargo build
# Start PostgreSQL
docker compose up -d
# Configure environment
cp .env.example .env
# Edit .env with your API key and settings
# One-shot scrape (stdout only)
cargo run -- scrape -u https://example.com -s schemas/blog/1.0.0.json
# Scrape a JS-rendered page with headless browser
cargo run --features browser -- scrape -u https://spa-example.com -s blog@latest --browser
# Scrape and persist to database
cargo run -- scrape -u https://example.com -s blog@latest --save
# View extraction history
cargo run -- history -u https://example.com -s blog
# Create a background job
cargo run -- job create -u https://example.com -s blog@latest
# Start a worker to process jobs
cargo run -- workerOne-shot extraction. Fetches the URL, cleans HTML to Markdown, sends it to the LLM with the JSON Schema, and prints the extracted data to stdout.
| Flag | Env Var | Description |
|---|---|---|
-u, --url |
Target URL | |
-s, --schema |
Schema path or name@version |
|
-m, --model |
ARES_MODEL |
LLM model (e.g., gpt-4o-mini) |
-b, --base-url |
ARES_BASE_URL |
API base URL (default: OpenAI) |
-a, --api-key |
ARES_API_KEY |
API key |
--save |
Persist result to database | |
--schema-name |
Override schema name for storage | |
--browser |
Use headless browser for JS-rendered pages (requires browser feature) |
|
--fetch-timeout |
HTTP fetch timeout in seconds (default: 30) | |
--llm-timeout |
LLM API timeout in seconds (default: 120) | |
--system-prompt |
Custom system prompt for LLM extraction | |
--skip-unchanged |
Skip saving when extracted data hasn't changed (requires --save) |
Show extraction history for a URL + schema pair, with change detection.
Manage persistent scrape jobs in the PostgreSQL queue.
Start a background worker that polls the job queue, processes scrape jobs through the circuit breaker, handles retries with exponential backoff, and supports graceful shutdown via Ctrl+C.
| Flag | Env Var | Description |
|---|---|---|
--worker-id |
Custom worker ID (auto-generated if omitted) | |
--poll-interval |
Seconds between job queue polls (default: 5) | |
-a, --api-key |
ARES_API_KEY |
API key |
--browser |
Use headless browser for JS-rendered pages (requires browser feature) |
|
--fetch-timeout |
HTTP fetch timeout in seconds (default: 30) | |
--llm-timeout |
LLM API timeout in seconds (default: 120) | |
--system-prompt |
Custom system prompt for LLM extraction | |
--skip-unchanged |
Skip saving when extracted data hasn't changed |
Ares ships a standalone HTTP server (ares-api) built on Axum with auto-generated OpenAPI documentation.
# Run locally
cargo run --bin ares-api
# Or with Docker
docker build -t ares-api:latest .
docker run -p 3000:3000 --env-file .env ares-api:latestOnce running, interactive API docs are available at /swagger-ui.
| Method | Path | Auth | Description |
|---|---|---|---|
POST |
/v1/scrape |
Bearer | One-shot scrape and extract |
POST |
/v1/jobs |
Bearer | Create a scrape job |
GET |
/v1/jobs |
Bearer | List jobs (filter by status, limit) |
GET |
/v1/jobs/{id} |
Bearer | Get job details |
DELETE |
/v1/jobs/{id} |
Bearer | Cancel a pending job |
GET |
/v1/extractions |
Bearer | Query extraction history |
GET |
/v1/schemas |
Bearer | List all schemas |
GET |
/v1/schemas/{name}/{version} |
Bearer | Get schema definition |
POST |
/v1/schemas |
Bearer | Create/upload a schema version |
GET |
/health |
— | Health check (database connectivity) |
Protected endpoints require a Bearer token set via ARES_ADMIN_TOKEN. Token comparison uses constant-time equality (subtle crate) to prevent timing attacks.
curl -H "Authorization: Bearer $ARES_ADMIN_TOKEN" http://localhost:3000/v1/jobsIf ARES_ADMIN_TOKEN is not set, all protected endpoints return 403 Forbidden.
Schemas are versioned JSON Schema files stored in schemas/:
schemas/
registry.json # {"blog": "1.0.0"}
blog/
1.0.0.json # JSON Schema definition
Reference by path (schemas/blog/1.0.0.json) or by name ([email protected], blog@latest).
| Variable | Required | Default | Description |
|---|---|---|---|
ARES_API_KEY |
Yes | LLM API key | |
ARES_MODEL |
Yes | LLM model name | |
ARES_BASE_URL |
No | https://api.openai.com/v1 |
OpenAI-compatible endpoint |
DATABASE_URL |
For persistence | PostgreSQL connection string | |
DATABASE_MAX_CONNECTIONS |
No | 5 |
PostgreSQL connection pool size |
ARES_ADMIN_TOKEN |
No | Bearer token for REST API auth | |
ARES_SERVER_PORT |
No | 3000 |
HTTP server listen port |
ARES_SCHEMAS_DIR |
No | schemas |
Path to schemas directory |
ARES_CORS_ORIGIN |
No | Allowed CORS origins (comma-separated, or *) |
|
ARES_RATE_LIMIT_BURST |
No | 30 |
Max burst requests per IP |
ARES_RATE_LIMIT_RPS |
No | 1 |
Request replenish rate (per second) |
ARES_BODY_SIZE_LIMIT |
No | 2097152 |
Max request body size in bytes (2 MB) |
CHROME_BIN |
No | Auto-detected | Override path to Chrome/Chromium binary |
Gemini works via the OpenAI-compatible endpoint:
export ARES_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
export ARES_MODEL="gemini-2.5-flash"# Build the image
docker build -t ares-api:latest .
# Run with docker compose (PostgreSQL + pgAdmin + server)
docker compose up -dThe Dockerfile uses a multi-stage build (Rust builder → Debian slim runtime) with Chromium pre-installed for browser-based scraping. The release binary is compiled with LTO and symbol stripping for minimal image size.
# Format, lint, and test
make all
# Run unit tests only
make test-unit
# Run integration tests (requires Docker)
make test-integration
# Run database migrations
make migrate
# Start/stop PostgreSQL
make docker-up
make docker-downCI runs on every push and PR via GitHub Actions: formatting, Clippy, unit tests, integration tests (with a Postgres service container), and a cargo-deny security audit.

