A browser agent that audits a GitHub account against a CIS-style security checklist by actually opening the settings pages, reading what's rendered, and asking Claude whether each setting matches the expected posture.
It's a narrow browser agent on purpose: one application (GitHub), one benchmark (a hand-rolled CIS-style subset), one output format (findings JSON). The point is to show the agent loop end-to-end (plan, navigate, observe, classify, report) on a real public site, not to be a general scanner.
For each check in a YAML benchmark, the agent:
- Launches a Chromium browser via Playwright with a saved login session.
- Navigates to the GitHub settings page that holds the relevant setting.
- Snapshots the page's accessibility tree (semantic, less brittle than raw DOM).
- Asks Claude: "given this page content and this benchmark check, is the setting compliant? Quote the evidence."
- Records a structured finding (PASS / FAIL / UNKNOWN, with a quoted evidence string and a screenshot path).
Output is a JSON file like:
{
"target": "b9nn",
"generated_at": "2026-05-17T22:14:03Z",
"benchmark": "github_user_cis_subset_v1",
"findings": [
{"id": "GH-USER-001", "title": "Two-factor authentication enabled",
"status": "PASS", "evidence": "Two-factor authentication is enabled."},
{"id": "GH-USER-004", "title": "No active classic personal access tokens",
"status": "FAIL", "evidence": "3 personal access tokens (classic) are active."}
]
}The hard parts of a real browser agent show up here, just at small scale:
- Auth. GitHub gates everything behind a session cookie. The agent uses
Playwright's
storage_stateto persist a logged-in session you create once interactively, so no password automation or TOTP juggling. - Selector drift. GitHub redesigns its settings UI. The agent reads the accessibility tree rather than relying on brittle CSS selectors, then delegates the "did I find the setting?" judgment to the model.
- Non-determinism. Claude's classification of a rendered page isn't
deterministic, so each check requires a structured output schema, an
explicit
UNKNOWNstate, and an evidence quote for the human to verify. - Vision fallback. If the accessibility tree comes back empty or the
model says
UNKNOWN, the agent retries with a screenshot via Claude's vision capability instead.
Requires Python 3.11+ and an Anthropic API key.
git clone https://github.com/b9nn/posture-probe.git
cd posture-probe
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e .
playwright install chromium
cp .env.example .env # then edit .env with your ANTHROPIC_API_KEYGitHub auth happens once, interactively, and the session is saved to
.auth/github.json. The agent reuses that on every subsequent run.
python -m posture_probe loginA browser window opens. Sign in to GitHub (do 2FA as usual). When you see your dashboard, switch back to the terminal and press Enter. The session state is written and the window closes.
python -m posture_probe audit --target b9nn --benchmark benchmarks/github_user.yamlFindings land in out/findings-<timestamp>.json and a redacted Markdown
report at out/report-<timestamp>.md. Screenshots used as evidence go into
out/screenshots/.
posture-probe/
posture_probe/
__main__.py CLI entry (login, audit)
agent.py Per-check agent loop (navigate, observe, classify)
browser.py Playwright session wrapper
llm.py Anthropic SDK wrapper with the prompt + JSON schema
benchmark.py YAML loader
findings.py Output schema + report rendering
benchmarks/
github_user.yaml The checks themselves
examples/
sample_findings.json
v1 covers GitHub user-account checks. v2 will add per-repository checks (iterates over repos and applies repo-level CIS items) and a vision-LLM fallback path for the cases where the accessibility tree is too noisy to judge from text alone.
MIT.