feat: hardware-tiered benchmark infrastructure

## Problem

NEXUS recommends local models based on detected GPU hardware, but CI only runs on standard GitHub runners (no GPU). There is no automated validation that:
- `detectGPU()` parses real `nvidia-smi` / `system_profiler` output correctly
- Recommended models actually fit in detected VRAM
- MCP tool response times match documented benchmarks

## Design Principles

- **Benchmarks never block releases.** CI is the release gate. Benchmarks are a post-release quality signal.
- **Results are editable after release.** Run benchmarks when hardware is available, update release notes as results come in. No new tag needed.
- **Contributors can submit results.** Anyone with hardware runs the script, submits JSON via PR or issue.
- **TUI integration.** `nexus benchmark` command runs the suite locally. Opt-in anonymous upload for community dataset.
- **Uses unified data schema.** See #74 for the shared schema across benchmarks, sessions, and analytics.

## Hardware Tiers

| Tier | Example Hardware | VRAM | Supervisor | Logic |
|------|-----------------|------|-----------|-------|
| Low | RTX 3050 Mobile | 4 GB | qwen2.5-coder:1.5b | llama3.2:3b |
| Mid | RTX 3060 / M3 Pro | 8-18 GB | qwen2.5-coder:7b | llama3.1:8b |
| High | RTX 4090 / M3 Max | 24+ GB | qwen2.5-coder:14b | qwen2.5:32b |

## Release Notes Format

```markdown
## Hardware Benchmarks
- ✅ Low tier (RTX 3050, 4GB) — tested
- ⏳ Mid tier — pending
- ⏳ High tier — pending
```

Updated as results arrive. GitHub release notes are editable post-publish.

## Phased Rollout

### Phase 1 — v0.2.0: Benchmark script + repo storage

- Ship `scripts/benchmark.sh` and TUI `nexus benchmark` command
- Detects hardware, pulls recommended models, runs all 6 MCP tools with standardized inputs
- Outputs schema-compliant JSON (#74)
- Store results in `docs/benchmarks/<version>/<tier>.json`
- Release checklist includes benchmark run (non-blocking)

### Phase 2 — v0.3.0: Cloud GPU + remote storage

- Cloud GPU scripting (Vast.ai or Lambda Cloud): spin up, benchmark, tear down (~pennies per release)
- Migrate storage from repo to S3/R2
- TUI opt-in upload for community-contributed results
- Quality score (1-5) added to benchmark results

### Phase 3 — v0.4.0+: Website + automation

- Website reads benchmark data from S3/R2, renders charts and hardware comparisons
- Optional self-hosted GitHub Actions runners for fully automated per-release benchmarks
- Same data store serves website, TUI dashboard, and future Pro analytics

## Acceptance Criteria (Phase 1)

- [ ] `scripts/benchmark.sh` exists and produces schema-compliant JSON (#74)
- [ ] TUI `nexus benchmark` command wraps the same logic
- [ ] Covers: GPU detection, model pull, all 6 MCP tools with sample inputs, response times
- [ ] Baseline thresholds defined per tier
- [ ] `docs/benchmarks/` directory with at least one versioned result
- [ ] Release checklist documents benchmark as non-blocking step

## Related

- #74 Unified data schema
- #9 Session logging
- #11 Cost tracker
- #29 Hardware detection (shipped in #73)
- #30 Benchmarks
- `docs/model-configuration.md` (existing manual test results)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hardware-tiered benchmark infrastructure #75

Problem

Design Principles

Hardware Tiers

Release Notes Format

Phased Rollout

Phase 1 — v0.2.0: Benchmark script + repo storage

Phase 2 — v0.3.0: Cloud GPU + remote storage

Phase 3 — v0.4.0+: Website + automation

Acceptance Criteria (Phase 1)

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tier	Example Hardware	VRAM	Supervisor	Logic
Low	RTX 3050 Mobile	4 GB	qwen2.5-coder:1.5b	llama3.2:3b
Mid	RTX 3060 / M3 Pro	8-18 GB	qwen2.5-coder:7b	llama3.1:8b
High	RTX 4090 / M3 Max	24+ GB	qwen2.5-coder:14b	qwen2.5:32b

feat: hardware-tiered benchmark infrastructure #75

Description

Problem

Design Principles

Hardware Tiers

Release Notes Format

Phased Rollout

Phase 1 — v0.2.0: Benchmark script + repo storage

Phase 2 — v0.3.0: Cloud GPU + remote storage

Phase 3 — v0.4.0+: Website + automation

Acceptance Criteria (Phase 1)

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions