Skip to content

feat: hardware-tiered benchmark infrastructure #75

@canoo

Description

@canoo

Problem

NEXUS recommends local models based on detected GPU hardware, but CI only runs on standard GitHub runners (no GPU). There is no automated validation that:

  • detectGPU() parses real nvidia-smi / system_profiler output correctly
  • Recommended models actually fit in detected VRAM
  • MCP tool response times match documented benchmarks

Design Principles

  • Benchmarks never block releases. CI is the release gate. Benchmarks are a post-release quality signal.
  • Results are editable after release. Run benchmarks when hardware is available, update release notes as results come in. No new tag needed.
  • Contributors can submit results. Anyone with hardware runs the script, submits JSON via PR or issue.
  • TUI integration. nexus benchmark command runs the suite locally. Opt-in anonymous upload for community dataset.
  • Uses unified data schema. See Benchmark data schema — hardware specs, model throughput, and test result format #74 for the shared schema across benchmarks, sessions, and analytics.

Hardware Tiers

Tier Example Hardware VRAM Supervisor Logic
Low RTX 3050 Mobile 4 GB qwen2.5-coder:1.5b llama3.2:3b
Mid RTX 3060 / M3 Pro 8-18 GB qwen2.5-coder:7b llama3.1:8b
High RTX 4090 / M3 Max 24+ GB qwen2.5-coder:14b qwen2.5:32b

Release Notes Format

## Hardware Benchmarks
- ✅ Low tier (RTX 3050, 4GB) — tested
- ⏳ Mid tier — pending
- ⏳ High tier — pending

Updated as results arrive. GitHub release notes are editable post-publish.

Phased Rollout

Phase 1 — v0.2.0: Benchmark script + repo storage

Phase 2 — v0.3.0: Cloud GPU + remote storage

  • Cloud GPU scripting (Vast.ai or Lambda Cloud): spin up, benchmark, tear down (~pennies per release)
  • Migrate storage from repo to S3/R2
  • TUI opt-in upload for community-contributed results
  • Quality score (1-5) added to benchmark results

Phase 3 — v0.4.0+: Website + automation

  • Website reads benchmark data from S3/R2, renders charts and hardware comparisons
  • Optional self-hosted GitHub Actions runners for fully automated per-release benchmarks
  • Same data store serves website, TUI dashboard, and future Pro analytics

Acceptance Criteria (Phase 1)

  • scripts/benchmark.sh exists and produces schema-compliant JSON (Benchmark data schema — hardware specs, model throughput, and test result format #74)
  • TUI nexus benchmark command wraps the same logic
  • Covers: GPU detection, model pull, all 6 MCP tools with sample inputs, response times
  • Baseline thresholds defined per tier
  • docs/benchmarks/ directory with at least one versioned result
  • Release checklist documents benchmark as non-blocking step

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions