feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s by zooqueen · Pull Request #31 · hanzoai/bot

zooqueen · 2026-02-25T22:56:21Z

Summary

Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on our DOKS cluster
Eliminates GitHub Actions queue wait times — jobs start immediately on dedicated pods
3 github-runner pods + 2 github-runner-build pods in actions-runner-system namespace
Adjusted BOT_TEST_WORKERS=1 for 4GiB runner pod memory limits

What's migrated

Workflow	Jobs moved	Runner label
ci.yml	8 Linux jobs (docs-scope, changed-scope, check, checks, build-artifacts, release-check, check-docs, secrets)	hanzo-k8s
workflow-sanity.yml	no-tabs	hanzo-k8s
install-smoke.yml	docs-scope, install-smoke	hanzo-k8s
formal-conformance.yml	formal_conformance	hanzo-k8s
labeler.yml	3 jobs	hanzo-k8s
auto-response.yml	respond	hanzo-k8s
stale.yml	stale	hanzo-k8s

What stays on GitHub runners

Job	Runner	Reason
Android	ubuntu-latest	Needs JDK + Gradle + Android SDK
Windows	windows-latest	Needs Windows OS
macOS	macos-latest	Needs Xcode + Swift
Docker builds	ubuntu-latest / ubuntu-24.04-arm	Needs Docker + registry access
npm release	ubuntu-latest	Needs npm registry auth

Test plan

CI jobs pick up self-hosted runners (check runner name in job logs)
All migrated jobs pass (lint, types, tests, protocol, secrets)
Jobs that stayed on GitHub runners still work (Windows, Android)
No queue wait times for Linux jobs

🤖 Generated with Claude Code

) Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on our DOKS cluster instead of GitHub-hosted ubuntu-latest. This eliminates queue wait times and runs jobs on dedicated infrastructure. Migrated workflows: ci.yml (8 jobs), workflow-sanity.yml, install-smoke.yml, formal-conformance.yml, labeler.yml, auto-response.yml, stale.yml. Kept on GitHub runners: Android (needs JDK/Gradle/SDK), Windows (needs Windows OS), macOS (needs Xcode/Swift), Docker release (needs Docker), npm release (needs registry). Adjusted test parallelism: BOT_TEST_WORKERS=1 for 4GiB runner pods. Runners: 3x github-runner + 2x github-runner-build pods in actions-runner-system namespace, org-scoped for hanzoai. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The myoung34/github-runner:ubuntu-jammy image only has python3, not the python symlink. Update the no-tabs check accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Scale self-hosted runners from 4GiB to 8GiB (node test OOM'd) - Restore BOT_TEST_WORKERS=2 (fits in 8GiB) - Revert labeler.yml to ubuntu-latest (GH_APP_PRIVATE_KEY org secret not available to self-hosted runners) Co-Authored-By: Claude Opus 4.6 <[email protected]>

On self-hosted runners the workspace is /tmp/runner/work/... which falls under os.tmpdir() (/tmp) — an allowed media root. The test was using process.cwd()/package.json which passed the root check, letting the Discord token validation fire first with a different error. Use /usr/share/ instead, which is never under an allowed root. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…runners 2 workers × 3072 MB = 6 GB V8 heap on 8 GB runners leaves no room for OS + Node RSS overhead, causing OOM kills during test cleanup. Reducing to 2048 MB per worker (4 GB total) leaves enough headroom. Co-Authored-By: Claude Opus 4.6 <[email protected]>

test-parallel.mjs runs 3 vitest groups in parallel, each spawning BOT_TEST_WORKERS processes. With 2 workers per group, total V8 heap was 3×2×3072MB = 18GB, causing OOMKilled on 16GB runner nodes. Reducing to 1 worker per group gives 3×3072MB = 9GB — fits in 12GB. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Replace manual myoung34/github-runner deployments with GitHub's official ARC controller. Runner pods now scale on-demand (0→10) when jobs queue, each getting its own 14Gi pod on the runner-pool node pool. Workers stay at 2 per vitest group with 2048MB heap (3×2×2048 = 12GB, fits in 14Gi). Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s (#31) Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on our DOKS cluster instead of GitHub-hosted ubuntu-latest. This eliminates queue wait times and runs jobs on dedicated infrastructure. Migrated workflows: ci.yml (8 jobs), workflow-sanity.yml, install-smoke.yml, formal-conformance.yml, labeler.yml, auto-response.yml, stale.yml. Kept on GitHub runners: Android (needs JDK/Gradle/SDK), Windows (needs Windows OS), macOS (needs Xcode/Swift), Docker release (needs Docker), npm release (needs registry). Adjusted test parallelism: BOT_TEST_WORKERS=1 for 4GiB runner pods. Runners: 3x github-runner + 2x github-runner-build pods in actions-runner-system namespace, org-scoped for hanzoai. * fix(ci): use python3 instead of python for self-hosted runner compat The myoung34/github-runner:ubuntu-jammy image only has python3, not the python symlink. Update the no-tabs check accordingly. * fix(ci): bump runner memory to 8GiB and revert labeler to ubuntu-latest - Scale self-hosted runners from 4GiB to 8GiB (node test OOM'd) - Restore BOT_TEST_WORKERS=2 (fits in 8GiB) - Revert labeler.yml to ubuntu-latest (GH_APP_PRIVATE_KEY org secret not available to self-hosted runners) * fix(test): use path outside tmpdir for media root rejection test On self-hosted runners the workspace is /tmp/runner/work/... which falls under os.tmpdir() (/tmp) — an allowed media root. The test was using process.cwd()/package.json which passed the root check, letting the Discord token validation fire first with a different error. Use /usr/share/ instead, which is never under an allowed root. * fix(ci): reduce Node test heap to 2048MB to avoid OOM on self-hosted runners 2 workers × 3072 MB = 6 GB V8 heap on 8 GB runners leaves no room for OS + Node RSS overhead, causing OOM kills during test cleanup. Reducing to 2048 MB per worker (4 GB total) leaves enough headroom. * fix(ci): reduce test workers to 1 per group to fit 12GB pod limit test-parallel.mjs runs 3 vitest groups in parallel, each spawning BOT_TEST_WORKERS processes. With 2 workers per group, total V8 heap was 3×2×3072MB = 18GB, causing OOMKilled on 16GB runner nodes. Reducing to 1 worker per group gives 3×3072MB = 9GB — fits in 12GB. * feat(ci): switch to ARC (Actions Runner Controller) for auto-scaling Replace manual myoung34/github-runner deployments with GitHub's official ARC controller. Runner pods now scale on-demand (0→10) when jobs queue, each getting its own 14Gi pod on the runner-pool node pool. Workers stay at 2 per vitest group with 2048MB heap (3×2×2048 = 12GB, fits in 14Gi). ---------

zooqueen and others added 4 commits February 25, 2026 20:44

fix(ci): use python3 instead of python for self-hosted runner compat

0b1f261

The myoung34/github-runner:ubuntu-jammy image only has python3, not the python symlink. Update the no-tabs check accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]>

zooqueen force-pushed the feat/self-hosted-runners branch from 84502ab to 791367c Compare February 26, 2026 04:45

zooqueen force-pushed the feat/self-hosted-runners branch from d0c94c6 to 3ae601e Compare February 26, 2026 05:40

zooqueen and others added 2 commits February 25, 2026 22:05

zooqueen merged commit 5f3401f into main Feb 26, 2026
18 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s#31

feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s#31
zooqueen merged 7 commits intomainfrom
feat/self-hosted-runners

zooqueen commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zooqueen commented Feb 25, 2026

Summary

What's migrated

What stays on GitHub runners

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant