Skip to content

feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s#31

Merged
zooqueen merged 7 commits intomainfrom
feat/self-hosted-runners
Feb 26, 2026
Merged

feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s#31
zooqueen merged 7 commits intomainfrom
feat/self-hosted-runners

Conversation

@zooqueen
Copy link
Copy Markdown

Summary

  • Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on our DOKS cluster
  • Eliminates GitHub Actions queue wait times — jobs start immediately on dedicated pods
  • 3 github-runner pods + 2 github-runner-build pods in actions-runner-system namespace
  • Adjusted BOT_TEST_WORKERS=1 for 4GiB runner pod memory limits

What's migrated

Workflow Jobs moved Runner label
ci.yml 8 Linux jobs (docs-scope, changed-scope, check, checks, build-artifacts, release-check, check-docs, secrets) hanzo-k8s
workflow-sanity.yml no-tabs hanzo-k8s
install-smoke.yml docs-scope, install-smoke hanzo-k8s
formal-conformance.yml formal_conformance hanzo-k8s
labeler.yml 3 jobs hanzo-k8s
auto-response.yml respond hanzo-k8s
stale.yml stale hanzo-k8s

What stays on GitHub runners

Job Runner Reason
Android ubuntu-latest Needs JDK + Gradle + Android SDK
Windows windows-latest Needs Windows OS
macOS macos-latest Needs Xcode + Swift
Docker builds ubuntu-latest / ubuntu-24.04-arm Needs Docker + registry access
npm release ubuntu-latest Needs npm registry auth

Test plan

  • CI jobs pick up self-hosted runners (check runner name in job logs)
  • All migrated jobs pass (lint, types, tests, protocol, secrets)
  • Jobs that stayed on GitHub runners still work (Windows, Android)
  • No queue wait times for Linux jobs

🤖 Generated with Claude Code

zooqueen and others added 4 commits February 25, 2026 20:44
)

Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on
our DOKS cluster instead of GitHub-hosted ubuntu-latest. This
eliminates queue wait times and runs jobs on dedicated infrastructure.

Migrated workflows: ci.yml (8 jobs), workflow-sanity.yml,
install-smoke.yml, formal-conformance.yml, labeler.yml,
auto-response.yml, stale.yml.

Kept on GitHub runners: Android (needs JDK/Gradle/SDK),
Windows (needs Windows OS), macOS (needs Xcode/Swift),
Docker release (needs Docker), npm release (needs registry).

Adjusted test parallelism: BOT_TEST_WORKERS=1 for 4GiB runner pods.

Runners: 3x github-runner + 2x github-runner-build pods in
actions-runner-system namespace, org-scoped for hanzoai.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The myoung34/github-runner:ubuntu-jammy image only has python3,
not the python symlink. Update the no-tabs check accordingly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Scale self-hosted runners from 4GiB to 8GiB (node test OOM'd)
- Restore BOT_TEST_WORKERS=2 (fits in 8GiB)
- Revert labeler.yml to ubuntu-latest (GH_APP_PRIVATE_KEY org secret
  not available to self-hosted runners)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
On self-hosted runners the workspace is /tmp/runner/work/... which
falls under os.tmpdir() (/tmp) — an allowed media root. The test
was using process.cwd()/package.json which passed the root check,
letting the Discord token validation fire first with a different error.

Use /usr/share/ instead, which is never under an allowed root.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@zooqueen zooqueen force-pushed the feat/self-hosted-runners branch from 84502ab to 791367c Compare February 26, 2026 04:45
…runners

2 workers × 3072 MB = 6 GB V8 heap on 8 GB runners leaves no room for
OS + Node RSS overhead, causing OOM kills during test cleanup. Reducing
to 2048 MB per worker (4 GB total) leaves enough headroom.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@zooqueen zooqueen force-pushed the feat/self-hosted-runners branch from d0c94c6 to 3ae601e Compare February 26, 2026 05:40
zooqueen and others added 2 commits February 25, 2026 22:05
test-parallel.mjs runs 3 vitest groups in parallel, each spawning
BOT_TEST_WORKERS processes. With 2 workers per group, total V8 heap
was 3×2×3072MB = 18GB, causing OOMKilled on 16GB runner nodes.
Reducing to 1 worker per group gives 3×3072MB = 9GB — fits in 12GB.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace manual myoung34/github-runner deployments with GitHub's official
ARC controller. Runner pods now scale on-demand (0→10) when jobs queue,
each getting its own 14Gi pod on the runner-pool node pool. Workers stay
at 2 per vitest group with 2048MB heap (3×2×2048 = 12GB, fits in 14Gi).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@zooqueen zooqueen merged commit 5f3401f into main Feb 26, 2026
18 of 20 checks passed
zooqueen added a commit that referenced this pull request Mar 6, 2026
* feat(ci): migrate Linux CI jobs to self-hosted runners on hanzo-k8s (#31)

Route all Linux CI jobs to self-hosted runners (hanzo-k8s label) on
our DOKS cluster instead of GitHub-hosted ubuntu-latest. This
eliminates queue wait times and runs jobs on dedicated infrastructure.

Migrated workflows: ci.yml (8 jobs), workflow-sanity.yml,
install-smoke.yml, formal-conformance.yml, labeler.yml,
auto-response.yml, stale.yml.

Kept on GitHub runners: Android (needs JDK/Gradle/SDK),
Windows (needs Windows OS), macOS (needs Xcode/Swift),
Docker release (needs Docker), npm release (needs registry).

Adjusted test parallelism: BOT_TEST_WORKERS=1 for 4GiB runner pods.

Runners: 3x github-runner + 2x github-runner-build pods in
actions-runner-system namespace, org-scoped for hanzoai.


* fix(ci): use python3 instead of python for self-hosted runner compat

The myoung34/github-runner:ubuntu-jammy image only has python3,
not the python symlink. Update the no-tabs check accordingly.


* fix(ci): bump runner memory to 8GiB and revert labeler to ubuntu-latest

- Scale self-hosted runners from 4GiB to 8GiB (node test OOM'd)
- Restore BOT_TEST_WORKERS=2 (fits in 8GiB)
- Revert labeler.yml to ubuntu-latest (GH_APP_PRIVATE_KEY org secret
  not available to self-hosted runners)


* fix(test): use path outside tmpdir for media root rejection test

On self-hosted runners the workspace is /tmp/runner/work/... which
falls under os.tmpdir() (/tmp) — an allowed media root. The test
was using process.cwd()/package.json which passed the root check,
letting the Discord token validation fire first with a different error.

Use /usr/share/ instead, which is never under an allowed root.


* fix(ci): reduce Node test heap to 2048MB to avoid OOM on self-hosted runners

2 workers × 3072 MB = 6 GB V8 heap on 8 GB runners leaves no room for
OS + Node RSS overhead, causing OOM kills during test cleanup. Reducing
to 2048 MB per worker (4 GB total) leaves enough headroom.


* fix(ci): reduce test workers to 1 per group to fit 12GB pod limit

test-parallel.mjs runs 3 vitest groups in parallel, each spawning
BOT_TEST_WORKERS processes. With 2 workers per group, total V8 heap
was 3×2×3072MB = 18GB, causing OOMKilled on 16GB runner nodes.
Reducing to 1 worker per group gives 3×3072MB = 9GB — fits in 12GB.


* feat(ci): switch to ARC (Actions Runner Controller) for auto-scaling

Replace manual myoung34/github-runner deployments with GitHub's official
ARC controller. Runner pods now scale on-demand (0→10) when jobs queue,
each getting its own 14Gi pod on the runner-pool node pool. Workers stay
at 2 per vitest group with 2048MB heap (3×2×2048 = 12GB, fits in 14Gi).


---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant