Skip to content

fix: stabilize CI — TS widen, sys.modules restore, WS subscriber race#17836

Merged
teknium1 merged 1 commit intomainfrom
fix/ci-cleanup-after-17828
Apr 30, 2026
Merged

fix: stabilize CI — TS widen, sys.modules restore, WS subscriber race#17836
teknium1 merged 1 commit intomainfrom
fix/ci-cleanup-after-17828

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Follow-up to #17828 to turn the remaining red checks green.

Summary

Three narrow, unrelated fixes — each addressing one failing check on main after #17828 landed.

Changes

1. ui-tui/src/app/slash/commands/ops.ts — fixes Docker Build

/reload-mcp's local params was annotated session_id: string, but ctx.sid is typed string | null. TS2322 on line 86. The rpc signature is Record<string, unknown> (accepts null), every other rpc call site does the same session_id: ctx.sid without narrowing, and the existing vitest test literally passes { session_id: null } to this rpc. Widening the local type from string to string | null matches reality — one line, no runtime change.

2. tests/plugins/test_achievements_plugin.py — fixes 13 cascading test failures

_install_fake_session_db did a raw sys.modules['hermes_state'] = fake_module with no restoration, leaking the fake past test boundaries. In xdist workers that run this test before tests/test_hermes_state.py or tests/run_agent/test_860_dedup.py, later tests see SessionDB = lambda: fake_db instead of the real class — causing AttributeError: 'function' object has no attribute '_sanitize_fts5_query' (6x) and TypeError: got unexpected keyword argument 'db_path' (7x).

Fix: stash monkeypatch on the plugin_api module via the fixture, and route the sys.modules swap through monkeypatch.setitem(...) so teardown auto-restores.

3. tests/hermes_cli/test_web_server.py — fixes WS broadcast race timeout

TestPtyWebSocket::test_pub_broadcasts_to_events_subscribers hit the 30s test timeout on CI. websocket_connect returns after ws.accept(), but the server registers the subscriber in _event_channels on the next await (inside _event_lock). A publish immediately after connect can race ahead, the broadcast has no subscribers, and the subsequent receive_text() blocks forever.

Fix: after the subscriber connects, poll _event_channels until the registration is visible before opening the publisher.

Validation

Result
scripts/run_tests.sh tests/plugins/test_achievements_plugin.py tests/run_agent/test_860_dedup.py tests/test_hermes_state.py tests/hermes_cli/test_web_server.py 338 passed
cd ui-tui && npm run type-check clean
cd ui-tui && npm run build clean

Not fixed (intentional)

Nix has been failing on main for 5+ consecutive runs due to GH Actions cache TwirpErrorResponse { code: ResourceExhausted } on ubuntu and npm openssl-legacy on macos. Pure infra flake — nothing to fix in the codebase.

Three narrow fixes targeting the remaining red checks after #17828:

1. ui-tui/src/app/slash/commands/ops.ts (Docker Build):
   /reload-mcp's local params type annotated session_id: string
   while ctx.sid is string | null. Widen to string | null —
   matches every other rpc call site and the test harness which passes
   { session_id: null }. Fixes TS2322 on line 86. The rpc signature
   itself is Record<string, unknown>, so this is purely a local
   typing fix, no behavioral change.

2. tests/plugins/test_achievements_plugin.py (13 cascading test failures):
   _install_fake_session_db did a raw sys.modules['hermes_state'] =
   fake_module without restoration, leaking the fake across xdist
   worker boundaries. Downstream tests doing from hermes_state import
   SessionDB got a module whose SessionDB was lambda: fake_db
   — 6 test_hermes_state.py tests failed with AttributeError: 'function'
   object has no attribute '_sanitize_fts5_query' / _contains_cjk,
   and 7 test_860_dedup.py tests failed with TypeError: got unexpected
   keyword argument 'db_path' (real code calls SessionDB(db_path=...)).

   Fix: stash monkeypatch on the plugin_api module object in the
   fixture, and have the helper do monkeypatch.setitem(sys.modules,
   'hermes_state', fake_module) for auto-restoration at test teardown.

3. tests/hermes_cli/test_web_server.py (WS race):
   TestPtyWebSocket::test_pub_broadcasts_to_events_subscribers hit the
   30s test timeout on CI. websocket_connect returns after
   ws.accept() — but /api/events registers the subscriber in
   _event_channels on the NEXT await (inside _event_lock). A
   publish immediately after connect could race ahead of registration
   and be dropped, and the subsequent receive_text() blocked until
   SIGALRM killed the test. Fix: poll _event_channels after the
   subscriber connects, before publishing.

Validation:
scripts/run_tests.sh tests/plugins/test_achievements_plugin.py
                     tests/run_agent/test_860_dedup.py
                     tests/test_hermes_state.py
                     tests/hermes_cli/test_web_server.py    338 passed
cd ui-tui && npm run type-check                             clean
cd ui-tui && npm run build                                  clean

Remaining red checks are pure infra (Nix ubuntu hits
TwirpErrorResponse ResourceExhausted on the GH Actions cache API; Nix
macos bounces between npm build openssl-legacy and cache rate-limits)
and cannot be fixed in the codebase.
@teknium1 teknium1 merged commit fd07969 into main Apr 30, 2026
11 checks passed
@teknium1 teknium1 deleted the fix/ci-cleanup-after-17828 branch April 30, 2026 08:34
@alt-glitch alt-glitch added type/test Test coverage or test infrastructure P2 Medium — degraded but workaround exists comp/tui Terminal UI (ui-tui/ + tui_gateway/) comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels Apr 30, 2026
alt-glitch pushed a commit that referenced this pull request Apr 30, 2026
magic-nix-cache caused recurring CI failures (TwirpErrorResponse
ResourceExhausted) by hitting GitHub Actions Cache's 10 GB limit and
200 req/min rate limit. This was flagged as 'unfixable infra flake' in
#17836 but is actually a fixable architecture choice.

Switch to Cachix (dedicated binary cache, no GHA quota dependency):
- Replace DeterminateSystems/magic-nix-cache-action with cachix/cachix-action
- Add cachix-auth-token input to nix-setup composite action
- Pass CACHIX_AUTH_TOKEN secret through all three nix workflows
- continue-on-error: true so cache failures never block CI

Cache 'hermes-agent' is public at hermes-agent.cachix.org.
Devs can pull locally with: cachix use hermes-agent
alt-glitch added a commit that referenced this pull request Apr 30, 2026
* fix(nix): replace magic-nix-cache with Cachix

magic-nix-cache caused recurring CI failures (TwirpErrorResponse
ResourceExhausted) by hitting GitHub Actions Cache's 10 GB limit and
200 req/min rate limit. This was flagged as 'unfixable infra flake' in
#17836 but is actually a fixable architecture choice.

Switch to Cachix (dedicated binary cache, no GHA quota dependency):
- Replace DeterminateSystems/magic-nix-cache-action with cachix/cachix-action
- Add cachix-auth-token input to nix-setup composite action
- Pass CACHIX_AUTH_TOKEN secret through all three nix workflows
- continue-on-error: true so cache failures never block CI

Cache 'hermes-agent' is public at hermes-agent.cachix.org.
Devs can pull locally with: cachix use hermes-agent

* fix: correct cachix-action commit SHA pin

---------

Co-authored-by: Hermes Agent <[email protected]>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
* fix(nix): replace magic-nix-cache with Cachix

magic-nix-cache caused recurring CI failures (TwirpErrorResponse
ResourceExhausted) by hitting GitHub Actions Cache's 10 GB limit and
200 req/min rate limit. This was flagged as 'unfixable infra flake' in
NousResearch#17836 but is actually a fixable architecture choice.

Switch to Cachix (dedicated binary cache, no GHA quota dependency):
- Replace DeterminateSystems/magic-nix-cache-action with cachix/cachix-action
- Add cachix-auth-token input to nix-setup composite action
- Pass CACHIX_AUTH_TOKEN secret through all three nix workflows
- continue-on-error: true so cache failures never block CI

Cache 'hermes-agent' is public at hermes-agent.cachix.org.
Devs can pull locally with: cachix use hermes-agent

* fix: correct cachix-action commit SHA pin

---------

Co-authored-by: Hermes Agent <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins comp/tui Terminal UI (ui-tui/ + tui_gateway/) P2 Medium — degraded but workaround exists type/test Test coverage or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants