Skip to content

feat(gateway): add channel-backed readiness probes#38285

Merged
vincentkoc merged 9 commits intomainfrom
vincentkoc-code/gateway-readiness-probes
Mar 6, 2026
Merged

feat(gateway): add channel-backed readiness probes#38285
vincentkoc merged 9 commits intomainfrom
vincentkoc-code/gateway-readiness-probes

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

@vincentkoc vincentkoc commented Mar 6, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: /ready and /readyz already exist on main, but they still behave like shallow probe aliases instead of reflecting whether managed channels are actually ready to serve traffic.
  • Why it matters: container and Kubernetes deployments need readiness to stay separate from liveness, but the route-precedence fixes from Gateway: add healthz/readyz probe endpoints for container checks #31272 and fix(gateway): keep probe routes reachable with root-mounted control ui #38199 need to remain intact.
  • What changed: wire a channel-backed readiness checker through gateway bootstrap and HTTP handling so /ready and /readyz return 200 or 503 based on managed channel state, cache readiness evaluations briefly to avoid repeated snapshot rebuilds, and only reveal detailed failing-channel payloads to local or authenticated callers.
  • What did NOT change (scope boundary): /health and /healthz remain shallow liveness probes; plugin route precedence and the root-mounted Control UI probe fix stay unchanged; no new dependencies or config knobs were added.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • /ready and /readyz now report managed channel readiness instead of a shallow static ready response.
  • Readiness stays green during startup grace, stays green through active channel restart backoff windows, and flips to 503 when managed channels are still disconnected after grace or disconnect later.
  • Unauthenticated remote readiness probes now get only { ready } plus the 200/503 status, while local or authenticated callers keep the detailed { ready, failing, uptimeMs } payload.
  • /health and /healthz remain unchanged shallow liveness probes.
  • Docker docs now spell out the liveness vs readiness split for the probe endpoints.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node.js 22 + pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): gateway probe layer
  • Relevant config (redacted): default test config with probe endpoints enabled

Steps

  1. Start the gateway with at least one managed channel configured.
  2. Request GET /readyz during startup grace, during a channel restart backoff window, and after a managed channel disconnect.
  3. Request GET /readyz from an unauthenticated remote client and from an authenticated remote client.
  4. Request GET /healthz in the same scenarios.

Expected

  • /ready and /readyz reflect managed channel readiness with 200 or 503.
  • Repeated readiness probes reuse a short-lived cached evaluation instead of rebuilding the snapshot every time.
  • Unauthenticated remote callers do not receive failing channel IDs.
  • /health and /healthz stay shallow liveness probes.

Actual

  • Matches expected in focused gateway tests and local build verification.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Focused verification:

  • pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/server-http.probe.test.ts src/gateway/server/readiness.test.ts src/gateway/server.plugin-http-auth.test.ts
  • pnpm build

Repo-wide note:

  • pnpm check is still blocked on existing main errors in extensions/feishu/src/media.ts (TS2353 on unsupported timeout properties). This PR does not touch that extension.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: local detailed readiness payloads, unauthenticated remote minimal payloads, authenticated remote detailed payloads, HEAD probe behavior, typed internal fallback response, readiness caching, restart-backoff handling, and preserved probe precedence coverage alongside the existing plugin/root-mounted Control UI tests.
  • Edge cases checked: startup grace keeps readiness green; active restart backoff keeps readiness green; stale-socket channels do not force readiness red; /healthz stays shallow even when readiness is red.
  • What you did not verify: end-to-end behavior in a live Kubernetes cluster.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR; probe semantics will fall back to the current shallow /ready behavior.
  • Files/config to restore: src/gateway/server-http.ts, src/gateway/server-runtime-state.ts, src/gateway/server.impl.ts, src/gateway/server/readiness.ts, src/gateway/server-channels.ts, src/gateway/channel-health-policy.ts, the gateway probe/readiness tests, docs/install/docker.md, and CHANGELOG.md.
  • Known bad symptoms reviewers should watch for: unexpected 503 readiness responses on healthy managed channels after startup grace expires, or unexpected detailed readiness payloads returned to unauthenticated remote callers.

Risks and Mitigations

  • Risk: a managed channel that is intentionally disconnected after startup grace will mark readiness red and keep the pod out of service.
    • Mitigation: this uses the existing channel health policy, only counts managed channels, preserves startup grace, keeps active restart backoff green, and still surfaces hard failures once retries are exhausted.
  • Risk: unauthenticated readiness probes could become a hot path on exposed gateways.
    • Mitigation: readiness evaluations are cached briefly, and unauthenticated remote callers only get the boolean readiness state instead of per-channel details.

@vincentkoc vincentkoc self-assigned this Mar 6, 2026
@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation gateway Gateway runtime docker Docker and sandbox tooling size: M maintainer Maintainer-authored PR labels Mar 6, 2026
@vincentkoc vincentkoc marked this pull request as ready for review March 6, 2026 19:43
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 6, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret

1. 🟠 Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret

Property Value
Severity High
CWE CWE-307
Location src/gateway/server-http.ts:173-180

Description

The /ready and /readyz probe endpoints now conditionally reveal a more detailed readiness payload (including failing channel IDs and uptime) when a request is considered local or when authorizeHttpGatewayConnect() succeeds.

For remote requests, the code attempts authentication using the Authorization: Bearer ... header, but does not pass the configured rateLimiter to authorizeHttpGatewayConnect(). This creates an unthrottled authentication oracle:

  • Input: attacker-controlled Authorization header parsed by getBearerToken(req)
  • Auth check: authorizeHttpGatewayConnect({ connectAuth: { token, password: token }, ... }) is called without rateLimiter
  • Oracle: response body shape differs depending on auth success ({ready} vs {ready,failing,uptimeMs}), enabling high-rate guessing of the gateway token/password.

Because the same shared secret protects other gateway surfaces, successful guessing here can enable broader unauthorized access.

Vulnerable code:

const bearerToken = getBearerToken(params.req);
const authResult = await authorizeHttpGatewayConnect({
  auth: params.resolvedAuth,
  connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
  req: params.req,
  trustedProxies: params.trustedProxies,
  allowRealIpFallback: params.allowRealIpFallback,
});
return authResult.ok;

Recommendation

Apply brute-force protection to the /ready readiness-details auth check, or avoid authenticating on probe endpoints.

Option A (recommended): pass the gateway rateLimiter into this auth attempt so failed guesses are throttled consistently:

async function canRevealReadinessDetails(params: {
  req: IncomingMessage;
  resolvedAuth: ResolvedGatewayAuth;
  trustedProxies: string[];
  allowRealIpFallback: boolean;
  rateLimiter?: AuthRateLimiter;
}): Promise<boolean> {
  if (isLocalDirectRequest(params.req, params.trustedProxies, params.allowRealIpFallback)) {
    return true;
  }
  if (params.resolvedAuth.mode === "none") {
    return false;
  }

  const bearerToken = getBearerToken(params.req);
  const authResult = await authorizeHttpGatewayConnect({
    auth: params.resolvedAuth,
    connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
    req: params.req,
    trustedProxies: params.trustedProxies,
    allowRealIpFallback: params.allowRealIpFallback,
    rateLimiter: params.rateLimiter,
  });
  return authResult.ok;
}

Then thread rateLimiter through handleGatewayProbeRequest(...) from createGatewayHttpServer().

Option B: make /ready always return the shallow {ready:boolean} response for remote requests (even if authenticated), and expose detailed readiness only on a separate, explicitly-authenticated endpoint behind the normal auth middleware + rate limiter.

Also consider wrapping the auth call in a try/catch and treating errors as includeDetails=false to avoid turning unexpected exceptions into noisy 500s.


Analyzed PR: #38285 at commit a2c9d38

Last updated on: 2026-03-06T20:35:49Z

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a26946c8b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@vincentkoc vincentkoc force-pushed the vincentkoc-code/gateway-readiness-probes branch from a26946c to 3525479 Compare March 6, 2026 20:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR upgrades /ready and /readyz from shallow static probes to channel-backed readiness probes, returning { ready, failing, uptimeMs } with 200 or 503 based on managed channel state. /health and /healthz remain unchanged shallow liveness probes. The implementation wires a new ReadinessChecker closure through gateway bootstrap (server.impl.tsserver-runtime-state.tsserver-http.ts) and delegates to the existing evaluateChannelHealth policy for consistency with the health-state poller.

Key changes:

  • src/gateway/server/readiness.ts (new): createReadinessChecker iterates channelAccounts from the channel manager snapshot and uses evaluateChannelHealth with 120 s connect grace and 30 min stale-socket threshold; stale-socket and unmanaged channels are deliberately excluded from failing.
  • src/gateway/server-http.ts: handleGatewayProbeRequest now accepts an optional ReadinessChecker; the status === "ready" branch uses it with a typed error fallback; HEAD requests correctly inherit the actual status code with no body.
  • src/gateway/server.impl.ts: createChannelManager is moved earlier in the startup sequence (before createGatewayRuntimeState) so getReadiness is available when the HTTP server is constructed. The synchronous setRuntime(running: true, lastStartAt: now) call in startChannelInternal means startup-connect-grace kicks in almost immediately after startChannels() — the very short not-running window before that is acceptable.
  • Tests (server/readiness.test.ts, server-http.probe.test.ts): cover healthy/failing channels, disabled/unconfigured channel filtering, startup grace, stale-socket exclusion, /healthz isolation, and HEAD probe behavior. One gap: no explicit test for the running: false (not-running / restart-backoff) state after startup grace — see inline comment.
  • Docs (docs/install/docker.md): adds clear liveness vs. readiness explanation.

Confidence Score: 4/5

  • Safe to merge; the behavioral change is intentional and well-tested, with one documented behavioral edge case during restart backoffs worth understanding before deploying to Kubernetes.
  • The implementation is clean and uses the existing evaluateChannelHealth policy correctly. Tests cover the major scenarios. The two minor style comments (restart-backoff 503 not explicitly tested, grace constants not co-located with the shared policy file) are non-blocking. The main real-world concern — 503 readiness during channel restart backoff windows (5 s to 5 min) — is an acknowledged trade-off documented in the PR description. Deployment to Kubernetes should confirm that readiness probe failureThreshold and periodSeconds are tuned to tolerate this.
  • src/gateway/server/readiness.ts and src/gateway/server.impl.ts deserve a close read for the startup sequencing and restart-backoff behavior described in the inline comments.

Last reviewed commit: 3525479

Comment on lines +41 to +44
if (!health.healthy && health.reason !== "unmanaged" && health.reason !== "stale-socket") {
failing.push(channelId);
break;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not-running during restart backoff not excluded from failing

evaluateChannelHealth checks !snapshot.running before the startup-connect-grace logic, so a channel that is briefly running: false during a restart backoff (5 s up to ~5 min per CHANNEL_RESTART_POLICY) will land here as not-running and be added to failing. The result: readiness returns 503 for the full duration of each backoff window, even though the channel is actively restarting and will re-enter startup-connect-grace once setRuntime(…, { running: true, lastStartAt: now }) fires.

For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the stale-socket and unmanaged reasons are already explicitly excluded, it may be worth also considering whether not-running + reconnectAttempts < MAX_RESTART_ATTEMPTS (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional:

// No test currently covers running: false (not-running) for a managed channel after
// startup grace — during restart backoff the probe will return 503 for the entire
// backoff duration (up to 5 min on the 10th attempt).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/readiness.ts
Line: 41-44

Comment:
**`not-running` during restart backoff not excluded from failing**

`evaluateChannelHealth` checks `!snapshot.running` before the `startup-connect-grace` logic, so a channel that is briefly `running: false` during a restart backoff (5 s up to ~5 min per `CHANNEL_RESTART_POLICY`) will land here as `not-running` and be added to `failing`. The result: readiness returns `503` for the full duration of each backoff window, even though the channel is actively restarting and will re-enter `startup-connect-grace` once `setRuntime(…, { running: true, lastStartAt: now })` fires.

For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the `stale-socket` and `unmanaged` reasons are already explicitly excluded, it may be worth also considering whether `not-running` + `reconnectAttempts < MAX_RESTART_ATTEMPTS` (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional:

```ts
// No test currently covers running: false (not-running) for a managed channel after
// startup grace — during restart backoff the probe will return 503 for the entire
// backoff duration (up to 5 min on the 10th attempt).
```

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc merged commit ab5fcfc into main Mar 6, 2026
29 of 31 checks passed
@vincentkoc vincentkoc deleted the vincentkoc-code/gateway-readiness-probes branch March 6, 2026 20:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2c9d383f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +174 to +177
const authResult = await authorizeHttpGatewayConnect({
auth: params.resolvedAuth,
connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
req: params.req,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rate-limit readiness auth checks for detail disclosure

This readiness-detail gate authenticates bearer credentials with authorizeHttpGatewayConnect, but it does not pass the gateway's auth rate limiter, so /ready becomes an unlimited credential oracle in token/password mode: a remote caller can brute-force Authorization: Bearer ... values and detect a correct secret when the response shape changes from { ready: ... } to the detailed payload. That bypasses the lockout behavior used on normal authenticated HTTP surfaces and weakens shared-secret protection.

Useful? React with 👍 / 👎.

mrosmarin added a commit to mrosmarin/openclaw that referenced this pull request Mar 6, 2026
* main: (37 commits)
  feat(gateway): add channel-backed readiness probes (openclaw#38285)
  CI: enable report-only Knip deadcode job
  Tooling: wire deadcode scripts to Knip
  Tooling: add Knip workspace config
  CI: skip detect-secrets on main temporarily
  Install Smoke: fetch docs base on demand
  CI: fetch base history on demand
  CI: add base-commit fetch helper
  Docs: clarify main secret scan behavior
  CI: keep full secret scans on main
  Docs: update secret scan reproduction steps
  CI: scope secret scans to changed files
  Media: reject spoofed input_image MIME payloads (openclaw#38289)
  chore: code/dead tests cleanup (openclaw#38286)
  Install Smoke: cache docker smoke builds
  Install Smoke: allow reusing prebuilt test images
  Install Smoke: shallow docs-scope checkout
  CI: shallow scope checkouts
  feat(onboarding): add web search to onboarding flow (openclaw#34009)
  chore: disable contributor labels
  ...
vincentkoc added a commit to BryanTegomoh/openclaw-fork that referenced this pull request Mar 8, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes

(cherry picked from commit ab5fcfc)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
* Changelog: add channel-backed readiness probe entry

* Gateway: add channel-backed readiness probes

* Docs: describe readiness probe behavior

* Gateway: add readiness probe regression tests

* Changelog: dedupe gateway probe entries

* Docs: fix readiness startup grace description

* Changelog: remove stale readiness entry

* Gateway: cover readiness hardening

* Gateway: harden readiness probes

(cherry picked from commit ab5fcfc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Docker and sandbox tooling docs Improvements or additions to documentation gateway Gateway runtime maintainer Maintainer-authored PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant