feat(gateway): add channel-backed readiness probes by vincentkoc · Pull Request #38285 · openclaw/openclaw

vincentkoc · 2026-03-06T19:25:50Z

Summary

Describe the problem and fix in 2–5 bullets:

Problem: /ready and /readyz already exist on main, but they still behave like shallow probe aliases instead of reflecting whether managed channels are actually ready to serve traffic.
Why it matters: container and Kubernetes deployments need readiness to stay separate from liveness, but the route-precedence fixes from Gateway: add healthz/readyz probe endpoints for container checks #31272 and fix(gateway): keep probe routes reachable with root-mounted control ui #38199 need to remain intact.
What changed: wire a channel-backed readiness checker through gateway bootstrap and HTTP handling so /ready and /readyz return 200 or 503 based on managed channel state, cache readiness evaluations briefly to avoid repeated snapshot rebuilds, and only reveal detailed failing-channel payloads to local or authenticated callers.
What did NOT change (scope boundary): /health and /healthz remain shallow liveness probes; plugin route precedence and the root-mounted Control UI probe fix stay unchanged; no new dependencies or config knobs were added.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

/ready and /readyz now report managed channel readiness instead of a shallow static ready response.
Readiness stays green during startup grace, stays green through active channel restart backoff windows, and flips to 503 when managed channels are still disconnected after grace or disconnect later.
Unauthenticated remote readiness probes now get only { ready } plus the 200/503 status, while local or authenticated callers keep the detailed { ready, failing, uptimeMs } payload.
/health and /healthz remain unchanged shallow liveness probes.
Docker docs now spell out the liveness vs readiness split for the probe endpoints.

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS
Runtime/container: Node.js 22 + pnpm workspace
Model/provider: N/A
Integration/channel (if any): gateway probe layer
Relevant config (redacted): default test config with probe endpoints enabled

Steps

Start the gateway with at least one managed channel configured.
Request GET /readyz during startup grace, during a channel restart backoff window, and after a managed channel disconnect.
Request GET /readyz from an unauthenticated remote client and from an authenticated remote client.
Request GET /healthz in the same scenarios.

Expected

/ready and /readyz reflect managed channel readiness with 200 or 503.
Repeated readiness probes reuse a short-lived cached evaluation instead of rebuilding the snapshot every time.
Unauthenticated remote callers do not receive failing channel IDs.
/health and /healthz stay shallow liveness probes.

Actual

Matches expected in focused gateway tests and local build verification.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Focused verification:

pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/server-http.probe.test.ts src/gateway/server/readiness.test.ts src/gateway/server.plugin-http-auth.test.ts
pnpm build

Repo-wide note:

pnpm check is still blocked on existing main errors in extensions/feishu/src/media.ts (TS2353 on unsupported timeout properties). This PR does not touch that extension.

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: local detailed readiness payloads, unauthenticated remote minimal payloads, authenticated remote detailed payloads, HEAD probe behavior, typed internal fallback response, readiness caching, restart-backoff handling, and preserved probe precedence coverage alongside the existing plugin/root-mounted Control UI tests.
Edge cases checked: startup grace keeps readiness green; active restart backoff keeps readiness green; stale-socket channels do not force readiness red; /healthz stays shallow even when readiness is red.
What you did not verify: end-to-end behavior in a live Kubernetes cluster.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps:

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert this PR; probe semantics will fall back to the current shallow /ready behavior.
Files/config to restore: src/gateway/server-http.ts, src/gateway/server-runtime-state.ts, src/gateway/server.impl.ts, src/gateway/server/readiness.ts, src/gateway/server-channels.ts, src/gateway/channel-health-policy.ts, the gateway probe/readiness tests, docs/install/docker.md, and CHANGELOG.md.
Known bad symptoms reviewers should watch for: unexpected 503 readiness responses on healthy managed channels after startup grace expires, or unexpected detailed readiness payloads returned to unauthenticated remote callers.

Risks and Mitigations

Risk: a managed channel that is intentionally disconnected after startup grace will mark readiness red and keep the pod out of service.
- Mitigation: this uses the existing channel health policy, only counts managed channels, preserves startup grace, keeps active restart backoff green, and still surfaces hard failures once retries are exhausted.
Risk: unauthenticated readiness probes could become a hot path on exposed gateways.
- Mitigation: readiness evaluations are cached briefly, and unauthenticated remote callers only get the boolean readiness state instead of per-channel details.

aisle-research-bot · 2026-03-06T19:44:00Z

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

#	Severity	Title
1	🟠 High	Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret

1. 🟠 Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret

Property	Value
Severity	High
CWE	CWE-307
Location	`src/gateway/server-http.ts:173-180`

Description

The /ready and /readyz probe endpoints now conditionally reveal a more detailed readiness payload (including failing channel IDs and uptime) when a request is considered local or when authorizeHttpGatewayConnect() succeeds.

For remote requests, the code attempts authentication using the Authorization: Bearer ... header, but does not pass the configured rateLimiter to authorizeHttpGatewayConnect(). This creates an unthrottled authentication oracle:

Input: attacker-controlled Authorization header parsed by getBearerToken(req)
Auth check: authorizeHttpGatewayConnect({ connectAuth: { token, password: token }, ... }) is called without rateLimiter
Oracle: response body shape differs depending on auth success ({ready} vs {ready,failing,uptimeMs}), enabling high-rate guessing of the gateway token/password.

Because the same shared secret protects other gateway surfaces, successful guessing here can enable broader unauthorized access.

Vulnerable code:

const bearerToken = getBearerToken(params.req);
const authResult = await authorizeHttpGatewayConnect({
  auth: params.resolvedAuth,
  connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
  req: params.req,
  trustedProxies: params.trustedProxies,
  allowRealIpFallback: params.allowRealIpFallback,
});
return authResult.ok;

Recommendation

Apply brute-force protection to the /ready readiness-details auth check, or avoid authenticating on probe endpoints.

Option A (recommended): pass the gateway rateLimiter into this auth attempt so failed guesses are throttled consistently:

async function canRevealReadinessDetails(params: {
  req: IncomingMessage;
  resolvedAuth: ResolvedGatewayAuth;
  trustedProxies: string[];
  allowRealIpFallback: boolean;
  rateLimiter?: AuthRateLimiter;
}): Promise<boolean> {
  if (isLocalDirectRequest(params.req, params.trustedProxies, params.allowRealIpFallback)) {
    return true;
  }
  if (params.resolvedAuth.mode === "none") {
    return false;
  }

  const bearerToken = getBearerToken(params.req);
  const authResult = await authorizeHttpGatewayConnect({
    auth: params.resolvedAuth,
    connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
    req: params.req,
    trustedProxies: params.trustedProxies,
    allowRealIpFallback: params.allowRealIpFallback,
    rateLimiter: params.rateLimiter,
  });
  return authResult.ok;
}

Then thread rateLimiter through handleGatewayProbeRequest(...) from createGatewayHttpServer().

Option B: make /ready always return the shallow {ready:boolean} response for remote requests (even if authenticated), and expose detailed readiness only on a separate, explicitly-authenticated endpoint behind the normal auth middleware + rate limiter.

Also consider wrapping the auth call in a try/catch and treating errors as includeDetails=false to avoid turning unexpected exceptions into noisy 500s.

Analyzed PR: #38285 at commit a2c9d38

_{Last updated on: 2026-03-06T20:35:49Z}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a26946c8b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

docs/install/docker.md

greptile-apps · 2026-03-06T20:05:07Z

Greptile Summary

This PR upgrades /ready and /readyz from shallow static probes to channel-backed readiness probes, returning { ready, failing, uptimeMs } with 200 or 503 based on managed channel state. /health and /healthz remain unchanged shallow liveness probes. The implementation wires a new ReadinessChecker closure through gateway bootstrap (server.impl.ts → server-runtime-state.ts → server-http.ts) and delegates to the existing evaluateChannelHealth policy for consistency with the health-state poller.

Key changes:

src/gateway/server/readiness.ts (new): createReadinessChecker iterates channelAccounts from the channel manager snapshot and uses evaluateChannelHealth with 120 s connect grace and 30 min stale-socket threshold; stale-socket and unmanaged channels are deliberately excluded from failing.
src/gateway/server-http.ts: handleGatewayProbeRequest now accepts an optional ReadinessChecker; the status === "ready" branch uses it with a typed error fallback; HEAD requests correctly inherit the actual status code with no body.
src/gateway/server.impl.ts: createChannelManager is moved earlier in the startup sequence (before createGatewayRuntimeState) so getReadiness is available when the HTTP server is constructed. The synchronous setRuntime(running: true, lastStartAt: now) call in startChannelInternal means startup-connect-grace kicks in almost immediately after startChannels() — the very short not-running window before that is acceptable.
Tests (server/readiness.test.ts, server-http.probe.test.ts): cover healthy/failing channels, disabled/unconfigured channel filtering, startup grace, stale-socket exclusion, /healthz isolation, and HEAD probe behavior. One gap: no explicit test for the running: false (not-running / restart-backoff) state after startup grace — see inline comment.
Docs (docs/install/docker.md): adds clear liveness vs. readiness explanation.

Confidence Score: 4/5

Safe to merge; the behavioral change is intentional and well-tested, with one documented behavioral edge case during restart backoffs worth understanding before deploying to Kubernetes.
The implementation is clean and uses the existing evaluateChannelHealth policy correctly. Tests cover the major scenarios. The two minor style comments (restart-backoff 503 not explicitly tested, grace constants not co-located with the shared policy file) are non-blocking. The main real-world concern — 503 readiness during channel restart backoff windows (5 s to 5 min) — is an acknowledged trade-off documented in the PR description. Deployment to Kubernetes should confirm that readiness probe failureThreshold and periodSeconds are tuned to tolerate this.
src/gateway/server/readiness.ts and src/gateway/server.impl.ts deserve a close read for the startup sequencing and restart-backoff behavior described in the inline comments.

_{Last reviewed commit: 3525479}

greptile-apps · 2026-03-06T20:05:11Z

src/gateway/server/readiness.ts

+        if (!health.healthy && health.reason !== "unmanaged" && health.reason !== "stale-socket") {
+          failing.push(channelId);
+          break;
+        }


not-running during restart backoff not excluded from failing

evaluateChannelHealth checks !snapshot.running before the startup-connect-grace logic, so a channel that is briefly running: false during a restart backoff (5 s up to ~5 min per CHANNEL_RESTART_POLICY) will land here as not-running and be added to failing. The result: readiness returns 503 for the full duration of each backoff window, even though the channel is actively restarting and will re-enter startup-connect-grace once setRuntime(…, { running: true, lastStartAt: now }) fires.

For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the stale-socket and unmanaged reasons are already explicitly excluded, it may be worth also considering whether not-running + reconnectAttempts < MAX_RESTART_ATTEMPTS (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional:

// No test currently covers running: false (not-running) for a managed channel after // startup grace — during restart backoff the probe will return 503 for the entire // backoff duration (up to 5 min on the 10th attempt).

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gateway/server/readiness.ts Line: 41-44 Comment: **`not-running` during restart backoff not excluded from failing** `evaluateChannelHealth` checks `!snapshot.running` before the `startup-connect-grace` logic, so a channel that is briefly `running: false` during a restart backoff (5 s up to ~5 min per `CHANNEL_RESTART_POLICY`) will land here as `not-running` and be added to `failing`. The result: readiness returns `503` for the full duration of each backoff window, even though the channel is actively restarting and will re-enter `startup-connect-grace` once `setRuntime(…, { running: true, lastStartAt: now })` fires. For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the `stale-socket` and `unmanaged` reasons are already explicitly excluded, it may be worth also considering whether `not-running` + `reconnectAttempts < MAX_RESTART_ATTEMPTS` (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional: ```ts // No test currently covers running: false (not-running) for a managed channel after // startup grace — during restart backoff the probe will return 503 for the entire // backoff duration (up to 5 min on the 10th attempt). ``` How can I resolve this? If you propose a fix, please make it concise.

src/gateway/server/readiness.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2c9d383f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T20:22:10Z

src/gateway/server-http.ts

+  const authResult = await authorizeHttpGatewayConnect({
+    auth: params.resolvedAuth,
+    connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
+    req: params.req,


Rate-limit readiness auth checks for detail disclosure

This readiness-detail gate authenticates bearer credentials with authorizeHttpGatewayConnect, but it does not pass the gateway's auth rate limiter, so /ready becomes an unlimited credential oracle in token/password mode: a remote caller can brute-force Authorization: Bearer ... values and detect a correct secret when the response shape changes from { ready: ... } to the detailed payload. That bypasses the lockout behavior used on normal authenticated HTTP surfaces and weakens shared-secret protection.

Useful? React with 👍 / 👎.

* main: (37 commits) feat(gateway): add channel-backed readiness probes (openclaw#38285) CI: enable report-only Knip deadcode job Tooling: wire deadcode scripts to Knip Tooling: add Knip workspace config CI: skip detect-secrets on main temporarily Install Smoke: fetch docs base on demand CI: fetch base history on demand CI: add base-commit fetch helper Docs: clarify main secret scan behavior CI: keep full secret scans on main Docs: update secret scan reproduction steps CI: scope secret scans to changed files Media: reject spoofed input_image MIME payloads (openclaw#38289) chore: code/dead tests cleanup (openclaw#38286) Install Smoke: cache docker smoke builds Install Smoke: allow reusing prebuilt test images Install Smoke: shallow docs-scope checkout CI: shallow scope checkouts feat(onboarding): add web search to onboarding flow (openclaw#34009) chore: disable contributor labels ...

* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes

* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes (cherry picked from commit ab5fcfc)

vincentkoc added 4 commits March 6, 2026 14:24

Changelog: add channel-backed readiness probe entry

5d10cbc

Gateway: add channel-backed readiness probes

cd13f50

Docs: describe readiness probe behavior

1c46e0d

Gateway: add readiness probe regression tests

5a2cace

vincentkoc self-assigned this Mar 6, 2026

openclaw-barnacle bot added docs Improvements or additions to documentation gateway Gateway runtime docker Docker and sandbox tooling size: M maintainer Maintainer-authored PR labels Mar 6, 2026

Changelog: dedupe gateway probe entries

2a32ae2

vincentkoc marked this pull request as ready for review March 6, 2026 19:43

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

docs/install/docker.md Outdated Show resolved Hide resolved

Docs: fix readiness startup grace description

3525479

vincentkoc force-pushed the vincentkoc-code/gateway-readiness-probes branch from a26946c to 3525479 Compare March 6, 2026 20:03

Changelog: remove stale readiness entry

2882bb5

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

vincentkoc added 2 commits March 6, 2026 15:13

Gateway: cover readiness hardening

c8ac68e

Gateway: harden readiness probes

a2c9d38

openclaw-barnacle bot added size: L and removed size: M labels Mar 6, 2026

vincentkoc merged commit ab5fcfc into main Mar 6, 2026
29 of 31 checks passed

vincentkoc deleted the vincentkoc-code/gateway-readiness-probes branch March 6, 2026 20:15

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

github-actions bot mentioned this pull request Mar 6, 2026

📡 Upstream Digest — 2026-03-06 20:23 UTC curtismercier/openclaw-mods#195

Open

ademczuk mentioned this pull request Mar 7, 2026

fix(gateway): add drain delay between health-monitor stop and start #38227

Closed

6 tasks

alexyyyander mentioned this pull request Mar 7, 2026

fix/gateway token mismatch 38617 #38676

Closed

alexey-pelykh mentioned this pull request Mar 10, 2026

Cherry-pick: gateway readiness probes, stale routes, HEIC media, Discord dedup remoteclaw/remoteclaw#845

Closed

anthony-spruyt mentioned this pull request Mar 13, 2026

feat(openclaw): switch to native HTTP health probes when upstream fix lands anthony-spruyt/spruyt-labs#593

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gateway): add channel-backed readiness probes#38285

feat(gateway): add channel-backed readiness probes#38285
vincentkoc merged 9 commits intomainfrom
vincentkoc-code/gateway-readiness-probes

vincentkoc commented Mar 6, 2026 •

edited

Loading

Uh oh!

aisle-research-bot bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

greptile-apps bot commented Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vincentkoc commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

aisle-research-bot bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

1. 🟠 Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret

Description

Recommendation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

greptile-apps bot commented Mar 6, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vincentkoc commented Mar 6, 2026 •

edited

Loading

aisle-research-bot bot commented Mar 6, 2026 •

edited

Loading