feat(gateway): add channel-backed readiness probes#38285
Conversation
🔒 Aisle Security AnalysisWe found 1 potential security issue(s) in this PR:
1. 🟠 Unthrottled auth oracle on /ready readiness details enables brute-force of gateway shared secret
DescriptionThe For remote requests, the code attempts authentication using the
Because the same shared secret protects other gateway surfaces, successful guessing here can enable broader unauthorized access. Vulnerable code: const bearerToken = getBearerToken(params.req);
const authResult = await authorizeHttpGatewayConnect({
auth: params.resolvedAuth,
connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
req: params.req,
trustedProxies: params.trustedProxies,
allowRealIpFallback: params.allowRealIpFallback,
});
return authResult.ok;RecommendationApply brute-force protection to the Option A (recommended): pass the gateway async function canRevealReadinessDetails(params: {
req: IncomingMessage;
resolvedAuth: ResolvedGatewayAuth;
trustedProxies: string[];
allowRealIpFallback: boolean;
rateLimiter?: AuthRateLimiter;
}): Promise<boolean> {
if (isLocalDirectRequest(params.req, params.trustedProxies, params.allowRealIpFallback)) {
return true;
}
if (params.resolvedAuth.mode === "none") {
return false;
}
const bearerToken = getBearerToken(params.req);
const authResult = await authorizeHttpGatewayConnect({
auth: params.resolvedAuth,
connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null,
req: params.req,
trustedProxies: params.trustedProxies,
allowRealIpFallback: params.allowRealIpFallback,
rateLimiter: params.rateLimiter,
});
return authResult.ok;
}Then thread Option B: make Also consider wrapping the auth call in a Analyzed PR: #38285 at commit Last updated on: 2026-03-06T20:35:49Z |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a26946c8b9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
a26946c to
3525479
Compare
Greptile SummaryThis PR upgrades Key changes:
Confidence Score: 4/5
Last reviewed commit: 3525479 |
src/gateway/server/readiness.ts
Outdated
| if (!health.healthy && health.reason !== "unmanaged" && health.reason !== "stale-socket") { | ||
| failing.push(channelId); | ||
| break; | ||
| } |
There was a problem hiding this comment.
not-running during restart backoff not excluded from failing
evaluateChannelHealth checks !snapshot.running before the startup-connect-grace logic, so a channel that is briefly running: false during a restart backoff (5 s up to ~5 min per CHANNEL_RESTART_POLICY) will land here as not-running and be added to failing. The result: readiness returns 503 for the full duration of each backoff window, even though the channel is actively restarting and will re-enter startup-connect-grace once setRuntime(…, { running: true, lastStartAt: now }) fires.
For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the stale-socket and unmanaged reasons are already explicitly excluded, it may be worth also considering whether not-running + reconnectAttempts < MAX_RESTART_ATTEMPTS (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional:
// No test currently covers running: false (not-running) for a managed channel after
// startup grace — during restart backoff the probe will return 503 for the entire
// backoff duration (up to 5 min on the 10th attempt).Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/readiness.ts
Line: 41-44
Comment:
**`not-running` during restart backoff not excluded from failing**
`evaluateChannelHealth` checks `!snapshot.running` before the `startup-connect-grace` logic, so a channel that is briefly `running: false` during a restart backoff (5 s up to ~5 min per `CHANNEL_RESTART_POLICY`) will land here as `not-running` and be added to `failing`. The result: readiness returns `503` for the full duration of each backoff window, even though the channel is actively restarting and will re-enter `startup-connect-grace` once `setRuntime(…, { running: true, lastStartAt: now })` fires.
For the first reconnect this is only ~5 s, but with exponential backoff it can reach 5 min, and Kubernetes will drain the pod well before that. The PR description notes this risk at a high level, but since the `stale-socket` and `unmanaged` reasons are already explicitly excluded, it may be worth also considering whether `not-running` + `reconnectAttempts < MAX_RESTART_ATTEMPTS` (i.e. "actively restarting") should be treated more leniently, or at least adding a test that documents the current "503 during backoff" contract so a future reader understands it's intentional:
```ts
// No test currently covers running: false (not-running) for a managed channel after
// startup grace — during restart backoff the probe will return 503 for the entire
// backoff duration (up to 5 min on the 10th attempt).
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2c9d383f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const authResult = await authorizeHttpGatewayConnect({ | ||
| auth: params.resolvedAuth, | ||
| connectAuth: bearerToken ? { token: bearerToken, password: bearerToken } : null, | ||
| req: params.req, |
There was a problem hiding this comment.
Rate-limit readiness auth checks for detail disclosure
This readiness-detail gate authenticates bearer credentials with authorizeHttpGatewayConnect, but it does not pass the gateway's auth rate limiter, so /ready becomes an unlimited credential oracle in token/password mode: a remote caller can brute-force Authorization: Bearer ... values and detect a correct secret when the response shape changes from { ready: ... } to the detailed payload. That bypasses the lockout behavior used on normal authenticated HTTP surfaces and weakens shared-secret protection.
Useful? React with 👍 / 👎.
* main: (37 commits) feat(gateway): add channel-backed readiness probes (openclaw#38285) CI: enable report-only Knip deadcode job Tooling: wire deadcode scripts to Knip Tooling: add Knip workspace config CI: skip detect-secrets on main temporarily Install Smoke: fetch docs base on demand CI: fetch base history on demand CI: add base-commit fetch helper Docs: clarify main secret scan behavior CI: keep full secret scans on main Docs: update secret scan reproduction steps CI: scope secret scans to changed files Media: reject spoofed input_image MIME payloads (openclaw#38289) chore: code/dead tests cleanup (openclaw#38286) Install Smoke: cache docker smoke builds Install Smoke: allow reusing prebuilt test images Install Smoke: shallow docs-scope checkout CI: shallow scope checkouts feat(onboarding): add web search to onboarding flow (openclaw#34009) chore: disable contributor labels ...
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes (cherry picked from commit ab5fcfc)
* Changelog: add channel-backed readiness probe entry * Gateway: add channel-backed readiness probes * Docs: describe readiness probe behavior * Gateway: add readiness probe regression tests * Changelog: dedupe gateway probe entries * Docs: fix readiness startup grace description * Changelog: remove stale readiness entry * Gateway: cover readiness hardening * Gateway: harden readiness probes (cherry picked from commit ab5fcfc)
Summary
Describe the problem and fix in 2–5 bullets:
/readyand/readyzalready exist onmain, but they still behave like shallow probe aliases instead of reflecting whether managed channels are actually ready to serve traffic./readyand/readyzreturn200or503based on managed channel state, cache readiness evaluations briefly to avoid repeated snapshot rebuilds, and only reveal detailed failing-channel payloads to local or authenticated callers./healthand/healthzremain shallow liveness probes; plugin route precedence and the root-mounted Control UI probe fix stay unchanged; no new dependencies or config knobs were added.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
/healthand/healthzcan return Control UI HTML200instead of machine health payload #18446User-visible / Behavior Changes
/readyand/readyznow report managed channel readiness instead of a shallow static ready response.503when managed channels are still disconnected after grace or disconnect later.{ ready }plus the200/503status, while local or authenticated callers keep the detailed{ ready, failing, uptimeMs }payload./healthand/healthzremain unchanged shallow liveness probes.Security Impact (required)
No)No)No)No)No)Yes, explain risk + mitigation:Repro + Verification
Environment
Steps
GET /readyzduring startup grace, during a channel restart backoff window, and after a managed channel disconnect.GET /readyzfrom an unauthenticated remote client and from an authenticated remote client.GET /healthzin the same scenarios.Expected
/readyand/readyzreflect managed channel readiness with200or503./healthand/healthzstay shallow liveness probes.Actual
Evidence
Attach at least one:
Focused verification:
pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/server-http.probe.test.ts src/gateway/server/readiness.test.ts src/gateway/server.plugin-http-auth.test.tspnpm buildRepo-wide note:
pnpm checkis still blocked on existingmainerrors inextensions/feishu/src/media.ts(TS2353on unsupportedtimeoutproperties). This PR does not touch that extension.Human Verification (required)
What you personally verified (not just CI), and how:
stale-socketchannels do not force readiness red;/healthzstays shallow even when readiness is red.Compatibility / Migration
Yes)No)No)Failure Recovery (if this breaks)
/readybehavior.src/gateway/server-http.ts,src/gateway/server-runtime-state.ts,src/gateway/server.impl.ts,src/gateway/server/readiness.ts,src/gateway/server-channels.ts,src/gateway/channel-health-policy.ts, the gateway probe/readiness tests,docs/install/docker.md, andCHANGELOG.md.503readiness responses on healthy managed channels after startup grace expires, or unexpected detailed readiness payloads returned to unauthenticated remote callers.Risks and Mitigations