fix: canonicalize session keys at write time (#29683)#30654
fix: canonicalize session keys at write time (#29683)#30654obviyus merged 9 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ca2caa60f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryThis PR fixes a critical session key canonicalization bug that caused orphaned session transcripts on every restart when the configured default agent differed from "main" or when The fix is well-architected with three complementary changes:
Key changes:
Test coverage: 14 new tests cover canonicalization behavior, migration scenarios, idempotency, and edge cases. Confidence Score: 5/5
Last reviewed commit: ce5b534 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7072d2a4c4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8b6e37554
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
We're hitting this exact bug on v2026.3.24 — Discord DM sessions are silently orphaned after gateway restarts. Just commented on #29683 with a detailed reproduction. Our case: Agent had an active DM session where a new extension was discussed and confirmed as trusted. Gateway restarted (config change → SIGUSR1), and the agent got a fresh DM session with zero context — treated the extension as suspicious prompt injection and refused to use it. The We confirmed This is causing real operational impact for us — we've worked around it with AGENTS.md documentation so agents can re-orient after restarts, but that's a band-aid. Would love to see this merged. Happy to help test if that's useful. |
353bf00 to
7cdf93a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b75b5551e1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@joshavant @jacobtomlinson @gumadeiras — requesting a review on this month-old PR. This fixes silent session data loss: DM sessions are orphaned on gateway restart when the configured default agent differs from "main". A user (@Kaspre) independently confirmed the bug in production on v2026.3.24 — see #29683. What changed since the original PR:
CI core checks pass. The failing channel extension tests are pre-existing on Greptile gave 5/5 confidence on the original implementation. @Kaspre confirmed real operational impact and offered to help test. |
|
Please don’t spam-ping multiple maintainers at once. Be patient, or join our community Discord for help: https://discord.gg/clawd |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df2154f3d6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
df2154f to
54b4d93
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54b4d938b9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1045e5f4f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ions (openclaw#29683) resolveSessionKey() uses hardcoded DEFAULT_AGENT_ID="main", but all read paths canonicalize via cfg. When the configured default agent differs (e.g. "ops" with mainKey "work"), writes produce "agent:main:main" while reads look up "agent:ops:work", orphaning transcripts on every restart. Fix all three write-path call sites by wrapping with canonicalizeMainSessionAlias: - initSessionState (auto-reply/reply/session.ts) - runWebHeartbeatOnce (web/auto-reply/heartbeat-runner.ts) - resolveCronAgentSessionKey (cron/isolated-agent/session-key.ts) Add startup migration (migrateOrphanedSessionKeys) to rename existing orphaned keys to canonical form, merging by most-recent updatedAt.
…t key P1: migrateOrphanedSessionKeys now tracks agentId alongside each store path in a Map instead of inferring from the filesystem path. This correctly handles custom session.store templates outside the default agents/<id>/ layout. P2: Pass the already-canonicalized sessionKey to getSessionSnapshot so the heartbeat snapshot reads/restores use the same key as the write path.
migrateOrphanedSessionKeys runs before detectLegacyStateMigrations, so it can canonicalize legacy keys (e.g. "main" → "agent:main:main") before the legacy detector sees them. This caused the early return path to skip logging, breaking doctor-state-migrations tests that assert log.info was called. Extract logMigrationResults helper and call it at every return point.
P1: When session.store has no {agentId}, all agents resolve to the same
file. Track all agentIds per store path (Map<path, Set<id>>) and run
canonicalization once per agent. Skip cross-agent "agent:main:*"
remapping when "main" is a legitimate configured agent sharing the store,
to avoid merging its data into another agent's namespace.
P2: Use expandHomePrefix (environment-aware ~ resolution) instead of
os.homedir() in resolveStorePathFromTemplate, matching the runtime
resolveStorePath behavior for OPENCLAW_HOME/HOME overrides.
Only remap agent:main:* keys where the suffix is a main session alias
("main" or the configured mainKey). Other agent:main:* keys — hooks,
subagents, cron sessions, per-sender keys — may be intentional
cross-agent references and must not be silently moved into another
agent's namespace.
645f058 to
d3deacb
Compare
obviyus
left a comment
There was a problem hiding this comment.
Reviewed latest changes; landing now.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3deacba84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return canonicalizeMainSessionAlias({ | ||
| cfg: params.cfg, | ||
| agentId: params.agentId, | ||
| sessionKey: raw, |
There was a problem hiding this comment.
Preserve explicit main-agent session keys in cron isolation
This canonicalization rewrites explicit keys like agent:main:main / agent:main:<mainKey> to the job agent, so isolated cron runs can silently ignore a user-specified session namespace. Fresh evidence: cron normalizes sessionTarget: "current" into session:<sessionKey> verbatim (src/cron/normalize.ts:457-473) and server-cron passes that custom key directly into runCronIsolatedAgentTurn (src/gateway/server-cron.ts:286-293), so these forms are not always orphan aliases. In setups with both a main agent and a different default agent, this remap can make jobs write/read the wrong session history.
Useful? React with 👍 / 👎.
…thomasxm) * fix: canonicalize session keys at write time to prevent orphaned sessions (openclaw#29683) resolveSessionKey() uses hardcoded DEFAULT_AGENT_ID="main", but all read paths canonicalize via cfg. When the configured default agent differs (e.g. "ops" with mainKey "work"), writes produce "agent:main:main" while reads look up "agent:ops:work", orphaning transcripts on every restart. Fix all three write-path call sites by wrapping with canonicalizeMainSessionAlias: - initSessionState (auto-reply/reply/session.ts) - runWebHeartbeatOnce (web/auto-reply/heartbeat-runner.ts) - resolveCronAgentSessionKey (cron/isolated-agent/session-key.ts) Add startup migration (migrateOrphanedSessionKeys) to rename existing orphaned keys to canonical form, merging by most-recent updatedAt. * fix: address review — track agent IDs in migration map, align snapshot key P1: migrateOrphanedSessionKeys now tracks agentId alongside each store path in a Map instead of inferring from the filesystem path. This correctly handles custom session.store templates outside the default agents/<id>/ layout. P2: Pass the already-canonicalized sessionKey to getSessionSnapshot so the heartbeat snapshot reads/restores use the same key as the write path. * fix: log migration results at all early return points migrateOrphanedSessionKeys runs before detectLegacyStateMigrations, so it can canonicalize legacy keys (e.g. "main" → "agent:main:main") before the legacy detector sees them. This caused the early return path to skip logging, breaking doctor-state-migrations tests that assert log.info was called. Extract logMigrationResults helper and call it at every return point. * fix: handle shared stores and ~ expansion in migration P1: When session.store has no {agentId}, all agents resolve to the same file. Track all agentIds per store path (Map<path, Set<id>>) and run canonicalization once per agent. Skip cross-agent "agent:main:*" remapping when "main" is a legitimate configured agent sharing the store, to avoid merging its data into another agent's namespace. P2: Use expandHomePrefix (environment-aware ~ resolution) instead of os.homedir() in resolveStorePathFromTemplate, matching the runtime resolveStorePath behavior for OPENCLAW_HOME/HOME overrides. * fix: narrow cross-agent remap to provable orphan aliases only Only remap agent:main:* keys where the suffix is a main session alias ("main" or the configured mainKey). Other agent:main:* keys — hooks, subagents, cron sessions, per-sender keys — may be intentional cross-agent references and must not be silently moved into another agent's namespace. * fix: run orphan-key session migration at gateway startup (openclaw#29683) * fix: canonicalize cross-agent legacy main aliases in session keys (openclaw#29683) * fix: guard shared-store migration against cross-agent legacy alias remap (openclaw#29683) * refactor: split session-key migration out of pr 30654 --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]>
…thomasxm) * fix: canonicalize session keys at write time to prevent orphaned sessions (openclaw#29683) resolveSessionKey() uses hardcoded DEFAULT_AGENT_ID="main", but all read paths canonicalize via cfg. When the configured default agent differs (e.g. "ops" with mainKey "work"), writes produce "agent:main:main" while reads look up "agent:ops:work", orphaning transcripts on every restart. Fix all three write-path call sites by wrapping with canonicalizeMainSessionAlias: - initSessionState (auto-reply/reply/session.ts) - runWebHeartbeatOnce (web/auto-reply/heartbeat-runner.ts) - resolveCronAgentSessionKey (cron/isolated-agent/session-key.ts) Add startup migration (migrateOrphanedSessionKeys) to rename existing orphaned keys to canonical form, merging by most-recent updatedAt. * fix: address review — track agent IDs in migration map, align snapshot key P1: migrateOrphanedSessionKeys now tracks agentId alongside each store path in a Map instead of inferring from the filesystem path. This correctly handles custom session.store templates outside the default agents/<id>/ layout. P2: Pass the already-canonicalized sessionKey to getSessionSnapshot so the heartbeat snapshot reads/restores use the same key as the write path. * fix: log migration results at all early return points migrateOrphanedSessionKeys runs before detectLegacyStateMigrations, so it can canonicalize legacy keys (e.g. "main" → "agent:main:main") before the legacy detector sees them. This caused the early return path to skip logging, breaking doctor-state-migrations tests that assert log.info was called. Extract logMigrationResults helper and call it at every return point. * fix: handle shared stores and ~ expansion in migration P1: When session.store has no {agentId}, all agents resolve to the same file. Track all agentIds per store path (Map<path, Set<id>>) and run canonicalization once per agent. Skip cross-agent "agent:main:*" remapping when "main" is a legitimate configured agent sharing the store, to avoid merging its data into another agent's namespace. P2: Use expandHomePrefix (environment-aware ~ resolution) instead of os.homedir() in resolveStorePathFromTemplate, matching the runtime resolveStorePath behavior for OPENCLAW_HOME/HOME overrides. * fix: narrow cross-agent remap to provable orphan aliases only Only remap agent:main:* keys where the suffix is a main session alias ("main" or the configured mainKey). Other agent:main:* keys — hooks, subagents, cron sessions, per-sender keys — may be intentional cross-agent references and must not be silently moved into another agent's namespace. * fix: run orphan-key session migration at gateway startup (openclaw#29683) * fix: canonicalize cross-agent legacy main aliases in session keys (openclaw#29683) * fix: guard shared-store migration against cross-agent legacy alias remap (openclaw#29683) * refactor: split session-key migration out of pr 30654 --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]>
…thomasxm) * fix: canonicalize session keys at write time to prevent orphaned sessions (openclaw#29683) resolveSessionKey() uses hardcoded DEFAULT_AGENT_ID="main", but all read paths canonicalize via cfg. When the configured default agent differs (e.g. "ops" with mainKey "work"), writes produce "agent:main:main" while reads look up "agent:ops:work", orphaning transcripts on every restart. Fix all three write-path call sites by wrapping with canonicalizeMainSessionAlias: - initSessionState (auto-reply/reply/session.ts) - runWebHeartbeatOnce (web/auto-reply/heartbeat-runner.ts) - resolveCronAgentSessionKey (cron/isolated-agent/session-key.ts) Add startup migration (migrateOrphanedSessionKeys) to rename existing orphaned keys to canonical form, merging by most-recent updatedAt. * fix: address review — track agent IDs in migration map, align snapshot key P1: migrateOrphanedSessionKeys now tracks agentId alongside each store path in a Map instead of inferring from the filesystem path. This correctly handles custom session.store templates outside the default agents/<id>/ layout. P2: Pass the already-canonicalized sessionKey to getSessionSnapshot so the heartbeat snapshot reads/restores use the same key as the write path. * fix: log migration results at all early return points migrateOrphanedSessionKeys runs before detectLegacyStateMigrations, so it can canonicalize legacy keys (e.g. "main" → "agent:main:main") before the legacy detector sees them. This caused the early return path to skip logging, breaking doctor-state-migrations tests that assert log.info was called. Extract logMigrationResults helper and call it at every return point. * fix: handle shared stores and ~ expansion in migration P1: When session.store has no {agentId}, all agents resolve to the same file. Track all agentIds per store path (Map<path, Set<id>>) and run canonicalization once per agent. Skip cross-agent "agent:main:*" remapping when "main" is a legitimate configured agent sharing the store, to avoid merging its data into another agent's namespace. P2: Use expandHomePrefix (environment-aware ~ resolution) instead of os.homedir() in resolveStorePathFromTemplate, matching the runtime resolveStorePath behavior for OPENCLAW_HOME/HOME overrides. * fix: narrow cross-agent remap to provable orphan aliases only Only remap agent:main:* keys where the suffix is a main session alias ("main" or the configured mainKey). Other agent:main:* keys — hooks, subagents, cron sessions, per-sender keys — may be intentional cross-agent references and must not be silently moved into another agent's namespace. * fix: run orphan-key session migration at gateway startup (openclaw#29683) * fix: canonicalize cross-agent legacy main aliases in session keys (openclaw#29683) * fix: guard shared-store migration against cross-agent legacy alias remap (openclaw#29683) * refactor: split session-key migration out of pr 30654 --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]>
…vent tests Updates the assertion in the `routes /btw replies through side-result events` test block to expect the fully resolved `agent:main:main` session key. This resolves a pre-existing failing unit test on the main branch where the mock assertions were lagging behind recent production changes to the session-key identification format. Root Cause: PR openclaw#30654 (`fix: canonicalize session keys at write time` by @thomasxm) updated the `sessionKey` format emitted by the gateway to use fully qualified keys to prevent orphans, but missed updating this specific test mock.
Summary
resolveSessionKey()uses hardcodedDEFAULT_AGENT_ID="main"for session key construction, but all read paths canonicalize viaresolveSessionStoreKey()using live config. When the configured default agent differs from "main" (ormainKeydiffers from "main"), writes and reads produce different keys, orphaning session transcripts on every restart.resolveSessionKey()call sites withcanonicalizeMainSessionAlias()so written keys match what reads produce.migrateOrphanedSessionKeys()to scan agent session stores on startup and rename orphaned raw keys to canonical form, merging duplicates by most-recentupdatedAt.Write paths fixed
src/auto-reply/reply/session.tsinitSessionStatecanonicalizeMainSessionAliassrc/web/auto-reply/heartbeat-runner.tsrunWebHeartbeatOnceresolveSessionKey()outputsrc/cron/isolated-agent/session-key.tsresolveCronAgentSessionKeycfg, apply canonicalizationMigration details
migrateOrphanedSessionKeysruns unconditionally duringautoMigrateLegacyState:agent:main:prefix in non-"main" agent stores (cross-agent orphans)skipMaintenance: trueto avoid side effects during startupTest plan
resolveCronAgentSessionKey(4 new for canonicalization)migrateOrphanedSessionKeys(new file):Closes #29683
Related: #21850, #23615, #18572, #25373, #1663