-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix(session-recovery): add fallbacks for tool_result_missing and thinking_block_order #486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(session-recovery): add fallbacks for tool_result_missing and thinking_block_order #486
Conversation
|
All contributors have signed the CLA. Thank you! ✅ |
Greptile SummaryThis PR enhances session recovery by adding fallback mechanisms for two critical error types: Key Changes:
Impact: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant API as Claude API
participant Hook as Session Recovery Hook
participant Index as index.ts
participant Storage as storage.ts
API->>Hook: Error: tool_result_missing
Hook->>Index: recoverToolResultMissing(failedMsg, error)
Index->>Index: extractToolUseIds(failedMsg.parts)
alt parts empty
Index->>Index: extractToolUseIdsFromError(error)
Note over Index: NEW: Regex extracts toolu_* from error message
end
Index->>API: Send tool_result parts
API-->>Hook: Recovery success
API->>Hook: Error: thinking_block_order
Hook->>Index: recoverThinkingBlockOrder(failedMsg, error)
Index->>Index: extractMessageIndex(error)
Index->>Storage: findMessageByIndexNeedingThinking(targetIndex)
Storage->>Storage: Try offsets: 0, -1, +1, -2, +2, -3, -4, -5
Note over Storage: NEW: Multiple index offsets
alt message found
Storage-->>Index: return messageID
Index->>Storage: prependThinkingPart(messageID)
else not found
Index->>Index: Try failedMsg.info.id directly
Note over Index: NEW: Direct fallback
alt direct fallback works
Index->>Storage: prependThinkingPart(failedMsgId)
else still not found
Index->>Storage: findMessagesWithOrphanThinking()
Index->>Storage: prependThinkingPart(orphanID)
end
end
Storage-->>Index: Recovery success/failure
Index-->>Hook: Return result
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
src/hooks/session-recovery/storage.ts, line 374-383 (link)logic: trying offsets without positive offsets first could match wrong message
the order tries negative offsets (-1, -2, -3, -4, -5) before larger positive offsets (+1, +2), which means if the actual target is at
targetIndex + 3, this would incorrectly match an earlier message attargetIndex - 1or similar
55 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 issues found across 55 files
Confidence score: 3/5
- Writing untrusted PR titles to
$GITHUB_OUTPUTvia plainechoin.github/workflows/sisyphus-agent.ymlrisks newline/control-character injection and can corrupt downstream workflow steps. blocking.tscreates a Worker but never guaranteesworker.terminate()/port1.close()on exceptions, so leaked resources or hung workers are likely under failure paths.- Pay close attention to
.github/workflows/sisyphus-agent.yml,src/features/opencode-skill-loader/blocking.ts,src/features/skill-mcp-manager/manager.ts- output handling, cleanup, and idle-client management each need hardening before merge.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/claude-code-plugin-loader/loader.ts">
<violation number="1" location="src/features/claude-code-plugin-loader/loader.ts:467">
P2: Wrapping synchronous functions in `Promise.resolve()` provides no parallelization benefit. Since `loadPluginCommands`, `loadPluginSkillsAsCommands`, `loadPluginAgents`, and `loadPluginHooksConfigs` are all synchronous, they still execute sequentially when called as arguments. Consider keeping the original sequential pattern for the sync functions and only `await` the async `loadPluginMcpServers`.</violation>
</file>
<file name="src/features/context-injector/injector.ts">
<violation number="1" location="src/features/context-injector/injector.ts:129">
P2: Using `Date.now()` alone for ID generation can cause collisions if called multiple times within the same millisecond. Consider using `crypto.randomUUID()` or combining timestamp with a random component.</violation>
</file>
<file name="src/features/context-injector/injector.test.ts">
<violation number="1" location="src/features/context-injector/injector.test.ts:203">
P2: Test helper `createMockMessage` generates inconsistent mock data: `messageID` in parts doesn't match `info.id`. The `info.id` includes `Math.random()` while `messageID` uses a separate `Date.now()` call without it, causing a mismatch. Consider storing the generated ID in a variable and reusing it.</violation>
</file>
<file name="src/features/opencode-skill-loader/blocking.ts">
<violation number="1" location="src/features/opencode-skill-loader/blocking.ts:27">
P2: Missing try/finally for resource cleanup. If an exception occurs after the Worker is created (e.g., in `postMessage` or `receiveMessageOnPort`), `worker.terminate()` and `port1.close()` will never be called, causing a resource leak.</violation>
</file>
<file name="src/features/claude-code-command-loader/loader.ts">
<violation number="1" location="src/features/claude-code-command-loader/loader.ts:190">
P2: Template literal has extra leading spaces compared to sync version. Each content line has an unwanted leading space that will cause inconsistent command template output between `loadCommandsFromDir` and `loadCommandsFromDirAsync`.</violation>
</file>
<file name="src/shared/opencode-config-dir.test.ts">
<violation number="1" location="src/shared/opencode-config-dir.test.ts:221">
P2: This assertion is too weak and doesn't match the test name. The test claims to verify 'returns null when no config exists' but the assertion accepts both `null` and `string`, meaning it can never fail. Consider using a deterministic assertion that actually verifies the expected behavior.</violation>
</file>
<file name=".github/workflows/lint-workflows.yml">
<violation number="1" location=".github/workflows/lint-workflows.yml:19">
P2: Avoid downloading and executing remote scripts via `curl | bash` pattern. Consider using the official GitHub Action `rhysd/actionlint-action` instead, which is more secure and maintainable.</violation>
</file>
<file name="src/features/skill-mcp-manager/manager.ts">
<violation number="1" location="src/features/skill-mcp-manager/manager.ts:184">
P2: `disconnectAll()` should also clear `pendingConnections` to fully reset manager state and avoid returning stale in-flight connections.</violation>
<violation number="2" location="src/features/skill-mcp-manager/manager.ts:200">
P2: Handle the Promise returned by `cleanupIdleClients()` in the interval callback to avoid unhandled rejections (and consider preventing overlapping runs if cleanup can take a while).</violation>
</file>
<file name=".github/workflows/sisyphus-agent.yml">
<violation number="1" location=".github/workflows/sisyphus-agent.yml:322">
P1: `TITLE` is written to `$GITHUB_OUTPUT` with a plain `echo "name=$value"`, which is unsafe/brittle for untrusted text and can break or inject outputs if the title contains newlines/control characters. Use the delimiter (`<<EOF`) form (and quote `$GITHUB_OUTPUT`) when setting outputs.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| const [commands, skills, agents, mcpServers, hooksConfigs] = await Promise.all([ | ||
| Promise.resolve(loadPluginCommands(plugins)), | ||
| Promise.resolve(loadPluginSkillsAsCommands(plugins)), | ||
| Promise.resolve(loadPluginAgents(plugins)), | ||
| loadPluginMcpServers(plugins), | ||
| Promise.resolve(loadPluginHooksConfigs(plugins)), | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Wrapping synchronous functions in Promise.resolve() provides no parallelization benefit. Since loadPluginCommands, loadPluginSkillsAsCommands, loadPluginAgents, and loadPluginHooksConfigs are all synchronous, they still execute sequentially when called as arguments. Consider keeping the original sequential pattern for the sync functions and only await the async loadPluginMcpServers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/claude-code-plugin-loader/loader.ts, line 467:
<comment>Wrapping synchronous functions in `Promise.resolve()` provides no parallelization benefit. Since `loadPluginCommands`, `loadPluginSkillsAsCommands`, `loadPluginAgents`, and `loadPluginHooksConfigs` are all synchronous, they still execute sequentially when called as arguments. Consider keeping the original sequential pattern for the sync functions and only `await` the async `loadPluginMcpServers`.</comment>
<file context>
@@ -464,11 +464,13 @@ export interface PluginComponentsResult {
- const agents = loadPluginAgents(plugins)
- const mcpServers = await loadPluginMcpServers(plugins)
- const hooksConfigs = loadPluginHooksConfigs(plugins)
+ const [commands, skills, agents, mcpServers, hooksConfigs] = await Promise.all([
+ Promise.resolve(loadPluginCommands(plugins)),
+ Promise.resolve(loadPluginSkillsAsCommands(plugins)),
</file context>
| const [commands, skills, agents, mcpServers, hooksConfigs] = await Promise.all([ | |
| Promise.resolve(loadPluginCommands(plugins)), | |
| Promise.resolve(loadPluginSkillsAsCommands(plugins)), | |
| Promise.resolve(loadPluginAgents(plugins)), | |
| loadPluginMcpServers(plugins), | |
| Promise.resolve(loadPluginHooksConfigs(plugins)), | |
| ]) | |
| const commands = loadPluginCommands(plugins) | |
| const skills = loadPluginSkillsAsCommands(plugins) | |
| const agents = loadPluginAgents(plugins) | |
| const mcpServers = await loadPluginMcpServers(plugins) | |
| const hooksConfigs = loadPluginHooksConfigs(plugins) |
| path?: { cwd?: string; root?: string } | ||
| } | ||
|
|
||
| const syntheticMessageId = `synthetic_ctx_${Date.now()}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Using Date.now() alone for ID generation can cause collisions if called multiple times within the same millisecond. Consider using crypto.randomUUID() or combining timestamp with a random component.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/context-injector/injector.ts, line 129:
<comment>Using `Date.now()` alone for ID generation can cause collisions if called multiple times within the same millisecond. Consider using `crypto.randomUUID()` or combining timestamp with a random component.</comment>
<file context>
@@ -0,0 +1,166 @@
+ path?: { cwd?: string; root?: string }
+ }
+
+ const syntheticMessageId = `synthetic_ctx_${Date.now()}`
+ const syntheticPartId = `synthetic_ctx_part_${Date.now()}`
+ const now = Date.now()
</file context>
| { | ||
| id: `part_${Date.now()}`, | ||
| sessionID, | ||
| messageID: `msg_${Date.now()}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Test helper createMockMessage generates inconsistent mock data: messageID in parts doesn't match info.id. The info.id includes Math.random() while messageID uses a separate Date.now() call without it, causing a mismatch. Consider storing the generated ID in a variable and reusing it.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/context-injector/injector.test.ts, line 203:
<comment>Test helper `createMockMessage` generates inconsistent mock data: `messageID` in parts doesn't match `info.id`. The `info.id` includes `Math.random()` while `messageID` uses a separate `Date.now()` call without it, causing a mismatch. Consider storing the generated ID in a variable and reusing it.</comment>
<file context>
@@ -0,0 +1,292 @@
+ {
+ id: `part_${Date.now()}`,
+ sessionID,
+ messageID: `msg_${Date.now()}`,
+ type: "text" as const,
+ text,
</file context>
| const signal = new Int32Array(new SharedArrayBuffer(4)) | ||
| const { port1, port2 } = new MessageChannel() | ||
|
|
||
| const worker = new Worker(new URL("./discover-worker.ts", import.meta.url), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Missing try/finally for resource cleanup. If an exception occurs after the Worker is created (e.g., in postMessage or receiveMessageOnPort), worker.terminate() and port1.close() will never be called, causing a resource leak.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/opencode-skill-loader/blocking.ts, line 27:
<comment>Missing try/finally for resource cleanup. If an exception occurs after the Worker is created (e.g., in `postMessage` or `receiveMessageOnPort`), `worker.terminate()` and `port1.close()` will never be called, causing a resource leak.</comment>
<file context>
@@ -0,0 +1,62 @@
+ const signal = new Int32Array(new SharedArrayBuffer(4))
+ const { port1, port2 } = new MessageChannel()
+
+ const worker = new Worker(new URL("./discover-worker.ts", import.meta.url), {
+ workerData: { signal }
+ })
</file context>
| ${body.trim()} | ||
| </command-instruction> | ||
| <user-request> | ||
| $ARGUMENTS | ||
| </user-request>` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Template literal has extra leading spaces compared to sync version. Each content line has an unwanted leading space that will cause inconsistent command template output between loadCommandsFromDir and loadCommandsFromDirAsync.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/claude-code-command-loader/loader.ts, line 190:
<comment>Template literal has extra leading spaces compared to sync version. Each content line has an unwanted leading space that will cause inconsistent command template output between `loadCommandsFromDir` and `loadCommandsFromDirAsync`.</comment>
<file context>
@@ -129,3 +130,130 @@ export function loadOpencodeProjectCommands(): Record<string, CommandDefinition>
+ const { data, body } = parseFrontmatter<CommandFrontmatter>(content)
+
+ const wrappedTemplate = `<command-instruction>
+ ${body.trim()}
+ </command-instruction>
+
</file context>
| ${body.trim()} | |
| </command-instruction> | |
| <user-request> | |
| $ARGUMENTS | |
| </user-request>` | |
| ${body.trim()} | |
| </command-instruction> | |
| <user-request> | |
| $ARGUMENTS | |
| </user-request>` |
| const result = detectExistingConfigDir("opencode", "1.0.200") | ||
|
|
||
| // #then result is either null or a valid string path | ||
| expect(result === null || typeof result === "string").toBe(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: This assertion is too weak and doesn't match the test name. The test claims to verify 'returns null when no config exists' but the assertion accepts both null and string, meaning it can never fail. Consider using a deterministic assertion that actually verifies the expected behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/shared/opencode-config-dir.test.ts, line 221:
<comment>This assertion is too weak and doesn't match the test name. The test claims to verify 'returns null when no config exists' but the assertion accepts both `null` and `string`, meaning it can never fail. Consider using a deterministic assertion that actually verifies the expected behavior.</comment>
<file context>
@@ -0,0 +1,224 @@
+ const result = detectExistingConfigDir("opencode", "1.0.200")
+
+ // #then result is either null or a valid string path
+ expect(result === null || typeof result === "string").toBe(true)
+ })
+ })
</file context>
|
|
||
| - name: Install actionlint | ||
| run: | | ||
| bash <(curl -sSL https://raw.githubusercontent.com/rhysd/actionlint/v1.7.10/scripts/download-actionlint.bash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Avoid downloading and executing remote scripts via curl | bash pattern. Consider using the official GitHub Action rhysd/actionlint-action instead, which is more secure and maintainable.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/lint-workflows.yml, line 19:
<comment>Avoid downloading and executing remote scripts via `curl | bash` pattern. Consider using the official GitHub Action `rhysd/actionlint-action` instead, which is more secure and maintainable.</comment>
<file context>
@@ -0,0 +1,22 @@
+
+ - name: Install actionlint
+ run: |
+ bash <(curl -sSL https://raw.githubusercontent.com/rhysd/actionlint/v1.7.10/scripts/download-actionlint.bash)
+
+ - name: Run actionlint
</file context>
|
|
||
| async disconnectAll(): Promise<void> { | ||
| for (const [, managed] of this.clients.entries()) { | ||
| this.stopCleanupTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: disconnectAll() should also clear pendingConnections to fully reset manager state and avoid returning stale in-flight connections.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/skill-mcp-manager/manager.ts, line 184:
<comment>`disconnectAll()` should also clear `pendingConnections` to fully reset manager state and avoid returning stale in-flight connections.</comment>
<file context>
@@ -102,26 +164,64 @@ export class SkillMcpManager {
async disconnectAll(): Promise<void> {
- for (const [, managed] of this.clients.entries()) {
+ this.stopCleanupTimer()
+ const clients = Array.from(this.clients.values())
+ this.clients.clear()
</file context>
| this.stopCleanupTimer() | |
| this.stopCleanupTimer() | |
| this.pendingConnections.clear() |
| private startCleanupTimer(): void { | ||
| if (this.cleanupInterval) return | ||
| this.cleanupInterval = setInterval(() => { | ||
| this.cleanupIdleClients() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Handle the Promise returned by cleanupIdleClients() in the interval callback to avoid unhandled rejections (and consider preventing overlapping runs if cleanup can take a while).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/skill-mcp-manager/manager.ts, line 200:
<comment>Handle the Promise returned by `cleanupIdleClients()` in the interval callback to avoid unhandled rejections (and consider preventing overlapping runs if cleanup can take a while).</comment>
<file context>
@@ -102,26 +164,64 @@ export class SkillMcpManager {
+ private startCleanupTimer(): void {
+ if (this.cleanupInterval) return
+ this.cleanupInterval = setInterval(() => {
+ this.cleanupIdleClients()
+ }, 60_000)
+ this.cleanupInterval.unref()
</file context>
| this.cleanupIdleClients() | |
| void this.cleanupIdleClients().catch(() => {}) |
| echo "type=issue" >> $GITHUB_OUTPUT | ||
| echo "number=${ISSUE_NUM}" >> $GITHUB_OUTPUT | ||
| fi | ||
| echo "title=${TITLE}" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: TITLE is written to $GITHUB_OUTPUT with a plain echo "name=$value", which is unsafe/brittle for untrusted text and can break or inject outputs if the title contains newlines/control characters. Use the delimiter (<<EOF) form (and quote $GITHUB_OUTPUT) when setting outputs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/sisyphus-agent.yml, line 322:
<comment>`TITLE` is written to `$GITHUB_OUTPUT` with a plain `echo "name=$value"`, which is unsafe/brittle for untrusted text and can break or inject outputs if the title contains newlines/control characters. Use the delimiter (`<<EOF`) form (and quote `$GITHUB_OUTPUT`) when setting outputs.</comment>
<file context>
@@ -244,14 +309,17 @@ jobs:
echo "type=issue" >> $GITHUB_OUTPUT
echo "number=${ISSUE_NUM}" >> $GITHUB_OUTPUT
fi
+ echo "title=${TITLE}" >> $GITHUB_OUTPUT
fi
</file context>
| echo "title=${TITLE}" >> $GITHUB_OUTPUT | |
| { | |
| echo "title<<EOF" | |
| echo "$TITLE" | |
| echo "EOF" | |
| } >> "$GITHUB_OUTPUT" |
junhoyeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@starcomo thanks for opening the PR.
Right now the commit set in this PR is pretty messy and includes unrelated history (it looks like it pulled in existing work). Could you rebase this branch onto the latest master and force-push so the PR contains only your changes?
A good PR is one that introduces only the changes you’re actually making, with a clean and reviewable diff.
…king_block_order - Add error parameter to recoverToolResultMissing and extract tool_use IDs from error message as fallback - Add multiple index offsets in findMessageByIndexNeedingThinking to handle API/storage index mismatch - Add failedAssistantMsg.info.id as fallback in recoverThinkingBlockOrder Fixes code-yeongyu#483
aa1b8bd to
c6feac9
Compare
|
I have read the CLA Document and I hereby sign the CLA |
|
@sisyphus-dev-ai review ulw |
PR #486 Technical Review - ULTRAWORK MODESummaryThis PR adds critical fallback mechanisms for session recovery, addressing issue #483. The changes are focused and well-intentioned, but there are several concerns that need attention before merge. Critical Issues1.
|
Summary
recoverToolResultMissingand extract tool_use IDs from error message as fallbackfindMessageByIndexNeedingThinkingto handle API/storage index mismatchfailedAssistantMsg.info.idas fallback inrecoverThinkingBlockOrderProblem
Session recovery was failing in certain scenarios:
1. tool_result_missing error
When
failedAssistantMsg.partsis empty and storage also has no parts, recovery fails even though the tool_use ID is present in the error message itself.2. thinking_block_order error
API message index differs from storage index (due to system messages), causing
findMessageByIndexNeedingThinkingto return null.Solution
extractToolUseIdsFromError(error)as fallback to extract IDs directly from error messagefindMessageByIndexNeedingThinkingfailedAssistantMsg.info.iddirectly as additional fallbackTest plan
Fixes #483
Summary by cubic
Improves session recovery by adding robust fallbacks for tool_result_missing and thinking_block_order errors, so recovery works even with index mismatches or empty parts. This makes recovery more reliable across varied API/storage states.
Written for commit c6feac9. Summary will update on new commits.