Skip to content

fix(cron): implement true LRU cache eviction#39706

Open
liangruochong44-ui wants to merge 2 commits intoopenclaw:mainfrom
liangruochong44-ui:fix/cron-cache-lru
Open

fix(cron): implement true LRU cache eviction#39706
liangruochong44-ui wants to merge 2 commits intoopenclaw:mainfrom
liangruochong44-ui:fix/cron-cache-lru

Conversation

@liangruochong44-ui
Copy link
Copy Markdown

Summary

The cronEvalCache was using FIFO eviction (first inserted = first evicted), but Map.get() doesn't reorder entries. This caused frequently accessed entries to be prematurely evicted.

Changes

  • On cache hit, delete and re-set the entry to move it to the end of Map iteration order, implementing true LRU behavior.

Testing

  • Existing cron tests pass

Closes #39679

lrc added 2 commits March 8, 2026 18:10
- Add WAL journal mode to memory SQLite database (fixes openclaw#36035)
  WAL mode is crash-safe and survives SIGTERM/SIGKILL mid-write
- Add transient SQLite error handling (fixes openclaw#34678)
  Classify SQLITE_CANTOPEN, SQLITE_BUSY, SQLITE_LOCKED, SQLITE_IOERR as
  non-fatal errors to prevent LaunchAgent crash loops
The cronEvalCache was using FIFO eviction (first inserted = first
evicted), but Map.get() doesn't reorder entries. This caused frequently
accessed entries to be prematurely evicted.

Fix: On cache hit, delete and re-set the entry to move it to the end
of Map iteration order, implementing true LRU behavior.

Fixes openclaw#39679
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR bundles three distinct changes under the cron LRU cache fix title: (1) the stated LRU fix in schedule.ts, (2) a new isTransientSqliteError handler in unhandled-rejections.ts, and (3) enabling WAL journal mode in manager-sync-ops.ts.

  • src/cron/schedule.ts — The LRU fix is correct. Delete-then-re-insert on a cache hit moves the entry to the tail of Map iteration order, ensuring the eviction path always removes the least-recently-used entry.
  • src/infra/unhandled-rejections.tsSQLITE_BUSY and SQLITE_LOCKED are legitimately transient, but SQLITE_CANTOPEN (missing file / bad permissions that won't self-heal) and SQLITE_IOERR (a broad parent code covering disk-level failures) are not reliably transient. Treating them as such means the gateway can silently continue running while its database is inaccessible, masking data-loss or corruption rather than crashing fast. The message-matching path also skips the case-normalization (.toUpperCase()) used by the analogous network-error checker, leaving lowercase variants undetected.
  • src/memory/manager-sync-ops.ts — WAL mode is a good addition; it is idempotent and improves crash-safety. The return value of the pragma is not checked, so a silent fallback on unsupported filesystems would go unnoticed, but this is a low-impact edge case.

Confidence Score: 2/5

  • The SQLite transient-error classification in unhandled-rejections.ts can cause the gateway to silently continue after non-recoverable database failures; needs tightening before merge.
  • The cron LRU fix and WAL mode change are both correct and safe. However, classifying SQLITE_CANTOPEN and SQLITE_IOERR as transient in the unhandled-rejection handler is a meaningful logic issue: these codes often represent permanent conditions, and suppressing them turns a crash-fast failure into a silent, hard-to-diagnose broken state. This lowers confidence significantly.
  • src/infra/unhandled-rejections.ts — specifically the TRANSIENT_SQLITE_CODES set and the isTransientSqliteError message-matching logic.

Last reviewed commit: 0a9c952

Comment on lines +72 to +76
"SQLITE_CANTOPEN",
"SQLITE_BUSY",
"SQLITE_LOCKED",
"SQLITE_IOERR",
]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLITE_CANTOPEN and SQLITE_IOERR are not truly transient

SQLITE_CANTOPEN typically indicates a permanent condition — the database file or its parent directory doesn't exist, or the process lacks permissions that won't fix themselves. Classifying it as transient means the gateway will silently continue running while unable to access its database, causing silent data-loss or reads returning stale data instead of crashing-fast.

SQLITE_IOERR is similarly dangerous: it's the parent code for disk-level I/O failures (SQLITE_IOERR_READ, SQLITE_IOERR_WRITE, SQLITE_IOERR_FSYNC, etc.) that can indicate disk corruption or hardware failure — not temporary contention. Swallowing it prevents the operator from knowing the database is in a bad state.

Consider narrowing the set to only codes that are genuinely retryable:

Suggested change
"SQLITE_CANTOPEN",
"SQLITE_BUSY",
"SQLITE_LOCKED",
"SQLITE_IOERR",
]);
const TRANSIENT_SQLITE_CODES = new Set([
"SQLITE_BUSY",
"SQLITE_LOCKED",
]);

If SQLITE_CANTOPEN or SQLITE_IOERR must be handled at all, they should at minimum not be suppressed silently — they warrant a distinct code-path that either retries with backoff or shuts down with a clear error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 72-76

Comment:
**`SQLITE_CANTOPEN` and `SQLITE_IOERR` are not truly transient**

`SQLITE_CANTOPEN` typically indicates a permanent condition — the database file or its parent directory doesn't exist, or the process lacks permissions that won't fix themselves. Classifying it as transient means the gateway will silently continue running while unable to access its database, causing silent data-loss or reads returning stale data instead of crashing-fast.

`SQLITE_IOERR` is similarly dangerous: it's the parent code for disk-level I/O failures (`SQLITE_IOERR_READ`, `SQLITE_IOERR_WRITE`, `SQLITE_IOERR_FSYNC`, etc.) that can indicate disk corruption or hardware failure — not temporary contention. Swallowing it prevents the operator from knowing the database is in a bad state.

Consider narrowing the set to only codes that are genuinely retryable:

```suggestion
const TRANSIENT_SQLITE_CODES = new Set([
  "SQLITE_BUSY",
  "SQLITE_LOCKED",
]);
```

If `SQLITE_CANTOPEN` or `SQLITE_IOERR` must be handled at all, they should at minimum not be suppressed silently — they warrant a distinct code-path that either retries with backoff or shuts down with a clear error.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +234 to +238
const rawMessage = (candidate as { message?: unknown }).message;
const message = typeof rawMessage === "string" ? rawMessage : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message check skips case normalization

Unlike isTransientNetworkError (line 186), which calls .toLowerCase().trim() before matching, the message here is used as-is and matched against uppercase literals. If a SQLite binding ever surfaces a lowercase variant (e.g. "sqlite_busy") or mixed-case message, these checks would silently miss it.

Suggested change
const rawMessage = (candidate as { message?: unknown }).message;
const message = typeof rawMessage === "string" ? rawMessage : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
return true;
const message = typeof rawMessage === "string" ? rawMessage.toUpperCase() : "";
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/unhandled-rejections.ts
Line: 234-238

Comment:
**Message check skips case normalization**

Unlike `isTransientNetworkError` (line 186), which calls `.toLowerCase().trim()` before matching, the message here is used as-is and matched against uppercase literals. If a SQLite binding ever surfaces a lowercase variant (e.g. `"sqlite_busy"`) or mixed-case message, these checks would silently miss it.

```suggestion
    const message = typeof rawMessage === "string" ? rawMessage.toUpperCase() : "";
    if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
        message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a9c952f7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

const db = new DatabaseSync(dbPath, { allowExtension: this.settings.store.vector.enabled });
// WAL mode is crash-safe and survives SIGTERM/SIGKILL mid-write.
// It also allows concurrent reads during writes.
db.exec("PRAGMA journal_mode = WAL");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set busy_timeout before enabling WAL mode

Running PRAGMA journal_mode = WAL before PRAGMA busy_timeout = 5000 makes the first lock-sensitive statement execute with SQLite's default 0ms timeout, so a brief concurrent writer can cause immediate database is locked failures during DB open/reopen. This regresses the retry behavior described in the nearby comment ("retry instead of failing immediately"), and can make startup/reindexing flaky under normal multi-process contention.

Useful? React with 👍 / 👎.

Comment on lines +236 to +237
if (message.includes("SQLITE_CANTOPEN") || message.includes("SQLITE_BUSY") ||
message.includes("SQLITE_LOCKED") || message.includes("SQLITE_IOERR")) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match transient SQLite errors using normalized messages

The new SQLite suppression path only matches uppercase substrings like SQLITE_BUSY/SQLITE_LOCKED, but common runtime forms in this codebase are normalized lowercase or plain text such as database is locked (for example, qmd-manager already checks those forms). As written, many transient SQLite lock rejections will not be recognized and still trigger process.exit(1), so the intended non-fatal handling is incomplete.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Cron schedule cache eviction is FIFO, not LRU — hot entries get evicted prematurely

1 participant