Skip to content

feat(plugins): add SQLite plugin state store#74190

Merged
amknight merged 6 commits intomainfrom
ak/sqlite-plugin-state-store
Apr 29, 2026
Merged

feat(plugins): add SQLite plugin state store#74190
amknight merged 6 commits intomainfrom
ak/sqlite-plugin-state-store

Conversation

@amknight
Copy link
Copy Markdown
Member

@amknight amknight commented Apr 29, 2026

Summary

  • Add a SQLite-backed plugin state store exposed as api.runtime.state.openKeyedStore for bundled plugins.
  • Add automatic plugin-id scoping, TTL expiry, namespace eviction, per-plugin live-row caps, JSON/value-size validation, diagnostics, and gateway shutdown cleanup.
  • Document the runtime API and add an agentic validation plan for follow-up E2E/load/cross-platform testing.
  • Scope boundary: this PR ships the storage/runtime foundation only; consumer migrations are planned as separate owner-scoped PRs.

Planned follow-up consumers

This PR ships the SQLite-backed runtime store foundation. Consumer migrations should be separate PRs so each plugin can preserve its current semantics.

Suggested order:

  1. Slack thread participation — first dogfood candidate. Related context: feat(slack): track thread participation for auto-reply without @mention #29165, fix(slack): persist thread participation cache to survive gateway restarts [claude, human developer oversight] #33845, feat(slack): persistent thread participation + bounded fallback #58731. Migrate the durable thread-participation cache to a namespace such as slack.thread-participation, keyed by ${accountId}:${channelId}:${threadTs}, storing { agentId, repliedAt } with the Slack TTL/cap policy. The main migration decision is that recordSlackThreadParticipation is currently synchronous, so the follow-up should either make that path async or keep the in-memory hot path and fire-and-forget the durable write. Keep persistence separate from [Bug]: Slack requireMention: true is bypassed for thread replies after prior bot participation #64277 / thread.requireExplicitMention policy semantics.
  2. Discord component and modal registries — related context: Discord: persist component registry to disk across gateway restarts #66241. Migrate the versioned JSON persistence, TTL pruning, caps, validation, atomic writes, and consume-on-resolve behavior into keyed-store namespaces such as discord.components and discord.modals. Preserve the current peek-vs-consume behavior and persist only the safe public registry shape.
  3. MS Teams sent-message cacheextensions/msteams/src/sent-message-cache.ts and extensions/msteams/src/monitor-handler/message-handler.ts currently keep a 24h process-global map of bot-sent message ids so replies can be classified as reply_to_bot. Restart drops that state, unlike Telegram's analogous file-backed sent-message cache. Migrate to a namespace such as msteams.sent-messages, keyed by ${conversationId}:${messageId}, with the same TTL.
  4. Matrix approval reaction targetsextensions/matrix/src/approval-reactions.ts, extensions/matrix/src/approval-handler.runtime.ts, and extensions/matrix/src/matrix/monitor/reaction-events.ts currently keep roomId:eventId -> { approvalId, allowedDecisions } in memory. Restart can leave a visible approval prompt whose reactions no longer resolve. Migrate the narrow reaction-target metadata to a Matrix approval namespace without broadening the stored approval payload.
  5. BlueBubbles inbound dedupe — related context: fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66816 and BlueBubbles/catchup: per-message retry cap for wedged messages (#66870) #67426. This is the same durable-keyed-state class and highlights lost-update race risk, but it should not be directly rewritten onto the v1 register / lookup / consume API because its dedupe contract has claim/commit/release and same-process memory semantics. Follow-up should either share this store's path/locking/atomic hardening or add a keyed-store-level atomic claim/check primitive first.
  6. Feishu persistent dedupe — related context: fix(feishu): persist dedup cache across gateway restarts via warmup #31605. Same restart-safe dedupe need as BlueBubbles, but direct migration should wait until warmup and memory-fast-path behavior can be preserved without weakening dedupe atomicity.

Validation

New unit and end to end tests. Full automated blitz, stress testing the solution - see findings in below comment.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime size: XL maintainer Maintainer-authored PR labels Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs maintainer review before merge.

What this changes:

The PR branch adds a SQLite-backed keyed plugin state store exposed as api.runtime.state.openKeyedStore, wires gateway shutdown and maintenance cleanup, updates SDK docs/baselines/changelog, and adds plugin-state unit, E2E, and permission tests.

Maintainer follow-up before merge:

Keep this PR open for explicit maintainer and security/runtime-storage review. The best path is to decide whether the bundled-only keyed-store API is the right plugin boundary, resolve or consciously accept the SQLite event-loop and permission-hardening tradeoffs, verify generated SDK docs/baseline and CI proof, and keep consumer migrations in separate owner-scoped PRs.

Best possible solution:

Keep this PR open for explicit maintainer and security/runtime-storage review. The best path is to decide whether the bundled-only keyed-store API is the right plugin boundary, resolve or consciously accept the SQLite event-loop and permission-hardening tradeoffs, verify generated SDK docs/baseline and CI proof, and keep consumer migrations in separate owner-scoped PRs.

Acceptance criteria:

  • pnpm test src/plugin-state/plugin-state-store.test.ts src/plugin-state/plugin-state-store.e2e.test.ts src/plugin-state/plugin-state-store.permissions.test.ts
  • pnpm plugin-sdk:api:gen/check
  • pnpm config:docs:gen/check
  • pnpm check:changed in Testbox before merge

What I checked:

  • Protected PR metadata: Provided GitHub context shows this PR is open, unmerged, ready for review, and labeled maintainer, docs, gateway, and size: XL; protected maintainer labeling requires explicit maintainer handling rather than auto-close. (1f320e3e0aa5)
  • Current runtime type has no keyed store: Current PluginRuntimeCore.state exposes only resolveStateDir; there is no openKeyedStore member on main. (src/plugins/runtime/types-core.ts:228, 443ca4865d61)
  • Current runtime factory only wires state dir resolution: createPluginRuntime currently constructs state: { resolveStateDir }, so main does not create the proposed runtime keyed-store API. (src/plugins/runtime/index.ts:229, 443ca4865d61)
  • Current registry proxy has no state-store binding: The runtime proxy currently only special-cases subagent; it does not bind plugin ids or gate bundled-only access for state.openKeyedStore. (src/plugins/registry.ts:1955, 443ca4865d61)
  • Current docs document only state directory resolution: The public SDK runtime docs describe api.runtime.state as state directory resolution and show only api.runtime.state.resolveStateDir(). Public docs: docs/plugins/sdk-runtime.md. (docs/plugins/sdk-runtime.md:396, 443ca4865d61)
  • No plugin-state implementation on current main: Search found no openKeyedStore, PluginStateKeyedStore, PluginStateStore, plugin-state-store, sweepExpiredPluginStateEntries, closePluginStateSqliteStore, or src/plugin-state directory on current main. (443ca4865d61)

Likely related people:

  • steipete: Prior ClawSweeper context for this same PR traced deeper current-main blame for PluginRuntimeCore.state, createPluginRuntime, the registry runtime proxy region, and SDK runtime docs to Peter Steinberger; local history also shows recent plugin-startup policy work by Peter near the affected plugin startup surface. (role: recent current-main maintainer and likely reviewer; confidence: medium; commits: 3b10b8cf74e3, 390a7598c952; files: src/plugins/runtime/types-core.ts, src/plugins/runtime/index.ts, src/plugins/registry.ts)
  • vincentkoc: Prior ClawSweeper context linked Vincent Koc to recent adjacent work in src/plugins/registry.ts and gateway/plugin-loader paths; local recent history also shows plugin contract/auth-adjacent maintenance under this author. (role: recent adjacent registry and gateway maintainer; confidence: medium; commits: 1d61862adb0d, ad2516b1, 56d2749b; files: src/plugins/registry.ts, src/gateway/server-methods/agent.ts, src/plugins/loader.test.ts)

Remaining risk / open question:

  • The PR adds a new SQLite-backed runtime storage surface with WAL behavior, file permission hardening, TTL eviction, plugin isolation, shutdown cleanup, and synchronous filesystem/database calls on gateway paths; that needs explicit runtime-storage and security review.
  • The API is intentionally bundled-plugin-only, which intersects with the plugin boundary rule against private bundled-only backdoors unless maintainers deliberately accept that contract.
  • The PR body lists several consumer migrations as follow-ups; merging or closing this PR should not accidentally settle Slack, Discord, MS Teams, Matrix, BlueBubbles, or Feishu owner-specific behavior.
  • Current main does not implement the proposed API, so closing as implemented would be incorrect.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 443ca4865d61.

@amknight
Copy link
Copy Markdown
Member Author

Validation blitz — plugin state store E2E results

Five standalone scripts exercised the plugin state store end-to-end against real SQLite on real filesystem. Every script used an isolated OPENCLAW_STATE_DIR temp directory. 109 checks total, 109 passed, 0 failed.

1. CRUD + Persistence — 25/25 ✅

Area Checks Highlights
CRUD basics 11 register→lookup, upsert overwrites, delete true/false, clear scoped, entries ordered, consume-then-gone, double consume
Persistence across close/reopen 2 50 mixed-type entries survive close/reopen; 60 survive a second cycle
Multiple close/reopen cycles 3 Write→close→reopen→write→close→reopen all data intact; double close is idempotent
Value type fidelity 9 Nested objects, arrays, null fields, booleans, 0/negative/float, empty {}, empty [], empty string field — all roundtrip exactly

2. TTL + Sweep — 18/18 ✅

Area Checks Highlights
defaultTtlMs 3 Expires after TTL; entries() empty; no-TTL entry persists
Per-call ttlMs override 3 Short override beats long default; long override beats short default; no-default + per-call works
Sweep 5 Correct sweep count, entries empty after, idempotent re-sweep, mixed TTL/no-TTL partitions correctly
Consume + TTL 3 Expired consume returns undefined; live consume works; double-consume undefined
TTL + eviction interaction 1 Expired slots freed for new writes after pruning
Entries ordering with TTL 3 created_at order preserved; expired entries excluded

Note: sweepExpiredPluginStateEntries() is global across all plugins/namespaces — leftover expired rows from earlier test phases are included in sweep counts. Isolation requires either fresh state dirs or draining stale rows first.

3. Isolation + Multi-plugin — 24/24 ✅

Area Checks Highlights
Basic plugin segregation 4 Same namespace+key, different plugins — each sees only its own
Cross-plugin delete 3 Deleting plugin A's key leaves plugin B's intact
Clear scoped to plugin+namespace 2 Clear plugin A → plugin B's 10 entries untouched
Consume scoped to plugin 4 Consuming A's token leaves B's token intact
Namespace isolation within same plugin 4 Same key in ns-a vs ns-b — independent; clear ns-a leaves ns-b
20 plugins, same namespace/key 2 All 20 isolated; 5 deleted, remaining 15 intact
Per-plugin entries ordering 2 Each plugin's entries() returns its own created_at order
Per-plugin row cap independence 3 "heavy" plugin at 500 rows doesn't affect "light" plugin's writes

4. Limits + Eviction — 22/22 ✅

Area Checks Highlights
Value size boundary 6 65,534 chars (exactly 64KB JSON) accepted + roundtripped; 65,535 chars rejected; complex object boundary; tiny values
Namespace eviction 5 Oldest evicted on overflow; bulk overflow keeps newest N; eviction by created_at ASC, entry_key ASC; upsert refreshes created_at
TTL + eviction interaction 1 Expired entries pruned before counting — new writes succeed
Per-plugin 1000-row cap 4 1000 rows fills exactly; 1001 rejected; existing data intact after rejection; delete-one-then-write succeeds
Namespace eviction + plugin cap 3 At 1000: cross-namespace write rejected, same-namespace write with eviction succeeds
maxEntries=1 3 Each write evicts the previous; chain eviction works

5. Stress + WAL — 10/10 ✅

Check Result Performance
1000 rapid sequential writes 920 ops/s, all 1000 readable
1000 mixed operations 500 writes + 200 reads + 200 overwrites + 100 consumes, 0 errors, 717ms
10-plugin concurrent writes 500 entries, full isolation verified
TTL churn under volume 600 writes, sweep, no errors
WAL file size 3.93 MB (well under 128 MB threshold)
Close/reopen under load 3 cycles, 300 entries, all readable
Probe after stress All 6 probe steps ok
Filesystem permissions DB file 0o600, directory 0o700
Large entries (50 × ~58KB) All roundtripped
999-entry listing entries() returned in 0.83ms

Summary

Domain Checks Result
CRUD + Persistence 25
TTL + Sweep 18
Isolation + Multi-plugin 24
Limits + Eviction 22
Stress + WAL 10
Total 109 All pass

No bugs found. The store is solid across all tested dimensions — correctness, isolation, limits, persistence, TTL semantics, eviction ordering, error handling, and stress.

@amknight amknight marked this pull request as ready for review April 29, 2026 10:06
@amknight amknight force-pushed the ak/sqlite-plugin-state-store branch from aadee11 to 56adb0d Compare April 29, 2026 10:08
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR introduces a SQLite-backed keyed-state store for bundled plugins, exposed as api.runtime.state.openKeyedStore, with TTL expiry, per-namespace entry caps, a per-plugin hard cap of 1,000 live rows, JSON value validation, WAL maintenance, and gateway shutdown cleanup. The implementation is well-structured and the core transaction logic (BEGIN IMMEDIATE, rollback on error, namespace eviction before plugin-cap check) is correct.

Confidence Score: 4/5

Safe to merge; all findings are non-blocking style/consistency issues with no correctness impact.

Only P2-level findings: two write operations (delete, clear) silently skip the permission-hardening path that register/consume use, a redundant ensureSchema call in the probe function, and a minor allocation pattern in the Proxy state handler. No logic bugs or data-loss risks identified.

src/plugin-state/plugin-state-store.sqlite.ts — the pluginStateDelete and pluginStateClear functions should be reviewed for permission-hardening consistency.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/plugin-state/plugin-state-store.sqlite.ts
Line: 477-494

Comment:
**`delete` and `clear` skip permission hardening**

`pluginStateDelete` and `pluginStateClear` (line 518) call `openPluginStateDatabase` directly and issue writes without going through `runWriteTransaction`. This means they never call `ensurePluginStatePermissions`, which is the step that tightens file permissions on `0o600` after every write. `register` and `consume` always get permission hardening via `runWriteTransaction`; these two silently skip it. If a file ends up world-readable after an external process touches it, a `delete` or `clear` will leave it that way whereas a `register` would re-tighten it.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/plugin-state/plugin-state-store.sqlite.ts
Line: 596-600

Comment:
**Redundant `ensureSchema` call in probe**

`openPluginStateDatabase` already calls `ensureSchema` internally (line 319). Calling it again on line 599 inside `probePluginStateStore` is harmless because `CREATE TABLE IF NOT EXISTS` is idempotent, but it's dead work and may confuse readers into thinking the probe step does additional schema validation. Consider removing this second call or renaming the probe step to `"open-and-schema"` if you want a unified step name.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/plugins/registry.ts
Line: 1949-1975

Comment:
**`runtime.state` allocates a new object on every property access**

The Proxy's `get` trap for `"state"` spreads `baseState` and creates a fresh `openKeyedStore` closure each time `runtime.state` is accessed. The proxy itself is cached per `pluginId` in `pluginRuntimeById`, but the returned state object isn't. For plugins that access `runtime.state` in registration callbacks this is harmless, but it's easy to inadvertently call `openKeyedStore` in a hot path (e.g., inside a hook handler that accesses `api.runtime.state` on every event). Memoizing the state object inside the proxy closure — as is already done for `subagent` via `defineCachedValue` — would make this consistent with the rest of the runtime API.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(plugins): keep committed state write..." | Re-trigger Greptile

Comment on lines +477 to +494
export function pluginStateDelete(params: {
pluginId: string;
namespace: string;
key: string;
}): boolean {
try {
const { statements } = openPluginStateDatabase("delete");
const result = statements.deleteEntry.run(params.pluginId, params.namespace, params.key);
return result.changes > 0;
} catch (error) {
throw wrapPluginStateError(
error,
"delete",
"PLUGIN_STATE_WRITE_FAILED",
"Failed to delete plugin state entry.",
);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 delete and clear skip permission hardening

pluginStateDelete and pluginStateClear (line 518) call openPluginStateDatabase directly and issue writes without going through runWriteTransaction. This means they never call ensurePluginStatePermissions, which is the step that tightens file permissions on 0o600 after every write. register and consume always get permission hardening via runWriteTransaction; these two silently skip it. If a file ends up world-readable after an external process touches it, a delete or clear will leave it that way whereas a register would re-tighten it.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugin-state/plugin-state-store.sqlite.ts
Line: 477-494

Comment:
**`delete` and `clear` skip permission hardening**

`pluginStateDelete` and `pluginStateClear` (line 518) call `openPluginStateDatabase` directly and issue writes without going through `runWriteTransaction`. This means they never call `ensurePluginStatePermissions`, which is the step that tightens file permissions on `0o600` after every write. `register` and `consume` always get permission hardening via `runWriteTransaction`; these two silently skip it. If a file ends up world-readable after an external process touches it, a `delete` or `clear` will leave it that way whereas a `register` would re-tighten it.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +596 to +600
try {
const store = openPluginStateDatabase("probe");
pushOk("open");
ensureSchema(store.db, store.path);
pushOk("schema");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant ensureSchema call in probe

openPluginStateDatabase already calls ensureSchema internally (line 319). Calling it again on line 599 inside probePluginStateStore is harmless because CREATE TABLE IF NOT EXISTS is idempotent, but it's dead work and may confuse readers into thinking the probe step does additional schema validation. Consider removing this second call or renaming the probe step to "open-and-schema" if you want a unified step name.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugin-state/plugin-state-store.sqlite.ts
Line: 596-600

Comment:
**Redundant `ensureSchema` call in probe**

`openPluginStateDatabase` already calls `ensureSchema` internally (line 319). Calling it again on line 599 inside `probePluginStateStore` is harmless because `CREATE TABLE IF NOT EXISTS` is idempotent, but it's dead work and may confuse readers into thinking the probe step does additional schema validation. Consider removing this second call or renaming the probe step to `"open-and-schema"` if you want a unified step name.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread src/plugins/registry.ts
Comment on lines 1949 to +1975
@@ -1952,10 +1958,27 @@ export function createPluginRegistry(registryParams: PluginRegistryParams) {
}
const runtime = new Proxy(registryParams.runtime, {
get(target, prop, receiver) {
if (prop === "state") {
const baseState = Reflect.get(target, prop, receiver) as PluginRuntime["state"];
return {
...baseState,
openKeyedStore: <T>(options: OpenKeyedStoreOptions): PluginStateKeyedStore<T> => {
const record =
pluginRuntimeRecordById.get(pluginId) ??
registry.plugins.find((entry) => entry.id === pluginId);
if (record?.origin !== "bundled") {
throw new Error(
"openKeyedStore is only available for bundled plugins in this release.",
);
}
return createPluginStateKeyedStore<T>(pluginId, options);
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 runtime.state allocates a new object on every property access

The Proxy's get trap for "state" spreads baseState and creates a fresh openKeyedStore closure each time runtime.state is accessed. The proxy itself is cached per pluginId in pluginRuntimeById, but the returned state object isn't. For plugins that access runtime.state in registration callbacks this is harmless, but it's easy to inadvertently call openKeyedStore in a hot path (e.g., inside a hook handler that accesses api.runtime.state on every event). Memoizing the state object inside the proxy closure — as is already done for subagent via defineCachedValue — would make this consistent with the rest of the runtime API.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugins/registry.ts
Line: 1949-1975

Comment:
**`runtime.state` allocates a new object on every property access**

The Proxy's `get` trap for `"state"` spreads `baseState` and creates a fresh `openKeyedStore` closure each time `runtime.state` is accessed. The proxy itself is cached per `pluginId` in `pluginRuntimeById`, but the returned state object isn't. For plugins that access `runtime.state` in registration callbacks this is harmless, but it's easy to inadvertently call `openKeyedStore` in a hot path (e.g., inside a hook handler that accesses `api.runtime.state` on every event). Memoizing the state object inside the proxy closure — as is already done for `subagent` via `defineCachedValue` — would make this consistent with the rest of the runtime API.

How can I resolve this? If you propose a fix, please make it concise.

@amknight
Copy link
Copy Markdown
Member Author

Validation blitz — round 2: robustness, contention, memory, and proxy path

Five additional standalone scripts targeting gaps identified after the first 109-check round. All run against real SQLite on real filesystem with isolated state dirs. 121 additional checks, 121 passed, 0 failed.


1. Multi-process contention — 8/8 ✅

6 child processes, each doing 250 mixed ops (1,500 total) against the same SQLite database file simultaneously.

Check Result
All 6 children completed without crash
Zero SQLITE_BUSY escapes
Zero errors across 1,500 ops
All 1,500 ops completed
Data isolation (child self-report)
Parent cross-verification of isolation
PRAGMA integrity_check after concurrent writes
No duplicate keys across workers

WAL mode + busy_timeout=5000 handled all contention. No lock escapes under 6-process concurrent load.


2. Close during in-flight writes — 38/38 ✅

Pattern Cycles What it proves
Burst close (150 async writes, close after 50 dispatched) 3 No crash, clean errors on post-close writes, pre-close data intact, PRAGMA integrity_check ok
Immediate close + reopen 2 DB auto-reopens, both pre- and post-close entries fully readable
Rapid flap (write + close × 50) 50 Zero errors, integrity clean, all 50 entries survive
Timer-based close (setTimeout(0) between microtask batches) 1 No crash, integrity clean, entries readable
Global: zero unhandled rejections

3. Database integrity + WAL checkpoint — 30/30 ✅

Test What it proves
2,000 mixed ops (793 writes, 376 reads, 324 consumes, 341 deletes, 166 clears) PRAGMA integrity_check = ok while DB is open
WAL checkpoint on close WAL file absent after resetPluginStateStoreForTests() — TRUNCATE checkpoint ran
Reopen integrity + data correctness Integrity clean on reopened DB, all surviving entries structurally valid
Abrupt exit (no close) Subprocess writes 501 entries and calls process.exit(0) without closing. WAL file left behind (~4MB). Parent reopens → WAL replayed automatically → integrity clean → sentinel value readable → all 501 entries valid. Clean close removes WAL.
Heavy write burst (999 rapid writes) Integrity clean, all entries survive with correct values
5 close/reopen cycles Integrity and WAL cleanliness verified after each cycle, data from all cycles survives

4. RSS memory growth — 2/2 ✅

Metric Value
30,000 mixed ops
Initial RSS 92.50 MB
Steady state (after 2k warmup) 96.53 MB
Final RSS 107.91 MB
Growth factor 1.12× ✅ (threshold: < 2×)
100 distinct stores (plugins × namespaces)
Before RSS 107.89 MB
Final RSS 108.00 MB
Growth factor 1.0004× ✅ (essentially flat)

No memory leak detected. RSS growth is modest and sublinear across 30,000 operations and 100 distinct store instances.


5. Runtime proxy path + bundled-only gate — 15/15 ✅

Check What it proves
Bundled plugin: register, lookup, consume via openKeyedStore Real proxy path works end-to-end
Non-bundled (workspace) plugin: openKeyedStore throws Bundled-only gate enforced
Two bundled plugins, same namespace + key: isolated Proxy auto-binds plugin ID correctly
Delete plugin A's key → plugin B unaffected Cross-plugin isolation through proxy
Proxy passthrough: version, state.resolveStateDir, config.current, nodes.list, logging.shouldLogVerbose, agent.defaults Proxy doesn't interfere with non-state properties
Base runtime openKeyedStore throws sentinel Confirms proxy is the only valid access path

Combined totals (round 1 + round 2)

Domain Checks Result
CRUD + Persistence 25
TTL + Sweep 18
Isolation + Multi-plugin 24
Limits + Eviction 22
Single-process stress + WAL 10
Multi-process contention 8
Close during in-flight writes 38
Database integrity + WAL checkpoint 30
RSS memory growth 2
Runtime proxy path + bundled gate 15
Input validation + errors + edge cases 44
Total 236 All pass

No bugs found across 236 checks covering correctness, isolation, limits, persistence, TTL, eviction, proxy binding, multi-process contention, close safety, database integrity, WAL checkpointing, memory growth, and stress.

@amknight
Copy link
Copy Markdown
Member Author

Review feedback response

Addressed findings from Aisle, Greptile, and self-review. All changes are local — not yet pushed.

Fixed

Finding Fix
Aisle #1 🟡 Unbounded key/namespace length (CWE-400) Added MAX_KEY_BYTES = 512 and MAX_NAMESPACE_BYTES = 128 byte-length validation before writes.
Aisle #2 🔵 Unbounded JSON recursion depth (CWE-400) Added MAX_JSON_DEPTH = 64 to assertPlainJsonValue; rejects with PLUGIN_STATE_LIMIT_EXCEEDED.
ttlMs: 0 creates immediately-expired entry Validation now rejects ttlMs < 1 (was < 0). Zero TTL is semantically "no time to live" and the entry was invisible on the same tick — a trap for callers who might confuse it with "no TTL" (which is expressed by omitting ttlMs).
memory-testing-plan.md at repo root Moved to docs/reference/plugin-state-validation-plan.md, matching the existing application-modernization-plan.md convention.

Tests added for all three limit types (key bytes, namespace bytes, nesting depth). All 31 tests pass (17 unit + 13 e2e + 1 permissions).

Declined

Finding Reasoning
Aisle #3 🔵 Post-commit chmod silently swallowed Deliberate — this was the fix in aadee11. Pre-commit hardening runs strictly and throws. Post-commit is best-effort because the write already committed; throwing would make the caller think the write failed. Logging would be noisy on filesystems that don't support chmod (Docker bind mounts, FAT32).
Greptile delete/clear skip permission hardening DELETE operations don't create WAL sidecar files. The DB was opened with correct permissions, and the next runWriteTransaction re-hardens. Wrapping single-statement DELETEs in BEGIN IMMEDIATE/COMMIT adds overhead for zero benefit.
Greptile Redundant ensureSchema in probe Intentional. The probe is a diagnostic tool — having "schema" as a distinct step lets operators see it passed as its own line item. CREATE TABLE IF NOT EXISTS is essentially free.
Greptile Proxy state allocates on every access Not a hot path — runtime.state is accessed a handful of times during plugin init. The existing subagent property on the same Proxy handler follows the exact same pattern.
No pluginId validation The only production call site is registry.ts, which supplies pluginId from PluginRecord.id — already validated by the plugin loader (validatePluginId in install-paths.ts). Re-validating at the store boundary is redundant.
createdAt naming (vs updatedAt) Defensible as "last registration time" since register() is the only write path. No consumers exist yet. Can revisit before the API stabilises if it proves confusing.

@amknight amknight force-pushed the ak/sqlite-plugin-state-store branch from 9173ca3 to dfd3973 Compare April 29, 2026 10:24
@amknight
Copy link
Copy Markdown
Member Author

Validation blitz — round 3: full gateway E2E lifecycle

Standalone script exercising the plugin state store through a real gateway process — start, write state, stop, verify persistence, restart, verify again. 21 checks, 21 passed, 0 failed.

What this proves that rounds 1–2 didn't

Rounds 1–2 (236 checks) exercised the store API directly and through the registry proxy with a mock runtime. This round starts the actual gateway binary (dist/index.js gateway --port N --bind loopback), uses the same OPENCLAW_STATE_DIR, and exercises the full lifecycle:

gateway start → plugin loader → registry → state dir initialized
→ store writes (concurrent with running gateway)
→ SIGTERM → graceful shutdown (closePluginStateSqliteStore + WAL checkpoint)
→ state dir persists on disk
→ gateway restart → state still readable

Results — 21/21 ✅

Phase Checks What it proves
1. Gateway start 1 Real gateway binary boots, listens on port
2. Store through gateway's state dir 4 Write/read through shared SQLite while gateway is running (multi-process access), probe passes (6 steps), DB file lives in gateway's state dir
3. Graceful shutdown 3 SIGTERM → clean exit, WAL file empty after shutdown (TRUNCATE checkpoint ran), DB file persists
4. Persistence after shutdown 5 All 3 entries survive shutdown (boot-marker, persistence-test, unicode 🚀日本語), new writes succeed after reopen
5. Full restart cycle 5 Gateway restarts cleanly with existing state, data from boot 1 survives, inter-restart writes persist, probe green after full cycle
6. Log verification 2 No plugin-state errors in gateway output, clean shutdown

Key observations

  • WAL checkpoint confirmed: WAL file size = 0 after gateway shutdown — the shutdownStep("plugin-state-store", () => closePluginStateSqliteStore()) in server-close.ts runs the TRUNCATE checkpoint.
  • Multi-process SQLite access: The test process and the gateway process both had the database open simultaneously (WAL mode) with no lock conflicts.
  • DB location: $OPENCLAW_STATE_DIR/plugins/state.sqlite — exactly where the path resolver puts it.

Combined totals (all 3 rounds)

Domain Checks Result
CRUD + Persistence 25
TTL + Sweep 18
Isolation + Multi-plugin 24
Limits + Eviction 22
Single-process stress + WAL 10
Multi-process contention 8
Close during in-flight writes 38
Database integrity + WAL checkpoint 30
RSS memory growth 2
Runtime proxy path + bundled gate 15
Input validation + errors + edge cases 44
Gateway E2E lifecycle 21
Total 257 All pass

@amknight amknight requested a review from vincentkoc April 29, 2026 12:29
@amknight amknight force-pushed the ak/sqlite-plugin-state-store branch from ba799c3 to 1f320e3 Compare April 29, 2026 12:48
Add key byte-length (512B), namespace byte-length (128B), and JSON
nesting depth (64) limits to prevent resource exhaustion. Reject
ttlMs < 1 to eliminate immediately-expired entry trap. Move
validation plan to docs/reference/.
Move plugin state DB from shared ~/.openclaw/plugins/ to dedicated
~/.openclaw/plugin-state/ so permission hardening doesn't touch the
plugin install directory. Remove unnecessary type assertions flagged
by check-lint. Thread correct operation into invalidInput() so
validation errors from lookup/delete/consume/open report the actual
operation instead of hardcoding 'register'. Rename overclaiming E2E
test name.
comradecowboy9 pushed a commit to comradecowboy9/excited-llama that referenced this pull request Apr 30, 2026
* feat(plugins): add experimental sqlite plugin state store
lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
* feat(plugins): add experimental sqlite plugin state store
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation gateway Gateway runtime maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant