feat: add native Cursor SQLite convo ingestion by saschabuehrle · Pull Request #287 · MemPalace/mempalace

saschabuehrle · 2026-04-08T19:45:31Z

Summary

This PR adds native Cursor SQLite conversation ingestion so users can point mempalace mine --mode convos directly at Cursor workspace storage and ingest composer chats without manual export.

What changed

normalize.py
- detect SQLite files by signature (SQLite format 3)
- read Cursor ItemTable rows where key = 'composer.composerData'
- parse allComposers payloads and normalize user/assistant turns into transcript format
- safely return empty content for non-conversation SQLite DBs (so they are skipped)
convo_miner.py
- include SQLite suffixes in convo scan (.vscdb, .sqlite, .db)
- add guardrail heuristics to avoid arbitrary DB ingestion
- detect Cursor workspace DB candidates (state.vscdb, Cursor workspaceStorage paths)
docs
- updated CLI/README examples to mention Cursor conversation ingestion

Why

Implements issue #274 by enabling local-first, verbatim extraction of Cursor composer history from state.vscdb.

Tests

Added:

tests/test_normalize_cursor.py
- verifies Cursor SQLite payload normalization from composer.composerData
- verifies non-Cursor SQLite DBs are skipped
tests/test_convo_scan.py
- verifies scan includes Cursor state.vscdb
- verifies scan skips unrelated .db files

Run locally:

ruff check mempalace/normalize.py mempalace/convo_miner.py mempalace/cli.py tests/test_normalize_cursor.py tests/test_convo_scan.py
pytest tests/test_normalize.py tests/test_normalize_cursor.py tests/test_convo_scan.py tests/test_convo_miner.py -q ✅

Note: full pytest tests/ -v currently fails on this machine in existing search/mcp tests due an upstream ONNX/CoreML runtime provider error, unrelated to these changes.

Fixes #274

Greetings, saschabuehrle

bgauryy · 2026-04-08T22:57:19Z

PR Review: feat: add native Cursor SQLite convo ingestion

Executive Summary

Aspect	Value
PR Goal	Allow `mempalace mine --mode convos` to ingest Cursor composer chats directly from `state.vscdb` SQLite databases without manual export
Files Changed	6 (2 new test files, 4 modified)
Risk Level	🟡 MEDIUM - new binary file ingestion path with recursive JSON walker
Review Effort	3/5 - moderate; most complexity is in `normalize.py` parser logic
Recommendation	💬 COMMENT - good overall, a few defensive improvements recommended

Affected Areas: mempalace/normalize.py (core parser), mempalace/convo_miner.py (file scanner), mempalace/cli.py (docstring), README.md, tests/test_normalize_cursor.py, tests/test_convo_scan.py

Business Impact: Users can now point mempalace mine at Cursor workspace storage directories and automatically ingest composer chat history — removing a manual export step that was previously required.

Flow Changes: normalize() now short-circuits on SQLite files before attempting text decode. scan_convos() now includes .vscdb/.sqlite/.db files but filters through _is_cursor_sqlite_candidate() heuristic to avoid picking up unrelated databases.

Ratings

Aspect	Score
Correctness	4/5
Security	5/5
Performance	4/5
Maintainability	4/5

PR Health

Has clear description
References ticket/issue (if applicable) — no issue linked
Appropriate size (240 additions is well-scoped)
Has relevant tests (2 new test files)

High Priority Issues

🐛 #1: Recursive `walk()` has no depth guard — could crash on corrupt/deeply nested DB payloads

Location: mempalace/normalize.py — _extract_cursor_messages.walk() | Confidence: ⚠️ MED

The walk() closure recursively descends into every dict value and list item without any depth limit. A corrupt or adversarial state.vscdb with deeply nested JSON (e.g. 1000+ levels) will hit Python's default recursion limit (usually 1000) and raise RecursionError, crashing the mining run for that file.

Since normalize() is called inside a try/except Exception: continue in convo_miner.py line 260, the crash is contained — but it means the entire file is silently skipped with no diagnostic output.

 def _extract_cursor_messages(payload) -> list:
     """Extract (role, text) pairs from Cursor composer payloads."""
     messages = []
+    MAX_DEPTH = 50

-    def walk(node):
+    def walk(node, depth=0):
+        if depth > MAX_DEPTH:
+            return
         if isinstance(node, dict):
             role = _cursor_role(node)
             text = _cursor_text(node)
             if role and text:
                 messages.append((role, text))
             for value in node.values():
-                walk(value)
+                walk(value, depth + 1)
         elif isinstance(node, list):
             for item in node:
-                walk(item)
+                walk(item, depth + 1)

Medium Priority Issues

🐛 #2: `_cursor_role` substring matching is overly broad — "prompt" token matches non-user roles

Location: mempalace/normalize.py — _cursor_role() | Confidence: ⚠️ MED

The role detection checks if any(token in role for token in ("user", "human", "prompt", "request")). The token "prompt" will match any field value containing that substring — e.g. "type": "prompt_template", "type": "system_prompt", or "sender": "prompt_injection_filter". These would all be classified as "user" role, producing false positive messages.

Cursor's actual composer message schema uses "role": "user" or "role": "assistant" — the fuzzy matching is a reasonable fallback, but "prompt" is particularly overloaded in AI tooling contexts.

-        if any(token in role for token in ("user", "human", "prompt", "request")):
+        if any(token in role for token in ("user", "human")):
             return "user"

If keeping "prompt" is desired for coverage, consider exact match or prefix match instead of substring containment.

🔄 #3: `normalize()` returns empty string for non-Cursor SQLite files — changes behavior for existing `.db` files

Location: mempalace/normalize.py — normalize() early return | Confidence: ✅ HIGH

When _looks_like_sqlite() returns True but _try_cursor_sqlite() returns None (e.g. a SQLite file that isn't Cursor), normalize() returns "". Previously, such files would have been read as text (with errors="replace") and passed through.

This is likely intentional (binary garbage avoidance), and the downstream convo_miner.py already checks if not content at line 263, so empty strings are safely skipped. But it's a behavior change worth documenting — any non-Cursor SQLite file in a convo directory will now be silently dropped rather than producing garbled text.

No code change needed — just noting the semantic change.

🎨 #4: `_is_cursor_sqlite_candidate` name check uses `startswith("state.vscdb")` — matches unintended filenames

Location: mempalace/convo_miner.py — _is_cursor_sqlite_candidate() | Confidence: ⚠️ MED

name.startswith("state.vscdb") will match state.vscdb but also hypothetical files like state.vscdb.backup, state.vscdb-wal, or state.vscdb.old. While the downstream normalize() will fail gracefully on non-SQLite files, using exact match is more precise:

-    if name.startswith("state.vscdb"):
+    if name == "state.vscdb":
         return True

Low Priority Issues

🎨 #5: Tests cover happy path well but miss edge cases

Location: tests/test_normalize_cursor.py | Confidence: ✅ HIGH

The test suite has good positive/negative coverage but would benefit from:

Empty allComposers list (should return "")
Single-message composer session (should return None due to len(deduped) >= 2 check)
Non-string values in ItemTable (e.g. blob data)
Payload with nested non-message dicts that have role-like keys (false positive test)

🔗 #6: `SQLITE_CONVO_EXTENSIONS` duplicates a subset of `CONVO_EXTENSIONS`

Location: mempalace/convo_miner.py | Confidence: ✅ HIGH

Both sets contain .vscdb, .sqlite, .db. The relationship could be made explicit:

 SQLITE_CONVO_EXTENSIONS = {".vscdb", ".sqlite", ".db"}
+# SQLite extensions are a subset of CONVO_EXTENSIONS — kept separate for
+# the _is_cursor_sqlite_candidate guard in scan_convos.

This is documentation-level — the current code is correct but the intent isn't obvious to future maintainers.

Flow Impact Analysis

Before:
  scan_convos() → [.md, .json, .jsonl files] → normalize() → text decode → chunking

After:
  scan_convos() → [.md, .json, .jsonl, .vscdb/.sqlite/.db files]
                           │                    │
                           │                    └─ _is_cursor_sqlite_candidate() gate
                           │
                   normalize()
                     │
                     ├─ _looks_like_sqlite()? ──YES──→ _try_cursor_sqlite() → transcript or ""
                     │
                     └─ NO → existing text/JSON path (unchanged)

The SQLite detection is cleanly gated: binary signature check happens first, then Cursor-specific key lookup. Non-Cursor databases produce "" and are skipped by the downstream if not content guard. Existing text/JSON normalization is untouched.

Created by Octocode MCP https://octocode.ai

web3guru888

This is a really useful addition — Cursor's state.vscdb stores a ton of valuable conversation context that would otherwise require manual export. The implementation is thoughtful:

SQLite signature detection (_looks_like_sqlite) avoids blindly opening every .db file as a conversation source
Heuristic filtering in _is_cursor_sqlite_candidate (checking for "cursor" and "workspacestorage" in path parts) prevents accidental ingestion of unrelated SQLite databases
Adjacent dedup in the extracted messages handles overlapping nested structures cleanly
Read-only mode (?mode=ro) is the right call for not corrupting the workspace DB

The recursive walk() approach for extracting messages from the nested composer payload is robust against schema changes in Cursor's internal format.

Nice work expanding the ingestion surface. Pairs well with #276 (your .rst PR).

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

bensig · 2026-04-11T05:28:04Z

Conflicts with main. Cursor SQLite ingestion is a good feature — would you be able to rebase against current main?

nanoscopic · 2026-04-16T09:16:12Z

Uh, @bensig ... you closed #232 saying this takes care of it. And immediately closed this also because it needs to be rebased?

Things that just need to be rebased shouldn't be closed. They should be left open until they are actually addressed.

By closing issues like this you are telling the people who did the work "Whatever I don't want help" even if that is not your intent.

Also, insisting people rebase everything so all you have to do is press merge is not conducive to people being willing to continue making PRs in a fast moving project.

Instead of closing the issues, why don't you or Milla or those with something to gain from the project succeeding tell AI to look at the things you do want to add and fix those PRs so they can merge cleanly?

You are asking people "would you rebase this?" That's common, but they could ask the same of you back. They made a PR that does something useful. Why not utilize it?

feat: add native Cursor SQLite convo ingestion

cc49ea6

web3guru888 reviewed Apr 10, 2026

View reviewed changes

bensig closed this Apr 11, 2026

bensig mentioned this pull request Apr 11, 2026

feat: add Cursor agent transcript JSONL normalizer #232

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native Cursor SQLite convo ingestion#287

feat: add native Cursor SQLite convo ingestion#287
saschabuehrle wants to merge 1 commit intoMemPalace:mainfrom
saschabuehrle:feat/cursor-sqlite-ingestion

saschabuehrle commented Apr 8, 2026

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

nanoscopic commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

saschabuehrle commented Apr 8, 2026

Summary

What changed

Why

Tests

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: feat: add native Cursor SQLite convo ingestion

Executive Summary

Ratings

PR Health

High Priority Issues

🐛 #1: Recursive walk() has no depth guard — could crash on corrupt/deeply nested DB payloads

Medium Priority Issues

🐛 #2: _cursor_role substring matching is overly broad — "prompt" token matches non-user roles

🔄 #3: normalize() returns empty string for non-Cursor SQLite files — changes behavior for existing .db files

🎨 #4: _is_cursor_sqlite_candidate name check uses startswith("state.vscdb") — matches unintended filenames

Low Priority Issues

🎨 #5: Tests cover happy path well but miss edge cases

🔗 #6: SQLITE_CONVO_EXTENSIONS duplicates a subset of CONVO_EXTENSIONS

Flow Impact Analysis

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

nanoscopic commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

🐛 #1: Recursive `walk()` has no depth guard — could crash on corrupt/deeply nested DB payloads

🐛 #2: `_cursor_role` substring matching is overly broad — "prompt" token matches non-user roles

🔄 #3: `normalize()` returns empty string for non-Cursor SQLite files — changes behavior for existing `.db` files

🎨 #4: `_is_cursor_sqlite_candidate` name check uses `startswith("state.vscdb")` — matches unintended filenames

🔗 #6: `SQLITE_CONVO_EXTENSIONS` duplicates a subset of `CONVO_EXTENSIONS`