Skip to content

feat: add native Cursor SQLite convo ingestion#287

Closed
saschabuehrle wants to merge 1 commit intoMemPalace:mainfrom
saschabuehrle:feat/cursor-sqlite-ingestion
Closed

feat: add native Cursor SQLite convo ingestion#287
saschabuehrle wants to merge 1 commit intoMemPalace:mainfrom
saschabuehrle:feat/cursor-sqlite-ingestion

Conversation

@saschabuehrle
Copy link
Copy Markdown

Summary

This PR adds native Cursor SQLite conversation ingestion so users can point mempalace mine --mode convos directly at Cursor workspace storage and ingest composer chats without manual export.

What changed

  • normalize.py
    • detect SQLite files by signature (SQLite format 3)
    • read Cursor ItemTable rows where key = 'composer.composerData'
    • parse allComposers payloads and normalize user/assistant turns into transcript format
    • safely return empty content for non-conversation SQLite DBs (so they are skipped)
  • convo_miner.py
    • include SQLite suffixes in convo scan (.vscdb, .sqlite, .db)
    • add guardrail heuristics to avoid arbitrary DB ingestion
    • detect Cursor workspace DB candidates (state.vscdb, Cursor workspaceStorage paths)
  • docs
    • updated CLI/README examples to mention Cursor conversation ingestion

Why

Implements issue #274 by enabling local-first, verbatim extraction of Cursor composer history from state.vscdb.

Tests

Added:

  • tests/test_normalize_cursor.py
    • verifies Cursor SQLite payload normalization from composer.composerData
    • verifies non-Cursor SQLite DBs are skipped
  • tests/test_convo_scan.py
    • verifies scan includes Cursor state.vscdb
    • verifies scan skips unrelated .db files

Run locally:

  • ruff check mempalace/normalize.py mempalace/convo_miner.py mempalace/cli.py tests/test_normalize_cursor.py tests/test_convo_scan.py
  • pytest tests/test_normalize.py tests/test_normalize_cursor.py tests/test_convo_scan.py tests/test_convo_miner.py -q

Note: full pytest tests/ -v currently fails on this machine in existing search/mcp tests due an upstream ONNX/CoreML runtime provider error, unrelated to these changes.

Fixes #274

Greetings, saschabuehrle

@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: feat: add native Cursor SQLite convo ingestion

Executive Summary

Aspect Value
PR Goal Allow mempalace mine --mode convos to ingest Cursor composer chats directly from state.vscdb SQLite databases without manual export
Files Changed 6 (2 new test files, 4 modified)
Risk Level 🟡 MEDIUM - new binary file ingestion path with recursive JSON walker
Review Effort 3/5 - moderate; most complexity is in normalize.py parser logic
Recommendation 💬 COMMENT - good overall, a few defensive improvements recommended

Affected Areas: mempalace/normalize.py (core parser), mempalace/convo_miner.py (file scanner), mempalace/cli.py (docstring), README.md, tests/test_normalize_cursor.py, tests/test_convo_scan.py

Business Impact: Users can now point mempalace mine at Cursor workspace storage directories and automatically ingest composer chat history — removing a manual export step that was previously required.

Flow Changes: normalize() now short-circuits on SQLite files before attempting text decode. scan_convos() now includes .vscdb/.sqlite/.db files but filters through _is_cursor_sqlite_candidate() heuristic to avoid picking up unrelated databases.

Ratings

Aspect Score
Correctness 4/5
Security 5/5
Performance 4/5
Maintainability 4/5

PR Health

  • Has clear description
  • References ticket/issue (if applicable) — no issue linked
  • Appropriate size (240 additions is well-scoped)
  • Has relevant tests (2 new test files)

High Priority Issues

🐛 #1: Recursive walk() has no depth guard — could crash on corrupt/deeply nested DB payloads

Location: mempalace/normalize.py_extract_cursor_messages.walk() | Confidence: ⚠️ MED

The walk() closure recursively descends into every dict value and list item without any depth limit. A corrupt or adversarial state.vscdb with deeply nested JSON (e.g. 1000+ levels) will hit Python's default recursion limit (usually 1000) and raise RecursionError, crashing the mining run for that file.

Since normalize() is called inside a try/except Exception: continue in convo_miner.py line 260, the crash is contained — but it means the entire file is silently skipped with no diagnostic output.

 def _extract_cursor_messages(payload) -> list:
     """Extract (role, text) pairs from Cursor composer payloads."""
     messages = []
+    MAX_DEPTH = 50

-    def walk(node):
+    def walk(node, depth=0):
+        if depth > MAX_DEPTH:
+            return
         if isinstance(node, dict):
             role = _cursor_role(node)
             text = _cursor_text(node)
             if role and text:
                 messages.append((role, text))
             for value in node.values():
-                walk(value)
+                walk(value, depth + 1)
         elif isinstance(node, list):
             for item in node:
-                walk(item)
+                walk(item, depth + 1)

Medium Priority Issues

🐛 #2: _cursor_role substring matching is overly broad — "prompt" token matches non-user roles

Location: mempalace/normalize.py_cursor_role() | Confidence: ⚠️ MED

The role detection checks if any(token in role for token in ("user", "human", "prompt", "request")). The token "prompt" will match any field value containing that substring — e.g. "type": "prompt_template", "type": "system_prompt", or "sender": "prompt_injection_filter". These would all be classified as "user" role, producing false positive messages.

Cursor's actual composer message schema uses "role": "user" or "role": "assistant" — the fuzzy matching is a reasonable fallback, but "prompt" is particularly overloaded in AI tooling contexts.

-        if any(token in role for token in ("user", "human", "prompt", "request")):
+        if any(token in role for token in ("user", "human")):
             return "user"

If keeping "prompt" is desired for coverage, consider exact match or prefix match instead of substring containment.


🔄 #3: normalize() returns empty string for non-Cursor SQLite files — changes behavior for existing .db files

Location: mempalace/normalize.pynormalize() early return | Confidence: ✅ HIGH

When _looks_like_sqlite() returns True but _try_cursor_sqlite() returns None (e.g. a SQLite file that isn't Cursor), normalize() returns "". Previously, such files would have been read as text (with errors="replace") and passed through.

This is likely intentional (binary garbage avoidance), and the downstream convo_miner.py already checks if not content at line 263, so empty strings are safely skipped. But it's a behavior change worth documenting — any non-Cursor SQLite file in a convo directory will now be silently dropped rather than producing garbled text.

No code change needed — just noting the semantic change.


🎨 #4: _is_cursor_sqlite_candidate name check uses startswith("state.vscdb") — matches unintended filenames

Location: mempalace/convo_miner.py_is_cursor_sqlite_candidate() | Confidence: ⚠️ MED

name.startswith("state.vscdb") will match state.vscdb but also hypothetical files like state.vscdb.backup, state.vscdb-wal, or state.vscdb.old. While the downstream normalize() will fail gracefully on non-SQLite files, using exact match is more precise:

-    if name.startswith("state.vscdb"):
+    if name == "state.vscdb":
         return True

Low Priority Issues

🎨 #5: Tests cover happy path well but miss edge cases

Location: tests/test_normalize_cursor.py | Confidence: ✅ HIGH

The test suite has good positive/negative coverage but would benefit from:

  • Empty allComposers list (should return "")
  • Single-message composer session (should return None due to len(deduped) >= 2 check)
  • Non-string values in ItemTable (e.g. blob data)
  • Payload with nested non-message dicts that have role-like keys (false positive test)

🔗 #6: SQLITE_CONVO_EXTENSIONS duplicates a subset of CONVO_EXTENSIONS

Location: mempalace/convo_miner.py | Confidence: ✅ HIGH

Both sets contain .vscdb, .sqlite, .db. The relationship could be made explicit:

 SQLITE_CONVO_EXTENSIONS = {".vscdb", ".sqlite", ".db"}
+# SQLite extensions are a subset of CONVO_EXTENSIONS — kept separate for
+# the _is_cursor_sqlite_candidate guard in scan_convos.

This is documentation-level — the current code is correct but the intent isn't obvious to future maintainers.


Flow Impact Analysis

Before:
  scan_convos() → [.md, .json, .jsonl files] → normalize() → text decode → chunking

After:
  scan_convos() → [.md, .json, .jsonl, .vscdb/.sqlite/.db files]
                           │                    │
                           │                    └─ _is_cursor_sqlite_candidate() gate
                           │
                   normalize()
                     │
                     ├─ _looks_like_sqlite()? ──YES──→ _try_cursor_sqlite() → transcript or ""
                     │
                     └─ NO → existing text/JSON path (unchanged)

The SQLite detection is cleanly gated: binary signature check happens first, then Cursor-specific key lookup. Non-Cursor databases produce "" and are skipped by the downstream if not content guard. Existing text/JSON normalization is untouched.


Created by Octocode MCP https://octocode.ai

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really useful addition — Cursor's state.vscdb stores a ton of valuable conversation context that would otherwise require manual export. The implementation is thoughtful:

  • SQLite signature detection (_looks_like_sqlite) avoids blindly opening every .db file as a conversation source
  • Heuristic filtering in _is_cursor_sqlite_candidate (checking for "cursor" and "workspacestorage" in path parts) prevents accidental ingestion of unrelated SQLite databases
  • Adjacent dedup in the extracted messages handles overlapping nested structures cleanly
  • Read-only mode (?mode=ro) is the right call for not corrupting the workspace DB

The recursive walk() approach for extracting messages from the nested composer payload is robust against schema changes in Cursor's internal format.

Nice work expanding the ingestion surface. Pairs well with #276 (your .rst PR).

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

Conflicts with main. Cursor SQLite ingestion is a good feature — would you be able to rebase against current main?

@nanoscopic
Copy link
Copy Markdown

Uh, @bensig ... you closed #232 saying this takes care of it. And immediately closed this also because it needs to be rebased?

Things that just need to be rebased shouldn't be closed. They should be left open until they are actually addressed.

By closing issues like this you are telling the people who did the work "Whatever I don't want help" even if that is not your intent.

Also, insisting people rebase everything so all you have to do is press merge is not conducive to people being willing to continue making PRs in a fast moving project.

Instead of closing the issues, why don't you or Milla or those with something to gain from the project succeeding tell AI to look at the things you do want to add and fix those PRs so they can merge cleanly?

You are asking people "would you rebase this?" That's common, but they could ask the same of you back. They made a PR that does something useful. Why not utilize it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Native Cursor SQLite Ingestion Support

5 participants