Skip to content

Disk-backed large dataset support #91

Description

@vmvarela

Part of #69
Depends on #90 (streaming spike — completed)

Description

Switch sql-pipe from an in-memory SQLite database to a file-backed temporary database when handling large inputs. This allows datasets larger than available RAM to be processed without crashing, across all supported input formats (CSV, TSV, NDJSON, JSON, XML).

Two changes are required (both are necessary — neither is sufficient alone):

  1. Open a file-backed temp DB instead of sqlite3_open(":memory:", &db) — this is what actually bounds data memory usage
  2. PRAGMA temp_store = FILE — ensures transient structures (ORDER BY sorts, GROUP BY indices) also spill to disk rather than RAM

All SQL operations continue to work correctly. No changes to query semantics or output.

Acceptance Criteria

  • --disk flag (or --large) switches the main database from :memory: to a file-backed temp DB
  • PRAGMA temp_store = FILE is set when --disk is active
  • Temp files are cleaned up after execution (even on error)
  • Queries that previously ran out of memory on 1GB+ files now complete successfully
  • Existing behavior unchanged when flag is not set (default stays :memory:)
  • Error message if temp directory is not writable
  • Works for all input formats: CSV, TSV, NDJSON, JSON, XML

Notes

  • Open the temp DB with an empty string path ("") or an explicit OS temp path — SQLite treats "" as a private temp file that is deleted automatically on close
  • Do NOT use PRAGMA temp_store_directory — it is deprecated; temp location is controlled via the SQLITE_TMPDIR / TMPDIR environment variable
  • The flag is binary (in-memory vs disk-backed), not a size threshold — there is no continuous SQLite API to "use disk above X MB"
  • JSON and XML loaders still buffer the raw input file before parsing — this is a separate concern tracked in future sub-issues; --disk reduces SQLite memory pressure but not parser-level buffering for these two formats
  • Follow spike recommendation: see Spike: evaluate streaming approaches for large datasets (all input formats) #90 comment for full trade-off analysis

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions