Part of #69
Description
Investigate approaches for handling large input files that don't fit in memory, across all supported input formats: CSV, TSV, JSON, NDJSON, and XML.
The tool currently uses an in-memory SQLite database for all formats. Each format has different memory characteristics:
| Format |
Current behavior |
Memory concern |
| CSV/TSV |
Streaming parser → row-by-row INSERT into :memory: |
Data accumulates in SQLite RAM |
| NDJSON |
Streaming line-by-line → row-by-row INSERT into :memory: |
Data accumulates in SQLite RAM |
| JSON |
Entire file read into RAM, then parsed and inserted |
Double buffering (raw bytes + parse tree) |
| XML |
Entire file read into RAM, then parsed and inserted |
Double buffering (raw bytes + parse tree) |
Two approaches are evaluated:
- SQLite virtual table — streaming input via a virtual table interface (read-only, CSV-specific; not directly applicable to JSON/XML)
- Disk-backed temp storage — switch from
:memory: to a file-backed SQLite database + PRAGMA temp_store = FILE, so both data and transient query structures (ORDER BY, GROUP BY) can spill to disk
Produce a written recommendation (implementation notes, trade-offs, estimated effort per format) to guide sub-issues 2–5.
Acceptance Criteria
Notes
- The disk-backed approach (file-backed DB +
PRAGMA temp_store = FILE) likely delivers 80% of the value with 20% of the complexity, and applies uniformly to all formats
- The virtual table approach is CSV/TSV-specific; JSON and XML have no equivalent streaming virtual table in SQLite
- JSON and XML loaders have an additional challenge: they buffer the entire raw input before parsing — fixing that requires format-level changes independent of the SQLite storage strategy
- NDJSON is already streaming at the parser level (line-by-line), same as CSV/TSV — its main bottleneck is the in-memory SQLite DB
- Timebox to 4 hours max
Part of #69
Description
Investigate approaches for handling large input files that don't fit in memory, across all supported input formats: CSV, TSV, JSON, NDJSON, and XML.
The tool currently uses an in-memory SQLite database for all formats. Each format has different memory characteristics:
:memory::memory:Two approaches are evaluated:
:memory:to a file-backed SQLite database +PRAGMA temp_store = FILE, so both data and transient query structures (ORDER BY, GROUP BY) can spill to diskProduce a written recommendation (implementation notes, trade-offs, estimated effort per format) to guide sub-issues 2–5.
Acceptance Criteria
Notes
PRAGMA temp_store = FILE) likely delivers 80% of the value with 20% of the complexity, and applies uniformly to all formats