Skip to content

Spike: evaluate streaming approaches for large datasets (all input formats) #90

Description

@vmvarela

Part of #69

Description

Investigate approaches for handling large input files that don't fit in memory, across all supported input formats: CSV, TSV, JSON, NDJSON, and XML.

The tool currently uses an in-memory SQLite database for all formats. Each format has different memory characteristics:

Format Current behavior Memory concern
CSV/TSV Streaming parser → row-by-row INSERT into :memory: Data accumulates in SQLite RAM
NDJSON Streaming line-by-line → row-by-row INSERT into :memory: Data accumulates in SQLite RAM
JSON Entire file read into RAM, then parsed and inserted Double buffering (raw bytes + parse tree)
XML Entire file read into RAM, then parsed and inserted Double buffering (raw bytes + parse tree)

Two approaches are evaluated:

  1. SQLite virtual table — streaming input via a virtual table interface (read-only, CSV-specific; not directly applicable to JSON/XML)
  2. Disk-backed temp storage — switch from :memory: to a file-backed SQLite database + PRAGMA temp_store = FILE, so both data and transient query structures (ORDER BY, GROUP BY) can spill to disk

Produce a written recommendation (implementation notes, trade-offs, estimated effort per format) to guide sub-issues 2–5.

Acceptance Criteria

  • Both approaches evaluated for each supported format (CSV/TSV, JSON, NDJSON, XML)
  • Recommendation written as a comment on this issue: which approach to implement first and why
  • Memory usage characteristics documented for each format × approach combination
  • Known limitations documented (e.g. which SQL operations won't work, which formats need special handling)

Notes

  • The disk-backed approach (file-backed DB + PRAGMA temp_store = FILE) likely delivers 80% of the value with 20% of the complexity, and applies uniformly to all formats
  • The virtual table approach is CSV/TSV-specific; JSON and XML have no equivalent streaming virtual table in SQLite
  • JSON and XML loaders have an additional challenge: they buffer the entire raw input before parsing — fixing that requires format-level changes independent of the SQLite storage strategy
  • NDJSON is already streaming at the parser level (line-by-line), same as CSV/TSV — its main bottleneck is the in-memory SQLite DB
  • Timebox to 4 hours max

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:mediumShould be done soonsize:sSmall — 1 to 4 hourstype:spikeResearch or investigation (timeboxed)

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions