Spike: evaluate streaming approaches for large datasets (all input formats)

Part of #69

## Description

Investigate approaches for handling large input files that don't fit in memory, across **all supported input formats**: CSV, TSV, JSON, NDJSON, and XML.

The tool currently uses an in-memory SQLite database for all formats. Each format has different memory characteristics:

| Format | Current behavior | Memory concern |
|--------|-----------------|----------------|
| CSV/TSV | Streaming parser → row-by-row INSERT into `:memory:` | Data accumulates in SQLite RAM |
| NDJSON | Streaming line-by-line → row-by-row INSERT into `:memory:` | Data accumulates in SQLite RAM |
| JSON | Entire file read into RAM, then parsed and inserted | Double buffering (raw bytes + parse tree) |
| XML | Entire file read into RAM, then parsed and inserted | Double buffering (raw bytes + parse tree) |

Two approaches are evaluated:

1. **SQLite virtual table** — streaming input via a virtual table interface (read-only, CSV-specific; not directly applicable to JSON/XML)
2. **Disk-backed temp storage** — switch from `:memory:` to a file-backed SQLite database + `PRAGMA temp_store = FILE`, so both data and transient query structures (ORDER BY, GROUP BY) can spill to disk

Produce a written recommendation (implementation notes, trade-offs, estimated effort per format) to guide sub-issues 2–5.

## Acceptance Criteria

- [ ] Both approaches evaluated for each supported format (CSV/TSV, JSON, NDJSON, XML)
- [ ] Recommendation written as a comment on this issue: which approach to implement first and why
- [ ] Memory usage characteristics documented for each format × approach combination
- [ ] Known limitations documented (e.g. which SQL operations won't work, which formats need special handling)

## Notes

- The disk-backed approach (file-backed DB + `PRAGMA temp_store = FILE`) likely delivers 80% of the value with 20% of the complexity, and applies uniformly to all formats
- The virtual table approach is CSV/TSV-specific; JSON and XML have no equivalent streaming virtual table in SQLite
- JSON and XML loaders have an additional challenge: they buffer the entire raw input before parsing — fixing that requires format-level changes independent of the SQLite storage strategy
- NDJSON is already streaming at the parser level (line-by-line), same as CSV/TSV — its main bottleneck is the in-memory SQLite DB
- Timebox to 4 hours max

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: evaluate streaming approaches for large datasets (all input formats) #90

Description

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Format	Current behavior	Memory concern
CSV/TSV	Streaming parser → row-by-row INSERT into `:memory:`	Data accumulates in SQLite RAM
NDJSON	Streaming line-by-line → row-by-row INSERT into `:memory:`	Data accumulates in SQLite RAM
JSON	Entire file read into RAM, then parsed and inserted	Double buffering (raw bytes + parse tree)
XML	Entire file read into RAM, then parsed and inserted	Double buffering (raw bytes + parse tree)

Spike: evaluate streaming approaches for large datasets (all input formats) #90

Description

Description

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions