Skip to content

Multi-format input/output support (JSON, TSV, Parquet, etc.) #68

Description

@vmvarela

Description

Extend sql-pipe beyond CSV to support multiple input and output formats. Users would specify formats with --input-format / --output-format flags, with automatic detection when possible.

This transforms sql-pipe from a CSV-specific tool into a universal data transformation engine — competing with tools like jq, xsv, and mlr for diverse data sources.

Supported formats (proposed)

Format Input Output Notes
CSV Current default
TSV Tab-separated, common in bioinformatics
JSON Array-of-objects and newline-delimited (NDJSON)
Parquet Columnar format, popular in data engineering
XML Row-based XML with configurable element names

Example usage

# JSON input, CSV output (default)
cat data.json | sql-pipe --input-format json "SELECT name, age FROM stdin WHERE age > 30"

# CSV input, JSON output
cat users.csv | sql-pipe --output-format json "SELECT * FROM stdin LIMIT 5"

# Parquet to TSV
cat warehouse.parquet | sql-pipe --input-format parquet --output-format tsv "SELECT product, SUM(qty) FROM stdin GROUP BY product"

# Auto-detect input format from file extension
sql-pipe --input data.json "SELECT * FROM stdin"

Acceptance Criteria

  • --input-format / -I flag selects input parser (csv, tsv, json, ndjson, parquet, xml)
  • --output-format / -O flag selects output formatter (csv, tsv, json, ndjson, parquet, xml)
  • Auto-detection of input format when reading from files (by extension)
  • Default remains CSV→CSV for backward compatibility
  • JSON output produces an array of objects with column names as keys
  • NDJSON output produces one JSON object per line
  • Schema/type detection works correctly for all input formats
  • Error messages are clear when a format is unsupported or data is malformed
  • Documentation updated with format examples

Recommended split plan

This is a size:xl epic. When scheduled for a sprint, split into these sub-issues:

  1. Format plugin architecture (size:s) — refactor the current CSV reader/writer into a pluggable format interface so new formats can be added incrementally. Add --input-format / --output-format flags that default to csv.
  2. JSON/NDJSON input and output (size:m) — add JSON array-of-objects and newline-delimited JSON support. Builds on JSON output format (--json) #44 (which adds --json as a standalone flag). JSON input requires flattening nested structures — strategy: top-level keys only, nested objects become JSON strings.
  3. TSV input support (size:xs) — TSV output already exists via --tsv; add TSV as an input format option. Trivial since it's just a different delimiter.
  4. Parquet input and output (size:l) — requires a Zig-compatible Parquet library or C bindings (e.g., Apache Arrow). Investigate feasibility before committing.
  5. XML input and output (size:m) — row-based XML with configurable root/row element names.

Dependencies

  • JSON output format (--json) #44 (JSON output) should be completed first — this issue generalizes --json into --output-format json
  • Sub-issue ordering: 1 → 2 → 3/4/5 (architecture first, then formats in parallel)

Notes

  • TSV is trivial (just a different delimiter) and could ship first
  • JSON/NDJSON requires flattening nested structures into tabular form — define a strategy (e.g., top-level keys only, or dot-notation for nested)
  • Parquet requires a Zig Parquet reader/writer library or C bindings
  • Keep the core SQL engine format-agnostic — only the input parser and output formatter change

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:mediumShould be done soonsize:xlExtra large — more than 2 days (split it)status:readyRefined and ready for sprint selectiontype:featureNew functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions