-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Parquet Support Tasks & TODO #6694
Copy link
Copy link
Open
Labels
New featureFeature requestsFeature requestsPerformancePerformance improvementsPerformance improvements
Description
Is your feature request related to a problem?
Tracking issue for Parquet-related features, optimizations
Decode Performance
- Optimize Parquet-to-native decoding for fixed-size columns (int, long, double, symbol etc. perf(sql): optimize parquet decode rowgroup performance #6632)
- Cache decoded Parquet data in ASOF JOIN to avoid redundant decoding
- Late materialization for filtered aggregate queries (feat(core): optimize parquet partition read with late materialization, zero-copy page reading, and use raw array encoding #6675)
- Profile and optimize Array column decoding performance and see Add projections for array access in parquet partitions/files #6065
- Using alternative varchar/string aux format internally to avoid unnecessary copy
Write Support
- Decimal types (decimal8/16/32/64/128/256) write support
- Fix Parquet writer to produce spec-compliant files (exporting symbol columns to parquet can be incompatible with strict parquet readers #6692)
- See also Parquet tasks #4738 for the comprehensive Parquet implementation roadmap including:
- DDL operations (detach/attach, add/change/drop column)
- UPDATE/deduplication support (Update on parquet partition end up with suspended table #6335)
- Index
- Parquet partition size issue, how to "recompress" #6427
- Catch MetaData from S3 parquet file to local disk
read_parquet() Function
- Support timestamp-based row group filtering for QuestDB-written files (
TimestampFinder) (Time intrinsics support for read_parquet() SQL function #6081) - Decode dictionary-encoded columns as Symbol(Dynamic) instead of Varchar
Statistics-based Optimization
- Use Parquet min/max statistics to skip row groups / pages during filtered reads
- Use Parquet statistics for aggregate pushdown (COUNT, MIN, MAX)
- Use Parquet bloom filter to skip row groups / pages during filtered reads
Testing & Benchmarks
- Add more Parquet sqllogictest cases (ported from Duck)
- Run ClickBench on Parquet and compare with Duck
- Integration testing (Pandas, Polars, DuckDB, Spark) (fix(core): fix dictionary and bitpack encoding for symbol columns in parquet #6708)
Easy of Use
Full Name:
Victor
Affiliation:
QuestDB
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
New featureFeature requestsFeature requestsPerformancePerformance improvementsPerformance improvements