-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Parquet tasks #4738
Copy link
Copy link
Open
Labels
CoreRelated to storage, data type, etc.Related to storage, data type, etc.New featureFeature requestsFeature requests
Description
Tasks (moved from #4918)
Dev
- [must-have] Time interval intrinsics - @puzpuzpuz
- [must-have]
alter table <table_name> convert partition to native- @eugenels - [must-have] Implement
TableReadercolumns/parquet fd refresh on parquet->native conversion (add a stress test for that) - @eugenels - [must-have] Include metadata offset into
TableReader's Parquet partition metadata snapshots (currently it only has fds) and use the offset when reading from Parquet - @amunra - [must-have] Integrate time interval intrinsics with the metadata offset patch and move per-row group fields from
PartitionFrametoPageFrame- @puzpuzpuz - [must-have] Consistent column index handling when querying/writing parquet partitions (e.g. dedup assumes matching column indexes between table writer and parquet files)
- [must-have] Dedup support - @eugenels
- [must-have] DDLs and coltop support for parquet partitions, e.g. when a coltop column is created, insert the column into older Parquet partitions - @eugenels
- [must-have] Write symbol index files for parquet partitions (a lot of factories assume that we have the files) - @puzpuzpuz
- [must-have] UPDATE on parquet partitions
- [must-have] Rebuild index on UPDATE on parquet partitions
- [must-have] Compaction background job for Parquet partitions
- [must-have] Alter table column type support for parquet partitions
- Index repair CLI tool needs to support parquet partitions
- Extract QdbMeta to separate Java API
- CI for our modified
parquet2lib - When encoding symbols, only walk through the source column once. We should review the rest of the columns too for this. - @eugenels
- Refactoring and code dedup around symbol column (todos left in code).
- Export KV metadata for
ARROW:schemaand (possibly)pandasso that symbols are read as categorical data. - Track Rust memory alloc - @amunra
- Raise
CairoExceptionfrom Rust, notRuntimeError. - Don't collect parquet stats for varchar/string/symbol/binary when encoding partitions (to save disk space)
- Display minTimestamp/maxTimestamp for parquet partitions in SHOW PARTITIONS
Profiling
... long list here :-)
Fuzz Testing
- Stress multiple row groups
- Stress adding and deleting columns
- Stress selecting individual columns
- Test
INandBETWEENqueries - [must-have] Add parquet partition conversion into existing fuzz tests (to/from)
Plain old testing
- Concurrent read-write test to make sure that parquet offsets is correctly read and used
- Dropping partitions
- Detach + attach partitions
- Renaming columns
- Changing column types
- Column tops
- Test
read_parquetfunction against a parquet partition we've generated that contains symbols (it will probably break sinceread_parquetexpects string column results, not i32 indices). - Parallel algo queries (threading)
- Latest on
- Filter
- Group by
- Sample by
- Cover
TableReadermetadata change handling when converting to/from parquet/regular partitions - Parameterize existing O3 tests to work with parquet
- Write integration tests to ensure our parquet files can be consumed by:
- Iceberg
- Pandas
- Polars
- PyArrow
- DuckDB
- Spark
Obsolete TODOs
- Write all column types into parquet format
- Encode QuestDB nulls as parquet nulls for all the data types
- Support parquet encodings: PLAIN for major data types, RleDictionary for SYMBOLs, DeltaBinaryPacked for TIMESTAMP
- Decide on the support of generic compression as a must for initial delivery (yes)
- Include column ids into written parquet file (@puzpuzpuz)
- Storage support converting existing partitions into parquet files (column tops) @mtopolnik
- SQL support to convert partitions to parquet format @eugenels
- Add SQL table function to query parquet files from anywhere in the local file system @ideoma
- Profile and optimize the SQL table function
- Integrate parquet file reading into existing
PageFramecursor factories using lazy Row Group decompression into memory buffers @puzpuzpuz -
DataFrameRecordCursorFactorywith Parquet partitions support - Parallel group by on mixed partitions
- Data frame factory with Parquet partitions support, i.e. read selected parquet columns using queries without filtering like
SELECT col1 FROM parquet_table - Designated timestamp search with Parquet partitions support (currently, for native partitions it's binary search-based), i.e. read parquet using queries with designated timestamp filters like
SELECT col1 FROM parquet_table WHERE timestamp in '2024-05-22 - Dictionary encoding for varchar and string, with a fallback to raw values when the dictionary grows too large
- Optimize encoding + compression for each column type
- Query tables with mixed parquet and non-parquet partitions
- Support queries with parquet partitions with ORDER BY clause
- Support queries with parquet partitions with GROUP BY clause
- Support queries with parquet partitions with LATEST BY clause
- Support other row factories with parquet storage
- Support push-down filters to skip row groups by using parquet statistics
- Support parquet data transfer pass through to the query clients without decompression
- ALTER TABLE to convert a Parquet partition back to native format
- Support appending new rows to parquet files
- Support O3 on partitions encoded in parquet
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
CoreRelated to storage, data type, etc.Related to storage, data type, etc.New featureFeature requestsFeature requests