Parquet tasks

## Tasks (moved from #4918)

### Dev
- [x] [must-have] Time interval intrinsics - @puzpuzpuz 
- [x] [must-have] `alter table <table_name> convert partition to native` - @eugenels 
- [x] [must-have] Implement `TableReader` columns/parquet fd refresh on parquet->native conversion (add a stress test for that) - @eugenels 
- [x] [must-have] Include metadata offset into `TableReader`'s Parquet partition metadata snapshots (currently it only has fds) and use the offset when reading from Parquet - @amunra 
- [x] [must-have] Integrate time interval intrinsics with the metadata offset patch and move per-row group fields from `PartitionFrame` to `PageFrame` - @puzpuzpuz 
- [ ] [must-have] Consistent column index handling when querying/writing parquet partitions (e.g. dedup assumes matching column indexes between table writer and parquet files)
- [ ] [must-have] Dedup support - @eugenels 
- [ ] [must-have] DDLs and coltop support for parquet partitions, e.g. when a coltop column is created, insert the column into older Parquet partitions - @eugenels 
- [x] [must-have] Write symbol index files for parquet partitions (a lot of factories assume that we have the files) - @puzpuzpuz 
- [ ] [must-have] UPDATE on parquet partitions
- [ ] [must-have] Rebuild index on UPDATE on parquet partitions
- [ ] [must-have] Compaction background job for Parquet partitions
- [ ] [must-have] Alter table column type support for parquet partitions
- [ ] Index repair CLI tool needs to support parquet partitions
- [ ] Extract QdbMeta to separate Java API
- [ ] CI for our modified `parquet2` lib
- [ ] When encoding symbols, only walk through the source column once. We should review the rest of the columns too for this. - @eugenels 
- [ ] Refactoring and code dedup around symbol column (todos left in code).
- [ ] Export KV metadata for `ARROW:schema` and (possibly) `pandas` so that symbols are read as categorical data.
- [x] Track Rust memory alloc - @amunra 
- [x] Raise `CairoException` from Rust, not `RuntimeError`.
- [ ] Don't collect parquet stats for varchar/string/symbol/binary when encoding partitions (to save disk space)
- [ ] Display minTimestamp/maxTimestamp for parquet partitions in SHOW PARTITIONS

### Profiling
... long list here :-)

### Fuzz Testing
- [ ] Stress multiple row groups
- [ ] Stress adding and deleting columns
- [ ] Stress selecting individual columns
- [ ] Test `IN` and `BETWEEN` queries
- [ ] [must-have] Add parquet partition conversion into existing fuzz tests (to/from)

### Plain old testing
- [ ] Concurrent read-write test to make sure that parquet offsets is correctly read and used
- [ ] Dropping partitions
- [ ] Detach + attach partitions
- [ ] Renaming columns
- [ ] Changing column types
- [ ] Column tops
- [ ] Test `read_parquet` function against a parquet partition we've generated that contains symbols (it will probably break since `read_parquet` expects string column results, not i32 indices).
- [ ] Parallel algo queries (threading)
  - [ ] Latest on
  - [x] Filter
  - [x] Group by
  - [ ] Sample by
- [x] Cover `TableReader` metadata change handling when converting to/from parquet/regular partitions
- [ ] Parameterize existing O3 tests to work with parquet
- [ ] Write integration tests to ensure our parquet files can be consumed by:
  - [ ] Iceberg
  - [ ] Pandas
  - [ ] Polars
  - [ ] PyArrow
  - [ ] DuckDB
  - [ ] Spark

## Obsolete TODOs

* [x] Write all column types into parquet format 
* [x] Encode QuestDB nulls as parquet nulls for all the data types
* [x] Support parquet encodings: PLAIN for major data types, RleDictionary for SYMBOLs, DeltaBinaryPacked for TIMESTAMP
* [x] Decide on the support of generic compression as a must for initial delivery (yes)
* [x] Include column ids into written parquet file (@puzpuzpuz)
* [x] Storage support converting existing partitions into parquet files (column tops) @mtopolnik 
* [x] SQL support to convert partitions to parquet format @eugenels 
* [x] Add SQL table function to query parquet files from anywhere in the local file system @ideoma 
* [ ] Profile and optimize the SQL table function
* [x] Integrate parquet file reading into existing `PageFrame` cursor factories using lazy Row Group decompression into memory buffers @puzpuzpuz 
* [x] `DataFrameRecordCursorFactory` with Parquet partitions support
* [x] Parallel group by on mixed partitions
* [ ] Data frame factory with Parquet partitions support, i.e. read selected parquet columns using queries without filtering like `SELECT col1 FROM parquet_table`
* [ ] Designated timestamp search with Parquet partitions support (currently, for native partitions it's binary search-based), i.e. read parquet using queries with designated timestamp filters like `SELECT col1 FROM parquet_table WHERE timestamp in '2024-05-22`
* [ ] Dictionary encoding for varchar and string, with a fallback to raw values when the dictionary grows too large
* [ ] Optimize encoding + compression for each column type
* [ ] Query tables with mixed parquet and non-parquet partitions
* [ ] Support queries with parquet partitions with ORDER BY clause
* [ ] Support queries with parquet partitions with GROUP BY clause
* [ ] Support queries with parquet partitions with LATEST BY clause
* [ ] Support other row factories with parquet storage
* [ ] Support push-down filters to skip row groups by using parquet statistics
* [ ] Support parquet data transfer pass through to the query clients without decompression
* [ ] ALTER TABLE to convert a Parquet partition back to native format
* [ ] Support appending new rows to parquet files
* [ ] Support O3 on partitions encoded in parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet tasks #4738

Tasks (moved from #4918)

Dev

Profiling

Fuzz Testing

Plain old testing

Obsolete TODOs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet tasks #4738

Description

Tasks (moved from #4918)

Dev

Profiling

Fuzz Testing

Plain old testing

Obsolete TODOs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions