-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
I hope to work on this after finishing up dictionary preservation (#1180) and async (#1154), creating now for feedback and visibility
Background
IOx is a columnar, time-series database that uses parquet for its persistence format. The data model consists of a number of relatively low cardinality tag columns, a timestamp column and a number of value columns. Data is stored in order of column cardinality, with the lowest cardinality columns first.
Almost all IOx queries contain highly selective predicates on tag columns, and therefore predicate pushdown is very important to IOx's performance story.
Currently the parquet crate only supports predicate pushdown in the form of row group pruning, that is using statistics to skip reading entire row groups. Unfortunately this is only very effective if a file is both large enough to contain multiple row groups, and the predicate is on a column early in the sort order.
An obvious next step might be to perform page-pruning, that is using page metadata to skip pages. However, the page metadata is typically stored inside the page header, and so you likely end up reading the entire page from the backing store and decompressing it only to then throw it away, which is not ideal.
There is an optional extension to parquet called PageIndex that tries to address this. However, I'm not aware of any systems outside of Impala that support this functionality, and it is still constrained by the page granularity and the limitations of relying on aggregate statistics.
Proposal
Instead I would like to propose adding functionality to allow providing a row selection mask when scanning a column chunk. When provided, only values (null or otherwise) with a corresponding set bit in the selection mask will be returned in the output. For simplicity nested data will not be supported, at least initially.
This would allow IOx, or potentially DataFusion depending on where the logic for this eventually sits, to do the following for pushing down predicates, in addition to the current row group filtering:
For each column in an order determined by some heuristic:
- Stream out the data for the column with the current selection mask (initially all true)
- Evaluate the column predicate on the returned array
- Use this to refine the selection mask for the next column
Once all the predicates that can be evaluated in this way have been evaluated, a final scan can be performed with the final selection mask across the full set of projected columns.
Potential Problems
I'm currently aware of the following potential problems with this approach, but let me know if I've missed something:
- Columns may be scanned twice. My hope is with (Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180) this won't be too problematic
- For heavily compressed column chunks with many rows, the selection mask may be prohibitively large
- Would add non-trivial additional complexity to the decoder implementations
Despite this I am optimistic about the feasibility of this approach, as it is very similar to the one @e-dard has successfully applied to IOx's read buffer - an in-memory, read-only, queryable data structure.