Skip to content

Parquet Scan Filter #1191

@tustvold

Description

@tustvold

I hope to work on this after finishing up dictionary preservation (#1180) and async (#1154), creating now for feedback and visibility

Background

IOx is a columnar, time-series database that uses parquet for its persistence format. The data model consists of a number of relatively low cardinality tag columns, a timestamp column and a number of value columns. Data is stored in order of column cardinality, with the lowest cardinality columns first.

Almost all IOx queries contain highly selective predicates on tag columns, and therefore predicate pushdown is very important to IOx's performance story.

Currently the parquet crate only supports predicate pushdown in the form of row group pruning, that is using statistics to skip reading entire row groups. Unfortunately this is only very effective if a file is both large enough to contain multiple row groups, and the predicate is on a column early in the sort order.

An obvious next step might be to perform page-pruning, that is using page metadata to skip pages. However, the page metadata is typically stored inside the page header, and so you likely end up reading the entire page from the backing store and decompressing it only to then throw it away, which is not ideal.

There is an optional extension to parquet called PageIndex that tries to address this. However, I'm not aware of any systems outside of Impala that support this functionality, and it is still constrained by the page granularity and the limitations of relying on aggregate statistics.

Proposal

Instead I would like to propose adding functionality to allow providing a row selection mask when scanning a column chunk. When provided, only values (null or otherwise) with a corresponding set bit in the selection mask will be returned in the output. For simplicity nested data will not be supported, at least initially.

This would allow IOx, or potentially DataFusion depending on where the logic for this eventually sits, to do the following for pushing down predicates, in addition to the current row group filtering:

For each column in an order determined by some heuristic:

  • Stream out the data for the column with the current selection mask (initially all true)
  • Evaluate the column predicate on the returned array
  • Use this to refine the selection mask for the next column

Once all the predicates that can be evaluated in this way have been evaluated, a final scan can be performed with the final selection mask across the full set of projected columns.

Potential Problems

I'm currently aware of the following potential problems with this approach, but let me know if I've missed something:

Despite this I am optimistic about the feasibility of this approach, as it is very similar to the one @e-dard has successfully applied to IOx's read buffer - an in-memory, read-only, queryable data structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions