Skip to content

Parallel read_parquet() SQL function execution #5250

@puzpuzpuz

Description

@puzpuzpuz

Is your feature request related to a problem?

Recently we've added read_parquet() SQL function which allows one to read external Apache Parquet files:
https://questdb.io/docs/reference/function/parquet/#read_parquet

The limitation is that the function is backed with a single-threaded ReadParquetRecordCursorFactory factory while we want read_parquet() queries to run parallel.

To make this happen, we can rewrite ReadParquetRecordCursorFactory factory to implement page frame cursors (see PageFrameRecordCursorFactory). This way the rest of our query engine will consider it "normal" table with Parquet partitions (sans time order and, hence, time intrinsics) and all parallel factories, like filter and group by ones, will kick in automatically. As a result, we'll get parallel read_parquet() execution with everything we have in the query engine.

Describe the solution you'd like.

No response

Describe alternatives you've considered.

No response

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementEnhance existing functionalityPerformancePerformance improvementsSQLIssues or changes relating to SQL execution

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions