Skip to content

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

@samuelcolvin

Description

@samuelcolvin

Describe the bug

I noticed this while investigating apache/datafusion#7845 (comment).

The suggestion from @jayzhan211 and @alamb was that datafusion.execution.parquet.pushdown_filters true should improve performance of queries like this, but it seems to make them slower.

I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on this code:

image

(You can view this flamegraph properly here)

You can see that there are two blocks of decompression work, the second one is associated with parquet::column::reader::GenericColumnReader::skip_records and happens after the first decompression chunk and running the query has completed.

In particular you can se that there's a read_new_page() cal in parquet::column::reader::GenericColumnReader::skip_records (line 335) that's taking a lot of time:

image

My question is - could this second run of compression be avoided?

To Reproduce

Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled cargo build --profile profiling, run with samply samply record ./target/profiling/batson-perf

Expected behavior

I would expect that datafusion.execution.parquet.pushdown_filters true was faster, I think the reason it's not is decompressing the data twice.

Additional context

apache/datafusion#7845 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions