`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data

**Describe the bug**

I noticed this while investigating https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772.

The suggestion from @jayzhan211 and @alamb was that `datafusion.execution.parquet.pushdown_filters true` should improve performance of queries like this, but it seems to make them slower.

I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on [this code](https://github.com/samuelcolvin/batson-perf):

<img width="1596" alt="image" src="https://github.com/user-attachments/assets/b3268dd8-8264-4cd4-972c-0ed3f20a3a4c">

(You can view this flamegraph properly [here](https://share.firefox.dev/3zrdUpN))

You can see that there are two blocks of decompression work, the second one is associated with `parquet::column::reader::GenericColumnReader::skip_records` and happens after the first decompression chunk and running the query has completed.

In particular you can se that there's a `read_new_page()` cal in `
parquet::column::reader::GenericColumnReader::skip_records` (line 335) that's taking a lot of time:

<img width="798" alt="image" src="https://github.com/user-attachments/assets/abfce516-1eae-4ac3-a240-1a0686a37fe4">

My question is - could this second run of compression be avoided?

**To Reproduce**

Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled `cargo build --profile profiling`, run with samply `samply record ./target/profiling/batson-perf`

**Expected behavior**

I would expect that `datafusion.execution.parquet.pushdown_filters true` was faster, I think the reason it's not is decompressing the data twice.

**Additional context**

https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454