-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
I noticed this while investigating apache/datafusion#7845 (comment).
The suggestion from @jayzhan211 and @alamb was that datafusion.execution.parquet.pushdown_filters true should improve performance of queries like this, but it seems to make them slower.
I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on this code:
(You can view this flamegraph properly here)
You can see that there are two blocks of decompression work, the second one is associated with parquet::column::reader::GenericColumnReader::skip_records and happens after the first decompression chunk and running the query has completed.
In particular you can se that there's a read_new_page() cal in parquet::column::reader::GenericColumnReader::skip_records (line 335) that's taking a lot of time:
My question is - could this second run of compression be avoided?
To Reproduce
Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled cargo build --profile profiling, run with samply samply record ./target/profiling/batson-perf
Expected behavior
I would expect that datafusion.execution.parquet.pushdown_filters true was faster, I think the reason it's not is decompressing the data twice.
Additional context