Conversation
|
This is an automated comment for commit 72dc186 with description of existing statuses. It's updated for the latest CI running
|
|
I'd mark closes #23297 |
|
I found on my re-implemented parquet reader that filters need to be pushed down to column decoder, especially for dictionary-encoded string columns. |
|
@al13n321 Clickhouse Arrow version lags two big versions of upstream, do you have plan to upgrade the version? |
I don't understand, can you elaborate? This PR just looks at column chunk statistics in the FileMetaData struct, then just skips row groups. No deeper integration with the decoder seems necessary. Do you mean one of these?:
Btw, how's the reimplementation going? I'm looking forward to it! |
Just mentioning. It doesn't mean that this feature is bad. |
IIUC then prewhere would not take effect on parquet pushdown, and would have to manually be implemented with nested subqueries in the current form of this PR? |
Yes. |
dbf9d7d to
a5e6a76
Compare
|
Can we add a performance test? (in |
b9728f8 to
c315638
Compare
|
which now takes 3 ms instead of 2 ms. It's plausible that ~1ms is really how long the new filtering code takes (on a file with 133 columns and 1 row group), and I guess that's ok. |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Parquet filter pushdown. I.e. when reading Parquet files, row groups (chunks of the file) are skipped based on the WHERE condition and the min/max values in each column. In particular, if the file is roughly sorted by some column, queries that filter by a short range of that column will be much faster.
Closes #23297