Add parquet integration tests for explicitly smaller page sizes, page pruning#4131
Conversation
438eac0 to
c572688
Compare
|
|
||
| #[cfg(not(target_family = "windows"))] | ||
| #[tokio::test] | ||
| async fn single_file_small_data_pages() { |
There was a problem hiding this comment.
Here is the new test -- I verified manually the layout was good (typically makes 6 data pages for each column chunk)
| // page 5: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 1970-01-01T00:00:00.000000000, max: 1970-01-01T00:00:00.005330944, num_nulls not defined] CRC:[none] SZ:12601 VC:7739 | ||
| TestCase::new(&test_parquet_file) | ||
| .with_name("selective") | ||
| // predicagte is chosen carefully to prune pages |
There was a problem hiding this comment.
| // predicagte is chosen carefully to prune pages | |
| // predicate is chosen carefully to prune pages |
| /// given properties | ||
| pub fn try_new( | ||
| path: PathBuf, | ||
| props: WriterProperties, |
There was a problem hiding this comment.
this change is to pass in the writer props to remove a level of indirection of how the file is created and make the code easier to read
|
|
||
| #[cfg(not(target_family = "windows"))] | ||
| #[tokio::test] | ||
| async fn single_file_small_data_pages() { |
| if let Some(s) = opt.page_size { | ||
| props_builder = props_builder | ||
| .set_data_pagesize_limit(s) | ||
| .set_write_batch_size(s); |
There was a problem hiding this comment.
This looking good!👍 Once i try to create page less then DEFAULT_WRITE_BATCH_SIZE row count, forgot set write_batch_size puzzle me a while 😂
There was a problem hiding this comment.
Maybe we can add this info in arrow-rs 🤔
There was a problem hiding this comment.
Maybe we can add this info in arrow-rs 🤔
@Ted-Jiang I am not sure what you mean. Do you mean improve the documentation here?
There was a problem hiding this comment.
@alamb
I mean set here https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_data_page_row_count_limit
if user only set set_data_page_row_count_limit to 100(without set_write_batch_size to 100 ), it less than default DEFAULT_WRITE_BATCH_SIZE(1024), i think it still cut 1024 row one page🤔.
There was a problem hiding this comment.
Test shows:
First i set in single_file_small_data_pages
let props = WriterProperties::builder()
.set_data_page_row_count_limit(100)
.build();
then i run parquet-tools column-index -c service /Users/yangjiang/data_8311.parquet
offset index for column service:
offset compressed size first row index
page-0 29 48 0
page-1 77 48 1024
page-2 125 48 2048
page-3 173 48 3072
page-4 221 48 4096
page-5 269 48 5120
page-6 317 48 6144
page-7 365 48 7168
page-8 413 48 8192
page-9 461 48 9216
page-10 509 48 10240
page-11 557 48 11264
page-12 605 48 12288
page-13 653 48 13312
page-14 701 48 14336
page-15 749 48 15360
page-16 797 48 16384
page-17 845 48 17408
page-18 893 48 18432
page-19 941 48
There was a problem hiding this comment.
So only when data_page_row_count_limit less than write_batch_size it works 😂
I think because arrow-rs are vectored writer, write one batch one time🤔
There was a problem hiding this comment.
Oh it's already here 😭 , i should read them more carefully😩
Note: this is a best effort limit based on the write batch size
There was a problem hiding this comment.
👍 I filed apache/arrow-rs#3068 to try and make the comments somewhat clearer
|
Benchmark runs are scheduled for baseline = 47f0e5a and contender = 175adbd. 175adbd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Draft as it builds on #4062Closes #4087
Rationale for this change
Some of the parquet filter pushdown features do not really help unless there are multiple data pages. I wanted to cover these with tests
As it turns out I
found(rediscovered) a gap #3833 while writing these tests 🎉What changes are included in this PR?
Are there any user-facing changes?
No