ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions#8007
ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions#8007alamb wants to merge 2 commits intoapache:masterfrom
Conversation
|
|
||
| // NB can be 0 if at end of page | ||
| let records_read_once = self.record_reader.read_records(records_to_read)?; | ||
| if records_read_once == 0 { |
There was a problem hiding this comment.
The case of 0 rows being read is handled in the if records_read_once < records_to-read clause below -- namely in this case the code needs to try and get the next page of data from the page reader.
| >(2, 100, 2, message_type, 15, 50, converter); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Both these tests fail without the changes in this PR.
I don't like the copy/paste nature of these tests and I plan a minor PR building on this one proposing how to remove the duplication and make the tests easier to read.
|
As the code that I removed in this PR added to support empty pages in 8a61570 (but it didn't seem to have any tests), I was worried that this deletion would cause some sort of reversion in behavior. To ensure it didn't, I added a test for empty pages in commit cfdea83 To try and ensure that the test covers the empty pages case, I and ran the new test on commit ed1f771 (the one right before 8a61570) and the test fails, thus leading me to conclude that the code removed in this PR is unnecessary and zero page parquet files are still supported. Log of what I did: Get commit right before support for empty pages: Apply new test for no pages: Then I had to edit the test a little to use And then I ran the test, and it fails: FYI @houqp and @paddyhoran |
|
@sunchao Could you review as well ? |
| } | ||
|
|
||
| #[test] | ||
| fn test_primitive_array_reader_empty_pages() { |
Thanks for pinging! will do today. |
|
|
||
| // NB can be 0 if at end of page | ||
| let records_read_once = self.record_reader.read_records(records_to_read)?; | ||
| if records_read_once == 0 { |
I added two test cases in #8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>
I added two test cases in apache/arrow#8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>
When I was reading a parquet file into
RecordBatchesusingParquetFileArrowReaderthat had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this errorUpon investigation, I found that when reading with
ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are readVisually:
I traced the issue down to a bug in
PrimitiveArrayReaderwhere it mistakenly interprets reading0rows from a page reader as being at the end of the column.This bug appears not to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled