ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions by alamb · Pull Request #8007 · apache/arrow

alamb · 2020-08-19T13:50:01Z

When I was reading a parquet file into RecordBatches using ParquetFileArrowReader that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

 ParquetError("Parquet error: Not all children array length are the same!")

Upon investigation, I found that when reading with ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:

+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+

I traced the issue down to a bug in PrimitiveArrayReader where it mistakenly interprets reading 0 rows from a page reader as being at the end of the column.

This bug appears not to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled

…tions

alamb · 2020-08-19T13:51:29Z

rust/parquet/src/arrow/array_reader.rs


+            // NB can be 0 if at end of page
            let records_read_once = self.record_reader.read_records(records_to_read)?;
-            if records_read_once == 0 {


The case of 0 rows being read is handled in the if records_read_once < records_to-read clause below -- namely in this case the code needs to try and get the next page of data from the page reader.

alamb · 2020-08-19T13:52:26Z

rust/parquet/src/arrow/arrow_reader.rs

        >(2, 100, 2, message_type, 15, 50, converter);
    }

+    #[test]


Both these tests fail without the changes in this PR.

I don't like the copy/paste nature of these tests and I plan a minor PR building on this one proposing how to remove the duplication and make the tests easier to read.

github-actions · 2020-08-19T14:05:44Z

https://issues.apache.org/jira/browse/ARROW-9790

andygrove

LGTM

alamb · 2020-08-19T14:29:15Z

As the code that I removed in this PR added to support empty pages in 8a61570 (but it didn't seem to have any tests), I was worried that this deletion would cause some sort of reversion in behavior. To ensure it didn't, I added a test for empty pages in commit cfdea83

To try and ensure that the test covers the empty pages case, I and ran the new test on commit ed1f771 (the one right before 8a61570) and the test fails, thus leading me to conclude that the code removed in this PR is unnecessary and zero page parquet files are still supported.

Log of what I did:

Get commit right before support for empty pages:

alamb@MacBook-Pro rust % git checkout ed1f771dccdde623ce85e212eccb2b573185c461
git checkout ed1f771dccdde623ce85e212eccb2b573185c461
M	testing
Note: switching to 'ed1f771dccdde623ce85e212eccb2b573185c461'.
...
HEAD is now at ed1f771dc ARROW-8717: [CI][Packaging] Add build dependency on boost to homebrew

Apply new test for no pages:

alamb@MacBook-Pro rust % git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
Auto-merging rust/parquet/src/arrow/array_reader.rs
[detached HEAD ab3bcdd8f] Add a test for reading empty pages
 Date: Wed Aug 19 10:16:31 2020 -0400
 1 file changed, 57 insertions(+), 1 deletion(-)

Then I had to edit the test a little to use .len instead of .is_empty (which did not exist in the old commit):

diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs
index d70619518..54c929c05 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -967,7 +967,7 @@ mod tests {

         // expect no values to be read
         let array = array_reader.next_batch(50).unwrap();
-        assert!(array.is_empty());
+        assert!(array.len() == 0);
     }

     #[test]

And then I ran the test, and it fails:

cd arrow/rust/parquet && PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data cargo test
...
---- arrow::array_reader::tests::test_primitive_array_reader_empty_pages stdout ----
thread 'arrow::array_reader::tests::test_primitive_array_reader_empty_pages' panicked at 'called `Result::unwrap()` on an `Err` value: General("Can\'t build array without pages!")', parquet/src/arrow/array_reader.rs:962:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

FYI @houqp and @paddyhoran

andygrove · 2020-08-19T14:29:36Z

@sunchao Could you review as well ?

alamb · 2020-08-19T14:31:22Z

rust/parquet/src/arrow/array_reader.rs

    }

+    #[test]
+    fn test_primitive_array_reader_empty_pages() {


This is a test for the functionality added in 8a61570 / #7140

sunchao · 2020-08-19T14:52:28Z

@sunchao Could you review as well ?

Thanks for pinging! will do today.

sunchao

LGTM. Thanks @alamb .

sunchao · 2020-08-20T01:01:06Z

rust/parquet/src/arrow/array_reader.rs


+            // NB can be 0 if at end of page
            let records_read_once = self.record_reader.read_records(records_to_read)?;
-            if records_read_once == 0 {


I added two test cases in #8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

I added two test cases in apache/arrow#8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

ARROW-9790: [Rust][Parquet] Fix ParquetFileArrowReader boundary condi…

0da6cc4

…tions

alamb commented Aug 19, 2020

View reviewed changes

Add a test for reading empty pages

cfdea83

andygrove approved these changes Aug 19, 2020

View reviewed changes

andygrove requested a review from sunchao August 19, 2020 14:28

andygrove added the Component: Rust label Aug 19, 2020

andygrove requested a review from nevi-me August 19, 2020 14:29

andygrove requested a review from paddyhoran August 19, 2020 14:29

alamb commented Aug 19, 2020

View reviewed changes

alamb mentioned this pull request Aug 19, 2020

ARROW-9790: [Rust][Parquet]: Increase test coverage in arrow_reader.rs #8009

Closed

sunchao approved these changes Aug 20, 2020

View reviewed changes

sunchao closed this in 25b0b1b Aug 20, 2020

alamb deleted the alamb/ARROW-9790-record-batch-boundaries branch October 26, 2020 20:05

asfimport mentioned this pull request Jul 8, 2022

[Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries #25836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions#8007

ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions#8007
alamb wants to merge 2 commits intoapache:masterfrom
alamb:alamb/ARROW-9790-record-batch-boundaries

alamb commented Aug 19, 2020 •

edited

Loading

Uh oh!

alamb Aug 19, 2020

Uh oh!

sunchao Aug 20, 2020

Uh oh!

alamb Aug 19, 2020

Uh oh!

alamb Aug 19, 2020

Uh oh!

github-actions bot commented Aug 19, 2020

Uh oh!

andygrove left a comment

Uh oh!

alamb commented Aug 19, 2020 •

edited

Loading

Uh oh!

andygrove commented Aug 19, 2020

Uh oh!

alamb Aug 19, 2020

Uh oh!

sunchao commented Aug 19, 2020

Uh oh!

sunchao left a comment

Uh oh!

sunchao Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alamb commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 19, 2020

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Aug 19, 2020

Uh oh!

alamb Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

sunchao commented Aug 19, 2020

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Aug 19, 2020 •

edited

Loading

alamb commented Aug 19, 2020 •

edited

Loading