ARROW-8751: [Rust] support empty parquet file in arrow array reader#7140
Closed
houqp wants to merge 1 commit intoapache:masterfrom
Closed
ARROW-8751: [Rust] support empty parquet file in arrow array reader#7140houqp wants to merge 1 commit intoapache:masterfrom
houqp wants to merge 1 commit intoapache:masterfrom
Conversation
nevi-me
approved these changes
May 14, 2020
sunchao
pushed a commit
that referenced
this pull request
Aug 20, 2020
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
```
ParquetError("Parquet error: Not all children array length are the same!")
```
Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
Visually:
```
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
| |
+-----+
```
I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.
This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled
Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries
Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
alamb
added a commit
to apache/arrow-rs
that referenced
this pull request
Apr 20, 2021
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
```
ParquetError("Parquet error: Not all children array length are the same!")
```
Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
Visually:
```
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
| |
+-----+
```
I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.
This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in apache/arrow#7140. I will do some more investigating to ensure the test case described in that ticket is handled
Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries
Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sometimes spark will write out parquet files with zero row groups, which will result in error if read using ParquetFileArrowReader.
It would be more convenient if ParquetFileArrowReader can support this edge-case out of the box.