ARROW-4218: [Rust][Parquet] Initial support for array reader.#5378
ARROW-4218: [Rust][Parquet] Initial support for array reader.#5378liurenjie1024 wants to merge 5 commits intoapache:masterfrom
Conversation
|
@sunchao @andygrove @nevi-me @paddyhoran Please help to review. |
|
@sunchao @nevi-me @paddyhoran @andygrove Can you take a look at this when you are available? We are close to initial support of parquet reader. This is the last but one PR to finish this goal. |
|
I noticed that this had not gotten a review last night. I'm not too familiar with Parquet in general (although I do want to learn) so I was deferring to the others. I'll try and take a look today but it will need additional review from someone else also I think. |
|
@liurenjie1024 I have not forgotten about this, sorry it is taking so long. It's doubtful this will make it into 0.15 though so I will take a look once 0.15 is released. |
|
@liurenjie1024 Sorry for the delay. I will start reviewing this tomorrow. |
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use std::cmp::min; |
There was a problem hiding this comment.
Could we get a rustdoc comment explaining what is in in this module?
There was a problem hiding this comment.
This is used by another component of arrow reader and is not a public api. I'll add doc after the last component is ready.
andygrove
left a comment
There was a problem hiding this comment.
This is looking good to me. There a quite a few unwrap calls ... can these be handled in a safer way?
|
@paddyhoran I don't know the schedule of 0.15 release, but we have only one PR(exclude this one) left before initial support of arrow reader getting ready. And that PR will not be a large one. |
|
I think 0.15 is being cut this morning. We should keep moving though as it might be delayed. I don't have time now, but might have some later today. |
andygrove
left a comment
There was a problem hiding this comment.
Thanks @liurenjie1024 .. LGTM!
|
@paddyhoran OK if I merge this one? |
paddyhoran
left a comment
There was a problem hiding this comment.
Sorry, I thought I had already approved earlier today. Thanks @liurenjie1024
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
```
ParquetError("Parquet error: Not all children array length are the same!")
```
Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
Visually:
```
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
| |
+-----+
```
I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.
This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled
Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries
Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Initial support for array reader. List and map support will come later.