ARROW-4365: [Rust] [Parquet] Implement arrow record reader.#4292
ARROW-4365: [Rust] [Parquet] Implement arrow record reader.#4292liurenjie1024 wants to merge 10 commits intoapache:masterfrom
Conversation
|
@sunchao @nevi-me @andygrove Please help to review this. |
andygrove
left a comment
There was a problem hiding this comment.
This looks good to me from a coding point of view but I'm not familiar enough with parquet format internals to know how correct this is.
nevi-me
left a comment
There was a problem hiding this comment.
Hi @liurenjie1024, I'm also not very familiar with the parquet code base (planning on spending time on it in the coming weeks). I just have style changes and 1 comment.
There was a problem hiding this comment.
Thanks @liurenjie1024 for the PR. Still reading through it but I'll need more time to think about the overall design and how we can align this with other parts of the project.
Seems this references the C++ version of RecordReader implementation which allows us to handle nested records with repetition level > 0 (although seems C++ still doesn't handle list type at the moment). In the Rust implementation though, we've already have record/reader.rs which can handle nested structures and is built on top of the column readers. Thus, I'm wondering if is possible to utilize the existing code (esp. triplet.rs) and implement the Arrow record reader on top of it. Just some food for thought :)
Also cc @sadikovi who originally wrote this part and might be interested.
|
Also we should change the title of this PR: Implement Arrow record reader. |
|
@sunchao Thanks for the review. Yes you are right, this is heavily inspired by c++ implementation. Currently c++ implementation can handle List<List<List> type, I'll extend this to handle more cases. Also I took a look at triple.rs. Though it's similar to record reader, they serve different purposes. The main goal of record reader is to help arrow array reader to find the boundaries of nested records and buffer them, while triple.rs is much simpler. |
|
@liurenjie1024 Ouch. The list collection is actually done in I looked at the code, could you add more documentation around it, otherwise it is quite difficult to understand what is going on? Also would be good to add more tests around different complex schemas including deeply nested nullable lists and maps. Thanks! |
|
@sadikovi |
|
@sunchao @sadikovi The following code is from primitive array reader (WIP: read parquet column into arrow primitive array), it demonstrates usage of record reader. impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
fn next_batch(&mut self, batch_size: usize) -> Result<Arc<Array>> {
self.record_reader.reset()?;
let mut records_read = 0usize;
while records_read < batch_size {
let records_to_read = batch_size - records_read;
let records_read_once = self.record_reader.read_records(records_to_read)?;
records_read = records_read + records_read_once;
// Record reader exhausted
if records_read_once < records_to_read {
if let Some(page_reader) = self.pages.next() {
// Read from new page reader
self.record_reader.set_page_reader(page_reader?)?;
} else {
// Page reader also exhausted
break;
}
}
}
// Convert buffer data in record_reader to array
let record_data = self.record_reader.consume_record_data();
let null_bitmap = self.consume_bitmap();
.....
}
} |
|
Sorry for late review @liurenjie1024 !. I'll take another thorough look on this in the weekend. |
sunchao
left a comment
There was a problem hiding this comment.
Thanks @liurenjie1024 and sorry again for the late review. I guess that it's hard to reuse triplet.rs in this case.
Does the current record reader only handles the top-level record? i.e., the outer most List in a List<List<..>> structure. Curious how this can be used to assemble the inner lists in a multi-level list structure.
Left some comments - mostly minor and about naming/style stuff.
|
@sunchao Thanks for the review. |
|
@sunchao @sadikovi |
b77a4ba to
3969a46
Compare
|
Nice @liurenjie1024 ! will take a look at this. |
Codecov Report
@@ Coverage Diff @@
## master #4292 +/- ##
=========================================
Coverage ? 82.73%
=========================================
Files ? 87
Lines ? 25430
Branches ? 0
=========================================
Hits ? 21040
Misses ? 4390
Partials ? 0
Continue to review full report at Codecov.
|
|
@liurenjie1024 thanks again for the POC (sorry for being late as I was on vacation). I think I have a much better understanding on this now. Could you address the comments I had and update this PR? |
|
@sunchao Sure, will fix it soon |
ed180da to
85fe336
Compare
ee1ea5d to
e279a00
Compare
|
@sunchao Fixed comments. Please take a look when you got a chance. |
There was a problem hiding this comment.
Thanks @liurenjie1024 ! this overall looks good to me - just left some comments & questions. Also could you check the CI failure?
There was a problem hiding this comment.
In the current code, if data_read is 0 you may still enter the loop right? and same thing will happen?
There was a problem hiding this comment.
hmm, you mean some latter processing relies on the fact that we should default value as garbage value?
|
@sunchao Thanks for the review. Fixed comments. Will take a look at the ci error later. |
|
Thanks @liurenjie1024 . The PR looks good to me except one minor thing: can you rename |
|
@sunchao CI passed. I think the c++ build failure is unrelated. |
RecordReader reads logical records into memory, this is the prerequisite for ColumnReader.