feat: support reading and writingStringView and BinaryView in parquet (part 2)#5557
feat: support reading and writingStringView and BinaryView in parquet (part 2)#5557ariesdevil wants to merge 5 commits intoapache:masterfrom
StringView and BinaryView in parquet (part 2)#5557Conversation
17695d3 to
1c7260e
Compare
|
The parquet build for wasm32 check always failed, but it works on my machine... |
|
Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting! |
d2f1a30 to
fdd2e48
Compare
StringView and BinaryView in parquet
alamb
left a comment
There was a problem hiding this comment.
Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏
To merge this, I think we need:
- Benchmarks for reading/writing StringView / BinaryViews
- Cover a few more cases for round tripping
I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).
It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs
We can add benchmarks here:
- reading
arrow-rs/parquet/benches/arrow_reader.rs
Lines 867 to 868 in 7e134f4
- writing:
arrow-rs/parquet/benches/arrow_writer.rs
Lines 382 to 392 in 7e134f4
End to end round trip tests:
|
This is really nice step forward -- thank you @ariesdevil |
f37c3dd to
f65807f
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.
As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.
I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required
Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR
a22ee14 to
d286521
Compare
|
I plan to give this another review today |
|
The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it |
|
I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days. |
|
Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor. Thanks for your patience and have a nice weekend. |
I wrote new tests in #5639 |
fef3c9f to
cbf08f4
Compare
|
I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil |
|
No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃 |
|
I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday) |
|
Hi @ariesdevil, thanks for the great work! I'm looking at this PR these days and trying to push it forward. However, I do feel the code change is too large for a single PR that it is quite difficult to understand every detail of the implementation. I suggest we break the PR into even smaller pieces and incrementally add the features/optimizations so it poses less mental burdens on the reviewers, what do you think? I'll try to consolidate a few low-hanging fruit changes and try to get them merged first; let me know your thoughts! |
|
Hi @XiangpengHao, thanks for your reply. I don't think this PR is very large, the previous reviewers are already very familiar with this PR, this PR has been submitted for almost three months, and the reviewer has had enough time to get familiar with it. |
Indeed -- I am sorry this PR is stalled @ariesdevil -- it has been very hard for me to find enough contiguous time to get familar enough with this PR and really understand it (and ensure the test coverage is adequate, etc). @ariesdevil do you think we should attempt to merge this entire PR at once? I do think it would help speed up review if we could break it into some smaller PRs |
Thank you for understanding -- I feel bad whenever PRs get stuck.
That is awesome. THank you. I will make every effort to help get this over the line and now that @XiangpengHao is helping I am optimistic we can get it in shortly |
|
#5877 is merged now! |
This reverts commit 71f51aa10079cdadfe7a0f5026563eedd7f76bd9.
5730aa0 to
7fe2847
Compare
|
closed by #5877 |
Which issue does this PR close?
Closes #5530
Rationale for this change
Use view_buffer instead of offset_buffer for view type parquet reader
What changes are included in this PR?
Use view_buffer instead of offset_buffer for view type parquet reader
Are there any user-facing changes?
Yes