fix: Fix range out of index error with a temporary workaround#584
fix: Fix range out of index error with a temporary workaround#584andygrove merged 8 commits intoapache:mainfrom
Conversation
|
I will add a custom Java importer implementation to use the fixed |
|
Looks like I need to copy more Arrow Java classes to make it work... |
Instead of a real fix which requires adding I re-ran the query and verified this workaround fixes the range out of index error. |
| // HACK: For the issue https://github.com/apache/datafusion-comet/issues/540 | ||
| // As Arrow Java doesn't support `offset` in C Data interface, we cannot correctly import | ||
| // a slice of string from arrow-rs to Java Arrow and then export it to arrow-rs again. | ||
| // So we add this hack to always take full length of data buffer by assuming the first offset | ||
| // is always 0 which is true for Arrow Java and arrow-rs. | ||
| final int len = end; |
There was a problem hiding this comment.
This is the workaround added in Arrow Java side.
There was a problem hiding this comment.
A similar workaround is also added in the custom arrow-rs branch.
There was a problem hiding this comment.
This workaround assumes that the first element in offset buffer is always 0. This is true for arrow-rs and Arrow Java although Arrow spec just recommends it but not enforces it. As we only use arrow-rs and Arrow Java, this workaround should work for us without correctness issue.
There was a problem hiding this comment.
Are the changes in the custom arrow-rs branch different to the fix that went into arrow-rs master branch recently?
There was a problem hiding this comment.
The fix into arrow-rs repo is real fix that correctly reflecting offset of offset buffer into offset field in C Data interface. But only the fix is not enough, we still need the offset support in Arrow Java.
So the workaround here is different but just hacky solution. Instead filing correct offset in C Data interface, we calculate correct data buffer length in a hacky way by assuming the first offset value in offset buffer is always 0 (this is held in both arrow-rs and Arrow Java). So we get correct data buffer length even offset buffer is a slice (i.e., offset-ed in arrow-rs).
|
cc @andygrove This uses a temporary workaround to fix the issue. I'm not sure how long it will take to have the Arrow Java fix, so I think a workaround but not a real fix might be good for us to move forward. |
I also ran the benchmarks and can confirm that this fixes the issue. |
|
I proposed the workaround fix to arrow-rs at apache/arrow-rs#5958. If it could be merged in the upstream, we can just use official arrow-rs and DataFusion repos/releases. |
|
@andygrove Could you verify again if the latest change still resolves the issue? I changed it to use the commit proposed for arrow-rs apache/arrow-rs#5958 behind a feature flag. Thanks. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #584 +/- ##
============================================
+ Coverage 34.13% 34.31% +0.17%
Complexity 809 809
============================================
Files 106 108 +2
Lines 38586 38742 +156
Branches 8566 8568 +2
============================================
+ Hits 13172 13294 +122
- Misses 22674 22708 +34
Partials 2740 2740 ☔ View full report in Codecov by Sentry. |
|
I changed the arrow-rs repo to use the change at apache/arrow-rs#5964. See if everything in CI can pass. |
|
@viirya I confirmed that this fixes the issue when I run the benchmarks locally. |
Thank you, @andygrove. Then we can merge this and wait for new arrow-rs and DataFusion releases. |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. Hopefully we only need to use the forks for 2-3 weeks.
core/Cargo.toml
Outdated
| arrow-string = { version = "52.0.0" } | ||
| parquet = { version = "52.0.0", default-features = false, features = ["experimental"] } | ||
| arrow = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139", features = ["prettyprint", "ffi", "chrono-tz"] } | ||
| arrow-array = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139" } |
There was a problem hiding this comment.
Now that the PR is merged, can we switch this to a revision of the official arrow-rs repo?
There was a problem hiding this comment.
Yes. We can point to arrow-rs repo.
|
@andygrove This uses official arrow-rs repo now. |
|
Thanks @andygrove |
…#584) * fix: Fix range out of index error by using custom arrow-rs repo * Add custom Java Arrow classes * Add a hack * Update * Update * Update to use apache/arrow-rs#5958 * Use tustvold's branch * Use official arrow-rs repo
Which issue does this PR close?
Closes #540.
Rationale for this change
What changes are included in this PR?
How are these changes tested?