Add full data validation for ArrayData::try_new()#921
Conversation
fda948d to
f968803
Compare
Codecov Report
@@ Coverage Diff @@
## master #921 +/- ##
==========================================
+ Coverage 82.29% 82.31% +0.01%
==========================================
Files 168 168
Lines 48761 49031 +270
==========================================
+ Hits 40129 40358 +229
- Misses 8632 8673 +41
Continue to review full report at Codecov.
|
30710a1 to
74b74dd
Compare
|
Update: I have a bunch of tests written, so that is good 🎉 I think there is some more testing needed here as when I tried to apply |
c416ad1 to
9d97916
Compare
6180179 to
6e6e410
Compare
|
This PR (that has full ArrayData validate) is now ready for review . I hope it allows us to close the currently open RUSTSEC entries for arrow-rs: https://github.com/rustsec/advisory-db/tree/main/crates/arrow cc @jhorstmann @bjchambers @jimexist @jorgecarleitao I know it is a large ask to review this PR, so thank you if you have time to do so. As for testing, if you have a project that uses |
Co-authored-by: Jörn Horstmann <[email protected]>
…ta_construction_validation
|
Thanks again everyone for the help. I'll backport this PR and hopefully we can get it shipped in 6.4.0 |
* Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]>
* Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]>
* Add full data validation for ArrayData::try_new() (#921) * Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]> * Fix: clippy Co-authored-by: Jörn Horstmann <[email protected]>
…ike arrays # Rationale The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue. I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic: * C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592 * arrow-rs after apache/arrow-rs#921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic * arrow2 also ensures that offsets are always monotonic. https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119 # Changes Thus I propose updating the format docs to make the monotonic offsets explicit. # Background I think @jorgecarleitao's description on apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea > I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases. > > The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead. Closes #12019 from alamb/alamb/clarify_offsets Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Which issue does this PR close?
Closes #817
Rationale for this change
Improve the security arrow-rs by validating untrusted user input, and resolves several outstanding potential security issues by doing deep checks of the data passed to
ArrayData::try_new()What changes are included in this PR?
ArrayData::try_new()based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrouAre there any user-facing changes?
ArrayData::try_new()may return anErrnow on some (invalid) inputs rather than succeeding and will take more time.pub ArrayData::validate_full()that performs fullArrayDatavalidationTesting
arrow-rstests pass (other than some different error messages) when I temporarily changed the code to callvalidate_full()always as in (DEMO) tests pass whenvalidate_fullis forced #987 (review)datafusiontests pass with these changes (using https://github.com/alamb/arrow-rs/tree/alamb/full_array_data_construction_validation_df which has the changes in this PR cherry-picked)Follow On:
Unionin Update Union Array to addUnionMode, match latest Arrow Spec, and renamenew->unsafe new_unchecked()#885 (not backwards compatible)