Validate arguments to ArrayData::new and null bit buffer and buffers#810
Conversation
arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
This is the core validation check.
arrow/src/array/array_primitive.rs
Outdated
There was a problem hiding this comment.
These tests need to be updated because they were creating invalid Buffers (this one for example had an len of 5 and offset of 2 but only passed in an array 5 long)
|
@jorgecarleitao / @nevi-me / @jhorstmann do you have any concerns with this approach before I spent more time filling in the details for nested / compound structures ( For some of the structures, the actual data will likely need to be inspected (to ensure, for example, that the values in the offsets buffer of utf8 arrays are valid). I am not sure about the performance impact of doing these validations -- if it turns out to be too great, I can imagine having a "trusted" version of |
Codecov Report
@@ Coverage Diff @@
## master #810 +/- ##
==========================================
+ Coverage 82.29% 82.31% +0.01%
==========================================
Files 168 168
Lines 48028 48409 +381
==========================================
+ Hits 39527 39847 +320
- Misses 8501 8562 +61
Continue to review full report at Codecov.
|
|
@pitrou suggests looking at https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.h |
|
The C++ validation looks extensive and includes offsets and dictionary keys, so basing the logic on that makes sense. These offsets and dictionary keys are actually my main performance concern, I think we want at least an |
|
Sounds good @jhorstmann -- the C++ version includes two validation methods: one that basically checks buffer sizes and one that checks the offsets, etc I will attempt to mirror the same. Will ping you when I have this PR ready for review |
33a54ee to
73b249d
Compare
005c173 to
68d9a66
Compare
|
This PR is not ready for review yet, but when it is I will change it to not a draft |
1207c89 to
b270574
Compare
add6e56 to
a63b01e
Compare
a63b01e to
d3afda6
Compare
|
Update: I have all the existing tests now passing Still TODO:
|
edd2910 to
21b9685
Compare
arrow/src/array/array_binary.rs
Outdated
There was a problem hiding this comment.
the array data in this case is only 4 elements long, so using an offset of 1 with len 4 is incorrect (goes off the end of the array data). The same with the test below
arrow/src/array/array_boolean.rs
Outdated
There was a problem hiding this comment.
build() now fails the earlier validation check -- so to keep the test checking the internal BooleanArray checks need to use unsafe
Note that it might be best to eventually remove all Array specific checks (which I think will be redundant) in favor of consolidated checks in ArrayData.rs but I don't want to consider doing that until I have the validation checks completed
arrow/src/array/array_list.rs
Outdated
There was a problem hiding this comment.
This is a similar invalid test that the new checks identified -- value_data has only three items in it, so doing offset = 1 and len = 3 can potentially read off the end. This change corrects the len to be within bounds.
There is a very similar (and thus fixed) test in array_list.rs and one in array_map.rs
arrow/src/array/array_list.rs
Outdated
There was a problem hiding this comment.
this test is checking the ListArray specific error, but since the new ArrayData validation checks now catch the problem we need to switch to using build_unchecked() to exercise the same code path
arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
I finally settled on leaving UnionArray unvalidated for this PR so that I can backport it to the 6.x release line (it is backwards compatible) and in a PR that is backward incompatible I will fixup the UnionArray implementation (and validation)
21b9685 to
b37c8a1
Compare
arrow/src/array/array_union.rs
Outdated
There was a problem hiding this comment.
As part of #85 I will clean up the union array validation
|
@jhorstmann, @paddyhoran @nevi-me or @jorgecarleitao might I trouble one of you for a review of this PR? I know it is large, but I think it is important and fairly mechanical in terms of validating the creation of ArrayData |
|
FYI the MIRI failure is likely the same as #879 -- I'll plan to look at that shortly if no one else gets to it |
b37c8a1 to
4c5e3be
Compare
| } | ||
| } | ||
|
|
||
| if self.null_count > self.len { |
There was a problem hiding this comment.
I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range. I would suggest to remove the null_count parameter from try_new to completely avoid of inconsistencies.
Kernels that want to avoid the overhead of counting bits could use the unsafe new_unchecked method. I see it is currently set by the cast_array_data function. To make that one safe while avoiding bit counting, we could just validate that the from and to layouts are equal.
There was a problem hiding this comment.
I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range.
I suspect (but can not prove) that if null_count is inconsistent with the validity buffer then there may be incorrect answers but not undefined behavior (in the Rust sense).
My plan will be:
- In a separate PR (as it will not be backwards compatible) remove the
null_countfromtry_new(): Removenull_countfrom ArrayData::try_new() #911 - in the PR where I add the full on data checking (
validate_full()) ensure that the declared null count matches the validity bitmap
arrow/src/datatypes/datatype.rs
Outdated
| } | ||
|
|
||
| /// Returns true if this type is integral: (UInt*, Unit*). | ||
| pub fn is_integer(t: &DataType) -> bool { |
There was a problem hiding this comment.
This seems to be only used for validating dictionary key types, maybe rename to is_dictionary_key_type and link the ArrowDictionaryKeyType trait in the comment.
|
I think this looks good. Might be worthwhile to convert some of the current |
At the moment, almost all of the tests in arrow use The remaining uses of What I plan as part of implementing |
|
Thanks @jhorstmann for the review. I think this PR is now ready to go and will plan to merge it early next week unless anyone objects or would like more time to review. |
…810) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output
…810) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output
Which issue does this PR close?
Part of #817
Rationale for this change
This is a step towards improving the security posture of arrow-rs and resolving RUSTSEC vulnerabilities by validating arguments to ArrayData::try_new(). This particular PR adds basic size and offset sanity checking.
I apologize for the massive PR, but the checks and tests are fairly repetitive.
See discussion on #817 for more details
What changes are included in this PR?
ArrayData::try_new()based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrouPlanned for follow on PRs:
validate_full: that will also check the values of any offsets buffers: Add full data validation for ArrayData::try_new() #921UnionArrayimplementation / validate it properly (Implement Union validity bitmap changes from ARROW-9222 #85)Are there any user-facing changes?
ArrayData::try_new()may return anErrnow on some (invalid) inputs rather than succeeding