Skip to content

Validate arguments to ArrayData::try_new() #817

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket lists a high level plan to address one of the main sources of security issues in arrow-rs
such as #772 and likely several others on https://github.com/apache/arrow-rs/issues?q=is%3Aissue+is%3Aopen+label%3Asecurity

As demonstrated in https://github.com/jorgecarleitao/arrow2#why, and almost all of the examples in https://github.com/apache/arrow-rs/issues?q=is%3Aissue+is%3Aopen+label%3Asecurity, creating ArrayData::new with invalid arguments can lead to undefined behavior.

See also the discussion with @jhorstmann and others on https://lists.apache.org/thread.html/r3f12f3352ca36264622d4103fcb6c7c71544dcaf0f0a7e842f00c3a0%40%3Cdev.arrow.apache.org%3E

Describe the solution you'd like
I propose to follow the C++ implementation (kudos to @pitrou) in https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L388-L392

Add two new functions:

  1. ArrayData::validate() -- checks offsets / buffer sizes, relatively inexpensive
  2. ArrayData::validate_full -- which callsvalidate() AND checks all variable length data structures for consistency (e.g. ensures the offsets of a StringArray are within the size of the base array

Then, change ArrayData to have two constructors:

  1. unsafe ArrayData::new_unchecked() - Behaves like ArrayData::new() does today -- namely has no validation
  2. ArrayData::try_new() will be safe in the Rust sense -- can not cause undefined behavior and thus will call ArrayData::validate_full

This design will follow the Rust philosophy of "safe by default" but offer an alternative (unsafe) mechanism to bypass checking for known good inputs. This unsafe mechanism has been prototyped by @jhorstmann in #813

Describe alternatives you've considered
Could wait for arrow2 convergence, if that happens, but since the timeline on that ETA is still unknown, safety for the arrow-rs implementation seems to justify spending time here

** Progress **

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogsecurity

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions