Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder#7120
Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder#7120alamb merged 10 commits intoapache:mainfrom
with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder#7120Conversation
d4102d6 to
59b3033
Compare
7b85e5e to
77d3de5
Compare
|
|
||
| mod private { | ||
| /// A boolean flag that cannot be mutated outside of unsafe code. | ||
| /// A boolean flag that cannot be mutated outside of unsafe code. |
There was a problem hiding this comment.
I propose to make this UnsafeFlag public (and added examples and more docs) so I could use it across the two crates. However, I can also make a private copy of it in arrow-ipc if reviewers feel it would be better to avoid a new API
arrow-ipc/benches/ipc_reader.rs
Outdated
| writer.write(&batch).unwrap(); | ||
| } | ||
| writer.finish().unwrap(); | ||
| let buffer = ipc_stream(); |
There was a problem hiding this comment.
I added new versions of each benchmark that work with disabled validation
| StructArray::try_new(struct_fields.clone(), struct_arrays, None)? | ||
| }; | ||
| Ok(Arc::new(struct_array)) | ||
| self.create_struct_array(struct_node, null_buffer, struct_fields, struct_arrays) |
There was a problem hiding this comment.
I refactored this code into its own function so it was eaiser to call StructArray::new_unchecked when validation was disabled
arrow-ipc/src/reader.rs
Outdated
| /// | ||
| /// Relies on the caller only passing a flag with `true` value if they are | ||
| /// certain that the data is valid | ||
| pub fn with_skip_validation(mut self, skip_validation: UnsafeFlag) -> Self { |
There was a problem hiding this comment.
Note that RecordBatchDecoder is not a public API:
https://docs.rs/arrow-ipc/54.1.0/arrow_ipc/reader/struct.StreamReader.html?search=RecordBatchDecoder
| self | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
this is a new public API and follows the same pattern as ArrayData::skip_validation
| &mut self.reader | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
| &mut self.reader | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
| } | ||
|
|
||
| #[test] | ||
| fn test_invalid_nested_array_ipc_read_errors() { |
There was a problem hiding this comment.
I added some additional coverage to make sure the flag got passed when decoding multiple nesting levels
| } | ||
|
|
||
| #[test] | ||
| fn test_validation_of_invalid_union_array() { |
There was a problem hiding this comment.
Turns out UnionArray had its own path / handling so new coverage added
|
|
||
| // IPC Stream format | ||
| let buf = write_stream(&rb); // write is ok | ||
| read_stream_skip_validation(&buf).unwrap(); |
There was a problem hiding this comment.
Now all the validation tests also verify they can read the batch back without error if validation is disabled
tustvold
left a comment
There was a problem hiding this comment.
Mostly just some minor nits
arrow-ipc/benches/ipc_reader.rs
Outdated
| } | ||
|
|
||
| /// Return an IPC stream with ZSTD compression with 10 record batches | ||
| fn ipc_stream_zstd() -> Vec<u8> { |
There was a problem hiding this comment.
Minor nit, but I wonder if we could have an ipc_stream_options that takes a IpcWriteOptions and then have ipc_stream just call through to this
There was a problem hiding this comment.
this was a good suggestion -- I did it in 7d3f020 🧹
Co-authored-by: Raphael Taylor-Davies <[email protected]>
…-rs into alamb/ipc_disable_validation
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
…-rs into alamb/ipc_disable_validation
|
Thank you very much for the review @tustvold |
Which issue does this PR close?
with_skip_validationflag to IPC readers/writers #7093Rationale for this change
Forcing
Arrayvalidation while reading arrow IPC trusted data is inefficient. Users should be able to avoid doing so if they wantWhat changes are included in this PR?
This PR builds on this PR from @totoroyyb
Are there any user-facing changes?
with_disable_validationAPIs onStreamReader,FileReaderandFileDecoderBenchmark results Mac M3
Observations
mmapis super fast and thus the validation is a much larger portion of the time, resulting in a corresponding 8-9x faster without validationwith_skip_validationStreamReaderStreamReader(zstd)FileReaderFileDecoder+mmapDetails for Mac M3
Benchmarks: GCP
On a GCP
c2-standard-16 (16 vCPUs, 64 GB Memory)with_skip_validationStreamReaderStreamReader(zstd)FileReaderFileDecoder+mmapRaw Results (gpc)