Zero-copy Vec conversion (#3516) (#1176)#3756
Conversation
|
|
||
| /// Create a [`Buffer`] from the provided `Vec` without copying | ||
| #[inline] | ||
| pub fn from_vec<T: ArrowNativeType>(vec: Vec<T>) -> Self { |
There was a problem hiding this comment.
We can't implement From as this would conflict with the blanket From<AsRef<[u8]>>
| } | ||
|
|
||
| #[test] | ||
| fn test_vec_interop() { |
There was a problem hiding this comment.
Adding Vec conversion at the Buffer level does allow for transmuting types provided they have the same layout, this is perhaps not ideal but isn't harmful.
Long-term I hope to deprecate and remove Buffer and MutableBuffer, just leaving ScalarBuffer which statically prevents this sort of type-conversion, but doing it this way allows for a gradual migration.
There was a problem hiding this comment.
this is perhaps not ideal but isn't harmful.
Is it a problem for something like converting integers to (invalid) floats 🤔
There was a problem hiding this comment.
Nope, there is a good justification for why this is perfectly fine here. The only place where this becomes problematic is where types have bit sequences that illegal, e.g. NonZeroU32 or bool, all ArrowNativeType are valid for all bit sequences.
arrow-buffer/src/buffer/immutable.rs
Outdated
| } | ||
|
|
||
| /// Returns `Vec` for mutating the buffer if this buffer is not offset and was | ||
| /// allocated with the correct layout for `Vec<T>` |
There was a problem hiding this comment.
I think it would be good to explicitly say here an error is returned if the buffer can't be converted to a Vec (and ideally hint how to get a Vec out of it anyways (perhaps by copying)
arrow-buffer/src/alloc/mod.rs
Outdated
| Arrow(usize), | ||
| /// An allocation using [`std::alloc`] | ||
| Standard(Layout), | ||
| /// An allocation from an external source like the FFI interface or a Rust Vec. |
There was a problem hiding this comment.
I wonder if the comment for Custom should be updated to say like "external Rust Vec" or maybe remove the reference to Rust Vec entirely
| pub fn from_vec<T: ArrowNativeType>(vec: Vec<T>) -> Self { | ||
| // Safety | ||
| // Vec::as_ptr guaranteed to not be null and ArrowNativeType are trivially transmutable | ||
| let ptr = unsafe { NonNull::new_unchecked(vec.as_ptr() as _) }; |
There was a problem hiding this comment.
It seems like into_raw_parts would be ideal here, but sadly it appears to not yet be stabalized
I note that the docs say
After calling this function, the caller is responsible for the memory previously managed by the Vec. The only way to do this is to convert the raw pointer, length, and capacity back into a Vec with the from_raw_parts function, allowing the destructor to perform the cleanup.
However, this code uses Deallocation::Standard(layout) to deallocate memory, which seems ok, I just wanted to point out it doesn't match the docs (though maybe this is fine)
There was a problem hiding this comment.
The docs for dealloc are authoritative here
ptr must denote a block of memory currently allocated via this allocator
layout must be the same layout that was used to allocate that block of memory
The layout of arrays, vecs, etc... is fixed and defined by Layout::array. As such we are fine provided we obey the same logic.
There was a problem hiding this comment.
Do we plan to store/create from layouts other than Vec? Otherwise we can just create from Vec, forget the owned Vector.
On drop we recreate the Vec so that the drop sequence is executed.
This would defer all this logic to the implementation in std.
There was a problem hiding this comment.
Do we plan to store/create from layouts other than Vec? Otherwise we can just create from Vec, forget the owned Vector.
Eventually we may deprecate and remove support for other layouts, but at least for a period we need to support aligned layouts such as those created by MutableBuffer so that we can avoid stop-the-world changes
There was a problem hiding this comment.
Right, maybe we could add a comment that once we remove that support we could remove the unsafe layout related code and do something similar to what we do here: https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L25
There was a problem hiding this comment.
unsafe layout related code and do something similar to what we do here
What would be the advantage of this? Does polars depend on the foreign vec interface? It doesn't seem to be doing anything materially different?
Edit: In fact I'm fairly sure this line is technically unsound - https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L49, there are a lot of safety constraints on from_raw_parts that appear to be not being checked or documented?
| let len = vec.len() * std::mem::size_of::<T>(); | ||
| // Safety | ||
| // Layout guaranteed to be valid | ||
| let layout = unsafe { Layout::array::<T>(vec.capacity()).unwrap_unchecked() }; |
There was a problem hiding this comment.
Is there a reason to use unwrap_unchecked here? Maybe it would be better to panic if the layout was messed up somehow?
Looks like it errors on overflow https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.array
Like maybe the size overflows because someone passed in a giant Vec or something?
🤔
There was a problem hiding this comment.
Is there a reason to use unwrap_unchecked here
This is a clone of Vec::current_memory, I will add a link.
The TLDR is that if the Vec is valid, it must have a valid layout (otherwise among other issues Vec wouldn't be able to deallocate itself).
| } | ||
|
|
||
| #[test] | ||
| fn test_vec_interop() { |
There was a problem hiding this comment.
this is perhaps not ideal but isn't harmful.
Is it a problem for something like converting integers to (invalid) floats 🤔
| let mut b: Vec<i128> = Vec::with_capacity(4); | ||
| b.extend_from_slice(&[1, 2, 3]); | ||
| let b = Buffer::from_vec(b); | ||
| let back = b.into_vec::<i256>().unwrap(); |
There was a problem hiding this comment.
I think it is also worth a test when the capacity doesn't equally divide equally into the transmuted size -- like maybe change this test to have capacity 5 with i128 and verify the i256 output Vec only has capacity 2
There was a problem hiding this comment.
A capacity of 5 is tested above and results in an error, as the layout of the underlying allocation is invalid
| let slice = b.slice_with_length(0, 34); | ||
| drop(b); | ||
|
|
||
| let back = slice.into_vec::<i128>().unwrap(); |
There was a problem hiding this comment.
Is slicing with a non zero offset also covered? Maybe slice(2) below covers that 🤔
There was a problem hiding this comment.
Yeah slice(2) covers this
| @@ -95,12 +102,14 @@ impl MutableBuffer { | |||
|
|
|||
| /// Allocates a new [MutableBuffer] from given `Bytes`. | |||
There was a problem hiding this comment.
maybe it is worth updating these docs to explain when an Err is returned
| } | ||
| } | ||
|
|
||
| impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> { |
|
It might be worth it to have @viirya give this PR a once over before merge. Not sure if @ritchie46 or @jorgecarleitao are interested or want to comment either. |
|
Integration failure is unrelated - apache/arrow#34367 |
|
Unless I hear anything further, I plan to merge this tomorrow morning. Please let me know if you need more time to review |
|
Benchmark runs are scheduled for baseline = 5edc954 and contender = d440c24. d440c24 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3516
Relates to #1176
Rationale for this change
This starts the process of migrating away from custom mutable buffer abstractions, in favour of
Vec. This will allow eventually deprecating and removingMutableBufferandBufferBuilder, and eventuallyBufferitself.What changes are included in this PR?
Are there any user-facing changes?