Conversation
arrow-array/src/heap_size.rs
Outdated
| impl HeapSize for StringArray { | ||
| fn heap_size(&self) -> usize { | ||
| self.get_buffer_memory_size() | ||
| } | ||
| } |
There was a problem hiding this comment.
This could probably be a blanket implementation:
impl<T> HeapSize for T
where
T: Array|
I am not sure we need a new crate just for the HeapSize trait in a downstream crate (DataFUsion) What about making a trait like this in DataFusion common: pub trait DFHeapSize {
...
}And then providing a blanket implementation for impl <T: HeapSize> for DFHeapSize {
...
}That way we can use the arrow implementations, but we don't have to make a whole new crate etc |
| arrow-data = { version = "57.2.0", path = "./arrow-data" } | ||
| arrow-ipc = { version = "57.2.0", path = "./arrow-ipc" } | ||
| arrow-json = { version = "57.2.0", path = "./arrow-json" } | ||
| arrow-memory-size = { version = "57.2.0", path = "./arrow-memory-size" } |
There was a problem hiding this comment.
If we are going to do this I would suggest we put it in arrow-buffer or something
There was a problem hiding this comment.
I'm happy to just move it there instead of a new crate, certainly easier
|
I agree we can move this into |
|
What do we think about just adding the new crate in DataFusion? Or is there more to this PR that I am missing? |
|
I haven't had a chance to look through it (2k lines is a lot!) |
alchemist51
left a comment
There was a problem hiding this comment.
Pretty neat PR overall @adriangb ! Few minor comments on the PR.
arrow-array/src/heap_size.rs
Outdated
| fn test_primitive_array_heap_size() { | ||
| let array = Int32Array::from(vec![1, 2, 3, 4, 5]); | ||
| let size = array.heap_size(); | ||
| assert!(size >= 5 * std::mem::size_of::<i32>()); |
There was a problem hiding this comment.
Can it not be accurately compared?
There was a problem hiding this comment.
Changed to be exact comparisons
| //! | ||
| //! #[derive(HeapSize)] | ||
| //! struct MyStruct { | ||
| //! #[heap_size(size_fn = custom_size)] |
There was a problem hiding this comment.
Loved the interface to add custom size for custom objects!
|
|
||
| Key motivations: | ||
|
|
||
| - **Minimal API**: Just two methods (`heap_size()` and `total_size()`) |
There was a problem hiding this comment.
I didn't see the total_size() being exposed, can you point me to that implementation? In case we are targeting for total_size. One can't use the ignore type in FieldAttr right? Since the memory allocation has to belong to either stack or heap? Should we keep that type?
There was a problem hiding this comment.
The ignore option is useful e.g. for caches, shared references where you know / want to semantically treat them as owned by someone else. E.g. Vec<Arc<T>> should still count the Vec and pointers, but not the T.
This PR introduces two new crates: - `arrow-memory-size`: Contains the `HeapSize` trait for estimating heap memory usage, with implementations for std types, arrow-buffer types, and all arrow-array types. - `arrow-memory-size-derive`: Provides `#[derive(HeapSize)]` proc macro for automatically implementing HeapSize on structs and enums. The HeapSize infrastructure was previously in the parquet crate. Moving it to a standalone crate allows users who don't use parquet to depend on memory size estimation utilities. **Breaking Change**: The `#[derive(HeapSize)]` macro has been removed from `parquet_derive`. Users should switch to `arrow-memory-size-derive`. The parquet crate re-exports `HeapSize` from `arrow_memory_size` for backward compatibility. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Move HeapSize implementations from arrow-memory-size to their respective crates (arrow-buffer, arrow-array) - arrow-memory-size now contains only the trait and std impls - Re-export HeapSize and HeapSizeDerive from arrow::util - Add derive macro attributes: #[heap_size(ignore)], #[heap_size(size = N)], #[heap_size(size_fn = path)] - Add type coverage: tuples (up to 12), arrays [T; N], Mutex, RwLock - Add README files with comparison to deepsize and get-size2 crates This allows users who don't use arrow to depend on just arrow-memory-size for the trait, while arrow crates implement HeapSize for their own types. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove prescriptive recommendations about which crates to use - Explain motivations: minimal API, customizable Arc/Rc semantics, Arrow integration - Simplify comparison section to just explain our approach Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tests for the HeapSize derive macro are not parquet-specific, so move them from parquet_derive_test to arrow-memory-size-derive/tests/. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Proc-macro crates cannot resolve intra-doc links to external crates, causing rustdoc errors for the arrow_memory_size::HeapSize references. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace individual HeapSize implementations with a macro to reduce code duplication. A blanket impl<T: Array> is not possible due to Rust's orphan rules (E0210). Co-Authored-By: Claude Opus 4.5 <[email protected]>
aa2d756 to
7245a2a
Compare
|
|
||
| /// Return the total size of this object including heap allocations | ||
| /// and the size of the object itself. | ||
| fn total_size(&self) -> usize { |
There was a problem hiding this comment.
Here is total_size()
The rationale is that we want to use it in DataFusion but don't want to force a dependency on Parquet to use this trait.