[c++] Implement experimental Memory Pool for ManagedQuery read operation#4299
[c++] Implement experimental Memory Pool for ManagedQuery read operation#4299XanthosXanthopoulos merged 13 commits intomainfrom
ManagedQuery read operation#4299Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4299 +/- ##
==========================================
- Coverage 76.14% 75.74% -0.41%
==========================================
Files 233 233
Lines 32591 32767 +176
Branches 1236 1259 +23
==========================================
+ Hits 24818 24820 +2
- Misses 7349 7520 +171
- Partials 424 427 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
ManagedQuery read operationManagedQuery read operation
jp-dark
left a comment
There was a problem hiding this comment.
If this works like I think it will, it should be a big improvement, although I still need to dig a little deeper into some of the nuance. It also might be worth flipping it on by default and running the unit tests against it fully enabled.
| data_.reserve(num_bytes); | ||
| if (is_var_) { | ||
| offsets_.reserve(num_cells + 1); // extra offset for arrow | ||
| } | ||
| if (is_nullable_) { | ||
| validity_.reserve(num_cells); | ||
| } |
There was a problem hiding this comment.
It would be best to address these before another release. If we don't fix them here, we should file a follow-up linear issue.
| tiledb::impl::type_size(attr.type())) * | ||
| memory_budget_unit; | ||
|
|
||
| size_t num_cells = memory_budget_unit; |
There was a problem hiding this comment.
isn't this calculating num bytes, rather than num cells? (apologies if I'm misreading the code - but the units look like bytes)
There was a problem hiding this comment.
In this case memory_budget_unit is also the number of cells due to how memory per column is allocated because for any given column the allocated budget is memory_budget_unit * (element_size + is_var_size * 8 + is_nullable * 1)
| return buffers_.at(names_.front())->size(); | ||
| } | ||
|
|
||
| static bool use_memory_pool(const std::shared_ptr<tiledb::Array>& array); |
|
Question: do all of the unit tests pass if |
bkmartinjr
left a comment
There was a problem hiding this comment.
Overall this looks good (two very small nits noted inline). I'd like to confirm that our unit tests run with soma.read.use_memory_pool enabled.
Other than that, I'm happy with it landing under this feature flag. Lets coordinate on a run of the benchmark suite once it is in main.
bkmartinjr
left a comment
There was a problem hiding this comment.
I belatedly noticed this is lacking an entry in HISTORY.md and the equivalent R file
…ers for reading data
…actor for variable size columns
58c0c09 to
ac15bc8
Compare
Issue and/or context:
Changes:
This PR introduces a new experimental memory allocation mechanism for read operations. Instead of a per column memory budget, the new mechanism uses a global memory budget which is then split to each column based on the column byte size.
The new mechanism allows for more predictable memory usage which is independent of the underlying array schemas.
Additionally it changes how arrow arrays are constructed regardless of the memory allocation method. Each arrow table allocates its own buffers and copies data from the
ManagedQueryread buffers which are reused for all incomplete reads. Additionally dictionary arrays are now allocated similarly to ordinary arrays and can be reused across different columns if the use the sameEnumerationNotes for Reviewer:
The new mechanism also changes the way memory is allocated (using
resizeinstead ofreserveto avoid UB) and requires lower limits. To enable enable the new mechanism you need to setsoma.read.use_memory_pooltotrueandsoma.read.memory_budget(defaults to 256MB)