ARROW-530: [C++/Python] Provide subpools for better memory allocation …#2057
ARROW-530: [C++/Python] Provide subpools for better memory allocation …#2057rok wants to merge 9 commits intoapache:masterfrom
Conversation
xhochy
left a comment
There was a problem hiding this comment.
Implementation-wise this looks good. Also for me it seems as this pool will already tracks only the allocations made for it. To complete this PR, please add some C++ unit tests that check the implementation.
Also have a look at Travis which reports some linting errors.
cpp/src/arrow/memory_pool.cc
Outdated
| Status ProxyMemoryPool::Allocate(int64_t size, uint8_t** out) { | ||
| proxy_bytes_allocated_ += size; | ||
| Status s = pool_->Allocate(size, out); | ||
| std::cout << "Allocate: size = " << size << std::endl; |
There was a problem hiding this comment.
Please remove the calls to std::cout. If the user wants this, he should explicitly chose the LoggingMemoryPool.
|
@rok Code and tests looks good. Could you fix the linting errors mentioned in Travis? |
|
@xhochy - will do, traveling at the moment, better availability from Monday on. |
Codecov Report
@@ Coverage Diff @@
## master #2057 +/- ##
==========================================
- Coverage 87.44% 86.35% -1.09%
==========================================
Files 189 230 +41
Lines 29376 40513 +11137
==========================================
+ Hits 25687 34987 +9300
- Misses 3689 5526 +1837
Continue to review full report at Codecov.
|
cpp/src/arrow/memory_pool.cc
Outdated
| proxy_bytes_allocated_ -= size; | ||
| } | ||
|
|
||
| int64_t ProxyMemoryPool::bytes_allocated() const { |
There was a problem hiding this comment.
Just realised that bytes_allocated refers to the parent Pool. I would have expected that it would return only the allocation of the currrent scope (i.e. proxy_bytes_allocated_).
|
@xhochy - agreed that is a more intuitive interface. Now the ProxyMemoryPool object should return bytes allocated through it with bytes_allocated() method. |
cpp/src/arrow/memory_pool.cc
Outdated
|
|
||
| int64_t ProxyMemoryPool::bytes_allocated() const { | ||
| int64_t nb_bytes = bytes_allocated_; | ||
| return nb_bytes; |
There was a problem hiding this comment.
Use num or number here instead of nb
cpp/src/arrow/memory_pool.cc
Outdated
|
|
||
| int64_t ProxyMemoryPool::max_memory() const { | ||
| int64_t mem = pool_->max_memory(); | ||
| return mem; |
|
|
||
| Status ProxyMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { | ||
| Status s = pool_->Reallocate(old_size, new_size, ptr); | ||
| bytes_allocated_ += new_size - old_size; |
There was a problem hiding this comment.
If pool_->Reallocate fails, bytes_allocated_ will be incorrect
|
|
||
| Status ProxyMemoryPool::Allocate(int64_t size, uint8_t** out) { | ||
| Status s = pool_->Allocate(size, out); | ||
| bytes_allocated_ += size; |
There was a problem hiding this comment.
If pool_->Allocate fails, bytes_allocated_ will be incorrect
| MemoryPool* pool_; | ||
| }; | ||
|
|
||
| class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { |
There was a problem hiding this comment.
Can we consider some alternative names than "Proxy"? Can you add a doxygen comment describing what this class is?
There was a problem hiding this comment.
How about SubMemoryPool? MemorySubPool? PartialMemoryPool?
cpp/src/arrow/memory_pool.h
Outdated
| class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { | ||
| public: | ||
| explicit ProxyMemoryPool(MemoryPool* pool); | ||
| ~ProxyMemoryPool() override = default; |
cpp/src/arrow/memory_pool.h
Outdated
|
|
||
| private: | ||
| MemoryPool* pool_; | ||
| int64_t bytes_allocated_ = 0; |
| self.init(self.logging_pool.get()) | ||
|
|
||
|
|
||
| cdef class ProxyMemoryPool(MemoryPool): |
There was a problem hiding this comment.
Other classes are not documented in memory.pxi. Is there another location for documentation I'm missing?
There was a problem hiding this comment.
No, this is the main location where we should document things. We were not so good with adding docstrings in the past. I'll probably make a pass over the old classes soon and add some documentation but we should start adding documentation to new classes once we add them.
There was a problem hiding this comment.
Got it. Would be happy to help with the old classes as well.
There was a problem hiding this comment.
Would be happy to help with the old classes as well.
That would be very much appreciated!
python/pyarrow/memory.pxi
Outdated
| cdef LoggingMemoryPool _logging_memory_pool = ( | ||
| LoggingMemoryPool(_default_memory_pool)) | ||
| cdef ProxyMemoryPool _proxy_memory_pool = ( | ||
| ProxyMemoryPool(_default_memory_pool)) |
Cannot think of anything better at the moment, I think it's ok. |
|
Looks good except for a small formatting issue: https://travis-ci.org/apache/arrow/jobs/387368172#L1227 Would merge on a green build. |
|
It seems tests in Travis failed due to timeouts: https://travis-ci.org/apache/arrow/jobs/388699104#L4333, https://travis-ci.org/apache/arrow/jobs/388699106#L523. |
xhochy
left a comment
There was a problem hiding this comment.
+1, LGTM
Travis tests passed after I retriggered them.
|
Thanks @rok . I assigned the jira to |
|
@xhochy - yeah, that's me, thanks! :) |
…tracking
Currently we can only track the amount of bytes allocated by the main memory pool or the alternative jemalloc implementation. To better understand certain situation, we should provide a MemoryPool proxy implementation that tracks only the amount of memory that was made through its direct calls but delegates the actual allocation to an underlying pool.
Authors: Rok Mihevc [email protected] and Alex Hagerman [email protected]
Usage example:
import pyarrow as pa
def report(pool, proxy_pool):
print("Total: ", pa.total_allocated_bytes())
print("Default pool: ", pool.bytes_allocated())
print("Proxy allocated: ", proxy_pool.proxy_bytes_allocated())
pool = pa.default_memory_pool()
proxy_pool = pa.ProxyMemoryPool(pool)
report(pool, proxy_pool)
a1 = pa.array([0]*1000, memory_pool=pool)
report(pool, proxy_pool)
a2 = pa.array([0]*1, memory_pool=proxy_pool)
report(pool, proxy_pool)
a3 = pa.array([0]*1000, memory_pool=proxy_pool)
report(pool, proxy_pool)
a3 = pa.array([0]*1000, memory_pool=proxy_pool)
report(pool, proxy_pool)
Result:
Total: 0
Default pool: 0
bytes_allocated: 0
Proxy allocated: 0
Total: 8128
Default pool: 8128
bytes_allocated: 0
Proxy allocated: 0
Allocate: size = 64
Allocate: size = 256
Reallocate: old_size = 256 - new_size = 64 - proxy_allocated - 128
Total: 8256
Default pool: 8256
bytes_allocated: 128
Proxy allocated: 128
Allocate: size = 128
Allocate: size = 8192
Reallocate: old_size = 8192 - new_size = 8000 - proxy_allocated - 8256
Total: 16384
Default pool: 16384
bytes_allocated: 8256
Proxy allocated: 8256
Allocate: size = 128
Allocate: size = 8192
Reallocate: old_size = 8192 - new_size = 8000 - proxy_allocated - 16384
Free: size = 128
Free: size = 8000
Total: 16384
Default pool: 16384
bytes_allocated: 8256
Proxy allocated: 8256