No attempt at garbage collection when memory allocation fails in thrust/cublas/...

In our software we are running a large number of calculations using ArrayFire, using a large fraction of the memory on the GPU. These calculations can be repeated an indefinite number of times by our customers, and because of ArrayFire's garbage collector policy, memory usage increases continously until out-of-memory. For best performance, we never call the garbage collector manually: we just let ArrayFire run it whenever it has the need for it, i.e., when it has no room to allocate a new array. I believe this is how GCs are meant to be used.

This works fine in most cases, but not all the time. A typical case where this fails is when we are running close to the memory limit of our GPU, the GC has not been triggered yet, and we call an AF function that uses third-party CUDA code (thrust, cublas, etc.). If this third-party code makes CUDA allocations internally, and the amount of memory required goes beyond the available memory, that code rightfully throws because it is not aware of AF's garbage collection model.

The issue is, that exception is currently propagated straight back to us. 

As suggested in https://github.com/arrayfire/arrayfire/issues/2478#issuecomment-479732301 (for a different but similar issue), we could in principle catch ``af::exception`` in our code and check for ``AF_ERR_NO_MEM``, then call the GC ourselves and repeat the operation that failed. However there are hundreds of AF operations we would have to wrap in such a way, which is not maintainable.

The other alternative for us is to manually trigger the GC after each "major" calculation step. However there is no way for us to reliably address the issue this way; how often we need to call the GC will depend on the data volume that we have to process (larger images will trigger the problem sooner). The only way to be sure we'll never run into it would be to call the GC after each AF operation, which is of course a no go. Furthermore, even if we do find an acceptable set of locations at which to call the GC, this will incur a performance penalty as we will be running the GC more often than it is strictly needed. Given some of our customers' constraints, this is something we may not be able to afford. 

The ideal solution would be for ArrayFire to handle this internally, since it knows when calls to third-party libraries are made, and it knows exactly which operations need to be repeated if an internal memory allocation fails, with minimal performance and maintenance overhead for the user.

For reference, the out-of-memory exception we received came from ``af::regions``, and shoving an ``af::deviceGC()`` just before that call got rid of the exception. This is using ArrayFire 3.6.4, since 3.7.0 appears to trigger this issue more often in our tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No attempt at garbage collection when memory allocation fails in thrust/cublas/... #2794

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No attempt at garbage collection when memory allocation fails in thrust/cublas/... #2794

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions