Skip to content

No attempt at garbage collection when memory allocation fails in thrust/cublas/... #2794

@cschreib-ibex

Description

@cschreib-ibex

In our software we are running a large number of calculations using ArrayFire, using a large fraction of the memory on the GPU. These calculations can be repeated an indefinite number of times by our customers, and because of ArrayFire's garbage collector policy, memory usage increases continously until out-of-memory. For best performance, we never call the garbage collector manually: we just let ArrayFire run it whenever it has the need for it, i.e., when it has no room to allocate a new array. I believe this is how GCs are meant to be used.

This works fine in most cases, but not all the time. A typical case where this fails is when we are running close to the memory limit of our GPU, the GC has not been triggered yet, and we call an AF function that uses third-party CUDA code (thrust, cublas, etc.). If this third-party code makes CUDA allocations internally, and the amount of memory required goes beyond the available memory, that code rightfully throws because it is not aware of AF's garbage collection model.

The issue is, that exception is currently propagated straight back to us.

As suggested in #2478 (comment) (for a different but similar issue), we could in principle catch af::exception in our code and check for AF_ERR_NO_MEM, then call the GC ourselves and repeat the operation that failed. However there are hundreds of AF operations we would have to wrap in such a way, which is not maintainable.

The other alternative for us is to manually trigger the GC after each "major" calculation step. However there is no way for us to reliably address the issue this way; how often we need to call the GC will depend on the data volume that we have to process (larger images will trigger the problem sooner). The only way to be sure we'll never run into it would be to call the GC after each AF operation, which is of course a no go. Furthermore, even if we do find an acceptable set of locations at which to call the GC, this will incur a performance penalty as we will be running the GC more often than it is strictly needed. Given some of our customers' constraints, this is something we may not be able to afford.

The ideal solution would be for ArrayFire to handle this internally, since it knows when calls to third-party libraries are made, and it knows exactly which operations need to be repeated if an internal memory allocation fails, with minimal performance and maintenance overhead for the user.

For reference, the out-of-memory exception we received came from af::regions, and shoving an af::deviceGC() just before that call got rid of the exception. This is using ArrayFire 3.6.4, since 3.7.0 appears to trigger this issue more often in our tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions