Skip to content

[feature request] Caching allocator diagnostics and memory allocation tracing/visualization #1529

@vadimkantorov

Description

@vadimkantorov

Would be cool to peek into the state of the caching allocator on things like:

  • Total cached memory

  • Total currently used memory, referenced by Tensors

  • Forced free of unused segments

  • Tracing of memory allocations (along with some measure of fragmentation) and deallocations (both logical and physical). Would be useful for custom anasysis scripts and for understanding a reason of OOM (fragmentation or actual lack of memory)

  • Stats about currently existing tensors (if possible, otherwise with a full trace one implement this post-hoc): type, sizes, gpu device. if we had a way to dump timestamp of allocation, would be cool too (would allow to track sort of reliably memory leaks)

  • Dump information of all existing tensors / storages with refcounts, so that an easy vis of fragmentation can be done (hopefully with annotation of what required them for backward)

  • Built-in minimal tool for visualization of used memory (cached memory / used memory) into an SVG/HTML string

  • Arena allocators (memfree in one go, optionally mempreallocation in one go with upper memlimit, configurable block sizes, per-allocator stats on existing tensors referencing memory, optional support for CUDA Unified Memory)

With caching allocaotr it's hard to understand sometimes what's happening with memory since after some big allocations / deallocations memory on nvidia-smi always stays high and doesn't reflect actual usage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: memory usagePyTorch is using more memory than it should, or it is leaking memorytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions