Allocator unification, attempt three! #9358

ezyang · 2018-07-11T20:58:43Z

Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator

See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Signed-off-by: Edward Z. Yang <[email protected]>

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/Allocator.cpp

aten/src/ATen/Allocator.h

aten/src/THC/THCStorage.hpp

aten/src/ATen/Allocator.h

ezyang · 2018-07-13T16:53:52Z

The Lunch Design Committee(TM) has ordained DevicePtr shall be renamed as DataPtr.

aten/src/ATen/detail/UniqueVoidPtr.h

aten/src/ATen/Allocator.h

See Note [Supervisor deleter] for how SupervisedPtr works. This design is not the obvious one, but there were a lot of constraints feeding into it: - It must support the reallocation usage-pattern, where, given an existing Storage, we allocate a new region of memory, copy the existing data to it, and then deallocate the old region of memory. - Creation of a deleter for memory MUST avoid dynamic allocations in the common case. We've done some benchmarking in Caffe2 where dynamic allocation for deleters is ruinously expensive, and it's really hard to avoid these performance tarpits in very general function wrappers like std::function or folly::Function (while benchmarking this, we discovered that folly::Function's move constructor was way more expensive than it should be). - We need to be able to deallocate data that comes from external sources, e.g., dlpack and numpy tensors. Most notably, you often cannot deallocate these with merely the void* data pointer; you need some extra, out-of-band information (e.g., the managing struct) to deallocate it. Sometimes, you may even want to resize data living in an external source! - The "core" allocators need to support being wrapped in a Thrust allocator, so you need to be implement the following two functions: char* allocate(size_t); void deallocate(char*, size_t); - We need to support tensors which contain non-POD, non-trivially copyable data; specifically tensors of std::string. This is an upcoming requirement from Caffe2. It's dirty AF, but it's really useful. - It should use C++ standard library types like std::unique_ptr (which is hugely problematic because std::unique_ptr doesn't call the deleter when the pointer is null.) Here is the billing of changes: - Built-in support for realloc() has been DROPPED ENTIRELY. Instead, you're expected to allocate and then copy from the old memory to the new memory if you want to do a reallocation. This is what you'd generally have expected to occur; and axing realloc() from the design lets us avoid some tricky correctness issues with std::realloc(), namely the fact that we must refuse the realloc if the type of the elements are not trivially copyeable. If it really matters, we can add this back, but there really needs to be a good explanation WHY you need fast resizing reallocations (by in large, people don't resize their storages, and it should be acceptable to have a performance degradation when they do). - TH_STORAGE_FREEMEM is no more; instead, if you want a storage which doesn't free its result, you just give it an empty deleter. - What we used to call an "allocator" (really, a combined object for allocating/deleting) has been split into two concepts, an allocator, and a smart pointer (SupervisedPtr) which knows how to delete data. - Unlike previously, where THAllocator/THCDeviceAllocator could have a per-tensor context storing extra information (e.g., a pointer to the metadata you need to actually free the tensor), there is no context in the allocator or the deleter of the smart pointer; instead, the smart pointer directly holds an owning reference to the metadata necessary to free the data. This metadata is *freshly manufactured* upon every allocation, which permits us to resize tensors even in the absence of built-in support for realloc(). - By default, allocators don't support "raw" allocations and deallocations with raw pointers. This is because some allocations may return a different context every time, in which case you need to reconstruct the context at delete time (because all you got was a void*, not a unique_ptr that carries the deleter). - The diff between at::Allocator and THCDeviceAllocator is a bit larger: - It used to return a cudaError_t. Now, allocators are expected to check the error status immediately and throw an exception if there was an error. It turns out that this is what was immediately done after all occurrences of allocate/release, so it wasn't a big deal (although some subsidiary interfaces had to themselves be converted to not return cudaError_t). There is one notable exception to this, and it is how we handle CUDA OOM: if this occurs, we attempt to return unused memory to the system and try again. This is now handled by a catch-all try-catch block. The cost of catching the exception is probably the least of your worries if you're about to OOM. - It used to take the CUDA stream to perform the allocation on as an argument. However, it turned out that all call sites, this stream was the stream for the current device. So we can push this into the allocator (and the choice, in the future, could be made explicitly by twiddling thread local state.) - It held two extra methods, emptyCache and cacheInfo, specifically for interacting with some state in THCCachingAllocator. But this "generality" was a lie, since THCCachingAllocator was the only allocator that actually implemented these methods, and there is actually a bunch of code in THC which assumes that it is the caching allocator that is the underlying allocator for CUDA allocations. So I folded these two methods into this interface as THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo. - It held its context directly inside the THCDeviceAllocator struct. This context has been moved out into whatever is holding the at::Allocator*. - The APIs for getting at allocators/deleters is now a little different. - Previously there were a bunch of static variables you could get the address of (e.g., &THDefaultAllocator); now there is a function getTHDefaultAllocator(). - Some "allocators" didn't actually know how to allocate (e.g., the IPC "allocator"). These have been deleted; instead, you can wrap the produced pointers into SupervisedPtr using an appropriate makeSupervisedPtr() static method. - Storage sharing was a lot of work to wrangle, but I think I've tamed the beast. - THMapAllocator and its "subclasses" have been refactored to be proper, honest to goodness C++ classes. I used the enum argument trick to get "named" constructors. We use inheritance to add refcounting and management (in libshm). What we previously called the "Context" class (Context has been dropped from the name) is now the supervisor for the data. - Sometimes, we need to pull out the file descriptor from a tensor. Previously, it was pulled out of the allocator context. Now, we pull it out of the supervisor of the SupervisorPtr, using the static method fromSupervisedPtr(), which uses the deleter as the typeid, and refines the type if it matches. - I renamed the std::function deleter into InefficientStdFunctionSupervisor, to emphasize the fact that it does a dynamic allocation to save the std::function deleter. TODO: - Windows libshm is in shambles and needs to be fixed. Perhaps for the future: - newFromFd is now unconditionally calling cudaPointerGetAttributes even though this is unnecessary, because we know what the device is from higher up in the callstack. We can fix this by making newWithDataAndAllocator also take an explicit device argument. - Consider statically distinguishing between allocators that support raw_allocate/raw_deallocate, and those which don't. The Thrust constraint applies only to the CUDA device allocator; you never need to allocate CPU memory this way - Really want to get rid of storage views. Ugh. Nontrivial bugs I noticed when preparing this patch: - I forgot to placement-new unique pointers and attempted to assign them directly on uninitialized memory; very bad! Sam Gross has encouraged me to replace this with a proper constructor but I keep putting it off, because once everything goes in StorageImpl there really will be a proper constructor. - I rewrote a number of APIs to use newWithDataAndAllocator instead of newWithAllocator, calling the allocator at the call site (because they required "allocation context" which we no longer give to "allocators"). When I did this, I forgot to insert the multiplication with sizeof(real) to scale from numels to number of bytes. - The implementation of swap on storages was missing it for scalarType and backend. It was benign (because the only case we call swap is when these are the same), but I fixed it anyway. - I accidentally returned a nullptr unique_ptr with no deleter, even though there was a legitimate one. This matters, because some code still shoves its hands in the deleter context to get extra metadata about the function. - I used std::move() on a unique_ptr, and then did a boolean test on the pointer aftewards (always false!) Signed-off-by: Edward Z. Yang <[email protected]>

Signed-off-by: Edward Z. Yang <[email protected]>

ezyang · 2018-07-14T18:23:14Z

@pytorchbot retest this please

…9358) Summary: See Note [Supervisor deleter] for how SupervisedPtr works. This design is not the obvious one, but there were a lot of constraints feeding into it: - It must support the reallocation usage-pattern, where, given an existing Storage, we allocate a new region of memory, copy the existing data to it, and then deallocate the old region of memory. - Creation of a deleter for memory MUST avoid dynamic allocations in the common case. We've done some benchmarking in Caffe2 where dynamic allocation for deleters is ruinously expensive, and it's really hard to avoid these performance tarpits in very general function wrappers like std::function or folly::Function (while benchmarking this, we discovered that folly::Function's move constructor was way more expensive than it should be). - We need to be able to deallocate data that comes from external sources, e.g., dlpack and numpy tensors. Most notably, you often cannot deallocate these with merely the void* data pointer; you need some extra, out-of-band information (e.g., the managing struct) to deallocate it. Sometimes, you may even want to resize data living in an external source! - The "core" allocators need to support being wrapped in a Thrust allocator, so you need to be implement the following two functions: char* allocate(size_t); void deallocate(char*, size_t); - We need to support tensors which contain non-POD, non-trivially copyable data; specifically tensors of std::string. This is an upcoming requirement from Caffe2. It's dirty AF, but it's really useful. - It should use C++ standard library types like std::unique_ptr (which is hugely problematic because std::unique_ptr doesn't call the deleter when the pointer is null.) Here is the billing of changes: - Built-in support for realloc() has been DROPPED ENTIRELY. Instead, you're expected to allocate and then copy from the old memory to the new memory if you want to do a reallocation. This is what you'd generally have expected to occur; and axing realloc() from the design lets us avoid some tricky correctness issues with std::realloc(), namely the fact that we must refuse the realloc if the type of the elements are not trivially copyeable. If it really matters, we can add this back, but there really needs to be a good explanation WHY you need fast resizing reallocations (by in large, people don't resize their storages, and it should be acceptable to have a performance degradation when they do). - TH_STORAGE_FREEMEM is no more; instead, if you want a storage which doesn't free its result, you just give it an empty deleter. - What we used to call an "allocator" (really, a combined object for allocating/deleting) has been split into two concepts, an allocator, and a smart pointer (SupervisedPtr) which knows how to delete data. - Unlike previously, where THAllocator/THCDeviceAllocator could have a per-tensor context storing extra information (e.g., a pointer to the metadata you need to actually free the tensor), there is no context in the allocator or the deleter of the smart pointer; instead, the smart pointer directly holds an owning reference to the metadata necessary to free the data. This metadata is *freshly manufactured* upon every allocation, which permits us to resize tensors even in the absence of built-in support for realloc(). - By default, allocators don't support "raw" allocations and deallocations with raw pointers. This is because some allocations may return a different context every time, in which case you need to reconstruct the context at delete time (because all you got was a void*, not a unique_ptr that carries the deleter). - The diff between at::Allocator and THCDeviceAllocator is a bit larger: - It used to return a cudaError_t. Now, allocators are expected to check the error status immediately and throw an exception if there was an error. It turns out that this is what was immediately done after all occurrences of allocate/release, so it wasn't a big deal (although some subsidiary interfaces had to themselves be converted to not return cudaError_t). There is one notable exception to this, and it is how we handle CUDA OOM: if this occurs, we attempt to return unused memory to the system and try again. This is now handled by a catch-all try-catch block. The cost of catching the exception is probably the least of your worries if you're about to OOM. - It used to take the CUDA stream to perform the allocation on as an argument. However, it turned out that all call sites, this stream was the stream for the current device. So we can push this into the allocator (and the choice, in the future, could be made explicitly by twiddling thread local state.) - It held two extra methods, emptyCache and cacheInfo, specifically for interacting with some state in THCCachingAllocator. But this "generality" was a lie, since THCCachingAllocator was the only allocator that actually implemented these methods, and there is actually a bunch of code in THC which assumes that it is the caching allocator that is the underlying allocator for CUDA allocations. So I folded these two methods into this interface as THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo. - It held its context directly inside the THCDeviceAllocator struct. This context has been moved out into whatever is holding the at::Allocator*. - The APIs for getting at allocators/deleters is now a little different. - Previously there were a bunch of static variables you could get the address of (e.g., &THDefaultAllocator); now there is a function getTHDefaultAllocator(). - Some "allocators" didn't actually know how to allocate (e.g., the IPC "allocator"). These have been deleted; instead, you can wrap the produced pointers into SupervisedPtr using an appropriate makeSupervisedPtr() static method. - Storage sharing was a lot of work to wrangle, but I think I've tamed the beast. - THMapAllocator and its "subclasses" have been refactored to be proper, honest to goodness C++ classes. I used the enum argument trick to get "named" constructors. We use inheritance to add refcounting and management (in libshm). What we previously called the "Context" class (Context has been dropped from the name) is now the supervisor for the data. - Sometimes, we need to pull out the file descriptor from a tensor. Previously, it was pulled out of the allocator context. Now, we pull it out of the supervisor of the SupervisorPtr, using the static method fromSupervisedPtr(), which uses the deleter as the typeid, and refines the type if it matches. - I renamed the std::function deleter into InefficientStdFunctionSupervisor, to emphasize the fact that it does a dynamic allocation to save the std::function deleter. TODO: - Windows libshm is in shambles and needs to be fixed. Perhaps for the future: - newFromFd is now unconditionally calling cudaPointerGetAttributes even though this is unnecessary, because we know what the device is from higher up in the callstack. We can fix this by making newWithDataAndAllocator also take an explicit device argument. - Consider statically distinguishing between allocators that support raw_allocate/raw_deallocate, and those which don't. The Thrust constraint applies only to the CUDA device allocator; you never need to allocate CPU memory this way - Really want to get rid of storage views. Ugh. Nontrivial bugs I noticed when preparing this patch: - I forgot to placement-new unique pointers and attempted to assign them directly on uninitialized memory; very bad! Sam Gross has encouraged me to replace this with a proper constructor but I keep putting it off, because once everything goes in StorageImpl there really will be a proper constructor. - I rewrote a number of APIs to use newWithDataAndAllocator instead of newWithAllocator, calling the allocator at the call site (because they required "allocation context" which we no longer give to "allocators"). When I did this, I forgot to insert the multiplication with sizeof(real) to scale from numels to number of bytes. - The implementation of swap on storages was missing it for scalarType and backend. It was benign (because the only case we call swap is when these are the same), but I fixed it anyway. - I accidentally returned a nullptr unique_ptr with no deleter, even though there was a legitimate one. This matters, because some code still shoves its hands in the deleter context to get extra metadata about the function. - I used std::move() on a unique_ptr, and then did a boolean test on the pointer aftewards (always false!) Pull Request resolved: pytorch/pytorch#9358 Reviewed By: SsnL Differential Revision: D8811822 Pulled By: ezyang fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75

* upstream/master: (24 commits) Implement tensor weak references (pytorch#9363) Nuke TestCollectEnv (pytorch#9459) Add test case for segmentation fault fix in grad_fn (pytorch#9457) Add peephole optimization for type_as operators. (pytorch#9316) Fix out-of-range error for test_neg (pytorch#9431) add depthwise conv support for mkldnn (pytorch#8782) Refactor `_log_sum_exp` (pytorch#9173) Add ModuleDict and ParameterDict containers (pytorch#8463) Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (pytorch#9358) Introducing IsInf (pytorch#9169) add device to CUDAEvent (pytorch#9415) Make localScalar error message more intuitive (pytorch#9443) Only accept continguous tensors in TopK for cuda (pytorch#9441) Add support for .norm() pytorch onnx export and ReduceL1/ReduceL2 caffe2 operators (pytorch#9299) Only view() rhs of index_put if we need to (pytorch#9424) Add BatchBucketizeOp in caffe2 (pytorch#9385) Implementation of Wngrad optimizer caffe2 python wrapper and unit test on least square regression (pytorch#9001) Implementation and operator test for Wngrad optimizer (pytorch#8999) Fix segmentation fault in grad_fn (pytorch#9292) update docs (pytorch#9423) ...

…ytorch#9358) Summary: See Note [Supervisor deleter] for how SupervisedPtr works. This design is not the obvious one, but there were a lot of constraints feeding into it: - It must support the reallocation usage-pattern, where, given an existing Storage, we allocate a new region of memory, copy the existing data to it, and then deallocate the old region of memory. - Creation of a deleter for memory MUST avoid dynamic allocations in the common case. We've done some benchmarking in Caffe2 where dynamic allocation for deleters is ruinously expensive, and it's really hard to avoid these performance tarpits in very general function wrappers like std::function or folly::Function (while benchmarking this, we discovered that folly::Function's move constructor was way more expensive than it should be). - We need to be able to deallocate data that comes from external sources, e.g., dlpack and numpy tensors. Most notably, you often cannot deallocate these with merely the void* data pointer; you need some extra, out-of-band information (e.g., the managing struct) to deallocate it. Sometimes, you may even want to resize data living in an external source! - The "core" allocators need to support being wrapped in a Thrust allocator, so you need to be implement the following two functions: char* allocate(size_t); void deallocate(char*, size_t); - We need to support tensors which contain non-POD, non-trivially copyable data; specifically tensors of std::string. This is an upcoming requirement from Caffe2. It's dirty AF, but it's really useful. - It should use C++ standard library types like std::unique_ptr (which is hugely problematic because std::unique_ptr doesn't call the deleter when the pointer is null.) Here is the billing of changes: - Built-in support for realloc() has been DROPPED ENTIRELY. Instead, you're expected to allocate and then copy from the old memory to the new memory if you want to do a reallocation. This is what you'd generally have expected to occur; and axing realloc() from the design lets us avoid some tricky correctness issues with std::realloc(), namely the fact that we must refuse the realloc if the type of the elements are not trivially copyeable. If it really matters, we can add this back, but there really needs to be a good explanation WHY you need fast resizing reallocations (by in large, people don't resize their storages, and it should be acceptable to have a performance degradation when they do). - TH_STORAGE_FREEMEM is no more; instead, if you want a storage which doesn't free its result, you just give it an empty deleter. - What we used to call an "allocator" (really, a combined object for allocating/deleting) has been split into two concepts, an allocator, and a smart pointer (SupervisedPtr) which knows how to delete data. - Unlike previously, where THAllocator/THCDeviceAllocator could have a per-tensor context storing extra information (e.g., a pointer to the metadata you need to actually free the tensor), there is no context in the allocator or the deleter of the smart pointer; instead, the smart pointer directly holds an owning reference to the metadata necessary to free the data. This metadata is *freshly manufactured* upon every allocation, which permits us to resize tensors even in the absence of built-in support for realloc(). - By default, allocators don't support "raw" allocations and deallocations with raw pointers. This is because some allocations may return a different context every time, in which case you need to reconstruct the context at delete time (because all you got was a void*, not a unique_ptr that carries the deleter). - The diff between at::Allocator and THCDeviceAllocator is a bit larger: - It used to return a cudaError_t. Now, allocators are expected to check the error status immediately and throw an exception if there was an error. It turns out that this is what was immediately done after all occurrences of allocate/release, so it wasn't a big deal (although some subsidiary interfaces had to themselves be converted to not return cudaError_t). There is one notable exception to this, and it is how we handle CUDA OOM: if this occurs, we attempt to return unused memory to the system and try again. This is now handled by a catch-all try-catch block. The cost of catching the exception is probably the least of your worries if you're about to OOM. - It used to take the CUDA stream to perform the allocation on as an argument. However, it turned out that all call sites, this stream was the stream for the current device. So we can push this into the allocator (and the choice, in the future, could be made explicitly by twiddling thread local state.) - It held two extra methods, emptyCache and cacheInfo, specifically for interacting with some state in THCCachingAllocator. But this "generality" was a lie, since THCCachingAllocator was the only allocator that actually implemented these methods, and there is actually a bunch of code in THC which assumes that it is the caching allocator that is the underlying allocator for CUDA allocations. So I folded these two methods into this interface as THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo. - It held its context directly inside the THCDeviceAllocator struct. This context has been moved out into whatever is holding the at::Allocator*. - The APIs for getting at allocators/deleters is now a little different. - Previously there were a bunch of static variables you could get the address of (e.g., &THDefaultAllocator); now there is a function getTHDefaultAllocator(). - Some "allocators" didn't actually know how to allocate (e.g., the IPC "allocator"). These have been deleted; instead, you can wrap the produced pointers into SupervisedPtr using an appropriate makeSupervisedPtr() static method. - Storage sharing was a lot of work to wrangle, but I think I've tamed the beast. - THMapAllocator and its "subclasses" have been refactored to be proper, honest to goodness C++ classes. I used the enum argument trick to get "named" constructors. We use inheritance to add refcounting and management (in libshm). What we previously called the "Context" class (Context has been dropped from the name) is now the supervisor for the data. - Sometimes, we need to pull out the file descriptor from a tensor. Previously, it was pulled out of the allocator context. Now, we pull it out of the supervisor of the SupervisorPtr, using the static method fromSupervisedPtr(), which uses the deleter as the typeid, and refines the type if it matches. - I renamed the std::function deleter into InefficientStdFunctionSupervisor, to emphasize the fact that it does a dynamic allocation to save the std::function deleter. TODO: - Windows libshm is in shambles and needs to be fixed. Perhaps for the future: - newFromFd is now unconditionally calling cudaPointerGetAttributes even though this is unnecessary, because we know what the device is from higher up in the callstack. We can fix this by making newWithDataAndAllocator also take an explicit device argument. - Consider statically distinguishing between allocators that support raw_allocate/raw_deallocate, and those which don't. The Thrust constraint applies only to the CUDA device allocator; you never need to allocate CPU memory this way - Really want to get rid of storage views. Ugh. Nontrivial bugs I noticed when preparing this patch: - I forgot to placement-new unique pointers and attempted to assign them directly on uninitialized memory; very bad! Sam Gross has encouraged me to replace this with a proper constructor but I keep putting it off, because once everything goes in StorageImpl there really will be a proper constructor. - I rewrote a number of APIs to use newWithDataAndAllocator instead of newWithAllocator, calling the allocator at the call site (because they required "allocation context" which we no longer give to "allocators"). When I did this, I forgot to insert the multiplication with sizeof(real) to scale from numels to number of bytes. - The implementation of swap on storages was missing it for scalarType and backend. It was benign (because the only case we call swap is when these are the same), but I fixed it anyway. - I accidentally returned a nullptr unique_ptr with no deleter, even though there was a legitimate one. This matters, because some code still shoves its hands in the deleter context to get extra metadata about the function. - I used std::move() on a unique_ptr, and then did a boolean test on the pointer aftewards (always false!) Pull Request resolved: pytorch#9358 Reviewed By: SsnL Differential Revision: D8811822 Pulled By: ezyang fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75

ezyang requested review from apaszke, colesbury, gchanan, soumith and zdevito as code owners July 11, 2018 20:58

facebook-github-bot reviewed Jul 11, 2018

View reviewed changes

ezyang force-pushed the pr/allocator-deleter2 branch from 82d33a2 to 86a7e75 Compare July 11, 2018 22:02

facebook-github-bot reviewed Jul 11, 2018

View reviewed changes

facebook-github-bot reviewed Jul 12, 2018

View reviewed changes

ezyang changed the title ~~[TESTING] New allocator PR~~ Allocator unification, attempt three! Jul 12, 2018

gchanan reviewed Jul 12, 2018

View reviewed changes

aten/src/ATen/Allocator.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This was referenced Jul 13, 2018

Unify THAllocator and THCDeviceAllocator with at::Allocator/Deleter #9278

Closed

Introduce libtorch to setup.py build #8792

Closed

cpuhrsch reviewed Jul 13, 2018

View reviewed changes

aten/src/ATen/Allocator.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

gchanan reviewed Jul 13, 2018

View reviewed changes

aten/src/ATen/Allocator.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

gchanan approved these changes Jul 13, 2018

View reviewed changes

gchanan reviewed Jul 13, 2018

View reviewed changes

aten/src/ATen/Allocator.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang added 9 commits July 13, 2018 14:28

Get rid of exception throwing in destructor if possible.

0a2cec8

Signed-off-by: Edward Z. Yang <[email protected]>

Fix nullptr issue

56d4378

Signed-off-by: Edward Z. Yang <[email protected]>

A stab at Windows

04e5712

Signed-off-by: Edward Z. Yang <[email protected]>

Move more exceptions out of destructor.

b1f6f8e

Signed-off-by: Edward Z. Yang <[email protected]>

Hoist Windows include.

7357d26

Signed-off-by: Edward Z. Yang <[email protected]>

winfix

a551a7a

Signed-off-by: Edward Z. Yang <[email protected]>

fix vcall from destructor bug

49643fb

Signed-off-by: Edward Z. Yang <[email protected]>

Windows maybe

b2b81b7

Signed-off-by: Edward Z. Yang <[email protected]>

ezyang added 13 commits July 13, 2018 14:28

atest fix

58650c3

Signed-off-by: Edward Z. Yang <[email protected]>

winfix

c1da28d

Signed-off-by: Edward Z. Yang <[email protected]>

atest fix

df87f29

Signed-off-by: Edward Z. Yang <[email protected]>

Move device into SupervisorPTr

6d8b370

Signed-off-by: Edward Z. Yang <[email protected]>

Bugfixes for misclassified allocators.

3561bdd

Signed-off-by: Edward Z. Yang <[email protected]>

commentfix

b078535

Signed-off-by: Edward Z. Yang <[email protected]>

Split out DevicePtr from SupervisedPtr

07262cd

Signed-off-by: Edward Z. Yang <[email protected]>

Expunge newWithData references.

5268318

Signed-off-by: Edward Z. Yang <[email protected]>

Rename to DataPtr, and rejigger abstraction boundary.

f143084

Signed-off-by: Edward Z. Yang <[email protected]>

Add missing file

44689bd

Signed-off-by: Edward Z. Yang <[email protected]>

winfix

b892128

Signed-off-by: Edward Z. Yang <[email protected]>

buildfix

de6a08f

Signed-off-by: Edward Z. Yang <[email protected]>

CR comments

ebbe790

Signed-off-by: Edward Z. Yang <[email protected]>

ezyang force-pushed the pr/allocator-deleter2 branch from 220cf25 to ebbe790 Compare July 14, 2018 16:55

facebook-github-bot closed this in cffca29 Jul 15, 2018

ezyang added the merged label Jun 26, 2019

Allocator unification, attempt three! #9358

Allocator unification, attempt three! #9358

Uh oh!

Conversation

ezyang commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Jul 13, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

ezyang commented Jul 11, 2018 •

edited

Loading