Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Jul 11, 2018

Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator

See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Signed-off-by: Edward Z. Yang <[email protected]>

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang force-pushed the pr/allocator-deleter2 branch from 82d33a2 to 86a7e75 Compare July 11, 2018 22:02
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang changed the title [TESTING] New allocator PR Allocator unification, attempt three! Jul 12, 2018

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor Author

ezyang commented Jul 13, 2018

The Lunch Design Committee(TM) has ordained DevicePtr shall be renamed as DataPtr.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

ezyang added 9 commits July 13, 2018 14:28
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
ezyang added 13 commits July 13, 2018 14:28
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
@ezyang ezyang force-pushed the pr/allocator-deleter2 branch from 220cf25 to ebbe790 Compare July 14, 2018 16:55
@ezyang
Copy link
Contributor Author

ezyang commented Jul 14, 2018

@pytorchbot retest this please

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 15, 2018
…9358)

Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: pytorch/pytorch#9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 16, 2018
…9358)

Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: pytorch/pytorch#9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
petrex pushed a commit to petrex/pytorch that referenced this pull request Jul 16, 2018
* upstream/master: (24 commits)
  Implement tensor weak references (pytorch#9363)
  Nuke TestCollectEnv (pytorch#9459)
  Add test case for segmentation fault fix in grad_fn (pytorch#9457)
  Add peephole optimization for type_as operators. (pytorch#9316)
  Fix out-of-range error for test_neg (pytorch#9431)
  add depthwise conv support for mkldnn (pytorch#8782)
  Refactor `_log_sum_exp` (pytorch#9173)
  Add ModuleDict and ParameterDict containers (pytorch#8463)
  Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (pytorch#9358)
  Introducing IsInf (pytorch#9169)
  add device to CUDAEvent (pytorch#9415)
  Make localScalar error message more intuitive (pytorch#9443)
  Only accept continguous tensors in TopK for cuda (pytorch#9441)
  Add support for .norm() pytorch onnx export and ReduceL1/ReduceL2 caffe2 operators (pytorch#9299)
  Only view() rhs of index_put if we need to (pytorch#9424)
  Add BatchBucketizeOp in caffe2 (pytorch#9385)
  Implementation of Wngrad optimizer caffe2 python wrapper and unit test on least square regression (pytorch#9001)
  Implementation and operator test for Wngrad optimizer (pytorch#8999)
  Fix segmentation fault in grad_fn (pytorch#9292)
  update docs (pytorch#9423)
  ...
goldsborough pushed a commit to goldsborough/pytorch that referenced this pull request Jul 20, 2018
…ytorch#9358)

Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: pytorch#9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
jramseyer pushed a commit to jramseyer/pytorch that referenced this pull request Jul 30, 2018
…ytorch#9358)

Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: pytorch#9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
…ytorch#9358)

Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: pytorch#9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants