-
Notifications
You must be signed in to change notification settings - Fork 1.5k
CuPy Transforms #3294
Description
Is your feature request related to a problem? Please describe.
Using CuPy in transforms introduces a host of advantages and issues to consider. CuPy provides a number of CUDA accelerated facilities not provided by Pytorch. One important aspect is replicating the NumPy API so it's possible to define/rewrite code to use one or the other based on which library inputs are from. This does add CuPy as a new dependency that is either optional and difficult to integrate or hard.
Describe the solution you'd like
Transforms can be defined with:
- NumPy: CPU only, default array interface, many libraries use this directly
- CuPy: GPU only, can interoperate with array interface but must copy to/from VRAM, cannot be substituted for NumPy in every instance or in existing NumPy based libraries
- Pytorch: CPU or GPU, missing some features NumPy/CuPy provide through the SciPy and other routines, some features faster in GPU than those provided in CuPy
- Numba: CPU and GPU, provides facilities for compiling Python functions into either CPU code or CUDA, often using the same definitions in both places, not necessarily optimal code
We should investigate how to define transforms that use some best combination of these libraries. Since CuPy is meant to be drop-in replaceable with NumPy this may allow us to write code only once and select which libraries to use, but the details will probably make this less straight forward than anticipated. We will always need NumPy to support users' custom transforms using libraries based on it, so we can't state a hard requirement for transforms to accept and return only CuPy arrays or tensors without causing copying inefficiencies for these transforms.
CuPy doesn't cover the whole NumPy/SciPy API, there are missing routines and submodules such as scipy.interpolate. There are features in Pytorch that CuPy may provide (or have a better version), it's going to be a mix of which provides the better GPU implementation of some operation.
CuPy has a comprehensive interface for defining custom kernels in CUDA without having to deal with it and its compilation directly. CUDA code can be provided in snippets to use with a kernel template routine or translated from Python code using function decorators in a similar way to Numba's CUDA support. This can be really useful in providing fast operations which can be done in NumPy/CuPy directly but would benefit from being compiled into a faster direct version, which is the whole rationale behind Numba.
Describe alternatives you've considered
With all these pros and cons the question is where CuPy fits in with transforms. It's technically redundant in that it provides GPU operations which we can also get from Pytorch. For CPU transforms we can use NumPy or Pytorch but often will want NumPy for interoperation or other features not in Pytorch so we'll never go the Pytorch only route. I had discussed Numba before as a way to define CUDA code in a similarly to CuPy but with the advantage of doing CPU compiled code as well, often reusing the same definitions.
If we want to re-engineer the transforms to support GPU computation more we can do it with Pytorch in conjunction with NumPy for CPU only stuff, so is there a need for CuPy (or Numba)? Do one or the other provide enough advantage to make into a hard requirement and tightly integrate with the transform definitions?
Additional context
One comparison I've made with with resizing an image. In NumPy or SciPy there's options for interpolation or zoom but these are slow. The Resize transform uses torch.nn.functional.interpolate. In CuPy there's cupyx.scipy.ndimage.zoom which can achieve the same thing but is still much slower than interpolate, however it can do tricubic interpolation which interpolate cannot. Resizing images is important for the way I was defining smooth fields for various applications so there's a definite need for a fast solution.
My limited experience in playing with CuPy transforms is that it can be easily mixed with Pytorch code to use whichever component is the faster implementation. Some of our existing transforms can use CuPy as input now and operate using only its routines but more study needs to be done to ensure unnecessary copies and other inefficiencies are minimized.