Skip to content

Conversation

@VitalyFedyunin
Copy link
Contributor

This PR have combined changes of #23403 subtasks.

VitalyFedyunin and others added 30 commits August 5, 2019 14:20
Added cudnn nhwc support for:
1. batch norm
2. convolution
3. convolution_transpose
suggest_memory_format has ambiguous meaning for two cases:
1. tensor with NCHW where C = 1.
   we could use stride of C as a hint to tell the intended memory format.
2. tensor with NCHW where H == W == 1.
   there's no way to identify the intended memory format from strides.

Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding
ambiguity for some of the special cases.
The old implementation assumed `is_channels_last_contiguous_` to be mutually
exclusive to `is_contiguous_`, which is not true.
Properly set the flag by checking strides.
Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: internals Related to internal abstractions in c10 and ATen module: nn Related to torch.nn module: operators module: tests Issues related to tests (not the torch.testing module) labels Aug 23, 2019
@VitalyFedyunin
Copy link
Contributor Author

@jjsjann123 @ifedan FYI

@jjsjann123
Copy link
Collaborator

BTW,

if (!grad.defined()) {
// under following condition, we can avoid clone()
if (!GradMode::is_enabled()
&& !new_grad.is_sparse()
&& new_grad.is_contiguous()
&& new_grad.use_count() <= 1 + !post_hooks().empty()) {
// first check it is in first-order grad only mode
// then check not sparse before is_contiguous
// then check contiguous, otherwise later in place accumulation may fail
// and lastly, check it is the last reference before we grab it.
// If the function has post hooks (for example, a DDP allreduce hook),
// call_function in Engine.cpp will temporarily bump the refcount by one, hence the
// addition of !post_hooks().empty().
variable.grad() = new_grad.detach();
} else {
variable.grad() = new_grad.clone();
}
} else if (!GradMode::is_enabled()) {

This guy clones the grad if it's not contiguous. In the benchmarking for my PR, that permutation kills the perf gain from NHWC :/

Not sure how it works out now as our clone() supports NHWC, but we may want to double check the kernels in a toy example to make sure there's no wasted permutation.

VitalyFedyunin and others added 8 commits August 29, 2019 11:34
Previous kernel does not stride on Channel dimension, and the kernel uses shared
memory to store temporary result (to break data dependency -> code paralellism)

Resulted in requesting more resources than what's available.

Fixing:
added striding on C to reduce shmem usage per CTA.
Updated cudnn API for batchnorm. Enabling the Extended API which provides
semi-persistent batchnorm kernel that has better performance on NHWC layout.

TODO: I made adjustments to the API as well as BN in JIT IR. But I haven't fully
tested the JIT part yet. I should verify that in the final PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: internals Related to internal abstractions in c10 and ATen module: nn Related to torch.nn module: tests Issues related to tests (not the torch.testing module)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants