-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
We define 4D tensor as stored in channels last memory format, when dimensions order is NCHW and C-strides < W-strides < H-strides < N-strides (If size of any dimension is equal to 1, this dimension strides value is not taken into account).
Channels last contiguous tensor is channel last tensor which occupies contiguous memory block. So x.is_contiguous(memory_format=torch.channels_last) checks if tensor is channels last contiguous.
The goal of the experiment is to use channels last memory format in all Resnet (https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) model's operators and to measure performance gains on the Volta devices with CudNN library available.
This experiment requires:
- Update operators kernels to follow the next rule: if one of the operator's inputs is channel last tensor, all outputs should also be in the channel last memory format.
- For better performance gain, update DataLoader to output channel last tensors.
To avoid changing the model itself and more importantly, introduce this optimization to the existing saved models. We need to introduce the next changes:
-
tooperator should preserve memory format.copy_device_to_deviceshould be memory format aware. (empty_like,to,resize_as_andclonenow preserve memory format #23899) -
empty_likeoperator should preserve memory format by default. (empty_like,to,resize_as_andclonenow preserve memory format #23899) -
resize_as_operator should be memory format aware. (empty_like,to,resize_as_andclonenow preserve memory format #23899) -
cloneoperator should preserve memory format. (empty_like,to,resize_as_andclonenow preserve memory format #23899) -
scatterandgatherfunctions should be memory format aware. ([WIP] Scatter gather memory format #24121) -
TensorIteratorbased point-wise operators should preserve memory format ([WIP] Add Tensor Iterator and some cuda functions memory propagation #24038). -
adaptive_avg_pool2d_cudaandadaptive_avg_pool2d_backward_cudashould have channel last optimized kernels. ([nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396) -
max_pool2d_with_indices_cudaandmax_pool2d_with_indices_backward_cudashould have channel last optimized kernels (max_pool2d cuda should have channel last optimized kernels[Performance improvement] #24872). -
cudnn_batch_normandcudnn_batch_norm_backwardshould support channels last memory format. ([cudnn nhwc support] #23861) -
cudnn_convolution_forwardandcudnn_convolution_backwardshould support channels last memory format. ([cudnn nhwc support] #23861)
Writing memory format aware operators require special functions introduced in #23391
auto memory_format = input_tensor.suggest_memory_format();
auto output_tensor = at::empty(output_shape, memory_format);
switch (memory_format) {
case MemoryFormat::ChannelsLast: {
input_cl_contiguous = input_tensor.contiguous(
MemoryFormat::ChannelsLast); // if kernel requires memory contiguous
// tensor
// .... kernel code
break;
}
case MemoryFormat::Contiguous: {
// .... standard kernel
break;
}
default:
TORCH_CHECK(
false,
"Unsupported memory format. Supports only ChannelsLast, Contiguous");
}Notes
- Resnet calls
x = x.reshape(x.size(0), -1)before linear layers, we are going to updatereshapeandviewcode and convert tensor's memory format totorch.contiguous_formatat this step. - Making old models/libraries to work faster also requires BC sacrifices, such as
empty_like( and all _like operators ) will return channels last tensor if input is channels last, similar will apply toto,clone,resize_as. We are thinking about the ability to control suggest_memory_format behaviors by the global variable.