Skip to content

Conversation

@ifedan
Copy link
Contributor

@ifedan ifedan commented Oct 30, 2019

When NHWC Tensor has height or width larger then max CUDA grid size, max_pool fails with error code 0

The example is: #28714

This change should limit grid size to the CUDA max possible size and chunk the input to be able to process it.

@ifedan ifedan changed the title 28714 [WIP] Fix the issue when NHWC Tensor has height or width larger then max cuda grid Oct 30, 2019
maxThreadsDim[0], std::min<int>(lastPow2(nInputPlane), max_threads / block_y / block_z));
const dim3 block(block_x, block_y, block_z);
int grid_x = nbatch;
int grid_y = cuda::ATenCeilDiv(safe_downcast<int, int64_t>(outputWidth), block_y*BLOCK_STRIDE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need multiple kernel launching, as we are striding along width/height in the kernel already.

I would discard all changes and simply add change this line (and grid_z as well)

int grid_y = std::min<int>(
at::cuda::getCurrentDeviceProperties()->maxGridSize[1],
cuda::ATenCeilDiv(safe_downcast<int, int64_t>(outputWidth), block_y*BLOCK_STRIDE));

Please do NOT do the same thing for grid_x, as the kernel does not loop over batch dimension. The fix you have here could be used for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated grid, and removed slicing, but I still see the performance degradation for backward function.

Copy link
Collaborator

@jjsjann123 jjsjann123 Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, perf regression doesn't really make any sense here. It's basically the same kernel.
Not sure if it's related to that typo I pointed out in the other comment.

Changing the grid size would affect the occupancy and maybe cache hit? I need to think about it further before I can make any decisive call. But let's fix the typo and hope the problem goes away

maxThreadsDim[0], std::min<int>(lastPow2(nInputPlane), max_threads / block_y / block_z));
const dim3 block(block_x, block_y, block_z);
int grid_x = nbatch;
int grid_y = cuda::ATenCeilDiv(safe_downcast<int, int64_t>(inputWidth), block_y*BLOCK_STRIDE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as I commented on forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

cuda::ATenCeilDiv(safe_downcast<int, int64_t>(outputWidth), block_y*BLOCK_STRIDE));
int grid_z = std::min<int>(
at::cuda::getCurrentDeviceProperties()->maxGridSize[2],
cuda::ATenCeilDiv(safe_downcast<int, int64_t>(outputHeight), block_y*BLOCK_STRIDE));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here! block_y should be block_z

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

cuda::ATenCeilDiv(safe_downcast<int, int64_t>(inputWidth), block_y*BLOCK_STRIDE));
int grid_z = std::min<int>(
at::cuda::getCurrentDeviceProperties()->maxGridSize[2],
cuda::ATenCeilDiv(safe_downcast<int, int64_t>(inputHeight), block_y*BLOCK_STRIDE));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_y -> block_z

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@ifedan
Copy link
Contributor Author

ifedan commented Nov 5, 2019

image
image

@ifedan ifedan marked this pull request as ready for review November 5, 2019 20:16
@ifedan ifedan requested a review from VitalyFedyunin November 5, 2019 20:16
@ifedan ifedan changed the title [WIP] Fix the issue when NHWC Tensor has height or width larger then max cuda grid Fix the issue when NHWC Tensor has height or width larger then max cuda grid Nov 5, 2019
@ifedan ifedan requested a review from jjsjann123 November 6, 2019 16:30
@jjsjann123
Copy link
Collaborator

LGTM.

Did that few line change actually give a hit on performance? Or is there a high variance between runs?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 7, 2019
…da grid (#28931)

Summary:
When NHWC Tensor has height or width larger then max CUDA grid size, max_pool fails with error code 0

The example is: pytorch/pytorch#28714

This change should limit grid size to the CUDA max possible size and chunk the input to be able to process it.
Pull Request resolved: pytorch/pytorch#28931

Differential Revision: D18358892

Pulled By: ifedan

fbshipit-source-id: 2fd65448bd644f1588a0e208edaaea5bcb6a7d52
@facebook-github-bot
Copy link
Contributor

@ifedan merged this pull request in 5d70b11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants