[fixing cuda launch config failure on UpSampleNearest] #29016

jjsjann123 · 2019-11-01T03:22:20Z

Adding limitation on launch config for grid size
Test added in test_cuda;

Adding limitation on launch config for grid sizes as well; Test added in test_cuda;

ngimel

Looks good, just a couple small fixes.

ngimel · 2019-11-01T15:21:06Z

aten/src/ATen/native/cuda/UpSampleNearest1d.cu

+  TORCH_CHECK(
+      gdim.x <= at::cuda::getCurrentDeviceProperties()->maxGridSize[0],
+      "input tensor has spatial dimension larger than the kernel capacity");
+


tensors are indexed with ints in the kernel, so just check that the number of elements fits, and don't do this check.

It's fair to assume that maxGridSize is not going to change in the future.
I'll update it for the places where it's the same kernel launch.

ngimel · 2019-11-01T15:21:35Z

aten/src/ATen/native/cuda/UpSampleNearest1d.cu

+  TORCH_CHECK(
+      gdim.x <= at::cuda::getCurrentDeviceProperties()->maxGridSize[0],
+      "input tensor has spatial dimension larger than the kernel capacity");
+


ngimel · 2019-11-01T15:26:01Z

test/test_nn.py

            with torch.backends.cudnn.flags(enabled=False):
                self._test_rnn_retain_variables(device, dtype)

+    @onlyCUDA


How much memory do those tests require? Regular CI machines are M60s with 6GB memory IIRC, so if you need more, use large memory decorator.

Looking at the size of input + output, it takes 0.625GB. Given the way input/output are allocated in the code path, it should have cached 0.75GB with 0.625GB actually allocated.

Running the test independently I'm seeing 2.5GB cached memory in total at the end of the test, so we should be safe here.

jjsjann123 · 2019-11-01T22:40:22Z

Took me a while wondering how that expected failure is passing the test, then realized it's rocm...
I'll just exclude rocm from that, since I have no idea what are the limitations. Feel free to add a rocm specific failure test.

…nel and change the check using numerics check (assuming launch limitation to be 2**31-1 implicitly)

test/test_nn.py

jjsjann123 · 2019-11-02T17:40:16Z

Can't get any idea out of failing CI. Should I be concerned?

ngimel · 2019-11-04T01:17:47Z

Failures are unrelated

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Adding limitation on launch config for grid size Test added in test_cuda; Pull Request resolved: pytorch/pytorch#29016 Differential Revision: D18293788 Pulled By: ngimel fbshipit-source-id: 44de308b05a4fe44bfffc2f3713fd9fa67ef74fa

facebook-github-bot · 2019-11-04T21:40:39Z

@ngimel merged this pull request in 70f3f23.

zhanwenchen · 2019-11-07T21:47:40Z

When is this getting released? Right now I had to implement a Dataloader for inference.

ngimel · 2019-11-07T22:14:35Z

It is in nightly releases.

[fixing cuda launch config failure on UpSampleNearest]

9754033

Adding limitation on launch config for grid sizes as well; Test added in test_cuda;

jjsjann123 requested a review from ngimel November 1, 2019 03:22

ngimel approved these changes Nov 1, 2019

View reviewed changes

ngimel mentioned this pull request Nov 1, 2019

[fixing cuda launch config failure on UpSampleNearest] #28927

Closed

addressing comments -> changing indexing to int32 for appropriete ker…

34aa5cb

…nel and change the check using numerics check (assuming launch limitation to be 2**31-1 implicitly)

jjsjann123 mentioned this pull request Nov 1, 2019

RuntimeError: CUDA error: invalid configuration argument #22526

Closed

ngimel reviewed Nov 1, 2019

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

removing print/sleep from test

f41ce52

facebook-github-bot reviewed Nov 4, 2019

View reviewed changes

facebook-github-bot closed this in 70f3f23 Nov 4, 2019

facebook-github-bot added the merged label Nov 4, 2019

mruberry added the Merged label Oct 28, 2020

[fixing cuda launch config failure on UpSampleNearest] #29016

[fixing cuda launch config failure on UpSampleNearest] #29016

Uh oh!

Conversation

jjsjann123 commented Nov 1, 2019

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

ngimel Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

ngimel Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Nov 1, 2019

Uh oh!

Uh oh!

jjsjann123 commented Nov 2, 2019

Uh oh!

ngimel commented Nov 4, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 4, 2019

Uh oh!

zhanwenchen commented Nov 7, 2019

Uh oh!

ngimel commented Nov 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants