Skip to content

CUDA 12.4 CI Inductor Issues #126692

@nWEIdia

Description

@nWEIdia

🐛 Describe the bug

Note: this issue tracks issues that are only present with CUDA 12.4 (i.e. CUDA 12.4 incurred regressions).
In CUDA 12.4 enabling CI, cuda 12.4 inductor job has a few unexpected errors. Details below: (compiled from https://hud.pytorch.org/pytorch/pytorch/pull/121956 suppress deprecation cusparse warnings v3: Linux only (c8c7dd) )

  1. cuda12.4-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) : https://ossci-raw-job-status.s3.amazonaws.com/log/25153237387
    2024-05-19T18:23:37.4020115Z + python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference --export-aot-inductor --only nanogpt --output /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv 2024-05-19T18:23:40.7767437Z 2024-05-19T18:23:43.6908140Z loading model: 0it [00:00, ?it/s]number of parameters: 123.69M 2024-05-19T18:23:44.1011678Z num decayed parameter tensors: 50, with 124,354,560 parameters 2024-05-19T18:23:44.1012893Z num non-decayed parameter tensors: 98, with 121,344 parameters 2024-05-19T18:23:44.1016400Z using fused AdamW: True 2024-05-19T18:23:44.6099974Z 2024-05-19T18:23:44.6101030Z loading model: 0it [00:03, ?it/s] 2024-05-19T18:23:44.6137242Z cuda eval nanogpt 2024-05-19T18:24:23.3324973Z 2024-05-19T18:24:23.4389192Z running benchmark: 0% 0/30 [00:00<?, ?it/s] 2024-05-19T18:24:23.5433039Z running benchmark: 33% 10/30 [00:00<00:00, 92.93it/s] 2024-05-19T18:24:23.6293708Z running benchmark: 70% 21/30 [00:00<00:00, 99.92it/s] 2024-05-19T18:24:23.6299885Z running benchmark: 100% 30/30 [00:00<00:00, 100.56it/s] 2024-05-19T18:24:23.6317077Z 4.783x 2024-05-19T18:24:25.1040739Z + python benchmarks/dynamo/check_perf_csv.py -f /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv -t 4.9 2024-05-19T18:24:25.5607672Z nanogpt 4.783073 2024-05-19T18:24:25.5608107Z 2024-05-19T18:24:25.5608293Z Error 1 models performance regressed 2024-05-19T18:24:25.5608761Z nanogpt

Speedup 4.783 < threshold 4.9

  1. [cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)] (https://ossci-raw-job-status.s3.amazonaws.com/log/25153197487)

beit_base_patch16_224 FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T18:41:08.1561377Z loading model: 0it [00:00, ?it/s] 2024-05-19T18:41:08.1561964Z loading model: 0it [00:01, ?it/s] 2024-05-19T18:41:08.1562566Z cuda train beit_base_patch16_224 2024-05-19T18:42:06.9397643Z skipping cudagraphs due to deterministic index put. Found from : 2024-05-19T18:42:06.9399524Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363 2024-05-19T18:42:06.9400513Z pred = mod(*cloned_inputs) 2024-05-19T18:42:06.9401662Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9402737Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9403764Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward 2024-05-19T18:42:06.9404745Z x = self.forward_features(x) 2024-05-19T18:42:06.9405905Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features 2024-05-19T18:42:06.9407000Z x = blk(x, shared_rel_pos_bias=rel_pos_bias) 2024-05-19T18:42:06.9408189Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9409020Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9409841Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward 2024-05-19T18:42:06.9410912Z x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias)) 2024-05-19T18:42:06.9412109Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9412935Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9414002Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward 2024-05-19T18:42:06.9414757Z rel_pos_bias = self._get_rel_pos_bias() 2024-05-19T18:42:06.9415656Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias 2024-05-19T18:42:06.9418869Z relative_position_bias = self.relative_position_bias_table[ 2024-05-19T18:42:06.9419467Z 2024-05-19T18:42:09.2269343Z W0519 18:42:09.226000 140529272689280 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored 2024-05-19T18:42:57.8653667Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 2024-05-19T18:42:57.8655363Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad 2024-05-19T18:42:57.8663498Z fail_accuracy 2024-05-19T18:42:57.9158669Z TIMING: entire_frame_compile:100.00842 code_gen:27.30379 inductor_compile:54.77576 backend_compile:85.65148 2024-05-19T18:42:57.9160141Z STATS: call_* op count: 1054 | FakeTensor.__torch_dispatch__:15454 | FakeTensorMode.__torch_dispatch__:108695 | attempt fast:2534 | fast is_contiguous:2534 | ProxyTorchDispatchMode.__torch_dispatch__:21953 2024-05-19T18:42:57.9161453Z Dynamo produced 3 graphs covering 1054 ops with 7 graph breaks (5 unique)

Accuracy: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

  1. cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) )

phlippe_resnet FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T19:55:39.9818821Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:55:39.9819368Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:55:39.9819833Z cuda train phlippe_resnet 2024-05-19T19:55:59.0636307Z E0519 19:55:59.062000 139763991364224 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 2024-05-19T19:55:59.0647538Z fail_accuracy 2024-05-19T19:55:59.0648209Z TIMING: entire_frame_compile:16.81323 code_gen:3.27386 inductor_compile:8.07473 backend_compile:14.78056 2024-05-19T19:55:59.0649650Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:18639 | attempt fast:586 | fast is_contiguous:586 | ProxyTorchDispatchMode.__torch_dispatch__:4304 2024-05-19T19:55:59.0650951Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)

Fail_Accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000

  1. cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu))

beit_base_patch16_224 FAIL: accuracy=fail_accuracy, expected=pass

`
2024-05-19T18:36:45.5837030Z loading model: 0it [00:00, ?it/s]
2024-05-19T18:36:45.5837605Z loading model: 0it [00:01, ?it/s]
2024-05-19T18:36:45.5838178Z cuda train beit_base_patch16_224
2024-05-19T18:37:17.9066455Z skipping cudagraphs due to deterministic index put. Found from :
2024-05-19T18:37:17.9067751Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363
2024-05-19T18:37:17.9068822Z pred = mod(*cloned_inputs)
2024-05-19T18:37:17.9069817Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9070642Z return forward_call(*args, **kwargs)
2024-05-19T18:37:17.9071509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward
2024-05-19T18:37:17.9075389Z x = self.forward_features(x)
2024-05-19T18:37:17.9076536Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features
2024-05-19T18:37:17.9077463Z x = blk(x, shared_rel_pos_bias=rel_pos_bias)
2024-05-19T18:37:17.9078641Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9079788Z return forward_call(*args, **kwargs)
2024-05-19T18:37:17.9080644Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward
2024-05-19T18:37:17.9081701Z x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias))
2024-05-19T18:37:17.9082886Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9083725Z return forward_call(*args, *kwargs)
2024-05-19T18:37:17.9084896Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward
2024-05-19T18:37:17.9085681Z rel_pos_bias = self._get_rel_pos_bias()
2024-05-19T18:37:17.9086643Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias
2024-05-19T18:37:17.9087542Z relative_position_bias = self.relative_position_bias_table[
2024-05-19T18:37:17.9088149Z
2024-05-19T18:37:17.9997031Z W0519 18:37:17.998000 140609516171904 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-19T18:38:05.6029062Z E0519 18:38:05.601000 140609516171904 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-19T18:38:05.6031753Z E0519 18:38:05.602000 140609516171904 torch/dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad
2024-05-19T18:38:05.6053020Z fail_accuracy
2024-05-19T18:38:05.6551358Z TIMING: entire_frame_compile:63.06661 code_gen:24.76042 inductor_compile:42.87834 backend_compile:52.21182
2024-05-19T18:38:05.6552640Z STATS: call
op count: 1028 | FakeTensor.torch_dispatch:15453 | FakeTensorMode.torch_dispatch:94224 | ProxyTorchDispatchMode.torch_dispatch:21953
2024-05-19T18:38:05.6553843Z Dynamo produced 3 graphs covering 1028 ops with 7 graph breaks (5 unique)

`
Accuracy_fail: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

  1. [cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)] (https://github.com/pytorch/pytorch/actions/runs/9148947323/job/25153197408)

phlippe_resnet FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T19:59:30.8771470Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:59:30.8771987Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:59:30.8772519Z cuda train phlippe_resnet 2024-05-19T19:59:42.6845686Z E0519 19:59:42.683000 140475690025600 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 2024-05-19T19:59:42.6858154Z fail_accuracy 2024-05-19T19:59:42.6858869Z TIMING: entire_frame_compile:6.85443 code_gen:3.07611 inductor_compile:5.21695 backend_compile:6.00724 2024-05-19T19:59:42.6860223Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:15743 | ProxyTorchDispatchMode.__torch_dispatch__:4304 2024-05-19T19:59:42.6861358Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)

Fail_accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000

Versions

#121956 github workflow results

### Tasks
- [x] Add back disabled shards when all these issues are fixed.  Before that, put the shards back to test but just disable the affected models in here. (https://github.com/pytorch/pytorch/pull/127150)
- [x] Fix the perf smoke test regression [Fix Unknown]
- [x] Fix accuracy regression for beit_base_patch16_224 (2 instances) [Fix uknown; Fix in the sense that both cu121 and cu124 behaved the same, though failure. Why it regressed is another issue]
- [x] Fix accuracy regression for phlippe_resnet (2 instances) (Fix PR: https://github.com/pytorch/pytorch/pull/123475)
- [x] Fix accuracy gluon_inception_v3 or unflaky it (related #127672)

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire

cc @eqy @Fuzzkatt @atalman @malfet @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: inductoroncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions