CUDA 12.4 CI Inductor Issues

### 🐛 Describe the bug
Note: this issue tracks issues that are only present with CUDA 12.4 (i.e. CUDA 12.4 incurred regressions). 
In [CUDA 12.4 enabling CI](https://github.com/pytorch/pytorch/pull/121956), cuda 12.4 inductor job has a few unexpected errors. Details below:  (compiled from https://hud.pytorch.org/pytorch/pytorch/pull/121956 suppress deprecation cusparse warnings v3: Linux only (c8c7dd) ) 

1. cuda12.4-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) : https://[ossci-raw-job-status.s3.amazonaws.com/log/25153237387](https://ossci-raw-job-status.s3.amazonaws.com/log/25153237387) 
`2024-05-19T18:23:37.4020115Z + python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference --export-aot-inductor --only nanogpt --output /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv
2024-05-19T18:23:40.7767437Z 
2024-05-19T18:23:43.6908140Z loading model: 0it [00:00, ?it/s]number of parameters: 123.69M
2024-05-19T18:23:44.1011678Z num decayed parameter tensors: 50, with 124,354,560 parameters
2024-05-19T18:23:44.1012893Z num non-decayed parameter tensors: 98, with 121,344 parameters
2024-05-19T18:23:44.1016400Z using fused AdamW: True
2024-05-19T18:23:44.6099974Z 
2024-05-19T18:23:44.6101030Z loading model: 0it [00:03, ?it/s]
2024-05-19T18:23:44.6137242Z cuda eval  nanogpt                            
2024-05-19T18:24:23.3324973Z 
2024-05-19T18:24:23.4389192Z running benchmark:   0% 0/30 [00:00<?, ?it/s]
2024-05-19T18:24:23.5433039Z running benchmark:  33% 10/30 [00:00<00:00, 92.93it/s]
2024-05-19T18:24:23.6293708Z running benchmark:  70% 21/30 [00:00<00:00, 99.92it/s]
2024-05-19T18:24:23.6299885Z running benchmark: 100% 30/30 [00:00<00:00, 100.56it/s]
2024-05-19T18:24:23.6317077Z 4.783x
2024-05-19T18:24:25.1040739Z + python benchmarks/dynamo/check_perf_csv.py -f /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv -t 4.9
2024-05-19T18:24:25.5607672Z nanogpt                            4.783073
2024-05-19T18:24:25.5608107Z 
2024-05-19T18:24:25.5608293Z Error 1 models performance regressed
2024-05-19T18:24:25.5608761Z     nanogpt`

Speedup 4.783 < threshold 4.9  

2. [cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)] (https://ossci-raw-job-status.s3.amazonaws.com/log/25153197487)

beit_base_patch16_224               FAIL:     accuracy=fail_accuracy, expected=pass

`
2024-05-19T18:41:08.1561377Z loading model: 0it [00:00, ?it/s]
2024-05-19T18:41:08.1561964Z loading model: 0it [00:01, ?it/s]
2024-05-19T18:41:08.1562566Z cuda train beit_base_patch16_224              
2024-05-19T18:42:06.9397643Z skipping cudagraphs due to deterministic index put. Found from : 
2024-05-19T18:42:06.9399524Z    File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363
2024-05-19T18:42:06.9400513Z     pred = mod(*cloned_inputs)
2024-05-19T18:42:06.9401662Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:42:06.9402737Z     return forward_call(*args, **kwargs)
2024-05-19T18:42:06.9403764Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward
2024-05-19T18:42:06.9404745Z     x = self.forward_features(x)
2024-05-19T18:42:06.9405905Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features
2024-05-19T18:42:06.9407000Z     x = blk(x, shared_rel_pos_bias=rel_pos_bias)
2024-05-19T18:42:06.9408189Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:42:06.9409020Z     return forward_call(*args, **kwargs)
2024-05-19T18:42:06.9409841Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward
2024-05-19T18:42:06.9410912Z     x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias))
2024-05-19T18:42:06.9412109Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:42:06.9412935Z     return forward_call(*args, **kwargs)
2024-05-19T18:42:06.9414002Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward
2024-05-19T18:42:06.9414757Z     rel_pos_bias = self._get_rel_pos_bias()
2024-05-19T18:42:06.9415656Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias
2024-05-19T18:42:06.9418869Z     relative_position_bias = self.relative_position_bias_table[
2024-05-19T18:42:06.9419467Z 
2024-05-19T18:42:09.2269343Z W0519 18:42:09.226000 140529272689280 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-19T18:42:57.8653667Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-19T18:42:57.8655363Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad
2024-05-19T18:42:57.8663498Z fail_accuracy
2024-05-19T18:42:57.9158669Z TIMING: entire_frame_compile:100.00842 code_gen:27.30379 inductor_compile:54.77576 backend_compile:85.65148
2024-05-19T18:42:57.9160141Z STATS: call_* op count: 1054 | FakeTensor.__torch_dispatch__:15454 | FakeTensorMode.__torch_dispatch__:108695 | attempt fast:2534 | fast is_contiguous:2534 | ProxyTorchDispatchMode.__torch_dispatch__:21953
2024-05-19T18:42:57.9161453Z Dynamo produced 3 graphs covering 1054 ops with 7 graph breaks (5 unique)
`

Accuracy: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

3. [cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) ](https://ossci-raw-job-status.s3.amazonaws.com/log/25153197615))

phlippe_resnet                      FAIL:     accuracy=fail_accuracy, expected=pass

`
2024-05-19T19:55:39.9818821Z loading model: 0it [00:00, ?it/s]
2024-05-19T19:55:39.9819368Z loading model: 0it [00:00, ?it/s]
2024-05-19T19:55:39.9819833Z cuda train phlippe_resnet                     
2024-05-19T19:55:59.0636307Z E0519 19:55:59.062000 139763991364224 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
2024-05-19T19:55:59.0647538Z fail_accuracy
2024-05-19T19:55:59.0648209Z TIMING: entire_frame_compile:16.81323 code_gen:3.27386 inductor_compile:8.07473 backend_compile:14.78056
2024-05-19T19:55:59.0649650Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:18639 | attempt fast:586 | fast is_contiguous:586 | ProxyTorchDispatchMode.__torch_dispatch__:4304
2024-05-19T19:55:59.0650951Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)
`

Fail_Accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 

4. [cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)](https://ossci-raw-job-status.s3.amazonaws.com/log/25153197291))

beit_base_patch16_224               FAIL:     accuracy=fail_accuracy, expected=pass

`
2024-05-19T18:36:45.5837030Z loading model: 0it [00:00, ?it/s]
2024-05-19T18:36:45.5837605Z loading model: 0it [00:01, ?it/s]
2024-05-19T18:36:45.5838178Z cuda train beit_base_patch16_224              
2024-05-19T18:37:17.9066455Z skipping cudagraphs due to deterministic index put. Found from : 
2024-05-19T18:37:17.9067751Z    File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363
2024-05-19T18:37:17.9068822Z     pred = mod(*cloned_inputs)
2024-05-19T18:37:17.9069817Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9070642Z     return forward_call(*args, **kwargs)
2024-05-19T18:37:17.9071509Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward
2024-05-19T18:37:17.9075389Z     x = self.forward_features(x)
2024-05-19T18:37:17.9076536Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features
2024-05-19T18:37:17.9077463Z     x = blk(x, shared_rel_pos_bias=rel_pos_bias)
2024-05-19T18:37:17.9078641Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9079788Z     return forward_call(*args, **kwargs)
2024-05-19T18:37:17.9080644Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward
2024-05-19T18:37:17.9081701Z     x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias))
2024-05-19T18:37:17.9082886Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-05-19T18:37:17.9083725Z     return forward_call(*args, **kwargs)
2024-05-19T18:37:17.9084896Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward
2024-05-19T18:37:17.9085681Z     rel_pos_bias = self._get_rel_pos_bias()
2024-05-19T18:37:17.9086643Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias
2024-05-19T18:37:17.9087542Z     relative_position_bias = self.relative_position_bias_table[
2024-05-19T18:37:17.9088149Z 
2024-05-19T18:37:17.9997031Z W0519 18:37:17.998000 140609516171904 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-19T18:38:05.6029062Z E0519 18:38:05.601000 140609516171904 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-19T18:38:05.6031753Z E0519 18:38:05.602000 140609516171904 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad
2024-05-19T18:38:05.6053020Z fail_accuracy
2024-05-19T18:38:05.6551358Z TIMING: entire_frame_compile:63.06661 code_gen:24.76042 inductor_compile:42.87834 backend_compile:52.21182
2024-05-19T18:38:05.6552640Z STATS: call_* op count: 1028 | FakeTensor.__torch_dispatch__:15453 | FakeTensorMode.__torch_dispatch__:94224 | ProxyTorchDispatchMode.__torch_dispatch__:21953
2024-05-19T18:38:05.6553843Z Dynamo produced 3 graphs covering 1028 ops with 7 graph breaks (5 unique)

`
Accuracy_fail: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

6.  [cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)] (https://github.com/pytorch/pytorch/actions/runs/9148947323/job/25153197408)
 
phlippe_resnet                      FAIL:     accuracy=fail_accuracy, expected=pass 

`2024-05-19T19:59:30.8771470Z loading model: 0it [00:00, ?it/s]
2024-05-19T19:59:30.8771987Z loading model: 0it [00:00, ?it/s]
2024-05-19T19:59:30.8772519Z cuda train phlippe_resnet                     
2024-05-19T19:59:42.6845686Z E0519 19:59:42.683000 140475690025600 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
2024-05-19T19:59:42.6858154Z fail_accuracy
2024-05-19T19:59:42.6858869Z TIMING: entire_frame_compile:6.85443 code_gen:3.07611 inductor_compile:5.21695 backend_compile:6.00724
2024-05-19T19:59:42.6860223Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:15743 | ProxyTorchDispatchMode.__torch_dispatch__:4304
2024-05-19T19:59:42.6861358Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)`

Fail_accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000

### Versions

https://github.com/pytorch/pytorch/pull/121956 github workflow results 


```[tasklist]
### Tasks
- [x] Add back disabled shards when all these issues are fixed.  Before that, put the shards back to test but just disable the affected models in here. (https://github.com/pytorch/pytorch/pull/127150)
- [x] Fix the perf smoke test regression [Fix Unknown]
- [x] Fix accuracy regression for beit_base_patch16_224 (2 instances) [Fix uknown; Fix in the sense that both cu121 and cu124 behaved the same, though failure. Why it regressed is another issue]
- [x] Fix accuracy regression for phlippe_resnet (2 instances) (Fix PR: https://github.com/pytorch/pytorch/pull/123475)
- [x] Fix accuracy gluon_inception_v3 or unflaky it (related #127672)
```

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire

cc @eqy @Fuzzkatt  @atalman @malfet @ptrblck 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA 12.4 CI Inductor Issues #126692

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA 12.4 CI Inductor Issues #126692

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions