Skip to content

Conversation

@Chillee
Copy link
Collaborator

@Chillee Chillee commented May 15, 2024

Stack from ghstack (oldest at bottom):

image

@pytorch-bot
Copy link

pytorch-bot bot commented May 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126320

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (17 Unrelated Failures)

As of commit 7784349 with merge base 907cb28 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor release notes: fx release notes category labels May 15, 2024
Chillee added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 00cb973
Pull Request resolved: #126320
@github-actions github-actions bot requested a review from ezyang May 15, 2024 20:08
@Chillee
Copy link
Collaborator Author

Chillee commented Jun 7, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented Jun 8, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 17 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable), inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.4-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented Jun 8, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 17 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu), pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable), inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.4-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator Author

Chillee commented Jun 8, 2024

@pytorchbot merge -f "failures unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

huydhn added a commit to pytorch/test-infra that referenced this pull request Jun 11, 2024
I have seen recent cases where Dr.CI returns inconsistent results where
a failure was treated as a new failure while other looking exactly the
same were either unstable or flaky. This is due to 1) searching for
similar flaky failures could miss some and 2) oncall didn't open
unstable issues for all cases. So, the fix here is to do a post
processing step to make sure that failures don't match anything that has
already been classified as flaky to unstable.

I skip the generic GHA error to avoid false positives.

For example,

pytorch/pytorch#128005

## ❌ 1 New Failure, 4 Unrelated Failures
As of commit a80088bd7f862e2f28836eedfc4f88af576897aa with merge base
b054470db22a6c8ecba31c44ce54b9ca48159cdd (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1717564884?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details open><summary><b>NEW FAILURE</b> - The following job has
failed:</summary><p>

* [linux-binary-manywheel / manywheel-py3_8-cuda11_8-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871506583)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871506583))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
</p></details>
<details ><summary><b>BROKEN TRUNK</b> - The following job failed but
were present on the merge base:</summary><p>👉 <b>Rebase onto the
`viable/strict` branch to avoid these failures</b></p><p>

* [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5,
linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25825892375)
([gh](https://github.com/pytorch/pytorch/actions/runs/9379469649/job/25825892375))
([trunk
failure](https://hud.pytorch.org/pytorch/pytorch/commit/b054470db22a6c8ecba31c44ce54b9ca48159cdd#25823095706))
    `FAILED`
</p></details>
<details ><summary><b>UNSTABLE</b> - The following jobs failed but were
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [linux-binary-manywheel / manywheel-py3_8-cuda12_1-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871273727)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871273727))
([#127288](https://hud.pytorch.org/pytorch/pytorch/issues/127288))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_4-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871125249)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871125249))
([#127289](https://hud.pytorch.org/pytorch/pytorch/issues/127289))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25825958264)
([gh](https://github.com/pytorch/pytorch/actions/runs/9379469649/job/25825958264))
()
    `FAILED`
</p></details>

pytorch/pytorch#126320

## ❌ 1 New Failure, 8 Unrelated Failures
As of commit ef180dc28d26feb93972c19ef2386ffb79000b7c with merge base
907cb28f676a6d3f44d6f3a2503c56888ebecc93 (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1717542403?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details open><summary><b>NEW FAILURE</b> - The following job has
failed:</summary><p>

* [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5,
linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854353839)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388111130/job/25854353839))

`inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda`
</p></details>
<details ><summary><b>FLAKY</b> - The following jobs failed but were
likely due to flakiness present on trunk:</summary><p>

* [linux-binary-manywheel / manywheel-py3_8-cuda11_8-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25873256459)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25873256459))
([similar
failure](https://hud.pytorch.org/pytorch/pytorch/commit/96d5590c616b8249dbf21f2b0807dacad02aa793#25849670354))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [trunk / macos-py3-arm64 / test (default, 1, 3,
macos-m1-stable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25871052640)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973490/job/25871052640))
([similar
failure](https://hud.pytorch.org/pytorch/pytorch/commit/72c0a66585937de6e9fba6265c0d6dc5cb0a7889#25818105571))

`'test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_nested_list_out_dynamic_shapes'`
</p></details>
<details ><summary><b>UNSTABLE</b> - The following jobs failed but were
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm,
1, 2, linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854443831)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854443831))
([#127438](https://hud.pytorch.org/pytorch/pytorch/issues/127438))
    `beit_base_patch16_224`
* [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854442295)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854442295))
([#126884](https://hud.pytorch.org/pytorch/pytorch/issues/126884))
    `beit_base_patch16_224`
* [inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm,
1, 2, linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854507171)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854507171))
([#127680](https://hud.pytorch.org/pytorch/pytorch/issues/127680))
    `cspdarknet53`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_1-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25873101404)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25873101404))
([#127288](https://hud.pytorch.org/pytorch/pytorch/issues/127288))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_4-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25872980718)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25872980718))
([#127289](https://hud.pytorch.org/pytorch/pytorch/issues/127289))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854427220)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388111130/job/25854427220))
()

`inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda`
</p></details>

### Testing

pytorch/pytorch#128005

## ✅ You can merge normally! (5 Unrelated Failures)
As of commit a80088bd7f862e2f28836eedfc4f88af576897aa with merge base
b054470db22a6c8ecba31c44ce54b9ca48159cdd (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1717564884?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details ><summary><b>BROKEN TRUNK</b> - The following job failed but
were present on the merge base:</summary><p>👉 <b>Rebase onto the
`viable/strict` branch to avoid these failures</b></p><p>

* [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5,
linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25825892375)
([gh](https://github.com/pytorch/pytorch/actions/runs/9379469649/job/25825892375))
([trunk
failure](https://hud.pytorch.org/pytorch/pytorch/commit/b054470db22a6c8ecba31c44ce54b9ca48159cdd#25823095706))
    `FAILED`
</p></details>
<details ><summary><b>UNSTABLE</b> - The following jobs failed but were
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [linux-binary-manywheel / manywheel-py3_8-cuda11_8-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871506583)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871506583))
([related
job](https://hud.pytorch.org/pytorch/pytorch/commit/a80088bd7f862e2f28836eedfc4f88af576897aa#25871273727))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_1-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871273727)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871273727))
([#127288](https://hud.pytorch.org/pytorch/pytorch/issues/127288))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_4-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25871125249)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393196399/job/25871125249))
([#127289](https://hud.pytorch.org/pytorch/pytorch/issues/127289))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/128005#25825958264)
([gh](https://github.com/pytorch/pytorch/actions/runs/9379469649/job/25825958264))
()
    `FAILED`
</p></details>

pytorch/pytorch#126320

## ✅ You can merge normally! (9 Unrelated Failures)
As of commit ef180dc28d26feb93972c19ef2386ffb79000b7c with merge base
907cb28f676a6d3f44d6f3a2503c56888ebecc93 (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1717542403?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details ><summary><b>FLAKY</b> - The following jobs failed but were
likely due to flakiness present on trunk:</summary><p>

* [linux-binary-manywheel / manywheel-py3_8-cuda11_8-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25873256459)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25873256459))
([similar
failure](https://hud.pytorch.org/pytorch/pytorch/commit/96d5590c616b8249dbf21f2b0807dacad02aa793#25849670354))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [trunk / macos-py3-arm64 / test (default, 1, 3,
macos-m1-stable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25871052640)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973490/job/25871052640))
([similar
failure](https://hud.pytorch.org/pytorch/pytorch/commit/72c0a66585937de6e9fba6265c0d6dc5cb0a7889#25818105571))

`'test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_nested_list_out_dynamic_shapes'`
</p></details>
<details ><summary><b>UNSTABLE</b> - The following jobs failed but were
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm,
1, 2, linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854443831)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854443831))
([#127438](https://hud.pytorch.org/pytorch/pytorch/issues/127438))
    `beit_base_patch16_224`
* [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854442295)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854442295))
([#126884](https://hud.pytorch.org/pytorch/pytorch/issues/126884))
    `beit_base_patch16_224`
* [inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm,
1, 2, linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854507171)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388112520/job/25854507171))
([#127680](https://hud.pytorch.org/pytorch/pytorch/issues/127680))
    `cspdarknet53`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_1-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25873101404)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25873101404))
([#127288](https://hud.pytorch.org/pytorch/pytorch/issues/127288))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [linux-binary-manywheel / manywheel-py3_8-cuda12_4-test /
test](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25872980718)
([gh](https://github.com/pytorch/pytorch/actions/runs/9393973448/job/25872980718))
([#127289](https://hud.pytorch.org/pytorch/pytorch/issues/127289))
`ImportError: libcudnn.so.8: cannot open shared object file: No such
file or directory`
* [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5,
linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854353839)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388111130/job/25854353839))
([related
job](https://hud.pytorch.org/pytorch/pytorch/commit/ef180dc28d26feb93972c19ef2386ffb79000b7c#25854427220))

`inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda`
* [pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5,
linux.g5.4xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/126320#25854427220)
([gh](https://github.com/pytorch/pytorch/actions/runs/9388111130/job/25854427220))
()

`inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda`
</p></details>
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
@github-actions github-actions bot deleted the gh/chillee/291/head branch August 8, 2024 01:58
lw added a commit that referenced this pull request Mar 6, 2025
This was added in #126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

ghstack-source-id: 4deba4a
Pull Request resolved: #148678
pytorchmergebot pushed a commit that referenced this pull request Mar 7, 2025
This was added in #126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: #148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: fx release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants