[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142474

y-sq · 2024-12-10T08:42:58Z

Summary:
Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing xfailIfSM89.

_____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________
Unexpected success

(Since I am trying the other solution for #141082, I moved out the test case fixes from that pr to a separate pr to land first.)

Testing float8 dynamic scaling case with TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 didn't make any difference.

The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0

The main change here is to modify the condition of calling loop_reordering from shared_data_score == 0 to shared_data_score < config.score_fusion_memory_threshold.

Before the change:
shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold
After the change:
shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused

It's the same issue as fixed in #136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.

Test Plan:

buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering

Ran a float8 dynamic scaling training script to verify it e2e

Differential Revision: D67012816

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-12-10T08:43:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142474

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit a19bee6 with merge base e885225 ():

NEW FAILURE - The following job has failed:

inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) (gh)
Process completed with exit code 1.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-12-10T08:43:08Z

This pull request was exported from Phabricator. Differential Revision: D67012816

y-sq · 2024-12-10T08:45:44Z

@pytorchbot label "topic: not user facing"

…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816

facebook-github-bot · 2024-12-10T18:35:16Z

This pull request was exported from Phabricator. Differential Revision: D67012816

huydhn · 2024-12-11T18:43:22Z

@pytorchbot rebase

pytorchmergebot · 2024-12-11T18:44:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-11T18:44:55Z

Successfully rebased export-D67012816 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D67012816 && git pull --rebase)

…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816

huydhn · 2024-12-11T21:21:53Z

@pytorchbot rebase

pytorchmergebot · 2024-12-11T21:23:32Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-11T21:23:37Z

Successfully rebased export-D67012816 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D67012816 && git pull --rebase)

…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816

shunting314 · 2024-12-16T19:18:30Z

It looks like the earlier fix #136782 get dropped during recent code-refactorings. It's weird that the test failure does not show up in the CI on the refactoring PRs.

…torch#142474) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for pytorch#141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in pytorch#136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Reviewed By: shunting314, sijiac Differential Revision: D67012816

facebook-github-bot · 2024-12-16T19:24:07Z

This pull request was exported from Phabricator. Differential Revision: D67012816

facebook-github-bot · 2024-12-16T23:57:20Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-12-16T23:59:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-16T23:59:31Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

huydhn · 2024-12-17T04:12:00Z

@pytorchbot merge -f 'ROCm failures are not related, merge this as the diff has been landed internally'

pytorchmergebot · 2024-12-17T04:14:10Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor module: inductor labels Dec 10, 2024

facebook-github-bot added the fb-exported label Dec 10, 2024

y-sq mentioned this pull request Dec 10, 2024

[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142273

Closed

y-sq requested review from eellison, huydhn, shunting314 and vkuzo December 10, 2024 08:44

pytorch-bot bot added the topic: not user facing topic category label Dec 10, 2024

y-sq force-pushed the export-D67012816 branch from e8aa48f to 6b0585d Compare December 10, 2024 18:35

pytorchmergebot force-pushed the export-D67012816 branch from 6b0585d to c9d22a4 Compare December 11, 2024 18:44

pytorchmergebot force-pushed the export-D67012816 branch from c9d22a4 to 728bfe9 Compare December 11, 2024 21:23

eellison approved these changes Dec 13, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 13, 2024

sijiac approved these changes Dec 16, 2024

View reviewed changes

shunting314 approved these changes Dec 16, 2024

View reviewed changes

y-sq force-pushed the export-D67012816 branch from 728bfe9 to a19bee6 Compare December 16, 2024 19:23

pytorchmergebot added the merging label Dec 16, 2024

pytorchmergebot removed the merging label Dec 16, 2024

pytorchmergebot added the merging label Dec 17, 2024

pytorchmergebot added the Merged label Dec 17, 2024

pytorchmergebot closed this in bcd3692 Dec 17, 2024

pytorchmergebot removed the merging label Dec 17, 2024

[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142474

[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion #142474

Uh oh!

Conversation

y-sq commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142474

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Dec 10, 2024

Uh oh!

y-sq commented Dec 10, 2024

Uh oh!

facebook-github-bot commented Dec 10, 2024

Uh oh!

huydhn commented Dec 11, 2024

Uh oh!

pytorchmergebot commented Dec 11, 2024

Uh oh!

pytorchmergebot commented Dec 11, 2024

Uh oh!

huydhn commented Dec 11, 2024

Uh oh!

pytorchmergebot commented Dec 11, 2024

Uh oh!

pytorchmergebot commented Dec 11, 2024

Uh oh!

shunting314 commented Dec 16, 2024

Uh oh!

facebook-github-bot commented Dec 16, 2024

Uh oh!

facebook-github-bot commented Dec 16, 2024

Uh oh!

pytorchmergebot commented Dec 16, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 16, 2024

Merge failed

Uh oh!

huydhn commented Dec 17, 2024

Uh oh!

pytorchmergebot commented Dec 17, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

y-sq commented Dec 10, 2024 •

edited

Loading

pytorch-bot bot commented Dec 10, 2024 •

edited

Loading