Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Oct 23, 2024

Stack from ghstack (oldest at bottom):

When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config.

The simple fix here is to just not let XBLOCK to be larger than xnumel.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138730

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (15 Unrelated Failures)

As of commit 7df49a6 with merge base c6609ec (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shunting314 added a commit that referenced this pull request Oct 23, 2024
ghstack-source-id: 32b61c2
Pull Request resolved: #138730
Copy link
Collaborator

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't really understand all of our scaling up/down code, and I've run into a couple bugs with it before.

Is this the right way to solve it or just a patch?

@shunting314
Copy link
Contributor Author

Is this the right way to solve it or just a patch?

The stacked PR fails some internal tests: https://www.internalfb.com/diff/D64796587 . Those tests exists in OSS but we don't run it in OSS CI since we don't have h100...

This PR is a patch for quick fix those test failures.

A fundamental fix may need us to auditing all the scaling rules for BLOCK size and all kind of overriding (like min_elem_per_thread).

Copy link
Collaborator

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to track refactoring our scaling logic here? I think it's pretty messy.

@shunting314
Copy link
Contributor Author

Sure, here is the issue: #138743 . @Chillee

@shunting314 shunting314 added the topic: not user facing topic category label Oct 23, 2024
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@shunting314
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 15 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_torchbench, 2, 2, linux.16xlarge.spr), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_avx2_torchbench, 2, 2, linux.10xlarge.avx2), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@shunting314
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 15 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_torchbench, 2, 2, linux.16xlarge.spr), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_amp_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_avx2_torchbench, 2, 2, linux.10xlarge.avx2), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@shunting314
Copy link
Contributor Author

@pytorchbot merge -f '..'

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2024

You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
The explanation needs to be clear on why this is needed. Here are some good examples:

  • Bypass checks due to unrelated upstream failures from ...
  • This is a minor fix to ..., which shouldn't break anything
  • This is pre-tested in a previous CI run
  • Bypass flaky ... check

@shunting314
Copy link
Contributor Author

@pytorchbot merge -f "merge -i does not work"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants