Skip to content

Conversation

@jerrymannil
Copy link
Collaborator

@jerrymannil jerrymannil commented Dec 15, 2024

  • Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
    • for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
    • But elems_per_thread = 8 works better on half datypes for AMD gpus
  • Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143269

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit b779f2b with merge base 95b41d2 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pruthvistony pruthvistony added topic: not user facing topic category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR rocm This tag is for PRs from ROCm team ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Dec 15, 2024
@facebook-github-bot
Copy link
Contributor

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pruthvistony
Copy link
Collaborator

@Mellonta ,
Can you please update on the status of internal build.

@Mellonta
Copy link
Contributor

@Mellonta , Can you please update on the status of internal build.

Our internal builds include all tests in OSS CI. Could you please fix them?

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12540729651

@jerrymannil
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/15)
Rebasing (2/15)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12754592663

@jerrymannil
Copy link
Collaborator Author

@Mellonta
Can you import the updated PR ?

@jerrymannil
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12760808972

@jerrymannil
Copy link
Collaborator Author

@Mellonta
The CI is passing now, except for 2 flaky test.
Can you import the PR again ?

@facebook-github-bot
Copy link
Contributor

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jerrymannil
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pruthvistony added a commit to ROCm/pytorch that referenced this pull request Jan 31, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <[email protected]>
jerrymannil added a commit to ROCm/pytorch that referenced this pull request Feb 7, 2025
*  Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
   * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <[email protected]>
pruthvistony added a commit to ROCm/pytorch that referenced this pull request Feb 21, 2025
… (#1924)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <[email protected]>
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this pull request Apr 24, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <[email protected]>
(cherry picked from commit 4686828)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants