-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. #119639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119639
Note: Links to docs will display an error until the docs builds have been completed. ❌ 19 New FailuresAs of commit 0bdcec4 with merge base bcb4f7c ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
34daa3b to
f3ad2cf
Compare
43d4275 to
d31fa92
Compare
This comment was marked as resolved.
This comment was marked as resolved.
d31fa92 to
c273159
Compare
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds pretty cool!
|
@tringwald We need to search and replace all occurrences of std::numerical_limits<c10::DeviceIndex>::max() with C10_COMPILE_TIME_MAX_GPUS |
|
Maybe we should just replace all |
It's better to use a single macro, and it is even better to change it into a constexpr variable to avoid possible upgrading issues in the future. |
Something like this maybe? // c10/core/Device.h
namespace c10 {
constexpr DeviceIndex MAX_NUM_DEVICES = 512;
struct C10_API Device final {
// ...
}
}or // c10/core/Device.h
namespace c10 {
struct C10_API Device final {
static constexpr DeviceIndex MAX_NUM_DEVICES = 512;
}
} |
|
IMO, the first case is better |
|
@pytorchbot rebase |
|
@pytorchbot label ciflow/binaries |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
Hey @tringwald @cyyever @albanD @kit1980 @huydhn Is there a plan for when this change can be merged? The 120 GPU limit is restrictive. cc: @jeffdaily @jithunnair-amd I see the dependent PR is mostly approved #122527 but I assume there is still internal testing ongoing for this? And do we actually need the full removal of Caffe2 to move with the int16_t change? |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho that one dropped yes.
Only a small nit but sounds good otherwise I think
|
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few of the APIs being removed are used, so this needs manual fixing before it can land.
Are you referring to Meta-internal code? I think @kit1980 has done some work on the Meta side. |
I could not figure out one item and then caffe2 desync caused additional issues. |
2dc74b5 to
9ab9ba7
Compare
…ex. Changed some core JIT structures to accommodate the new 16 bit DeviceIndex. Added tests. Updated bounds checks.
…rning for unexpected behavior.
… device affiliation map from uint8_t to DeviceIndex.
…e a problem anymore when the internal code base also uses int16_t.
It was removed before.
9ab9ba7 to
0bdcec4
Compare
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
# Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: #120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
# Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: pytorch#120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
# Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: pytorch#120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
Fixes #115331.
This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer single nodes with a large device count. Until now,
DeviceIndexwas anint8_t, thus multiple changes were necessary:DeviceIndexchanged toint16_t. Updated consumers that assume it to be anint8_t.torch.device()in the Python frontend. Right now, we allow funny things liketorch.device('cpu', 200).index == -56, which is undefined behavior. I inserted some checks to only allow values between 0 andc10::Device::MAX_NUM_DEVICES - 1.ArgumentInfostruct as it hardcodes the device index as 8 bit field 1. Might be a breaking change, not sure if users rely on this.c10::Device::MAX_NUM_DEVICESas a replacement for the oldC10_COMPILE_TIME_MAX_GPUScc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @penguinwu @tianyu-l @yf225
Footnotes
This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the
ArgumentInfostruct. When I switched theDeviceIndextoint16_t, it actually stayed 255 after unpacking fromArgumentInfoagain, as theDeviceIndexwas now wide enough that it didn't wrap back to -1. ↩