Skip to content

Conversation

@tringwald
Copy link
Collaborator

@tringwald tringwald commented Feb 10, 2024

Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer single nodes with a large device count. Until now, DeviceIndex was an int8_t, thus multiple changes were necessary:

  • DeviceIndex changed to int16_t. Updated consumers that assume it to be an int8_t.
  • Updated bounds checking for torch.device() in the Python frontend. Right now, we allow funny things like torch.device('cpu', 200).index == -56, which is undefined behavior. I inserted some checks to only allow values between 0 and c10::Device::MAX_NUM_DEVICES - 1.
  • Updated the ArgumentInfo struct as it hardcodes the device index as 8 bit field 1. Might be a breaking change, not sure if users rely on this.
  • Introduced c10::Device::MAX_NUM_DEVICES as a replacement for the old C10_COMPILE_TIME_MAX_GPUS

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @penguinwu @tianyu-l @yf225

Footnotes

  1. This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the ArgumentInfo struct. When I switched the DeviceIndex to int16_t, it actually stayed 255 after unpacking from ArgumentInfo again, as the DeviceIndex was now wide enough that it didn't wrap back to -1.

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119639

Note: Links to docs will display an error until the docs builds have been completed.

❌ 19 New Failures

As of commit 0bdcec4 with merge base bcb4f7c (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tringwald tringwald force-pushed the increase-max-gpu-count branch 2 times, most recently from 34daa3b to f3ad2cf Compare February 15, 2024 12:51
@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 15, 2024
@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Feb 15, 2024
@tringwald tringwald added the release notes: python_frontend python frontend release notes category label Feb 15, 2024
@tringwald tringwald force-pushed the increase-max-gpu-count branch from 43d4275 to d31fa92 Compare February 15, 2024 18:38
@tringwald tringwald changed the title [WIP] Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. Feb 15, 2024
@tringwald tringwald marked this pull request as ready for review February 15, 2024 18:38
@tringwald tringwald requested a review from albanD February 15, 2024 21:03
@tringwald

This comment was marked as resolved.

@tringwald tringwald force-pushed the increase-max-gpu-count branch from d31fa92 to c273159 Compare February 16, 2024 11:42
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds pretty cool!

@cyyever
Copy link
Collaborator

cyyever commented Feb 17, 2024

@tringwald We need to search and replace all occurrences of std::numerical_limits<c10::DeviceIndex>::max() with C10_COMPILE_TIME_MAX_GPUS

@tringwald
Copy link
Collaborator Author

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

@cyyever
Copy link
Collaborator

cyyever commented Feb 17, 2024

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

It's better to use a single macro, and it is even better to change it into a constexpr variable to avoid possible upgrading issues in the future.

@tringwald
Copy link
Collaborator Author

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

It's better to use a single macro, and it is even better to change it into a constexpr variable to avoid possible upgrading issues in the future.

Something like this maybe?

// c10/core/Device.h
namespace c10 {
    constexpr DeviceIndex MAX_NUM_DEVICES = 512;

    struct C10_API Device final {
        // ...
    }
}

or

// c10/core/Device.h
namespace c10 {
    struct C10_API Device final {
        static constexpr DeviceIndex MAX_NUM_DEVICES = 512;
    }
}

@cyyever
Copy link
Collaborator

cyyever commented Feb 17, 2024

IMO, the first case is better

@cyyever
Copy link
Collaborator

cyyever commented Feb 18, 2024

@pytorchbot rebase

@cyyever
Copy link
Collaborator

cyyever commented Feb 18, 2024

@pytorchbot label ciflow/binaries

@pytorch-bot pytorch-bot bot added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Feb 18, 2024
@cyyever cyyever self-requested a review February 18, 2024 00:34
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased increase-max-gpu-count onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout increase-max-gpu-count && git pull --rebase)

@jataylo
Copy link
Collaborator

jataylo commented Jul 22, 2024

Hey @tringwald @cyyever @albanD @kit1980 @huydhn

Is there a plan for when this change can be merged? The 120 GPU limit is restrictive. cc: @jeffdaily @jithunnair-amd

I see the dependent PR is mostly approved #122527 but I assume there is still internal testing ongoing for this? And do we actually need the full removal of Caffe2 to move with the int16_t change?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho that one dropped yes.
Only a small nit but sounds good otherwise I think

@facebook-github-bot
Copy link
Contributor

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of the APIs being removed are used, so this needs manual fixing before it can land.

@tringwald
Copy link
Collaborator Author

A few of the APIs being removed are used, so this needs manual fixing before it can land.

Are you referring to Meta-internal code? I think @kit1980 has done some work on the Meta side.

@kit1980
Copy link
Contributor

kit1980 commented Jul 22, 2024

I think @kit1980 has done some work on the Meta side.

I could not figure out one item and then caffe2 desync caused additional issues.

@tringwald tringwald force-pushed the increase-max-gpu-count branch from 2dc74b5 to 9ab9ba7 Compare August 1, 2024 21:17
@tringwald tringwald requested a review from syed-ahmed as a code owner August 1, 2024 21:17
tringwald and others added 7 commits August 2, 2024 21:01
…ex. Changed some core JIT structures to accommodate the new 16 bit DeviceIndex. Added tests. Updated bounds checks.
… device affiliation map from uint8_t to DeviceIndex.
…e a problem anymore when the internal code base also uses int16_t.
@tringwald tringwald force-pushed the increase-max-gpu-count branch from 9ab9ba7 to 0bdcec4 Compare August 2, 2024 19:04
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Oct 2, 2024
@github-actions github-actions bot closed this Nov 1, 2024
pytorchmergebot pushed a commit that referenced this pull request Nov 12, 2024
# Movitation
refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch.

# Solution
Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices.

Pull Request resolved: #120768
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
zero000064 pushed a commit to zero000064/pytorch that referenced this pull request Nov 14, 2024
# Movitation
refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch.

# Solution
Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices.

Pull Request resolved: pytorch#120768
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Movitation
refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch.

# Solution
Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices.

Pull Request resolved: pytorch#120768
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: jit release notes category release notes: python_frontend python frontend release notes category Reverted Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Please increase "Number of CUDA devices" recognized by pytorch.