Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. #119639

tringwald · 2024-02-10T13:14:42Z

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer single nodes with a large device count. Until now, DeviceIndex was an int8_t, thus multiple changes were necessary:

DeviceIndex changed to int16_t. Updated consumers that assume it to be an int8_t.
Updated bounds checking for torch.device() in the Python frontend. Right now, we allow funny things like torch.device('cpu', 200).index == -56, which is undefined behavior. I inserted some checks to only allow values between 0 and c10::Device::MAX_NUM_DEVICES - 1.
Updated the ArgumentInfo struct as it hardcodes the device index as 8 bit field ¹. Might be a breaking change, not sure if users rely on this.
Introduced c10::Device::MAX_NUM_DEVICES as a replacement for the old C10_COMPILE_TIME_MAX_GPUS

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @penguinwu @tianyu-l @yf225

This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the ArgumentInfo struct. When I switched the DeviceIndex to int16_t, it actually stayed 255 after unpacking from ArgumentInfo again, as the DeviceIndex was now wide enough that it didn't wrap back to -1. ↩

pytorch-bot · 2024-02-10T13:14:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119639

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 19 New Failures

As of commit 0bdcec4 with merge base bcb4f7c ():

NEW FAILURES - The following jobs have failed:

linux-binary-libtorch-cxx11-abi / libtorch-rocm6_0-shared-with-deps-cxx11-abi-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-libtorch-pre-cxx11 / libtorch-rocm6_0-shared-with-deps-pre-cxx11-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_10-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_11-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_12-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_9-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, amz2023.linux.12xlarge) (gh)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/BUILD:31:17: Compiling torch_xla/csrc/aten_cpu_fallback.cpp failed: (Exit 1): gcc failed: error executing command (from target //torch_xla/csrc:tensor) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 421 arguments skipped)
windows-binary-conda / conda-py3_10-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_4-test (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Sounds pretty cool!

c10/core/Device.h

c10/core/TensorImpl.h

c10/cuda/CUDAMacros.h

torch/csrc/jit/runtime/argument_spec.h

cyyever · 2024-02-17T10:25:31Z

@tringwald We need to search and replace all occurrences of std::numerical_limits<c10::DeviceIndex>::max() with C10_COMPILE_TIME_MAX_GPUS

tringwald · 2024-02-17T10:50:11Z

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

cyyever · 2024-02-17T11:18:50Z

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

It's better to use a single macro, and it is even better to change it into a constexpr variable to avoid possible upgrading issues in the future.

tringwald · 2024-02-17T12:05:38Z

Maybe we should just replace all C10_COMPILE_TIME_MAX_GPUS with the new C10_MAX_NUM_DEVICES if we want to keep the device count consistent for all device types.

It's better to use a single macro, and it is even better to change it into a constexpr variable to avoid possible upgrading issues in the future.

Something like this maybe?

// c10/core/Device.h
namespace c10 {
    constexpr DeviceIndex MAX_NUM_DEVICES = 512;

    struct C10_API Device final {
        // ...
    }
}

or

// c10/core/Device.h
namespace c10 {
    struct C10_API Device final {
        static constexpr DeviceIndex MAX_NUM_DEVICES = 512;
    }
}

cyyever · 2024-02-17T14:41:25Z

IMO, the first case is better

aten/src/ATen/core/op_registration/infer_schema.h

c10/core/Device.h

cyyever · 2024-02-18T00:33:11Z

@pytorchbot rebase

cyyever · 2024-02-18T00:33:26Z

@pytorchbot label ciflow/binaries

pytorchmergebot · 2024-02-18T00:34:42Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-18T00:34:47Z

Successfully rebased increase-max-gpu-count onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout increase-max-gpu-count && git pull --rebase)

jataylo · 2024-07-22T10:20:48Z

Hey @tringwald @cyyever @albanD @kit1980 @huydhn

Is there a plan for when this change can be merged? The 120 GPU limit is restrictive. cc: @jeffdaily @jithunnair-amd

I see the dependent PR is mostly approved #122527 but I assume there is still internal testing ongoing for this? And do we actually need the full removal of Caffe2 to move with the int16_t change?

albanD

Ho that one dropped yes.
Only a small nit but sounds good otherwise I think

caffe2/contrib/nccl/cuda_nccl_op_gpu.cc

facebook-github-bot · 2024-07-22T15:46:12Z

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

albanD

A few of the APIs being removed are used, so this needs manual fixing before it can land.

tringwald · 2024-07-22T18:37:19Z

A few of the APIs being removed are used, so this needs manual fixing before it can land.

Are you referring to Meta-internal code? I think @kit1980 has done some work on the Meta side.

kit1980 · 2024-07-22T19:18:48Z

I think @kit1980 has done some work on the Meta side.

I could not figure out one item and then caffe2 desync caused additional issues.

…ex. Changed some core JIT structures to accommodate the new 16 bit DeviceIndex. Added tests. Updated bounds checks.

…rning for unexpected behavior.

… device affiliation map from uint8_t to DeviceIndex.

…e a problem anymore when the internal code base also uses int16_t.

It was removed before.

github-actions · 2024-10-02T13:36:22Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

# Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: #120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald

# Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: pytorch#120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald

pytorchbot added the open source label Feb 10, 2024

tringwald force-pushed the increase-max-gpu-count branch 2 times, most recently from 34daa3b to f3ad2cf Compare February 15, 2024 12:51

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 15, 2024

pytorch-bot bot added the release notes: jit release notes category label Feb 15, 2024

tringwald added the release notes: python_frontend python frontend release notes category label Feb 15, 2024

tringwald force-pushed the increase-max-gpu-count branch from 43d4275 to d31fa92 Compare February 15, 2024 18:38

tringwald changed the title ~~[WIP] Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.~~ Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. Feb 15, 2024

tringwald marked this pull request as ready for review February 15, 2024 18:38

tringwald requested a review from albanD February 15, 2024 21:03

This comment was marked as resolved.

Sign in to view

tringwald force-pushed the increase-max-gpu-count branch from d31fa92 to c273159 Compare February 16, 2024 11:42

albanD reviewed Feb 16, 2024

View reviewed changes

torch/csrc/jit/runtime/argument_spec.h Outdated Show resolved Hide resolved

torch/csrc/jit/runtime/argument_spec.h Outdated Show resolved Hide resolved

tringwald requested review from jeffdaily and jithunnair-amd as code owners February 17, 2024 14:21

cyyever reviewed Feb 18, 2024

View reviewed changes

aten/src/ATen/core/op_registration/infer_schema.h Outdated Show resolved Hide resolved

cyyever reviewed Feb 18, 2024

View reviewed changes

c10/core/Device.h Outdated Show resolved Hide resolved

pytorch-bot bot added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Feb 18, 2024

cyyever self-requested a review February 18, 2024 00:34

albanD reviewed Jul 22, 2024

View reviewed changes

caffe2/contrib/nccl/cuda_nccl_op_gpu.cc Outdated Show resolved Hide resolved

albanD requested changes Jul 22, 2024

View reviewed changes

tringwald force-pushed the increase-max-gpu-count branch from 2dc74b5 to 9ab9ba7 Compare August 1, 2024 21:17

tringwald requested a review from syed-ahmed as a code owner August 1, 2024 21:17

tringwald and others added 7 commits August 2, 2024 21:01

Increased compile time max GPUs to 512. Switched to int16_t DeviceInd…

bca176b

…ex. Changed some core JIT structures to accommodate the new 16 bit DeviceIndex. Added tests. Updated bounds checks.

Switched hashing logic for c10::Device in multiple places, added a wa…

b08dd89

…rning for unexpected behavior.

Preserved backwards compatibility for int8_t primitives, changed CUDA…

0ed4dc9

… device affiliation map from uint8_t to DeviceIndex.

Removed ifdef guards for Meta-internal DeviceIndex. This should not b…

3bd01ed

…e a problem anymore when the internal code base also uses int16_t.

Fixed formatting after rebase.

5d96c77

Cleanup after rebase. Fixed formatting.

b335bba

Delete caffe2/contrib/nccl/cuda_nccl_op_gpu.cc

0bdcec4

It was removed before.

tringwald force-pushed the increase-max-gpu-count branch from 9ab9ba7 to 0bdcec4 Compare August 2, 2024 19:04

github-actions bot added the Stale label Oct 2, 2024

github-actions bot closed this Nov 1, 2024

cyyever mentioned this pull request Jan 3, 2025

Use more bits to represent DeviceIndex #144137

Closed

tringwald mentioned this pull request Jan 14, 2025

Increase C10_COMPILE_TIME_MAX_GPUS to 128 #144138

Closed

Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. #119639

Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. #119639

Uh oh!

Conversation

tringwald commented Feb 10, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

pytorch-bot bot commented Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119639

❌ 19 New Failures

Uh oh!

This comment was marked as resolved.

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyyever commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tringwald commented Feb 17, 2024

Uh oh!

cyyever commented Feb 17, 2024

Uh oh!

tringwald commented Feb 17, 2024

Uh oh!

cyyever commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyyever commented Feb 18, 2024

Uh oh!

cyyever commented Feb 18, 2024

Uh oh!

pytorchmergebot commented Feb 18, 2024

Uh oh!

pytorchmergebot commented Feb 18, 2024

Uh oh!

jataylo commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jul 22, 2024

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

tringwald commented Jul 22, 2024

Uh oh!

kit1980 commented Jul 22, 2024

Uh oh!

github-actions bot commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

tringwald commented Feb 10, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 10, 2024 •

edited

Loading

cyyever commented Feb 17, 2024 •

edited

Loading

cyyever commented Feb 17, 2024 •

edited

Loading

jataylo commented Jul 22, 2024 •

edited

Loading