Skip to content

Change nccl get_unique_id to return a bytes string#9438

Merged
asi1024 merged 2 commits into
cupy:mainfrom
seberg:nccl-ident
Oct 29, 2025
Merged

Change nccl get_unique_id to return a bytes string#9438
asi1024 merged 2 commits into
cupy:mainfrom
seberg:nccl-ident

Conversation

@seberg
Copy link
Copy Markdown
Member

@seberg seberg commented Oct 22, 2025

This simplifies things slightly, but mostly avoids any sensitivity to systems with different char signedness.
I.e. casting the char to an integer tuple means we get negative values on many platforms (e.g. intel/AMD CPUs) while ARM system use unsigned char and would give signed values.

Using a bytes string, means that we can exchange that directly in the communicator and don't need to do the (incorrect) +128 (rather than a cast to unsigned int).

Since CuPy v14 is a major version and since nothing should change for most use-cases this seems sensible to try to just change to me.

Closes gh-9435, gh-9430

(I should probably to confirm this on an ARM system, although if anything the systems would fail probably and Cython generates memcpy and FromBytesAndSize, so it should be safe.)

@seberg seberg requested a review from a team as a code owner October 22, 2025 19:07
@seberg seberg added cat:bug Bugs no-compat Disrupts backward compatibility labels Oct 23, 2025
@kmaehashi kmaehashi added this to the v14.0.0rc1 milestone Oct 27, 2025
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change it to return bytes +1, just need to be more careful 🙂

Comment thread cupy_backends/cuda/libs/nccl.pyx Outdated
Comment thread cupyx/distributed/_nccl_comm.py
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 28, 2025

This pull request is now in conflicts. Could you fix it @seberg? 🙏

This simplifies things slightly, but mostly avoids any sensitivity
to systems with different char signedness.
I.e. casting the `char` to an integer tuple means we get negative
values on many platforms (e.g. intel/AMD CPUs) while ARM system use
unsigned char and would give signed values.

Using a bytes string, means that we can exchange that directly in the
communicator and don't need to do the (incorrect) `+128` (rather than
a cast to unsigned int).

Since CuPy v14 is a major version and since nothing should change for
most use-cases this seems sensible to try to just change to me.
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Oct 28, 2025

/test mini

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to mention this in docs/source/upgrade.rst (#9427).

@seberg
Copy link
Copy Markdown
Member Author

seberg commented Oct 28, 2025

We will want to mention this in docs/source/upgrade.rst

Do prefer I add it here or we just gather it later since it's not much and avoids merge conflicts? (I had tagged it as no-compat.)

@leofang
Copy link
Copy Markdown
Member

leofang commented Oct 28, 2025

No opinion 🙂 But let's give @asi1024 time to take a second look before merging (since this is a no-compat).

Copy link
Copy Markdown
Member

@asi1024 asi1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this change! Thank you for your pull request!

@asi1024 asi1024 merged commit ff65c3c into cupy:main Oct 29, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:bug Bugs no-compat Disrupts backward compatibility

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cupy.cuda.nccl.get_unique_id() returns signed ints on x86 but unsigned on arm64

4 participants