Skip to content

Conversation

@Isalia20
Copy link
Collaborator

@Isalia20 Isalia20 commented Oct 20, 2024

Fixes #138385 .

Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory)

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138421

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 00f402a with merge base f4ee5a2 (image):

NEW FAILURE - The following job has failed:

  • linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh)
    RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 20, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 20, 2024
@pytorch-bot pytorch-bot bot added the release notes: mps Release notes category label Oct 20, 2024
@Isalia20
Copy link
Collaborator Author

Isalia20 commented Oct 20, 2024

Fixed it for mps device as well

@Isalia20
Copy link
Collaborator Author

Added test to check nan outputs nan for softshrink. Would be happy to receive some feedback on this. It's my first time contributing so any feedback is welcome

@Isalia20 Isalia20 marked this pull request as ready for review October 20, 2024 13:32
@bdhirsh
Copy link
Contributor

bdhirsh commented Oct 23, 2024

cc @mikaylagawarecki do you know who would be a good reviewer?

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 23, 2024
@Isalia20
Copy link
Collaborator Author

Any updates on this?

@cyyever
Copy link
Collaborator

cyyever commented Oct 28, 2024

Format with clang-format? The indentation is wrong compared to before. Then it's hard to identify real changes.

@Isalia20
Copy link
Collaborator Author

I couldn't run the clang-format. Got:

error: unknown key 'Macros'
Macros:
^~~~~~
Error reading /home/isalia/Desktop/pytorch/.clang-format: Invalid argument

But I fixed the indentation manually. If you could point me how I can get the clang-format(which version is needed or if I'm missing something) I can run it

@cyyever
Copy link
Collaborator

cyyever commented Oct 28, 2024

Use clang-format 17

@mikaylagawarecki mikaylagawarecki self-requested a review October 28, 2024 15:32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multiply?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan * 0 -> nan. Otherwise 0

@Isalia20
Copy link
Collaborator Author

Use clang-format 17

I managed to run it and only ran it on aten/src/ATen/native/cpu/Activation.cpp but the whole file was changed. Not sure if that's intended. I just ran it with:
clang-format-17 aten/src/ATen/native/cpu/Activation.cpp > aten/src/ATen/native/cpu/Activation.cpp

@Isalia20
Copy link
Collaborator Author

Isalia20 commented Nov 4, 2024

any updates?

@cyyever
Copy link
Collaborator

cyyever commented Nov 4, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@Isalia20
Copy link
Collaborator Author

Isalia20 commented Nov 6, 2024

@cyyever Can you run the workflow?

@Isalia20
Copy link
Collaborator Author

bump

@Isalia20
Copy link
Collaborator Author

I guess we can merge? Anything else to be done from me?

self_val_t0 = (self_val > lambdVec) & (self_val - lambdVec);
self_val_t1 = (self_val < -lambd_val) & (self_val + lambdVec);
self_val_t0 = ((self_val > lambdVec) | (self_val.isnan())) & (self_val - lambdVec);
self_val_t1 = ((self_val < -lambd_val) | (self_val.isnan())) & (self_val + lambdVec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these changes can propagate nan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will propagate NaN values correctly. Previously, the comparison self_val > lambdVec always returned False when the input was NaN, because any comparison with NaN evaluated to False. This meant self_val - lambdVec wasn't propagating NaN values and instead defaulted to 0. The mask now will properly detect NaN inputs allowing self_val - lambdVec to return NaN (since + or - op with NaN results in NaN).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanation makes sense, do you know where the changes to the vectorized path are tested?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests which compares the output of the compiled softshrink with the functional version(with just basic ops). Functional version is here, which I also changed with multiplication to get nan value propagated:

@aten.softshrink.default.py_impl(DispatchKey.Autograd)
@register_decomposition(aten.softshrink)
@out_wrapper()
def softshrink(a: TensorLikeType, lambd: float = 0.5):
# Formula for reference,
# softshrink(x) = x - lambd if x > lambd
# = x + lambd if x < -lambd
# = 0 otherwise
torch._check(
lambd >= 0,
lambda: f"lambda must be greater or equal to 0, but found to be {lambd}",
)
# We implement this in one torch.where to generate better code in the backward
# see https://github.com/pytorch/pytorch/pull/107052#discussion_r1293748211
# If none of the expressions pass we multiply by 0 for dealing with nans and infs
return torch.where(torch.abs(a) > lambd, a - torch.sign(a) * lambd, a * 0)

From what I understood from reading the code, the functional version is run with different input shapes as well as different input devices and dtypes(selected by PYTORCH_OPINFO_SAMPLE_INPUT_INDEX). Shapes are defined in core.py of opinfo:

shapes = (
# tensors with no elements
(0,),
(1, 0, 3),
# zero dim (scalar) tensor
(),
# small 1D tensor
(20,),
# medium 1D tensor
(812,),
# large 2D tensor
(1029, 917),
)

One such test is:

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_ops.py TestCommonCPU.test_python_ref_torch_fallback__refs_nn_functional_softshrink_cpu_float32

or for cuda and float16(just another example):

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 python test/test_ops.py TestCommonCUDA.test_python_ref_torch_fallback__refs_nn_functional_softshrink_cuda_float16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you for clarifying, and to double check, these sample inputs have nans in them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes UnaryUfuncInfo class has argument handles_complex_extremal_values which is by default set to True, extremal values are(from comment on that same line):
# whether the op correctly handles extremal values (like nan/inf)

https://github.com/pytorch/pytorch/blob/a84779040049377aec7b62e37becb7327950541e/torch/testing/_internal/common_methods_invocations.py#L16833-L16841

I added printing of the inputs while running this test for example:
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_ops.py TestCommonCPU.test_python_ref__refs_nn_functional_softshrink_cpu_float16

image

@Isalia20
Copy link
Collaborator Author

Any other comments apart the one from above?

@cyyever cyyever requested a review from ezyang November 19, 2024 01:40
self_val_t0 = (self_val > lambdVec) & (self_val - lambdVec);
self_val_t1 = (self_val < -lambd_val) & (self_val + lambdVec);
self_val_t0 = ((self_val > lambdVec) | (self_val.isnan())) & (self_val - lambdVec);
self_val_t1 = ((self_val < -lambd_val) | (self_val.isnan())) & (self_val + lambdVec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanation makes sense, do you know where the changes to the vectorized path are tested?

Copy link
Contributor

@mikaylagawarecki mikaylagawarecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mikaylagawarecki mikaylagawarecki added release notes: nn release notes category topic: improvements topic category topic: bug fixes topic category and removed release notes: mps Release notes category topic: improvements topic category labels Nov 21, 2024
@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test

Details for Dev Infra team Raised by workflow job

@Isalia20
Copy link
Collaborator Author

Hmm, not sure what to do with this failing check:

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

Don't think it was introduced with this PR 🤔

@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Fixes pytorch#138385 .

Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory)

Pull Request resolved: pytorch#138421
Approved by: https://github.com/mikaylagawarecki
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.nn.functional.softshrink returns 0 on NaN input.

6 participants