softshrink nan fixes #138421

Isalia20 · 2024-10-20T12:21:52Z

Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory)

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-10-20T12:21:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138421

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 1 New Failure

As of commit 00f402a with merge base f4ee5a2 ():

NEW FAILURE - The following job has failed:

linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh)
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-10-20T12:21:56Z

The committers listed above are authorized under a signed CLA.

✅ login: Isalia20 / name: Irakli Salia (713c6de, 983d678, 104eeb1, 98add10, 4f5820a, 00f402a, d0c89aa, 643b337)

Isalia20 · 2024-10-20T13:15:24Z

Fixed it for mps device as well

Isalia20 · 2024-10-20T13:32:51Z

Added test to check nan outputs nan for softshrink. Would be happy to receive some feedback on this. It's my first time contributing so any feedback is welcome

bdhirsh · 2024-10-23T15:24:10Z

cc @mikaylagawarecki do you know who would be a good reviewer?

Isalia20 · 2024-10-28T13:50:09Z

Any updates on this?

cyyever · 2024-10-28T14:01:53Z

Format with clang-format? The indentation is wrong compared to before. Then it's hard to identify real changes.

Isalia20 · 2024-10-28T15:23:23Z

I couldn't run the clang-format. Got:

error: unknown key 'Macros'
Macros:
^~~~~~
Error reading /home/isalia/Desktop/pytorch/.clang-format: Invalid argument

But I fixed the indentation manually. If you could point me how I can get the clang-format(which version is needed or if I'm missing something) I can run it

cyyever · 2024-10-28T15:32:01Z

Use clang-format 17

cyyever · 2024-10-28T15:35:51Z

aten/src/ATen/native/cpu/Activation.cpp

Why multiply?

nan * 0 -> nan. Otherwise 0

Isalia20 · 2024-10-28T15:43:44Z

Use clang-format 17

I managed to run it and only ran it on aten/src/ATen/native/cpu/Activation.cpp but the whole file was changed. Not sure if that's intended. I just ran it with:
clang-format-17 aten/src/ATen/native/cpu/Activation.cpp > aten/src/ATen/native/cpu/Activation.cpp

Isalia20 · 2024-11-04T12:04:59Z

any updates?

cyyever · 2024-11-04T12:29:41Z

@pytorchbot rebase

pytorchmergebot · 2024-11-04T12:31:14Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Isalia20 · 2024-11-06T15:46:56Z

@cyyever Can you run the workflow?

Isalia20 · 2024-11-12T16:23:38Z

bump

Isalia20 · 2024-11-14T11:10:04Z

I guess we can merge? Anything else to be done from me?

cyyever · 2024-11-14T13:50:26Z

aten/src/ATen/native/cpu/Activation.cpp

-          self_val_t0 = (self_val > lambdVec) & (self_val - lambdVec);
-          self_val_t1 = (self_val < -lambd_val) & (self_val + lambdVec);
+          self_val_t0 = ((self_val > lambdVec) | (self_val.isnan())) & (self_val - lambdVec);
+          self_val_t1 = ((self_val < -lambd_val) | (self_val.isnan())) & (self_val + lambdVec);


I don't think these changes can propagate nan

The code will propagate NaN values correctly. Previously, the comparison self_val > lambdVec always returned False when the input was NaN, because any comparison with NaN evaluated to False. This meant self_val - lambdVec wasn't propagating NaN values and instead defaulted to 0. The mask now will properly detect NaN inputs allowing self_val - lambdVec to return NaN (since + or - op with NaN results in NaN).

Your explanation makes sense, do you know where the changes to the vectorized path are tested?

There are tests which compares the output of the compiled softshrink with the functional version(with just basic ops). Functional version is here, which I also changed with multiplication to get nan value propagated:

pytorch/torch/_refs/nn/functional/__init__.py

Lines 501 to 516 in 643b337

@aten.softshrink.default.py_impl(DispatchKey.Autograd)

@register_decomposition(aten.softshrink)

@out_wrapper()

def softshrink(a: TensorLikeType, lambd: float = 0.5):

# Formula for reference,

# softshrink(x) = x - lambd if x > lambd

# = x + lambd if x < -lambd

# = 0 otherwise

torch._check(

lambd >= 0,

lambda: f"lambda must be greater or equal to 0, but found to be {lambd}",

)

# We implement this in one torch.where to generate better code in the backward

# see https://github.com/pytorch/pytorch/pull/107052#discussion_r1293748211

# If none of the expressions pass we multiply by 0 for dealing with nans and infs

return torch.where(torch.abs(a) > lambd, a - torch.sign(a) * lambd, a * 0)

From what I understood from reading the code, the functional version is run with different input shapes as well as different input devices and dtypes(selected by PYTORCH_OPINFO_SAMPLE_INPUT_INDEX). Shapes are defined in core.py of opinfo:

pytorch/torch/testing/_internal/opinfo/core.py

Lines 1987 to 1999 in 723498a

shapes = (

# tensors with no elements

(0,),

(1, 0, 3),

# zero dim (scalar) tensor

(),

# small 1D tensor

(20,),

# medium 1D tensor

(812,),

# large 2D tensor

(1029, 917),

)

One such test is:

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_ops.py TestCommonCPU.test_python_ref_torch_fallback__refs_nn_functional_softshrink_cpu_float32

or for cuda and float16(just another example):

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 python test/test_ops.py TestCommonCUDA.test_python_ref_torch_fallback__refs_nn_functional_softshrink_cuda_float16

Amazing, thank you for clarifying, and to double check, these sample inputs have nans in them?

Yes UnaryUfuncInfo class has argument handles_complex_extremal_values which is by default set to True, extremal values are(from comment on that same line):
# whether the op correctly handles extremal values (like nan/inf)

https://github.com/pytorch/pytorch/blob/a84779040049377aec7b62e37becb7327950541e/torch/testing/_internal/common_methods_invocations.py#L16833-L16841

I added printing of the inputs while running this test for example:
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=14 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_ops.py TestCommonCPU.test_python_ref__refs_nn_functional_softshrink_cpu_float16

Isalia20 · 2024-11-18T18:14:21Z

Any other comments apart the one from above?

test/test_nn.py

mikaylagawarecki · 2024-11-21T17:22:56Z

aten/src/ATen/native/cpu/Activation.cpp

-          self_val_t0 = (self_val > lambdVec) & (self_val - lambdVec);
-          self_val_t1 = (self_val < -lambd_val) & (self_val + lambdVec);
+          self_val_t0 = ((self_val > lambdVec) | (self_val.isnan())) & (self_val - lambdVec);
+          self_val_t1 = ((self_val < -lambd_val) | (self_val.isnan())) & (self_val + lambdVec);


Your explanation makes sense, do you know where the changes to the vectorized path are tested?

mikaylagawarecki

Thanks!

mikaylagawarecki · 2024-11-21T19:32:41Z

@pytorchbot merge

pytorchmergebot · 2024-11-21T19:34:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-21T21:37:32Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test

Details for Dev Infra team

Raised by workflow job

Isalia20 · 2024-11-21T22:22:39Z

Hmm, not sure what to do with this failing check:

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

Don't think it was introduced with this PR 🤔

mikaylagawarecki · 2024-11-21T22:58:25Z

@pytorchbot merge -i

pytorchmergebot · 2024-11-21T23:00:10Z

Merge started

Your change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#138385 . Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory) Pull Request resolved: pytorch#138421 Approved by: https://github.com/mikaylagawarecki

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 20, 2024

pytorchbot added the open source label Oct 20, 2024

pytorch-bot bot added the release notes: mps Release notes category label Oct 20, 2024

Isalia20 marked this pull request as ready for review October 20, 2024 13:32

Isalia20 requested review from eqy, kulinseth, malfet and syed-ahmed as code owners October 20, 2024 13:32

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 23, 2024

mikaylagawarecki self-requested a review October 28, 2024 15:32

cyyever reviewed Oct 28, 2024

View reviewed changes

Isalia20 added 6 commits November 4, 2024 12:31

fixes for cpu and cuda

713c6de

better formatting

98add10

softshrink fix for mps device

4f5820a

added testing for nan tensors in softshrink

983d678

indentation fixes

d0c89aa

more indentation fixes

104eeb1

shorter comment

00f402a

cyyever reviewed Nov 14, 2024

View reviewed changes

cyyever requested a review from ezyang November 19, 2024 01:40

mikaylagawarecki reviewed Nov 21, 2024

View reviewed changes

mikaylagawarecki approved these changes Nov 21, 2024

View reviewed changes

mikaylagawarecki added release notes: nn release notes category topic: improvements topic category topic: bug fixes topic category and removed release notes: mps Release notes category topic: improvements topic category labels Nov 21, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2024

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot removed the merging label Nov 21, 2024

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot added the Merged label Nov 21, 2024

pytorchmergebot closed this in 37fe801 Nov 21, 2024

pytorchmergebot removed the merging label Nov 21, 2024

	@aten.softshrink.default.py_impl(DispatchKey.Autograd)
	@register_decomposition(aten.softshrink)
	@out_wrapper()
	def softshrink(a: TensorLikeType, lambd: float = 0.5):
	# Formula for reference,
	# softshrink(x) = x - lambd if x > lambd
	# = x + lambd if x < -lambd
	# = 0 otherwise
	torch._check(
	lambd >= 0,
	lambda: f"lambda must be greater or equal to 0, but found to be {lambd}",
	)
	# We implement this in one torch.where to generate better code in the backward
	# see https://github.com/pytorch/pytorch/pull/107052#discussion_r1293748211
	# If none of the expressions pass we multiply by 0 for dealing with nans and infs
	return torch.where(torch.abs(a) > lambd, a - torch.sign(a) * lambd, a * 0)

	shapes = (
	# tensors with no elements
	(0,),
	(1, 0, 3),
	# zero dim (scalar) tensor
	(),
	# small 1D tensor
	(20,),
	# medium 1D tensor
	(812,),
	# large 2D tensor
	(1029, 917),
	)

softshrink nan fixes #138421

softshrink nan fixes #138421

Uh oh!

Conversation

Isalia20 commented Oct 20, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138421

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

linux-foundation-easycla bot commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isalia20 commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isalia20 commented Oct 20, 2024

Uh oh!

bdhirsh commented Oct 23, 2024

Uh oh!

Isalia20 commented Oct 28, 2024

Uh oh!

cyyever commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isalia20 commented Oct 28, 2024

Uh oh!

cyyever commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Oct 28, 2024

Uh oh!

Isalia20 commented Nov 4, 2024

Uh oh!

cyyever commented Nov 4, 2024

Uh oh!

pytorchmergebot commented Nov 4, 2024

Uh oh!

Isalia20 commented Nov 6, 2024

Uh oh!

Isalia20 commented Nov 12, 2024

Uh oh!

Isalia20 commented Nov 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Nov 18, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki left a comment

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge failed

Uh oh!

Isalia20 commented Nov 21, 2024

Isalia20 commented Oct 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 20, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 20, 2024 •

edited

Loading

Isalia20 commented Oct 20, 2024 •

edited

Loading

cyyever commented Oct 28, 2024 •

edited

Loading