improve deterministic engine by louis-she · Pull Request #2756 · pytorch/ignite

louis-she · 2022-10-29T03:52:15Z

The neptune-client has make some APIs to their legacy package, see neptune-ai/neptune-client#1039

louis-she · 2022-10-29T08:15:13Z

@vfdev-5 could you take a look at the failed tests?

For the RL failed examples, I can not reproduce them on my own machine.

vfdev-5 · 2022-10-29T08:19:52Z

Thanks for the PR, @louis-she !
I'll check in details a bit later the failures. Seeing one job RL failure complaining about seed arg, it seems that this could be related to gym version...
Related PR #2706

louis-she · 2022-10-29T08:22:55Z

Thanks for the PR, @louis-she ! I'll check in details a bit later the failures. Seeing one job RL failure complaining about seed arg, it seems that this could be related to gym version...

yes, I think it should be related to the version. I upgraded to the latest version of gym 0.26.2 but still can not reproduce the error.

vfdev-5 · 2022-10-29T08:24:23Z

Do you check with python 3.7 ?
Can you please see in the CI which version we are installing? Maybe they simply dropped support for py37

louis-she · 2022-10-29T08:30:05Z

I'm using 3.7.10, the CI uses 3.7.15. I can't see the version of gym CI use cause the pip install ... -qq in the workflow. https://github.com/pytorch/ignite/blob/master/.github/workflows/unit-tests.yml#L90

Can we remove the -qq of pip install for easily debug?

vfdev-5 · 2022-10-29T08:32:01Z

Yes, let's remove -qq . I propose to create a separate PR for CI fix. For Neptune logger fix we can ping one of their folks for review

louis-she · 2022-10-29T08:35:38Z

OK, then I'll create another PR to remove the -qq in CI. It'll be great if neptune guys could review the codes.

vfdev-5 · 2022-10-29T15:05:45Z

ignite/engine/deterministic.py

+                # according to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
+                # CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic.
+                # **the behavior is expected to change in a future release of cuBLAS**.
+                os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"


I'm not fan of doing that in ignite. If this call is necessary, it should be done by pytorch...
Reading the docs:

set the debug environment variable CUBLAS_WORKSPACE_CONFIG to ":16:8" (may limit overall performance) or ":4096:8" (will increase library footprint in GPU memory by approximately 24MiB).

I do not think that we want to set a debug env variable.

@louis-she let's remove that.

EDIT: I checked pytorch docs https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html?highlight=use_deterministic_algorithms#torch.use_deterministic_algorithms and see this suggestion...

If one of these environment variable configurations is not set, a RuntimeError will be raised from these operations when called with CUDA tensors:

vfdev-5 · 2022-10-29T15:13:04Z

ignite/engine/deterministic.py

+                # CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic.
+                # **the behavior is expected to change in a future release of cuBLAS**.
+                os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
+                torch.use_deterministic_algorithms(True)


Maybe, we should set warn_only=True such that we do not break previous code but only raise warnings about non-deterministic implementation.

vfdev-5

LGTM, thanks @louis-she

vfdev-5 · 2022-10-30T10:31:02Z

@louis-she you were right, gpu tests are failing for deterministic engine:

>       return torch.matmul(kernel_x.t(), kernel_y)  # (kernel_size, 1) * (1, kernel_size)
E       RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

=========================== short test summary info ============================
FAILED tests/ignite/engine/test_deterministic.py::test_gradients_on_resume_on_cuda - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/test_ssim.py::test_ssim[shape0-7-False-True-cuda] - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/test_ssim.py::test_ssim[shape1-11-True-False-cuda] - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/gan/test_utils.py::test_device_mismatch_cuda - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
===== 4 failed, 20 passed, 5 skipped, 1920 deselected, 1 warning in 22.48s =====

Exited with code exit status 1

I do not quite understand why it fails with RuntimeError but we asked for a warning

louis-she · 2022-10-30T11:15:32Z

looks like it was related to multiple GPU nodes. Let me look at this.

louis-she · 2022-10-30T12:53:10Z

I'm not sure if this is a bug of PyTorch, but I can reproduce this with this very straightforward snippet:

import torch
torch.use_deterministic_algorithms(True, warn_only=True)
assert torch.is_deterministic_algorithms_warn_only_enabled()
torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda")))

raise RuntimeError ,

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda")))
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

The error will not be raised with torch==1.13.0, I have had a look at the source of torch, and it was different of the 2 versions.

v1.12.1 https://github.com/pytorch/pytorch/blob/v1.12.1/aten/src/ATen/Context.cpp#L126
v1.13.0 https://github.com/pytorch/pytorch/blob/v1.13.0/aten/src/ATen/Context.cpp#L142

For v1.12.1, there is no if statement about the warning flag, but v1.13.0 did has it.

vfdev-5 · 2022-10-30T22:38:47Z

@louis-she does your code sample show a warning with 1.13.0 and cuda 11.6 ?
In my case, it does not show anything ... Seems like it is also related to cuda version as 1.12.1 with cuda 11.6 does not raise any error...

louis-she · 2022-10-31T02:51:09Z

Hmm seems like NVIDIA makes torch.nn.Linear deterministic in cu116. Another test snippet

import torch
import torchvision

warn_only = False
torch.use_deterministic_algorithms(True, warn_only=warn_only)
if warn_only:
    assert torch.is_deterministic_algorithms_warn_only_enabled()

model = torchvision.models.swin_s().cuda()
model(torch.rand(2, 3, 224, 224, device="cuda"))

Here are some experiments result:

torch version	warn_only	behavior
`torch==1.12.1+cu113`	True	RuntimeError
`torch==1.12.1+cu113`	False	RuntimeError
`torch==1.12.1+cu116`	True	RuntimeError
`torch==1.12.1+cu116`	False	RuntimeError
`torch==1.13.0+cu116`	True	UserWarning
`torch==1.13.0+cu116`	False	RuntimeError

github-actions bot added module: engine Engine module module: utils Utils module labels Oct 29, 2022

louis-she mentioned this pull request Oct 29, 2022

Fix failing CI #2757

Merged

vfdev-5 reviewed Oct 29, 2022

View reviewed changes

louis-she added 2 commits October 30, 2022 09:51

improve deterministic engine

f1766ef

warning if non-deterministic method been called in deterministic mode

ce1d363

louis-she force-pushed the deterministic-engine-improvement branch from 69b29fd to ce1d363 Compare October 30, 2022 01:55

remove unused import

4a85895

vfdev-5 approved these changes Oct 30, 2022

View reviewed changes

vfdev-5 merged commit 117529e into pytorch:master Oct 30, 2022

vfdev-5 mentioned this pull request Oct 30, 2022

Fix failing GPU ci #2759

Merged

3 tasks

vfdev-5 mentioned this pull request Nov 18, 2022

DeterministicEngine improvement #2754

Closed

Uh oh!

Conversation

louis-she commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louis-she commented Oct 29, 2022

Uh oh!

vfdev-5 commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louis-she commented Oct 29, 2022

Uh oh!

vfdev-5 commented Oct 29, 2022

Uh oh!

louis-she commented Oct 29, 2022

Uh oh!

vfdev-5 commented Oct 29, 2022

Uh oh!

louis-she commented Oct 29, 2022

Uh oh!

vfdev-5 Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Oct 29, 2022

Choose a reason for hiding this comment

Uh oh!

vfdev-5 left a comment

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Oct 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louis-she commented Oct 30, 2022

Uh oh!

louis-she commented Oct 30, 2022

Uh oh!

vfdev-5 commented Oct 30, 2022

Uh oh!

louis-she commented Oct 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

louis-she commented Oct 29, 2022 •

edited

Loading

vfdev-5 commented Oct 29, 2022 •

edited

Loading

vfdev-5 Oct 29, 2022 •

edited

Loading

vfdev-5 commented Oct 30, 2022 •

edited

Loading