Conversation
|
@vfdev-5 could you take a look at the failed tests? For the RL failed examples, I can not reproduce them on my own machine. |
|
Thanks for the PR, @louis-she ! |
yes, I think it should be related to the version. I upgraded to the latest version of gym |
|
Do you check with python 3.7 ? |
|
I'm using Can we remove the |
|
Yes, let's remove -qq . I propose to create a separate PR for CI fix. For Neptune logger fix we can ping one of their folks for review |
|
OK, then I'll create another PR to remove the |
ignite/engine/deterministic.py
Outdated
| # according to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility | ||
| # CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic. | ||
| # **the behavior is expected to change in a future release of cuBLAS**. | ||
| os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" |
There was a problem hiding this comment.
I'm not fan of doing that in ignite. If this call is necessary, it should be done by pytorch...
Reading the docs:
set the debug environment variable CUBLAS_WORKSPACE_CONFIG to ":16:8" (may limit overall performance) or ":4096:8" (will increase library footprint in GPU memory by approximately 24MiB).
I do not think that we want to set a debug env variable.
@louis-she let's remove that.
EDIT: I checked pytorch docs https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html?highlight=use_deterministic_algorithms#torch.use_deterministic_algorithms and see this suggestion...
If one of these environment variable configurations is not set, a RuntimeError will be raised from these operations when called with CUDA tensors:
ignite/engine/deterministic.py
Outdated
| # CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic. | ||
| # **the behavior is expected to change in a future release of cuBLAS**. | ||
| os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" | ||
| torch.use_deterministic_algorithms(True) |
There was a problem hiding this comment.
Maybe, we should set warn_only=True such that we do not break previous code but only raise warnings about non-deterministic implementation.
69b29fd to
ce1d363
Compare
vfdev-5
left a comment
There was a problem hiding this comment.
LGTM, thanks @louis-she
|
@louis-she you were right, gpu tests are failing for deterministic engine:
I do not quite understand why it fails with RuntimeError but we asked for a warning |
|
looks like it was related to multiple GPU nodes. Let me look at this. |
|
I'm not sure if this is a bug of import torch
torch.use_deterministic_algorithms(True, warn_only=True)
assert torch.is_deterministic_algorithms_warn_only_enabled()
torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda")))raise The error will not be raised with v1.12.1 https://github.com/pytorch/pytorch/blob/v1.12.1/aten/src/ATen/Context.cpp#L126 For v1.12.1, there is no |
|
@louis-she does your code sample show a warning with 1.13.0 and cuda 11.6 ? |
|
Hmm seems like NVIDIA makes import torch
import torchvision
warn_only = False
torch.use_deterministic_algorithms(True, warn_only=warn_only)
if warn_only:
assert torch.is_deterministic_algorithms_warn_only_enabled()
model = torchvision.models.swin_s().cuda()
model(torch.rand(2, 3, 224, 224, device="cuda"))Here are some experiments result:
|
#2754
The
neptune-clienthas make some APIs to their legacy package, see neptune-ai/neptune-client#1039