"Unknown: Failed to get convolution algorithm." error and how to solve it

I spent yesterday afternoon installing n2v on a new machine (Ryzen 5, RTX 2060, Ubuntu 20.04, conda) and ran into tensorflow-related issues with n2v. 
The error occurs when running training in any of the example notebooks.
The error message was cuDNN related (see full traceback below), so I suspected a library version problem.
I tried various versions of tensorflow-gpu such as 1.14, 1.15 and versions installed with pip or from conda using the anaconda and conda-forge channels. Also tried various versions of the CUDA toolkit and python 3.6 and 3.7. All without success.

While there was enough GPU VRAM available, it turns out that this is related to GPU memory management in tensorflow.
Setting the following environment variable
```
export TF_FORCE_GPU_ALLOW_GROWTH=true
```
fixed the issue. This is not specific to `n2v`, in fact I found the answer in a thread related to DeepLabCut (https://forum.image.sc/t/could-not-create-cudnn-handle/24276/17).

I am putting this here so that others who run into the issue can find it. I am not sure how common it is (I did not encounter the issue when installing n2v on Windows) and whether it warrants mentioning in the README.md file.


```
Preparing validation data:   0%|          | 0/544 [00:00<?, ?it/s]

8 blind-spots will be generated per training patch of size (64, 64).

Preparing validation data: 100%|██████████| 544/544 [00:00<00:00, 1512.18it/s]

WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:245: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.

WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:273: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:280: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/10

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-10-147763b6fb69> in <module>
      1 # We are ready to start training now.
----> 2 history = model.train(X, X_val)

~/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v-0.2.1-py3.7.egg/n2v/models/n2v_standard.py in train(self, X, validation_X, epochs, steps_per_epoch)
    238         history = self.keras_model.fit_generator(generator=training_data, validation_data=(validation_X, validation_Y),
    239                                                  epochs=epochs, steps_per_epoch=steps_per_epoch,
--> 240                                                  callbacks=self.callbacks, verbose=1)
    241 
    242         if self.basedir is not None:

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1656             use_multiprocessing=use_multiprocessing,
   1657             shuffle=shuffle,
-> 1658             initial_epoch=initial_epoch)
   1659 
   1660     @interfaces.legacy_generator_methods_support

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    213                 outs = model.train_on_batch(x, y,
    214                                             sample_weight=sample_weight,
--> 215                                             class_weight=class_weight)
    216 
    217                 outs = to_list(outs)

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1447             ins = x + y + sample_weights
   1448         self._make_train_function()
-> 1449         outputs = self.train_function(ins)
   1450         return unpack_singleton(outputs)
   1451 

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2977                     return self._legacy_call(inputs)
   2978 
-> 2979             return self._call(inputs)
   2980         else:
   2981             if py_any(is_tensor(x) for x in inputs):

~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2935             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2936         else:
-> 2937             fetched = self._callable_fn(*array_vals)
   2938         return fetched[:len(self.outputs)]
   2939 

~/miniconda3/envs/n2v/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
   1470         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1471                                                self._handle, args,
-> 1472                                                run_metadata_ptr)
   1473         if run_metadata:
   1474           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node channel_0down_level_0_no_0/convolution}}]]
	 [[metrics/n2v_mse/Mean/_499]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node channel_0down_level_0_no_0/convolution}}]]
0 successful operations.
0 derived errors ignored.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Unknown: Failed to get convolution algorithm." error and how to solve it #100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Unknown: Failed to get convolution algorithm." error and how to solve it #100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions