Unable to work with TensorFlow in Docker on DGX Spark

I understand that NVIDIA Optimised TensorFlow containers will no longer be released after 25.02. Therefore, I need to build a custom Docker image on DGX Spark. I’ve tried several base images, but none have worked.

Here are my files to reproduce my problem:

# Dockerfile
FROM nvcr.io/nvidia/pytorch:25.10-py3

SHELL ["/bin/bash", "-c"]
ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /workspace

RUN python3 -m pip install nvidia-tensorflow[horovod] --extra-index-url=https://pypi.ngc.nvidia.com/

ENTRYPOINT ["python3", "-c", "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"]
# docker-compose.yml
name: tensorflow-test
services:
  tensorflow-test:
    build: .
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
    ipc: host
    restart: no
    runtime: nvidia
    ulimits:
      memlock: -1
      stack: 67108864

Place above two files in same directory, then I ran:

docker compose up

And here is the end of result:

[+] Running 3/3
 ✔ tensorflow-test-tensorflow-test              Built                                                                                                                                0.0s
 ✔ Network tensorflow-test_default              Created                                                                                                                              0.0s
 ✔ Container tensorflow-test-tensorflow-test-1  Created                                                                                                                              0.1s
Attaching to tensorflow-test-1
tensorflow-test-1  | 2025-11-06 11:41:05.770923: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
tensorflow-test-1  | 2025-11-06 11:41:05.786955: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
tensorflow-test-1  | 2025-11-06 11:41:05.794465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
tensorflow-test-1  | 2025-11-06 11:41:06.306927: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT: INTERNAL: Cannot dlopen all TensorRT libraries: FAILED_PRECONDITION: Could not load dynamic library 'libnvinfer.so.10.8.0'; dlerror: libnvinfer.so.10.8.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
tensorflow-test-1  | WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
tensorflow-test-1  | I0000 00:00:1762429267.139781       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1762429267.223135       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1762429267.225943       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tensorflow-test-1 exited with code 0

It seems like the container found the GPU in TensorFlow, but it didn’t work. cuFFT, cuDNN, cuBLAS, and cuda couldn’t register it at all.

I don’t see any threads discussing this issue, so I’m not sure if it’s a unique problem or if no one has run TensorFlow in Docker. As TensorFlow is listed in 2.3.4.2 on the DGX Spark User Guide. I suppose it could work.

1 Like

we will discuss this eng. and clarify our documentation as needed. But yes, NVIDIA optimized TF containers have been EOL since 25.02.

I am able to run my tensorflow scripts on my DGX Spark using this docker container:

docker run --gpus all -it --rm --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -u $(id -u):$(id -g) nvcr.io/nvidia/tensorflow:25.02-tf2-py3

The IGPU version, 25.02-tf2-py3-igpu, fails to run with a cuFFT compilation error that I’ve failed to resolve.

I assume the tensorflow script they’re able to run is limited to CPU only since no CUDA is registered.

I’ve modified the Dockerfile for testing with nvcr.io/nvidia/tensorflow:25.02-tf2-py3:

FROM nvcr.io/nvidia/tensorflow:25.02-tf2-py3

SHELL ["/bin/bash", "-c"]
ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /workspace

ENTRYPOINT ["python3", "-c", "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"]

similar result:

 ✔ tensorflow-test-tensorflow-test  Built                                                                                                                                      0.0s
Attaching to tensorflow-test-1
tensorflow-test-1  | 2025-11-14 08:58:27.241191: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
tensorflow-test-1  | 2025-11-14 08:58:27.241191: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
tensorflow-test-1  | 2025-11-14 08:58:27.247346: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
tensorflow-test-1  | 2025-11-14 08:58:27.247346: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
tensorflow-test-1  | 2025-11-14 08:58:27.250185: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
tensorflow-test-1  | 2025-11-14 08:58:27.250185: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
tensorflow-test-1  | WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
tensorflow-test-1  | I0000 00:00:1763110708.167110       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
tensorflow-test-1  | I0000 00:00:1763110708.167110       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1763110708.205785       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1763110708.205785       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1763110708.208566       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | I0000 00:00:1763110708.208566       1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1  | [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tensorflow-test-1  | [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tensorflow-test-1 exited with code 0

I’m not sure those “unable to register” notices are significant. I see those as well when kicking off my training scripts within the docker container I describe above. While training I do see from DGX Dashboard that the GPU is utilized at ~80%, so it appears to work despite the notices. When attempting to use 25.02-tf2-py3-igpu, my scripts fail within Model.fit() with a graph execution error “Failed to create cuFFT batched plan with scratch allocator.” The GB10 is an IGPU device, so I’m wondering I’m missing out on some capability/speed.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.