I understand that NVIDIA Optimised TensorFlow containers will no longer be released after 25.02. Therefore, I need to build a custom Docker image on DGX Spark. I’ve tried several base images, but none have worked.
Here are my files to reproduce my problem:
# Dockerfile
FROM nvcr.io/nvidia/pytorch:25.10-py3
SHELL ["/bin/bash", "-c"]
ARG DEBIAN_FRONTEND=noninteractive
WORKDIR /workspace
RUN python3 -m pip install nvidia-tensorflow[horovod] --extra-index-url=https://pypi.ngc.nvidia.com/
ENTRYPOINT ["python3", "-c", "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"]
# docker-compose.yml
name: tensorflow-test
services:
tensorflow-test:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
ipc: host
restart: no
runtime: nvidia
ulimits:
memlock: -1
stack: 67108864
Place above two files in same directory, then I ran:
docker compose up
And here is the end of result:
[+] Running 3/3
✔ tensorflow-test-tensorflow-test Built 0.0s
✔ Network tensorflow-test_default Created 0.0s
✔ Container tensorflow-test-tensorflow-test-1 Created 0.1s
Attaching to tensorflow-test-1
tensorflow-test-1 | 2025-11-06 11:41:05.770923: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
tensorflow-test-1 | 2025-11-06 11:41:05.786955: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
tensorflow-test-1 | 2025-11-06 11:41:05.794465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
tensorflow-test-1 | 2025-11-06 11:41:06.306927: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT: INTERNAL: Cannot dlopen all TensorRT libraries: FAILED_PRECONDITION: Could not load dynamic library 'libnvinfer.so.10.8.0'; dlerror: libnvinfer.so.10.8.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
tensorflow-test-1 | WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
tensorflow-test-1 | I0000 00:00:1762429267.139781 1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1 | I0000 00:00:1762429267.223135 1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1 | I0000 00:00:1762429267.225943 1 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
tensorflow-test-1 | [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tensorflow-test-1 exited with code 0
It seems like the container found the GPU in TensorFlow, but it didn’t work. cuFFT, cuDNN, cuBLAS, and cuda couldn’t register it at all.
I don’t see any threads discussing this issue, so I’m not sure if it’s a unique problem or if no one has run TensorFlow in Docker. As TensorFlow is listed in 2.3.4.2 on the DGX Spark User Guide. I suppose it could work.