Has anyone come up with a better or more efficient way to get the DGX Spark to do GPU training using PyTorch? I had a lot of issues with getting a version of PyTorch or NVRTC to operate when trying to use the GPU’s for training specifically. Open to suggestions if someone has a better way to make the system function as a training solution. For context I was trying to process images in a YOLO model. Thanks for any advice you can provide.
The best I was able to put together was:
| Component | Version | Path / Location |
|---|---|---|
| Ubuntu LTS | 22.04.4 LTS | / |
| NVIDIA Driver | 555.xx + | /usr/lib/modules/<kernel>/kernel/drivers/video/nvidia.ko |
| CUDA Toolkit | 12.4 .1 | /usr/local/cuda-12.4/ |
| cuDNN | 9.x (ships with CUDA 12.4) | /usr/local/cuda-12.4/lib64/libcudnn* |
| Python (venv) | 3.12.4 | ~/vllm-env/ |
| PyTorch | 2.5.1 + cu124 | ~/vllm-env/lib/python3.12/site-packages/torch/ |
| TorchVision | 0.20.1 + cu124 | same site-packages |
| YOLO / Ultralytics | 8.2.85 | ~/vllm-env/bin/yolo |
| vLLM / Transformers | 0.5.3 / 4.44.2 | site-packages |
| TensorRT (optional) | 10.3.0 | /usr/lib/x86_64-linux-gnu/ |
# Clean out any old drivers or CUDA bits
sudo apt purge 'nvidia-*' -y
sudo apt update && sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
sudo reboot
# CUDA 12.4 Toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-repo-ubuntu2204_12.4.1-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204_12.4.1-1_amd64.deb
sudo apt update && sudo apt install -y cuda cuda-toolkit-12-4
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version
# Python 3.12 virtual environment
sudo apt install -y python3.12-venv
python3.12 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip wheel setuptools
# GPU-enabled PyTorch stack
pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 \
-f https://download.pytorch.org/whl/torch_stable.html
python - <<'EOF'
import torch
print("Torch:", torch.__version__)
print("CUDA:", torch.version.cuda)
print("GPU available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
EOF
# YOLOv8 training
pip install ultralytics==8.2.85
yolo check
yolo train model=yolov8m.pt data=wildfire.yaml imgsz=640 epochs=100 device=0
# Optional: vLLM / TensorRT
pip install vllm==0.5.3 transformers==4.44.2 accelerate
pip install tensorrt==10.3.0
nvidia-smi
# Shows python/yolo using GPU 90–100%
python -c "import torch; print(torch.cuda.get_device_name(0))"
# → NVIDIA H100 / RTX 5090 etc.
Training logs display:
GPU_mem: 10.4G / 24G
Speed: 2.1ms preprocess, 3.0ms inference
Save as dgx_gpu_setup.sh, then run:
chmod +x dgx_gpu_setup.sh
sudo ./dgx_gpu_setup.sh
#!/bin/bash
# === DGX Spark Full GPU Training Provisioner ===
# Tested Ubuntu 22.04 LTS | NVIDIA driver 555+ | CUDA 12.4
set -e
echo "[1/8] Updating system..."
sudo apt update && sudo apt install -y ubuntu-drivers-common curl wget python3.12-venv
echo "[2/8] Installing NVIDIA driver..."
sudo ubuntu-drivers install
echo "[3/8] Installing CUDA 12.4 toolkit..."
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-repo-ubuntu2204_12.4.1-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204_12.4.1-1_amd64.deb
sudo apt update && sudo apt install -y cuda cuda-toolkit-12-4
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
echo "[4/8] Creating Python environment..."
python3.12 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip wheel setuptools
echo "[5/8] Installing GPU-enabled PyTorch stack..."
pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 \
-f https://download.pytorch.org/whl/torch_stable.html
echo "[6/8] Installing YOLOv8 + extras..."
pip install ultralytics==8.2.85 vllm==0.5.3 transformers==4.44.2 accelerate
echo "[7/8] Testing GPU access..."
python - <<'EOF'
import torch
print("Torch:", torch.__version__)
print("CUDA:", torch.version.cuda)
print("GPU available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
EOF
echo "[8/8] Complete! Reboot recommended."