Skip to content

Conversation

@jvstme
Copy link
Collaborator

@jvstme jvstme commented Sep 13, 2025

Currently, aws, gcp, azure, and oci backends use our custom OS images. These images use 535 version of the CUDA driver. This version doesn't support newer generations of GPUs such as NVIDIA B200.

This PR updates the scripts that build these custom OS images to update the CUDA version from 535 to 570.

This PR is required for #3100 (issue #3088).

Scope:

  • Update from the 535 to the 570 family.
  • Update to Ubuntu 24.04, since Ubuntu 22.04 does not have the gcc version required for building the 570 driver.
  • Switch from proprietary to open kernel modules.
  • Since pre-Turing GPUs aren't supported by NVIDIA open kernel modules, conditionally choose between old and new dstack OS images based on the GPU name.
  • Adjust handling apt race conditions - the existing hack did not work on OCI's Ubuntu 24.04.
  • Install ufw when building the image - it is missing in OCI's Ubuntu 24.04.

Notes:

Before/upon merging:

  • Build the 0.11rc2 images
  • Bump base_image in version.py
  • Test the relevant backends, especially Azure, which was only partially tested due to limitations of its staging images (see below)

Update the driver to support NVIDIA B200.

- Update from the 535 to the 570 family.
- Update to Ubuntu 24.04, since Ubuntu 22.04 does
  not have the gcc version required for building
  the 570 driver.
- Switch from proprietary to open kernel modules.
- Since pre-Turing GPUs aren't supported by NVIDIA
  open kernel modules, conditionally choose
  between old and new dstack OS images based on
  the GPU name.
- Adjust handling `apt` race conditions - the
  existing hack did not work on OCI's Ubuntu
  24.04.
- Install `ufw` when building the image - it is
  missing in OCI's Ubuntu 24.04.
@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Sep 22, 2025

image_name = (
f"dstack-{version.base_image}" if not cuda else f"dstack-cuda-{version.base_image}"
)
if gpu_name is None:
Copy link
Contributor

@peterschmidt85 peterschmidt85 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding AWS, just to confirm, the new image is only required for very few GPU types note covered by AWS DLAMI (e.g. T4), right?

Copy link
Contributor

@peterschmidt85 peterschmidt85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Azure:

Not that it's a problem but for the Grid image (e.g., A10:4GB), we are still hard-coding the old CUDA version (550):

https://download.microsoft.com/download/c5319e92-672e-4067-8d85-ab66a7a64db3/NVIDIA-Linux-x86_64-550.144.06-grid-azure.run

Copy link
Contributor

@peterschmidt85 peterschmidt85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After building 0.11rc2, I've tested:

  1. AWS (T4) - all works
  2. GCP (T4, L4) - all works
  3. OCI (A10) - all works
  4. Azure (T4, A10:4GB) - all works

Also, tested AWS (L4, L40S) but it used AWS DLAMI.

@peterschmidt85 peterschmidt85 merged commit b29c55e into master Sep 23, 2025
28 checks passed
@peterschmidt85 peterschmidt85 deleted the update_nvidia_driver_in_base_images branch September 23, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants