-
Notifications
You must be signed in to change notification settings - Fork 207
Update NVIDIA driver in dstack OS images #3099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update the driver to support NVIDIA B200. - Update from the 535 to the 570 family. - Update to Ubuntu 24.04, since Ubuntu 22.04 does not have the gcc version required for building the 570 driver. - Switch from proprietary to open kernel modules. - Since pre-Turing GPUs aren't supported by NVIDIA open kernel modules, conditionally choose between old and new dstack OS images based on the GPU name. - Adjust handling `apt` race conditions - the existing hack did not work on OCI's Ubuntu 24.04. - Install `ufw` when building the image - it is missing in OCI's Ubuntu 24.04.
| image_name = ( | ||
| f"dstack-{version.base_image}" if not cuda else f"dstack-cuda-{version.base_image}" | ||
| ) | ||
| if gpu_name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding AWS, just to confirm, the new image is only required for very few GPU types note covered by AWS DLAMI (e.g. T4), right?
peterschmidt85
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Azure:
Not that it's a problem but for the Grid image (e.g., A10:4GB), we are still hard-coding the old CUDA version (550):
| https://download.microsoft.com/download/c5319e92-672e-4067-8d85-ab66a7a64db3/NVIDIA-Linux-x86_64-550.144.06-grid-azure.run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After building 0.11rc2, I've tested:
- AWS (T4) - all works
- GCP (T4, L4) - all works
- OCI (A10) - all works
- Azure (T4, A10:4GB) - all works
Also, tested AWS (L4, L40S) but it used AWS DLAMI.
Bumped `base_image` to `0.11rc2`
Updated tests
Currently,
aws,gcp,azure, andocibackends use our custom OS images. These images use 535 version of the CUDA driver. This version doesn't support newer generations of GPUs such as NVIDIA B200.This PR updates the scripts that build these custom OS images to update the CUDA version from 535 to 570.
This PR is required for #3100 (issue #3088).
Scope:
gccversion required for building the 570 driver.aptrace conditions - the existing hack did not work on OCI's Ubuntu 24.04.ufwwhen building the image - it is missing in OCI's Ubuntu 24.04.Notes:
azure's Grid drivers (used for A10).gcp's A3 OS image scriptBefore/upon merging:
base_imageinversion.py