Add example nccl test script for slurm on gke#4960
Add example nccl test script for slurm on gke#4960LAVEEN merged 1 commit intoGoogleCloudPlatform:developfrom
Conversation
Summary of ChangesHello @ACW101, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request adds a new example to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds helpful example scripts for running NCCL tests on Slurm on GKE. The scripts and documentation are a good starting point, but there are several inconsistencies and errors that need to be addressed. Most importantly, as per the repository style guide (line 33), new examples must be added to the index in examples/README.md. This change is missing and is critical for discoverability. Additionally, there is a critical syntax error in run-nccl-tests-tcpxo.sh due to a missing closing quote. I've left detailed comments on the README and shell scripts to correct these issues, as well as benchmark names, script references, container version inconsistencies, and other items to improve clarity and ensure the examples work as expected.
b823081 to
76baf7f
Compare
Submission Checklist
Add example NCCL script specific for running Slurm on GKE. Based on the previous work in https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning/a3-megagpu-8g/nccl-tests and https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning/a3-ultragpu-8g
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.