Add a post-deploy test specific to TPUs#4969
Add a post-deploy test specific to TPUs#4969agrawalkhushi18 merged 1 commit intoGoogleCloudPlatform:developfrom
Conversation
Summary of ChangesHello @agrawalkhushi18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a dedicated post-deployment test for Google Kubernetes Engine (GKE) environments utilizing Tensor Processing Units (TPUs). The primary goal is to automatically validate that JAX applications deployed on these clusters correctly identify and report the expected number of TPU cores. This enhancement improves the reliability of TPU deployments by providing automated verification and robust debugging capabilities in case of issues. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new post-deployment test for GKE TPUs, which involves a new Ansible playbook and updates to the build and test configurations. The changes are logical and address the goal of verifying the JAX device count. My review focuses on improving the robustness of the new Ansible playbook (test-gke-tpu.yml) by making the job and pod selection more specific. The current implementation has a risk of selecting incorrect resources if multiple jobs are present, which could lead to flaky tests. The suggested changes will make the tests more reliable.
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-tpu.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-tpu.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-tpu.yml
Outdated
Show resolved
Hide resolved
72c344c to
c05a06f
Compare
b4e834b to
e683137
Compare
b0faeb6 to
e3de76d
Compare
e3de76d to
68ae6e0
Compare
5da389b
into
GoogleCloudPlatform:develop
|
The PR tests for gke-tpu-v6e and gke-tpu-7x passed successfully. |
This PR adds the new
test-gke-tpu.ymlworkflow, intended to run as a post-deployment test. It verifies and outputs the pod logs for the JAX device count.These changes address a follow-up action from PR #4906.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.