Skip to content

Update wait flag and resolving helm_release deadlock destruction error#5147

Merged
agrawalkhushi18 merged 10 commits intoGoogleCloudPlatform:developfrom
agrawalkhushi18:helm-destroy
Feb 12, 2026
Merged

Update wait flag and resolving helm_release deadlock destruction error#5147
agrawalkhushi18 merged 10 commits intoGoogleCloudPlatform:developfrom
agrawalkhushi18:helm-destroy

Conversation

@agrawalkhushi18
Copy link
Copy Markdown
Contributor

@agrawalkhushi18 agrawalkhushi18 commented Jan 28, 2026

This PR addresses transient kueue-webhook build failures caused by asynchronous Helm installs (wait=false) by enabling wait=true. To prevent the resulting destruction-time race condition where node pools are deleted before Helm resources, it introduces an explicit dependency on the system_node_pool to ensure a clean, sequential teardown.

Changes:

  • gke-cluster/outputs.tf: Added system_node_pool_id output to expose the node pool's unique identifier for dependency tracking.

  • kubectl-apply/variables.tf: Introduced system_node_pool_id variable to receive the dependency anchor from the cluster module.

  • kubectl-apply/main.tf: Forced wait = true for Kueue installation and passed the system_node_pool_id into the install_kueue module.

  • helm_install/main.tf: Implemented a tf_dependency_anchor using a Helm set block to force Terraform to honor sequential resource ordering during destruction.

  • Updated GKE blueprints to remove explicit setting of the wait flag

To clean up the orphaned resources created through the ansible post-deploy test, a cleanup step to remove these resources has been added for helm and terraform to handle the gcluster destroy gracefully in the build environment.

@agrawalkhushi18 agrawalkhushi18 added release-improvements Added to release notes under the "Improvements" heading. release-bugfix Added to release notes under the "Bug fixes" heading. labels Jan 28, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @agrawalkhushi18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and reliability of Kubernetes cluster deployments, particularly concerning Helm chart installations and resource teardown. By enforcing synchronous Helm installations for Kueue and introducing explicit dependency tracking using the system node pool ID, it resolves issues related to transient build failures and prevents deadlock errors during the destruction of Helm releases. The changes ensure a more predictable and robust lifecycle management for cluster components.

Highlights

  • Kueue Helm Installation Reliability: The wait flag for Kueue Helm installations has been explicitly set to true. This change addresses transient build failures by ensuring that all resources are fully ready before the installation is considered complete, preventing issues caused by asynchronous Helm installs.
  • Helm Release Destruction Order: A mechanism has been introduced to enforce explicit Terraform dependency tracking for Helm releases. This prevents destruction-time race conditions, specifically addressing scenarios where node pools might be deleted before dependent Helm resources, which previously led to deadlock errors.
  • System Node Pool ID Exposure: The system_node_pool_id is now exposed as an output from the GKE cluster module. This ID is crucial for establishing dependencies and ensuring proper resource teardown order.
  • Dependency Anchoring in Helm: A tf_dependency_anchor Helm set block has been added within the helm_release resource. This block uses the deployment_dependency_ids variable to create an implicit dependency, forcing Terraform to respect the desired sequential ordering during resource destruction.
  • GKE Blueprint Updates: Various GKE blueprints have been updated to pass the newly exposed system_node_pool_id from the cluster module to the kubectl-apply module, integrating the new dependency tracking across the infrastructure definitions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a Helm release deadlock during destruction by enforcing wait=true for Kueue installations and creating an explicit dependency on the system node pool. The use of a tf_dependency_anchor is a solid approach to manage this race condition in Terraform. The changes are well-propagated through the various modules and examples. I've included a couple of suggestions to improve the robustness of a new output and to address an inconsistency introduced by hardcoding the wait flag, which will enhance the long-term maintainability of the module.

Comment thread modules/management/kubectl-apply/main.tf
Comment thread modules/scheduler/gke-cluster/outputs.tf
@agrawalkhushi18 agrawalkhushi18 changed the title Updating wait flag and resolving helm_release deadlock destruction error Update wait flag and resolving helm_release deadlock destruction error Jan 28, 2026
@SwarnaBharathiMantena SwarnaBharathiMantena added the release-breaking-changes Prevents "smooth" re-deploy across versions label Jan 29, 2026
@SwarnaBharathiMantena
Copy link
Copy Markdown
Contributor

Added release-breaking-changes label as the default value of wait is being updated.

@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review January 30, 2026 07:04
@agrawalkhushi18 agrawalkhushi18 requested review from a team and samskillman as code owners January 30, 2026 07:04
@agrawalkhushi18 agrawalkhushi18 marked this pull request as draft January 30, 2026 08:02
@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review February 11, 2026 07:05
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread modules/management/kubectl-apply/helm_install/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
@agrawalkhushi18 agrawalkhushi18 merged commit b50773b into GoogleCloudPlatform:develop Feb 12, 2026
16 of 85 checks passed
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request Feb 17, 2026
GoogleCloudPlatform#5147)

All the relevant tests passed successfully on running babysit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions release-bugfix Added to release notes under the "Bug fixes" heading. release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants