Skip to content

fix(ci): actually tear down AWS UAT cluster (destroy → apply)#1213

Merged
mchmarny merged 5 commits into
NVIDIA:mainfrom
njhensley:fix/aws-uat-resource-leak
Jun 8, 2026
Merged

fix(ci): actually tear down AWS UAT cluster (destroy → apply)#1213
mchmarny merged 5 commits into
NVIDIA:mainfrom
njhensley:fix/aws-uat-resource-leak

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Fix the AWS UAT teardown, which was a silent no-op leaking the p5.48xlarge GPU node and holding the H100 capacity reservation (cr-0cbe491320188dfa6) across every run.

Motivation / Context

The actuator image was bumped v0.2.6 → v0.4.23. Bringup was migrated to the new apply interface, but the Destroy Cluster step still called the subcommand destroy, which no longer exists in v0.4.23. The unknown command hit urfave/cli's help path and exited 0, so the if docker run … destroy; then echo "destroyed successfully" branch matched and the loop broke — nothing was ever torn down. Destruction in this image is apply with .deployment.destroy=true (the actuator's own help: "apply — Deploy or destroy infrastructure via Terraform"). The GCP UAT sibling already does it this way.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Other: CI — .github/workflows/uat-aws.yaml (UAT teardown)

Implementation Notes

  • Set .deployment.destroy = true and invoke apply instead of the removed destroy subcommand, mirroring uat-gcp.yaml.
  • Added a destroyed flag + post-loop guard so the step fails (and surfaces a ::error::) when all 3 attempts fail. docker run inside an if condition does not trip set -e, so without this a genuine destroy failure (e.g. orphaned ENIs/SGs blocking VPC deletion) would still exit 0 and re-leak silently.
  • Behavior is gated by the existing skip_delete input and steps.infra.outcome != 'skipped', unchanged.

Testing

yamllint .github/workflows/uat-aws.yaml   # clean

Workflow-only change (no Go code touched). Functional validation is the next scheduled/dispatched UAT-AWS run actually destroying its cluster and releasing the reservation; the new post-loop guard means a failed teardown now turns the step red instead of passing silently.

Risk Assessment

  • Low — Isolated change to one workflow's teardown step, mirrors the proven GCP path, easy to revert.

Rollout notes: Pre-existing leaked aicr-uat-* clusters from prior runs are not cleaned up by this change and must be torn down manually (apply with .deployment.destroy=true per leaked deployment.id).

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes; yamllint clean
  • Linter passes (make lint) — yamllint clean
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (CI workflow)
  • I updated docs if user-facing behavior changed — N/A
  • Changes follow existing patterns in the codebase (mirrors uat-gcp.yaml)
  • Commits are cryptographically signed (git commit -S)

@njhensley njhensley requested a review from a team as a code owner June 8, 2026 15:58
@njhensley njhensley added the bug label Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 41467623-248f-4eff-aa8d-3b59eb49f6ba

📥 Commits

Reviewing files that changed from the base of the PR and between 031e4bc and 6a0a7f7.

📒 Files selected for processing (1)
  • .github/workflows/uat-aws.yaml

📝 Walkthrough

Walkthrough

The PR modifies the "Destroy Cluster" step in the AWS UAT workflow to change how the EKS actuator teardown is invoked. The step now sets deployment.destroy=true in the cluster configuration and runs the actuator using the apply command instead of a destroy subcommand. A destroyed sentinel flag is initialized before the retry loop and set to true only upon successful actuator execution. After the retry loop completes, the step explicitly checks the sentinel flag and exits with status 1 if destruction never succeeded, ensuring the workflow fails rather than silently succeeding when teardown fails.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing AWS UAT teardown by switching from the removed 'destroy' subcommand to the 'apply' command with destroy flag.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug, motivation, implementation details, testing, and risk assessment for the AWS UAT workflow teardown fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny enabled auto-merge (squash) June 8, 2026 16:04
@mchmarny mchmarny merged commit 5f51fe8 into NVIDIA:main Jun 8, 2026
30 checks passed
@njhensley njhensley deleted the fix/aws-uat-resource-leak branch June 8, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants