fix(ci): actually tear down AWS UAT cluster (destroy → apply)#1213
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR modifies the "Destroy Cluster" step in the AWS UAT workflow to change how the EKS actuator teardown is invoked. The step now sets Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Fix the AWS UAT teardown, which was a silent no-op leaking the
p5.48xlargeGPU node and holding the H100 capacity reservation (cr-0cbe491320188dfa6) across every run.Motivation / Context
The actuator image was bumped
v0.2.6 → v0.4.23. Bringup was migrated to the newapplyinterface, but theDestroy Clusterstep still called the subcommanddestroy, which no longer exists inv0.4.23. The unknown command hit urfave/cli's help path and exited0, so theif docker run … destroy; then echo "destroyed successfully"branch matched and the loop broke — nothing was ever torn down. Destruction in this image isapplywith.deployment.destroy=true(the actuator's own help: "apply — Deploy or destroy infrastructure via Terraform"). The GCP UAT sibling already does it this way.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
.github/workflows/uat-aws.yaml(UAT teardown)Implementation Notes
.deployment.destroy = trueand invokeapplyinstead of the removeddestroysubcommand, mirroringuat-gcp.yaml.destroyedflag + post-loop guard so the step fails (and surfaces a::error::) when all 3 attempts fail.docker runinside anifcondition does not tripset -e, so without this a genuine destroy failure (e.g. orphaned ENIs/SGs blocking VPC deletion) would still exit 0 and re-leak silently.skip_deleteinput andsteps.infra.outcome != 'skipped', unchanged.Testing
yamllint .github/workflows/uat-aws.yaml # cleanWorkflow-only change (no Go code touched). Functional validation is the next scheduled/dispatched UAT-AWS run actually destroying its cluster and releasing the reservation; the new post-loop guard means a failed teardown now turns the step red instead of passing silently.
Risk Assessment
Rollout notes: Pre-existing leaked
aicr-uat-*clusters from prior runs are not cleaned up by this change and must be torn down manually (applywith.deployment.destroy=trueper leakeddeployment.id).Checklist
make testwith-race) — N/A, no Go changes;yamllintcleanmake lint) —yamllintcleanuat-gcp.yaml)git commit -S)