fix: delete corresponding podgroup created by rbg when gang-schedulin…#57
Conversation
…g is disabled Signed-off-by: dingxy <[email protected]> Co-authored-by: dingxy <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Thank you for identifying this issue and providing a possible solution. The specific background is that when we use PodGroup for gang scheduling, we inject relevant gang scheduling information into the pod’s labels or annotations (https://github.com/sgl-project/rbg/blob/main/pkg/scheduler/podgroup_manager.go#L52). If changes to this field are allowed, it could lead to modifications in the workload spec, which in turn would trigger the underlying pods to be recreated/updated. |
|
Since RBG was originally intended to support automatically cleaning up the PodGroup object when its definition is deleted, I think we can go ahead and merge this PR first. Later, we can consider whether to make this field immutable. |
|
/lgtm |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a bug where podgroups created by RoleBasedGroup resources were not being properly deleted when gang-scheduling was disabled or removed. The fix consolidates CRD existence checking and resource watching logic into the Reconcile method and adds better ownership validation when deleting podgroups.
Key changes:
- Moved CRD existence checking and controller setup logic from separate function into the main
Reconcilemethod - Added ownership validation to ensure only podgroups owned by the RoleBasedGroup are deleted
- Updated test cases to use consistent variable names and added proper test setup
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/scheduler/podgroup_manager.go | Refactored reconcile logic to handle CRD checking inline and added ownership validation for deletions |
| pkg/scheduler/podgroup_manager_test.go | Enhanced test cases with proper setup, consistent naming, and additional test scenarios |
| internal/controller/workloads/rolebasedgroup_controller.go | Simplified controller logic by removing separate CRD checking step |
| test/wrappers/rbg_wrapper.go | Removed unused method and added TypeMeta to test wrapper |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| UID: rbg.UID, | ||
| Controller: ptr.To[bool](true), | ||
| }, | ||
| *metav1.NewControllerRef(rbg, rbg.GroupVersionKind()), |
There was a problem hiding this comment.
Using metav1.NewControllerRef is cleaner than manually constructing the OwnerReference, but the original manual construction included BlockOwnerDeletion: ptr.To[bool](true) which NewControllerRef doesn't set by default. This could affect garbage collection behavior.
| assert.Equal(t, tt.rbg.Name, pg.OwnerReferences[0].Name) | ||
| assert.Equal(t, "RoleBasedGroup", pg.OwnerReferences[0].Kind) | ||
| switch pg := obj.(type) { | ||
| case *volcanoschedulingv1beta1.PodGroup: |
There was a problem hiding this comment.
The type switch handles the volcanoschedulingv1beta1.PodGroup case but doesn't perform any assertions on it, while the schedv1alpha1.PodGroup case has multiple assertions. This creates inconsistent test coverage between the two scheduler types.
| case *volcanoschedulingv1beta1.PodGroup: | |
| case *volcanoschedulingv1beta1.PodGroup: | |
| assert.Equal(t, tt.rbg.Name, pg.Name) | |
| assert.Equal(t, tt.rbg.Namespace, pg.Namespace) | |
| assert.Equal(t, int32(tt.rbg.GetGroupSize()), pg.Spec.MinMember) | |
| // Check owner reference | |
| assert.Len(t, pg.OwnerReferences, 1) | |
| assert.Equal(t, tt.rbg.Name, pg.OwnerReferences[0].Name) | |
| assert.Equal(t, "RoleBasedGroup", pg.OwnerReferences[0].Kind) |
|
@ShirleyDing Thanks for your fix! Look forward to your next PR! |
Ⅰ. Motivation
bugfix
Ⅱ. Modifications
podgroup_manager and rolebasedgroup_controller reconcile logic
Ⅲ. Does this pull request fix one issue?
fixes #56
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.