Skip to content

fix: delete corresponding podgroup created by rbg when gang-schedulin…#57

Merged
cheyang merged 1 commit intosgl-project:mainfrom
ShirleyDing:fix/deletePodgroup
Oct 21, 2025
Merged

fix: delete corresponding podgroup created by rbg when gang-schedulin…#57
cheyang merged 1 commit intosgl-project:mainfrom
ShirleyDing:fix/deletePodgroup

Conversation

@ShirleyDing
Copy link
Copy Markdown
Contributor

@ShirleyDing ShirleyDing commented Oct 15, 2025

Ⅰ. Motivation

bugfix

Ⅱ. Modifications

podgroup_manager and rolebasedgroup_controller reconcile logic

Ⅲ. Does this pull request fix one issue?

fixes #56

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

  1. create a RoleBasedGroup with gang-scheduling enabled
  2. update the RoleBasedGroup and remove spec.podGroupPolicy
  3. check if the podGroup created by the RoleBasedGroup is deleted

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@bcfre
Copy link
Copy Markdown
Collaborator

bcfre commented Oct 15, 2025

Thank you for identifying this issue and providing a possible solution.
I’d like to add another angle to consider: Should we support this field being mutable? @Syspretor @cheyang @ShirleyDing

The specific background is that when we use PodGroup for gang scheduling, we inject relevant gang scheduling information into the pod’s labels or annotations (https://github.com/sgl-project/rbg/blob/main/pkg/scheduler/podgroup_manager.go#L52).

If changes to this field are allowed, it could lead to modifications in the workload spec, which in turn would trigger the underlying pods to be recreated/updated.
We should consider whether pod recreation caused solely by enabling or disabling gang scheduling behavior is actually expected.

@bcfre
Copy link
Copy Markdown
Collaborator

bcfre commented Oct 20, 2025

Since RBG was originally intended to support automatically cleaning up the PodGroup object when its definition is deleted, I think we can go ahead and merge this PR first. Later, we can consider whether to make this field immutable.
https://github.com/sgl-project/rbg/blob/main/pkg/scheduler/podgroup_manager.go#L48

@bcfre
Copy link
Copy Markdown
Collaborator

bcfre commented Oct 20, 2025

/lgtm

@BSWANG BSWANG requested a review from Copilot October 20, 2025 12:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where podgroups created by RoleBasedGroup resources were not being properly deleted when gang-scheduling was disabled or removed. The fix consolidates CRD existence checking and resource watching logic into the Reconcile method and adds better ownership validation when deleting podgroups.

Key changes:

  • Moved CRD existence checking and controller setup logic from separate function into the main Reconcile method
  • Added ownership validation to ensure only podgroups owned by the RoleBasedGroup are deleted
  • Updated test cases to use consistent variable names and added proper test setup

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/scheduler/podgroup_manager.go Refactored reconcile logic to handle CRD checking inline and added ownership validation for deletions
pkg/scheduler/podgroup_manager_test.go Enhanced test cases with proper setup, consistent naming, and additional test scenarios
internal/controller/workloads/rolebasedgroup_controller.go Simplified controller logic by removing separate CRD checking step
test/wrappers/rbg_wrapper.go Removed unused method and added TypeMeta to test wrapper

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

UID: rbg.UID,
Controller: ptr.To[bool](true),
},
*metav1.NewControllerRef(rbg, rbg.GroupVersionKind()),
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using metav1.NewControllerRef is cleaner than manually constructing the OwnerReference, but the original manual construction included BlockOwnerDeletion: ptr.To[bool](true) which NewControllerRef doesn't set by default. This could affect garbage collection behavior.

Copilot uses AI. Check for mistakes.
assert.Equal(t, tt.rbg.Name, pg.OwnerReferences[0].Name)
assert.Equal(t, "RoleBasedGroup", pg.OwnerReferences[0].Kind)
switch pg := obj.(type) {
case *volcanoschedulingv1beta1.PodGroup:
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type switch handles the volcanoschedulingv1beta1.PodGroup case but doesn't perform any assertions on it, while the schedv1alpha1.PodGroup case has multiple assertions. This creates inconsistent test coverage between the two scheduler types.

Suggested change
case *volcanoschedulingv1beta1.PodGroup:
case *volcanoschedulingv1beta1.PodGroup:
assert.Equal(t, tt.rbg.Name, pg.Name)
assert.Equal(t, tt.rbg.Namespace, pg.Namespace)
assert.Equal(t, int32(tt.rbg.GetGroupSize()), pg.Spec.MinMember)
// Check owner reference
assert.Len(t, pg.OwnerReferences, 1)
assert.Equal(t, tt.rbg.Name, pg.OwnerReferences[0].Name)
assert.Equal(t, "RoleBasedGroup", pg.OwnerReferences[0].Kind)

Copilot uses AI. Check for mistakes.
@cheyang
Copy link
Copy Markdown
Collaborator

cheyang commented Oct 21, 2025

@ShirleyDing Thanks for your fix! Look forward to your next PR!

@cheyang cheyang merged commit 7d6c5bb into sgl-project:main Oct 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Update rbg and remove podGroupPolicy, but existing podgroup remains

5 participants