Skip to content

Conversation

@waiterQ
Copy link
Contributor

@waiterQ waiterQ commented Feb 7, 2023

Modification Motivation

Volcano can schedule normal pods, but it shows up podGroup inqueue problem at some conditions. scheduling k8s-job,
podGroup is still at running phase when pod went to completed phase; changing deployment's requeuest will cause old podGroup in phase Inqueue.
So I pick and fix #2589 (Feature/add replicaset gc pg), and add phase Completed for normal pod podGroup to enhances podGroup.

Test Result

when deployment roll-updating, there're two podgroups exists.
image

when deployment roll-update is over, one podgroup left.
image

when k8s job is completed
image
podgroup is completed
image

ci e2e result
image

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 7, 2023
@waiterQ waiterQ force-pushed the add-pg-completed branch 2 times, most recently from 9d31199 to 3a6c9e4 Compare February 7, 2023 09:19
@waiterQ waiterQ changed the title Add pg completed Add podGroup completed phase Feb 7, 2023
Gaizhi and others added 2 commits February 9, 2023 11:18
pg.queue.Add(req)
}

func (pg *pgcontroller) addReplicaSet(obj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waiterQ As far as I know, when replicaset was created, it would always be 0 replica. And after creating, the replicaset would scale up to defined replica numbers. So why deleting podgroup on both addReplicaSet and updateReplicaSet, but not only updateReplicaSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right. In normal process with one version, volcano just need updateReplicaSet, and if consider the situation upgrade from a version to another, there isn't addReplicaSet help to cleanup stock podgroups in cluster. addReplicaSet is work with already-exist podgroups, addReplicaSet work with upcoming podgroups.

return
}

if *rs.Spec.Replicas == 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a probability that there will be two replicasets with none zero replicas when doing roll upgrade which means two pg exists, does this matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about deployment's rollingUpdate strategy, in pod rolling creating, its definitely 2 kind pods exists. I think it's normal, not a problem.

@william-wang
Copy link
Member

Please add the test results on the PR, thanks.

@waiterQ
Copy link
Contributor Author

waiterQ commented Feb 13, 2023

Please add the test results on the PR, thanks.

ok, done.

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -0,0 +1,189 @@
/*
Copyright 2021 The Volcano Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Copyright is not correct.

Expect(len(pgs.Items)).To(Equal(1), "only one podGroup should be exists")
})

It("k8s Job", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a formal and complete description for func

fmt.Sprintf("expectPod %d, q1ScheduledPod %d, q2ScheduledPod %d", expectPod, q1ScheduledPod, q2ScheduledPod))
})

It("changeable Deployment's PodGroup", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give a formal and complete description for func

@@ -0,0 +1,189 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a util package, why name the file name as deployment.go

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2023
@volcano-sh-bot volcano-sh-bot merged commit 0254501 into volcano-sh:master Feb 13, 2023
@waiterQ waiterQ deleted the add-pg-completed branch March 24, 2023 07:40
if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
status.Phase = scheduling.PodGroupRunning
// If all allocated tasks is succeeded, it's completed
if len(jobInfo.TaskStatusIndex[api.Succeeded]) == allocated {
Copy link

@zhoushuke zhoushuke Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for batchv1 native job, if using .spec.completions and .spec.parallelism in job, for case, successed 10, in the same time, the queue is full, other 10 pod will pending, len(jobInfo.TaskStatusIndex[api.Succeeded]) == allocated will be true, job not finished but pg status is completed, would it happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants