Skip to content

Conversation

@archlitchi
Copy link
Contributor

We implement a common interface for shareable devices(GPU,NPU,FPGA,...) called sharedDevicePool, and use it to reimplement current gpu-share mechanism. The goal is to let device-sharing easy to implement, and better organised. If you wish to grant vc-scheduler the ability to share another device, all you need to do is to implement these methods in sharedDevicePool, and place your logic under pkg/scheduler/api/devices. No other hackings are needed.

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 9, 2023
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make CI happy first.
And squash your code.

}

func (gs *GPUDevices) AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error {
klog.V(4).Infoln("DeviceSharing:Into AllocateToPod", pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the parameter pod will be used in the log info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you are right, now i'm using pod.Name instead

@zishen
Copy link
Contributor

zishen commented Jan 14, 2023

This rework so good, and can you make it a plugin to add a device?

Others map[string]interface{}
GPUDevices map[int]*GPUDevice
Others map[string]interface{}
SharedDevices map[string]SharedDevicePool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others is used for store custom resource. We can reuse it.

nodeInfo.Capability = NewResource(node.Status.Capacity).Add(nodeInfo.OversubscriptionResource)
}
nodeInfo.setNodeGPUInfo(node)
nodeInfo.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use setNodeOthersResource to encapsulation the SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node)

ni.setOversubscription(node)
ni.setNodeGPUInfo(node)
ni.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(ni.Name, node)
ni.setRevocableZone(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

ni.Releasing.Add(ti.Resreq)
ni.Used.Add(ti.Resreq)
ni.AddGPUResource(ti.Pod)
ni.SharedDevices[GPUSharingDevice].AddResource(ti.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to retain AddResource and put the SharedDevices[GPUSharingDevice].AddResource(ti.Pod) inside AddResource.

@@ -0,0 +1,25 @@
package api
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the copyright and change the name of this file to device_interface.go

GPUSharingDevice = "GpuShare"
)

type SharedDevicePool interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface covers the gpu number as well and npu in the future. It's better to change the name to from SharedDevicePool to Devices.


type SharedDevicePool interface {
//following two functions used in node_info
AddResource(pod *v1.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add function description for each interface.

SubResource(pod *v1.Pod)

//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change RequestInPod to HasDeviceRequest or some other name that is easy to understand.


//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
FitInPod(pod *v1.Pod) (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change interface from FitInPod to Filter or FilterNode

//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
FitInPod(pod *v1.Pod) (bool, error)
AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change AllocateToPod and ReleaseFromPod to Allocate and Release

ReleaseFromPod(kubeClient kubernetes.Interface, pod *v1.Pod) error

//used for debug and monitor
MonitorStatus() string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change MonitorStatus to GetStatus or CheckStatus

@archlitchi
Copy link
Contributor Author

This rework so good, and can you make it a plugin to add a device?

I'd like to, but you still need to hack into scheduler.api even with this PR merged (ie. you must put your device-initialization logic in node_info.go in order for the scheduler to recognize your device). so it's hard for a plugin to do that

@william-wang william-wang added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 30, 2023
@william-wang
Copy link
Member

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants