initial kep for qos of storage v0.1#1907
Conversation
|
Welcome @pacoxu! |
|
Hi @pacoxu. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@pacoxu: Reiterating the mentions to trigger a notification: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pacoxu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: pacoxu <[email protected]>
|
/sig storage can someone give comments? Or we may discuss in the storage sig about this? |
|
/reassign |
|
/retest |
|
|
||
|
|
||
| ### Risks and Mitigations | ||
| I think iops limiting would be a great idea, but I'm not sure if the current cgroups implementation will effectively implement it. For example, with non-direct device access, writes are buffered in the kernel and something like 80% of them will not be accounted to a cgroup (instead they're all aggregated together). I did a little bit of experimentation here: https://gitlab.com/mattcary/blkio_cgroups/-/blob/master/data/blkio_cgroup.md (sorry that the writeup is not very polished). |
There was a problem hiding this comment.
That's actually the problem with limiting IOPS: with the current Linux kernel it's not possible to limit it reliably and the reason we don't have it yet in Kubernetes.
There was a problem hiding this comment.
As docker supports it, it seems to be a general requirements for storage, especially block volumes. As I mentioned, the disk device that docker is running on, is so important that we need to protect it from being overused by some pods.
--device-read-bps value Limit read rate (bytes per second) from a device (default [])
--device-read-iops value Limit read rate (IO per second) from a device (default [])
--device-write-bps value Limit write rate (bytes per second) to a device (default [])
--device-write-iops value Limit write rate (IO per second) to a device (default [])
There was a problem hiding this comment.
if docker-shim is deprecated, I would look into other cri like containerd. Whether this is supported?
There was a problem hiding this comment.
It looks like there is some level of cgroups v2 support in containerd? containerd/containerd#3726
containerd is supported well, it's available on GKE and is becoming the default.
| annotations: | ||
| qos.volume.storage.daocloud.io: >- | ||
| {"pvc": "snap-03", "iops": {"read": 2000, "write": 1000}, "bps": | ||
| {"read": 1000000, "write": 1000000}} |
There was a problem hiding this comment.
We don't use alpha annotations in Kubernetes, we directly declare fields in structs (as beta proposal).
There was a problem hiding this comment.
will only include the final design later.
| iops: | ||
| read: 2000 | ||
| write: 1000 | ||
| bps: | ||
| read: 1000000 | ||
| write: 1000000 |
There was a problem hiding this comment.
-
This assumes that cluster admin can trust users to create pods with the right IOPS requirements. A rogue user can claim all / most of the IOPS for itself. In Kubernetes, we don't trust users (much) and we have quotas and similar concepts to limit them. I don't know how to apply IOPS there, perhaps via StorageClass & IOPS defined there? Then the cluster admin can define quota of volumes to each user / namespace.
-
How does it work when user sets different IOPS for volumes based by the same device (e.g. each Secret / EmptyDir / HostPath volume having different IOPS)?
-
Not all volume types support IOPS, e.g. NFS / GlusterFS / CephFS / tmpfs. Will be IOPS setting silently ignored for them?
There was a problem hiding this comment.
-
To limit user, we may add it to limitrange of namespace.
IOPS can be applied to storage class as well, but this may be a limitation from storage side. This may be a physical limitation. -
IOPS limitation may only apply to volume, which is a block device.
cgroups can be applied to only device id.
If configmap and secrets are mounted on which is on the rootfs of the host, the rootfs iops limitation will take effect. -
That's the limitation.
Just ignore those in kubelet. kubelet can get mount type.
There was a problem hiding this comment.
Would it make sense to instead of adding absolute IOPS limits, add more of a QoS class to the storageclass. eg,
-
unlimited, no restriction on IOPS, which is what we currently do
-
fair, all pods using the PVC/volume split IOPS fairly. This does require a pretty complicated controller to track PV usage and update all the nodes with how usage should split, especially if we want to base throttling on actual usage (eg, pods that aren't hammering the device don't get the same share as a more active pod), but this may be what users actually want
-
priority, pods with some sort of annotation get unlimited IOPS, all other pods get a fixed share
OK, I'll admit this is all very vague and not well thought-out, but a direction like this has a few advantages:
-
Current state of the art means we're not going to be able to hit all IOPS limits anyway AFAIU
-
This gives leeway for implementation differences and means we wouldn't have to change the API when cgroups v2 comes out or do all sort of complicated docker vs containerd differences
-
Managing IOPS at a pod level is not going to scale for large deployments, but it's (probably) only in large deployments where this sort of thing is really needed. So a higher-level API is maybe ultimately much more useful.
#27000 already was discussing along these lines.
Maybe another way to see this is that QoS is really a volume characteristic and not a pod characteristic. That is, we are concerned about guaranteeing a certain performance for all users of a volume more than we are concerned about setting what performance a pod sees. So setting limits at some higher level than the pod is probably the right way to go about this.
|
|
||
| - The limit is runtime limitation for block device when it is mounted to the pod. The limit is not a volume limitation in IaaS's aspect. If the device is used by multi pods, each pod should limit the iops by itself. | ||
| - For volume capability of iops, it is a physical limitation on device(PV), and this is not the same as the QoS of PVC in this proposal. | ||
| - As cgroup implement has its limitations, we will not mention kernel buffered writings issues with cgroup limitations. This should be fixed or optimized in kernel side. Detailes will be mentioned in the `risks and limitations` below. |
There was a problem hiding this comment.
Given this limitation, how useful is this feature for node stability? In my experience, most writes are buffered, so this would only limit some pods.
| The proposal would benifit in below two senarios. | ||
|
|
||
| * Local Storage(local disk devices), better speed: | ||
| For instances, I want to use local storage(local device) to gain the storage local speed. |
There was a problem hiding this comment.
How does this feature improve speed? It seems to only address limiting speed.
There was a problem hiding this comment.
It is advantage of local volume. I may put it in the wrong place. thanks.
|
|
||
| - The limit is runtime limitation for block device when it is mounted to the pod. | ||
| - Only blocking device will be limited with specified volume device id. | ||
| - This should be implemented with cgroup, so it will only work beyond cgroup capability. |
There was a problem hiding this comment.
Windows has been GA for a while now. Given that you are proposing a user-facing API, we should at minimum evaluate feasibility of implementing this feature in windows.
|
|
||
|
|
||
| Then kubelet can get the mount point of the pvc and the device id. Then we use the cgroup to limit iops and bps of the pod. | ||
| We can just edit the cgroup limit files under the pod /sys/fs/cgroup/blkio/kubepods/pod/<Container_ID>/... |
There was a problem hiding this comment.
nit: This is the container cgroup, but you say below we aren't managing containers. Can you remove <Container_ID> from the path?
| echo "<block_device_maj:min> <value>" > /sys/fs/cgroup/blkio/kubepods/pod<UID>/blkio.throttle.read_iops_device | ||
| ``` | ||
|
|
||
| To manage the QoS of containers, it can be supported in the future if there are more than 1 contaienrs in the pod, and we may add different iops limits for each container. |
There was a problem hiding this comment.
We have discussed having both container-level and pod-level control for other resources in the past, but it can be quite complex: #1592. I would recommend assuming we will only get either pod or container-level limiting.
| Better resource management for iops or bps of block device in pvc. | ||
| In some senarios, we need to limit PV's iops limit or Pod iops on a device or volume at runtime. | ||
|
|
||
| For container runtime, dockerd provides blkio related params in docker run to limit a device's iops and bps. [docker reference #block-io-bandwidth-blkio-constraint](https://docs.docker.com/engine/reference/run/#block-io-bandwidth-blkio-constraint) |
There was a problem hiding this comment.
Does containerd have something similar?
I see blkio-based stats but am not familiar in general with it. https://sourcegraph.com/github.com/containerd/containerd@a615a6fe5dd3a6a6689ccb119b0a9632100c7a50/-/blob/metrics/cgroups/blkio.go
| - The limit is runtime limitation for block device when it is mounted to the pod. | ||
| - Only blocking device will be limited with specified volume device id. | ||
| - This should be implemented with cgroup, so it will only work beyond cgroup capability. | ||
| - An extended feature that can be provided later is that we can limit iops of rootfs for container runtime. This will reduce the influcences between pods, running on the same host. |
There was a problem hiding this comment.
This is already a key problem, I think we need to design something for that here.
eg with GCP PD, IOPS are pooled across all devices attached to a node so that limiting IOPS on a single device may not work as expected (eg, accessing the rootfs may end up throttling an attached PV).
There was a problem hiding this comment.
- cgroups can limit iops for local block volume and rootfs.
I may change to goal to this. For all types of volumes' QoS, it may be a much bigger topic.
|
|
||
| - The limit is runtime limitation for block device when it is mounted to the pod. The limit is not a volume limitation in IaaS's aspect. If the device is used by multi pods, each pod should limit the iops by itself. | ||
| - For volume capability of iops, it is a physical limitation on device(PV), and this is not the same as the QoS of PVC in this proposal. | ||
| - As cgroup implement has its limitations, we will not mention kernel buffered writings issues with cgroup limitations. This should be fixed or optimized in kernel side. Detailes will be mentioned in the `risks and limitations` below. |
There was a problem hiding this comment.
Will cgroups v2 change any of this?
If so we should mention that and link in (or start...) related work in docker/containerd.
I think if cgroups is fundamentally limited we want to be careful about adding in a feature that won't work very well but will get baked into deployments and require long-term support.
| ``` | ||
| annotations: | ||
| qos.volume.storage.daocloud.io: >- | ||
| {"pvc": "snap-03", "iops": {"read": 2000, "write": 1000}, "bps": |
There was a problem hiding this comment.
If this is to be volume-based, why not use the pod volume name rather than the PVC? That would incidentally allow for rootfs throttling.
This might not be effective for all kinds of volumes depending on how they're mounted, I suppose, but since the throttling has to be best-effort anyway that might not be a problem in practice.
| iops: | ||
| read: 2000 | ||
| write: 1000 | ||
| bps: | ||
| read: 1000000 | ||
| write: 1000000 |
There was a problem hiding this comment.
Would it make sense to instead of adding absolute IOPS limits, add more of a QoS class to the storageclass. eg,
-
unlimited, no restriction on IOPS, which is what we currently do
-
fair, all pods using the PVC/volume split IOPS fairly. This does require a pretty complicated controller to track PV usage and update all the nodes with how usage should split, especially if we want to base throttling on actual usage (eg, pods that aren't hammering the device don't get the same share as a more active pod), but this may be what users actually want
-
priority, pods with some sort of annotation get unlimited IOPS, all other pods get a fixed share
OK, I'll admit this is all very vague and not well thought-out, but a direction like this has a few advantages:
-
Current state of the art means we're not going to be able to hit all IOPS limits anyway AFAIU
-
This gives leeway for implementation differences and means we wouldn't have to change the API when cgroups v2 comes out or do all sort of complicated docker vs containerd differences
-
Managing IOPS at a pod level is not going to scale for large deployments, but it's (probably) only in large deployments where this sort of thing is really needed. So a higher-level API is maybe ultimately much more useful.
#27000 already was discussing along these lines.
Maybe another way to see this is that QoS is really a volume characteristic and not a pod characteristic. That is, we are concerned about guaranteeing a certain performance for all users of a volume more than we are concerned about setting what performance a pod sees. So setting limits at some higher level than the pod is probably the right way to go about this.
Signed-off-by: pacoxu <[email protected]>
|
@pacoxu: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@pacoxu any updates about this kep? |
|
I prefer to wait for cgroup v2 and discuss it later. Cgroup V2 can refer to #2254. According to 1.22 KEP prioritization |
|
+1 it's good to see that cgroups v2 has gotten some momentum. It would be a bit of a mess to do storage qos in both cgroups v1 and v2 in a consistent way. |
|
This just popped up on my radar. One thing I'd mention is that a lot of storage solutions have their own implementations for QoS, it might make sense to consider the implementation and API separately (perhaps no need to wait on cgroups v2, if the API can be finalized and CSI drivers can begin implementing how they see fit). Or perhaps the only supported means of QoS is intended to be at the node and in cgroups? In this proposal, is QoS being considered for anything that is not a PVC - emptyDir, etc? I've heard of use cases where it's the node's FS that needs to be guarded. |
|
@mlsorensen correct. This proposal only focuses on cgroup limitation on QoS.
To set QoS of a PVC, it has to be a block device. The emptyDir may be limited with the rootfs of the node? |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
/remove-lifecycle stale |
|
See kubernetes/kubernetes#92287 (comment) for related support update in cri-o and containerd. |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
/remove-lifecycle stale |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
/remove-lifecycle stale |
|
@pacoxu this PR is using an old version of the KEP template, please update and also open an issue which is also required for all KEPs :) |
/close I opened #3520 for future tracking of this feature design. |
|
@pacoxu: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
See the new proposal in #3004. |
based on kubernetes/kubernetes#92287 proposal 2.
The initial discussion and user requirements can be found in kubernetes/kubernetes#27000
Fixes #3520
@kubernetes/sig-storage-feature-requests
/sig storage