fix a cgroup cpu quota/period order issue.#2044
fix a cgroup cpu quota/period order issue.#2044zq-david-wang wants to merge 1 commit intoopencontainers:masterfrom
Conversation
libcontainer/cgroups/fs/cpu.go
Outdated
| } | ||
| if cgroup.Resources.CpuQuota != 0 { | ||
| // if quota is already set, don't set it again. | ||
| if (!qs) && cgroup.Resources.CpuQuota != 0 { |
865c0e8 to
cd766b0
Compare
|
Just realize that the code only address the situation for changing leaf node of a cgroup tree. When a parent node is changed, it should make it larger first.. And worse, if one want to change a node in the middle of the cgroup tree, it becoms a mission impossible... Say we have three cgroup nodes a, b, c, a is b's parent, and b is c's parent |
|
My understanding is that runc is mostly used to manage leaf node of cgroup tree, right? If so, this fix should work "most of the time"... |
|
Yes, runc only ever manages the leaf of tree. If the admin decides to misconfigure the cgroup tree to make runc crash there isn't much we can (or should) do about it. |
cd766b0 to
173d424
Compare
|
I have make changes to use a simpler strategy: set period first, if failed than retry if after setting quota. |
|
c44aec9 |
libcontainer/cgroups/fs/cpu.go
Outdated
| } | ||
| // The order of setting cfs_quota_us and cfs_period_us is significant, since cgroup child node | ||
| // should not have a higher quota/period ratio than its parent. | ||
| // Use period->quota-->period sequence to make sure no matther parent or child node is changed, |
|
One question, why not use re-order the logic on what is set first instead of this re-order bool? |
|
@crosbymichael Thanks for the comments. I can put the first version of code changes in another PR if necessary. |
|
@crosbymichael @cyphar
Need your opinion on this. |
Signed-off-by: zq-david-wang <[email protected]>
|
Please add tests |
When setting/updating child cpu cgroup cfs_quota_us and cfs_period_us, the order is significant because the child node should not have a higher quota/period ratio than its parent. This means if always setting period first would cause kernel error.
For example, say parent node has quota/period set to 10000/10000, and we want to change child to 1000/1000, if set period first, than after the period change and before the quota change, the ration is 10000/1000, and this setting make child node have larger quota ratio than its parent and rejected by kernel.
This patch would get current child value first, and make sure the first setting always make the quota/period ratio lower than current (valid) ratio.
EDIT:
A simpler strategy is to use a period->quota---->period sequence, if setting period failed, retry setting it after seting quota.
And there is an issue reported in kubernetes project
kubernetes/kubernetes#76704
EDIT:
To trigger the issue, we need
And when using systemd cgroup driver with kubernetes/runc, both conditions are met...
kubelet would create a parent node for pod, and set its quota according to resource limit, and when create child node via systemd cgroup driver, systemd would set peirod to 100ms and quota to a calculated value other than -1. Then further change to quota/period would be rejected by liinux kernel if child node have higher quota/period ratio after setting period and before setting quota.