Skip to content

Update runc to 1.0.0#102508

Merged
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
kolyshkin:runc-1.0
Jul 1, 2021
Merged

Update runc to 1.0.0#102508
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
kolyshkin:runc-1.0

Conversation

@kolyshkin
Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin commented Jun 1, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes: #102676

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

none

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 1, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @kolyshkin. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 1, 2021
@k8s-ci-robot k8s-ci-robot requested review from a team, logicalhan and odinuge June 1, 2021 21:57
@k8s-ci-robot k8s-ci-robot added area/dependency Issues or PRs related to dependency changes area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 1, 2021
@BenTheElder
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2021
@BenTheElder
Copy link
Copy Markdown
Member

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Jun 3, 2021

/test pull-kubernetes-node-crio-cgrpv2-e2e

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Jun 3, 2021

Test failure in pull-kubernetes-node-crio-cgrpv2-e2e is expected. Looks like this works ok on cgroup v2 as well.

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Jun 4, 2021

Is there a reason why we don't want to split the dependency bump and the device changes @kolyshkin?

@kolyshkin
Copy link
Copy Markdown
Contributor Author

Is there a reason why we don't want to split the dependency bump and the device changes @kolyshkin?

runc 1.0.0 includes the fix for SkipDevices to work properly with the systemd driver (opencontainers/runc#2958). This is why the update and the removal of "allow all" device rule is tied together. One thing this PR does is making sure this PR really fixes the issue.

Surely we can do it in a sequential fashion -- update to 1.0 first, then remove the rules. I don't see the reason why though.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/apiserver and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 7, 2021
@kolyshkin
Copy link
Copy Markdown
Contributor Author

if we intend to take this back to 1.21

I believe we do

I'd prefer we bump the dependency before making other changes we don't intend to take back to 1.21

Do you mean to split this PR into two? I would say it's not needed, as the other change is a cleanup coupled to runc-1.0.0 bump, but sure, I have separated the second commit into #103163

@kolyshkin
Copy link
Copy Markdown
Contributor Author

@odinuge @dims @liggitt this needs to be re-LGTM'ed (the only changes are second commit is removed)

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jun 24, 2021

Do you mean to split this PR into two?

no, I was talking about merge order between #92863 and this... if #92863 is not intended to go to 1.21, we should merge this PR first

@kolyshkin
Copy link
Copy Markdown
Contributor Author

no, I was talking about merge order between #92863 and this... if #92863 is not intended to go to 1.21, we should merge this PR first

Ah, OK, I reverted the split, so both commits are back here.

Yes, I think this should go in first, and #92863 is not intended to be backported to 1.21.

@kolyshkin
Copy link
Copy Markdown
Contributor Author

/assign @liggitt

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jun 30, 2021

/approve
dep mechanics lgtm, node approval at #102508 (review)

needs a final update-vendor.sh run... looks like there's a stale go.sum file

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, kolyshkin, liggitt, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

k8s-ci-robot commented Jun 30, 2021

@kolyshkin: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-kubelet-serial-crio-cgroupv2 3246ca7554fa69e3358c0ff0b0324d4da1447053 link /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
pull-kubernetes-node-kubelet-serial-crio-cgroupv1 3246ca7554fa69e3358c0ff0b0324d4da1447053 link /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
pull-kubernetes-node-kubelet-serial 3246ca7554fa69e3358c0ff0b0324d4da1447053 link /test pull-kubernetes-node-kubelet-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@BenTheElder
Copy link
Copy Markdown
Member

should we be upgrading the runc binary in the CI jobs? AFAIK this has been pretty adhoc

@dims
Copy link
Copy Markdown
Member

dims commented Jun 30, 2021

@BenTheElder we update the runc in CI when we update containerd (goes as a pair)

@BenTheElder
Copy link
Copy Markdown
Member

@BenTheElder we update the runc in CI when we update containerd (goes as a pair)

Yes, but do we do that after this PR? containerd has already updated AFAIK.

Also the project has blocking CI on GCE with:

  • docker
  • containerd
  • cri-o *

Plus KIND with it's own version of containerd.

So if we should be updating it, we have some work to coordinate.

This is to check if runc 1.0.0 (to be released shortly) works with k8s.

The commands used were (roughly):

	hack/pin-dependency.sh github.com/opencontainers/runc v1.0.0
	hack/lint-dependencies.sh
	# Follow its recommendations.
	hack/pin-dependency.sh github.com/cilium/ebpf v0.6.1
	hack/pin-dependency.sh github.com/opencontainers/selinux v1.8.2
	hack/pin-dependency.sh github.com/sirupsen/logrus v1.8.1
	# Recheck.
	hack/lint-dependencies.sh
	GO111MODULE=on go mod edit -dropreplace github.com/willf/bitset
	hack/update-vendor.sh
	# Recheck.
	hack/lint-dependencies.sh
	hack/update-internal-modules.sh
	# Recheck.
	hack/lint-dependencies.sh

[v2: rebased, updated runc 3a0234e1fe2e82 -> 2f8e8e9d977500]
[v3: testing master + runc pr 3019]
[v4: updated to 93a01cd4d0b7a0f08a]
[v5: updated to f093cca13d3cf8a484]
[v6: rebased]
[v7: updated to runc v1.0.0]
[v8: rebased]

Signed-off-by: Kir Kolyshkin <[email protected]>
Since runc 1.0.0 it is now sufficient to have SkipDevices: true.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Copy Markdown
Contributor Author

needs a final update-vendor.sh run... looks like there's a stale go.sum file

Rebased, re-run update-vendor.sh, fixed now. The cause, I guess, is a recently merged #103099 (commit 6f9011a)

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jun 30, 2021

/lgtm

@kolyshkin
Copy link
Copy Markdown
Contributor Author

1.21 backport: #103392

@hakman
Copy link
Copy Markdown
Member

hakman commented Jul 3, 2021

@kolyshkin The kOps project has some periodic e2e conformance tests that run against the ci/latest k8s marker. This week we noticed that these tests stopped working. Symptoms were that the static pods used to bring up the cluster failed to start randomly, tracked under kubernetes/kops#11911. Together with @justinsb @rifelpet and @olemarkus we tried to find clues, but found nothing conclusive.

The conformance tests results can be found here: https://testgrid.k8s.io/kops-misc#kops-aws-misc-amd64-conformance.

After some testing using various kubernetes versions, I think I tracked the moment that the issue appeared to this change:
https://storage.googleapis.com/kubernetes-release-dev/ci/v1.22.0-beta.0.307+c206af0367f76f - works as before
https://storage.googleapis.com/kubernetes-release-dev/ci/v1.22.0-beta.0.310+4748bb04b61f87 - static pods fail to start

Would you mind checking if this is in any way related with failures we are seeing in kOps tests?

@BenTheElder
Copy link
Copy Markdown
Member

For some clarity: c206af0...4748bb0 (based on commits in links above)

@hakman
Copy link
Copy Markdown
Member

hakman commented Jul 4, 2021

We managed to find more info about the issue we are seeing in kOps. By switching the cgroup driver from systemd to cgroupfs, clusters are starting again and can pass the conformance tests. I hope this helps.
kubernetes/kops#11919
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kops-e2e-k8s-ci

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Jul 5, 2021

Thanks for the report @hakman! And thanks for all the details! I have done some digging, and it turns out opencontainers/runc#3019 (I am pretty sure, but have not bisected) cause some strange behavior where various cgroups are freezed without being thawed (more info here: https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt). We still lack a great deal of CI coverage for the systemd cgroup driver....

The result of this freezing is that everything just stops, and stuff just times out in places where that shouldn't be possible.

I vote for reverting this PR, and awaits a proper fix, since this essentially breaks kubelet systemd cgroup driver support :/

The regression in v1.21 could also be fixed by #102250 as well, so we should look into that in order to fix that release (or revert the previous runc bump there as well).

Output after kubeadm init (that just times out):

odin@xps ~ % grep -R ""  /sys/fs/cgroup/freezer/kubepods.slice/**/freezer.state 
/sys/fs/cgroup/freezer/kubepods.slice/freezer.state:THAWED
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-besteffort.slice/freezer.state:THAWED
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6ca0ea8a_ed75_4049_8ca1_40a79c10a2b8.slice/freezer.state:THAWED
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1519ed09d4a881bb0d1512a9b0b37628.slice/cri-containerd-2d5af28874c4f84c788ed9d6b73f97ce25d1475285bf505bbbe871ebc6dbec9b.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1519ed09d4a881bb0d1512a9b0b37628.slice/cri-containerd-535fded4daa260ba8d41d3ffc7db4edecc14912cdfd3f883347ce4c68469fb13.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1519ed09d4a881bb0d1512a9b0b37628.slice/cri-containerd-f1459055dd2ce055b03b14d56f32f8d1a9219b218c0b41afa29d05956c82c2e3.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1519ed09d4a881bb0d1512a9b0b37628.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1e851413_9e5c_476a_a807_55f54700db56.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2169c1802eafeb63cc8e3e07d6294e16.slice/cri-containerd-9cf394c64876814e6d90c382ae0ba54e7198e39abd108021404b4c17278ce01f.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2169c1802eafeb63cc8e3e07d6294e16.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7083cd00f13e827a6dfc9f3d4d9510b3.slice/cri-containerd-101c8e3864d28bbcb1d1c1b40eaa2129a65c86120f38eb4cc5b635561e914c5e.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7083cd00f13e827a6dfc9f3d4d9510b3.slice/cri-containerd-eefa3339f42c5f861b3b0304bf7425d652250de727e11b849ef951df06e89739.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7083cd00f13e827a6dfc9f3d4d9510b3.slice/cri-containerd-fcd69f24e18b31cec69be0127b9097bc63a60a8ce838fb603eb588c55fc6707c.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7083cd00f13e827a6dfc9f3d4d9510b3.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod78aa97a0670eff5cbcc2e5f9d4370fe1.slice/cri-containerd-c6dab44a2943cd7c4146893d2577a1495946b5bb12b8884065159095a2960324.scope/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod78aa97a0670eff5cbcc2e5f9d4370fe1.slice/freezer.state:FROZEN
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod87c9b6ea_8287_4946_9b34_e77647c249cb.slice/freezer.state:FROZEN

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Jul 5, 2021

Opened a revert here: #103483

/cc @dims @liggitt @kolyshkin @hakman

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@odinuge: GitHub didn't allow me to request PR reviews from the following users: kolyshkin.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Opened a revert here: #103483

/cc @dims @liggitt @kolyshkin @hakman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkumatag
Copy link
Copy Markdown
Member

mkumatag commented Jul 6, 2021

we are also seeing issues on ppc64le platform, hitting with the following error while starting the containers,

Jul 05 08:40:12 config1-1625472016-master.power-iaas.cloud.ibm.com kubelet[25341]: E0705 08:40:12.449311   25341 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to start sandbox container for pod \"etcd-config1-1625472016-master.power-iaas.cloud.ibm.com\": operation timeout: context deadline exceeded"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Pod stuck in ContainerCreating: Unit ...slice already exists