Skip to content

kubelet consistently timing out on attempting to Destroy cgroups #92766

@haircommander

Description

@haircommander

What happened:
the CRI-O CI tests have been in a bad shape recently. In debugging, I have found that the kubelet logs are filled with:

Timed out while waiting for StopUnit(kubepods-besteffort-pod867fd309_03ba_4715_a044_29393f495cea.slice) completion signal from dbus. Continuing...
 grep 'Timed out'  /tmp/kubelet.log   | wc -l
352562

AFAICT, this is from a combination of bumping to go 1.14.4, and a03db63

as I ran two different PRs that dropped each of these commits, and there were no similar problems.

I am fairly certain this is NOT a problem with kubernetes directly, but rather some odd interaction between go 1.14 and either libcontainer, go-systemd, or godbus. but I figure it can be opened here to start the conversation

What you expected to happen:
StopUnit should not time out

How to reproduce it (as minimally and precisely as possible):
run a node with cgroupv1

build hyperkube with go 1.14.4 (as is now required)

run hack/local-up.sh

create and remove a pod and see the cgroup be failed to be torn down

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    master
  • Cloud provider or hardware configuration:
    aws
  • OS (e.g: cat /etc/os-release):
ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Cloud Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Cloud Edition"
VARIANT_ID=cloud

though this also happens on our RHEL 7 boxes

  • Kernel (e.g. uname -a):
uname -a
Linux ip-172-18-11-215.ec2.internal 5.6.13-100.fc30.x86_64 #1 SMP Fri May 15 00:36:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    build locally
  • Network plugin and version (if this is a network-related bug):
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions