-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large kubemark performance tests failing with timeout during ns deletion #53327
Comments
I'm seeing multiple such error lines from hollow-kubelet logs of currently live cluster:
@kubernetes/sig-node-bugs @yujuhong @Random-Liu - Any ideas? |
Seems like kubelet's ImageGCPeriod is 5m and I'm seeing such logs every 5 min - so it's pretty much failing all the time. Fyi - I'm also seeing one such error log once at the start:
[Edit]: Neglect this error.. It's only seen initially when node info is not available yet. Later it goes fine and reaches next stage where root cgroup stats are fetched but that too fails (as I think hollow-kubelet doesn't expose stats). But these errors are unrelated to our problem iiuc. |
Ok.. So here's what I found out from the live cluster:
Smells like a bug in GC. Also it doesn't look right that we can have a pod in such a 'forever undeletable' state. @gmarek Any idea about this? Or whom should I be cc'ing here? |
I think it's a desired behavior if Kubelet is unresponsive. We stopped issuing 'hard deletes' from the control place because stateful sets @smarterclayton. Now control plane always only marks Pod for deletion and waits for kubelet to clean up resources and ack the delete sending the second delete request. |
fyi - The corresponding hollow-kubelet was alive when I checked. |
Then we need logs and involve @dchen1107, @yujuhong or someone else from the Node team. |
Let me turn on hollow-node logs for 5k with --v=2 for kubelet (would that be sufficient @yujuhong?) |
My guess is that it's hard to tell up front;) Certainly won't hurt to have more logs. |
Are we sure that nodes are not cpu-starved? I think we've seen something like that in the past in such scenario. |
I'm seeing the following logs throughout the test from real kubelet:
The hollow-kubelet resource usages are not directly available from the logs, but took a look from the live cluster now and seems normal so far (I'll look again while deletion). |
I digged deeper into this today and here's what I found:
|
Will look at other smaller scale kubemark jobs too if we're seeing a similar issue there. |
Ok.. It seems like there's also a bump in time for kubemark-500 job (from 47m -> 1h) across runs 8256 and 8257 (https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce?before=8275). And this was around the same time we saw the bump for kubemark-5000. This is great, we now have a much smaller diff. |
We're also seeing a small difference for kubemark-100 (from 40m -> 45m) across runs 4369 and 4370 (https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-kubemark-100-gce?before=4385) but the diff is the same. From the diff:
|
@shyamjvs - since this is visible in kubemark-500, can you please locally revert the PR you suspect and verify if it's that? |
Yup.. Doing that now. |
Some updates:
|
Unable to reproduce the difference locally. Also can't see any obvious test-infra changes across those runs. |
As discussed offline, this is clearly related to scheduler:
It seems to me that this is qps-related issue. BTW - looking into kubemark-scale it seems there might be yet another regression in scheduler. @kubernetes/sig-scheduling-bugs |
@dashpole From the above, I believe that the problem is not that the killContainer() failed with error for that container. But instead the container was not even seen as part of the pod (because the container ID field was empty) which probably explains why you couldn't see those logs. It was because the for loop you referred to was over an empty list (which didn't include the container at all). Does that sound reasonable to you? |
Its a bug in the fake docker client filtering. Thanks to @Random-Liu for helping debug this with me |
Ok... I think I've found the cause. Before deleting a pod, kubelet is checking if all containers have been deleted (and this change was introduced by @dashpole in his PR #50350). However the way we're checking that all containers have been deleted is by checking that the length of the pod's This is a bug IMO. I'll send out a fix for this. It should most likely also solve this regression. |
Sorry.. Just saw your above comment after posting mine. What did you find out? |
where do you see that non-existent containers have an empty containerID? |
oh, I see |
So the key is that we are not looking at the status in the status manager. We were looking at the containerStatuses from the status provided by the runtime. |
The other important thing to note here, is that the PodResourcesAreReclaimed function and the HandlePodCleanups function use different caches. The podCache (used by PodResourcesAreReclaimed) is just based on the most recent PLEG events, whereas the runtimeCache (used by HandlePodCleanups) is populated by querying the runtime. Hence the bug in filtering means we do not see the container in HandlePodCleanups, but we do see it in PodResourcesAreReclaimed. |
Discussed offline with @dashpole and he updated his PR to fix the bug in the docker client mock. Turns out my guess in #53327 (comment) was right that there are issues on the mock side :) Also, I closed my PR as it's not needed. @dashpole Let's get your change in asap, so we can have some result by tomorrow. Hoping for a green run. |
Automatic merge from submit-queue (batch tested with PRs 55893, 55906, 55026). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. [Test Fix] Mock docker network dependencies and fix filtering bug This PR only affects the mocked docker runtime, and has no impact on the kubelet. Issue #53327 When kubernetes creates a pod using the docker shim, it creates a container which contains the pod's network namespace, and then creates containers which specify that namespace. The current mocked docker does not mock this interaction, and thus allows a container to be created even when the container whose network it is joining does not exist. This allows the mocked kubelet to end up in a state where the pod does not exist, but a container in the pod does, and this breaks pod deletion. This fixes the above by only allowing containers to be started if the container whose network it is trying to join is running. Additionally, this PR fixes a filtering bug where we were incorrectly comparing docker container statuses. /assign @shyamjvs can you test this to see if it fixes the issue? /assign @Random-Liu for approval after @shyamjvs confirms this works.
Closing this as we finally have the kubemark-5000 job green - https://k8s-testgrid.appspot.com/sig-scalability-kubemark#kubemark-5000 (run 737) /close |
Seems like this issue is seen again in the following run - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/738 cc @dashpole |
My strong feeling is that your fix is working almost as desired, except sometimes by chance it is failing to delete certain containers. I suspect if your |
Sadly, we don't have the hollow-kubelet logs for kubemark-5000 as we're not collecting them due to their explosive volume. |
The function is based on toDockerContainerStatus and toRuntimeAPIContainerState. It should be correct. It is possible there is another bug somewhere. |
@dashpole I managed to get one of the hollow-kubelet logs from the currently running kubemark-5000 cluster where we're seeing a forever undeleted pod 'load-medium-1010-gh8nw'. Attaching it here - And the logs concerning to that particular pod are:
Clearly we're not seeing those 'pod is terminated, but some containers have not been cleaned up' msgs forever (like we were seeing earlier) - which seems to be because of your fix. However, the pod still remains undeleted. My suspicion is sth around the network-related hack you added in the hollow-kubelet as part of the fix. Could you PTAL? |
Looking... |
This looks like the culprit log line: |
Ref: #56821 |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. [Test Fix] Fake docker client can remove containers which have not been started **What this PR does / why we need it**: During kubemark-5000 scalability tests, which use the fake docker client, we encountered a bug where containers where the pod was deleted before the container was started could not be deleted. This is because we only remove pods from the `ExitedContainers` list. Containers are only added to this when they have been created, started, and then stopped. However, containers that have only been created, but not started cannot be deleted. This PR fixes this issue by allowing containers with `State.Running=false` to be deleted. **Which issue(s) this PR fixes**: Ref #53327 **Release note**: ```release-note NONE ``` /sig node /kind bug /priority critical-urgent /assign @Random-Liu @dchen1107 @shyamjvs
The job is now green - https://k8s-testgrid.appspot.com/sig-scalability-kubemark#kubemark-5000 Closing this issue. Thanks! |
Yes, moving to -b is my plan. I kept it in -a so far to limit the number of
things changing at the same time.
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Optimize random string generator to avoid multiple locks & use bit-masking Ref kubernetes/kubernetes#53327 We recently started seeing a 50% decrease in scheduling throughput (for e.g in kubemark-500 scale job) and turns out kubernetes/kubernetes#53135 introduced it. The reason is [this call](https://github.com/jsafrane/kubernetes/blob/2caae38d323720a96be34606589f41488ba82b87/plugin/pkg/scheduler/algorithm/predicates/predicates.go#L272) to create a random 32-length string. From the code of the `rand` utility (which is being heavily used throughout the system for randomizing object names), I noticed following performance issues: - to create an n-length string, we are making n calls to `rand.Intn()` each of which does a lock+unlock operation on the RNG.. while just 1 lock+unlock operation is enough for all - we're choosing one character (from an alphabet of 27 chars) per each random integer.. while we can select 10 characters using a single int63 (by masking and bit-shifting) as 1 character uses just 5 bits of randomness - the character set is defined as a global slice (mutable), so the compiler needs to fetch length of the slice on each invocation to `len()` (we're making n of those).. while we can just use a const string (immutable) which will make len directly available as a cached constant (yes, go does it!) This PR is making the above fixes. I'll try to add some benchmarking to measure the difference (as @wojtek-t suggested). /cc @kubernetes/sig-scalability-misc @kubernetes/sig-scheduling-bugs @kubernetes/sig-api-machinery-misc @wojtek-t @smarterclayton Kubernetes-commit: f1d9962fec2ee2d52f41bcbc14b920a3ab66d9a9
We've been seeing these failures continuously in kubemark-5000 for quite some now - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-5000
Even in kubemark-500 we're occasionally seeing flakes - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/8806
On first look I'm seeing that hollow-nodes are going pending during test namespace deletion:
Digging into it now. Sorry for the delay, I've been busy with release scalability validation.
cc @kubernetes/sig-scalability-bugs @wojtek-t @gmarek
The text was updated successfully, but these errors were encountered: