Give e2e_node serial tests more memory#21828
Conversation
Increase machine size to get more memory, since sshd, and therefore e2e_test kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/sshd.service,task=e2e_node.test,pid=1856,uid=0 kernel: Out of memory: Killed process 1856 (e2e_node.test) total-vm:12342304kB, anon-rss:435388kB, file-rss:0kB, shmem-rss:91832kB, UID:0 pgtables:1340kB oom_score_adj:0 kernel: oom_reaper: reaped process 1856 (e2e_node.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:3736kB It also looks like there is an memory leak in the e2e_node.test as well, but increase the machine size to see if it works. kernel: Tasks state (memory values in pages): kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name kernel: [ 116] 0 116 59391 197 479232 0 0 systemd-journal kernel: [ 131] 0 131 3962 258 69632 0 -1000 systemd-udevd kernel: [ 216] 274 216 5900 212 65536 0 0 systemd-network kernel: [ 261] 275 261 4580 114 81920 0 0 systemd-resolve kernel: [ 268] 0 268 3183 143 65536 0 0 systemd-logind kernel: [ 274] 201 274 3476 138 61440 0 -900 dbus-daemon kernel: [ 303] 0 303 29690 1256 106496 0 0 google_osconfig kernel: [ 311] 0 311 622185 17311 1220608 0 -999 containerd kernel: [ 312] 0 312 2088 27 53248 0 0 agetty kernel: [ 316] 0 316 21477 38 69632 0 0 chronyd kernel: [ 330] 0 330 4618 150 81920 0 -1000 sshd kernel: [ 343] 0 343 103441 1109 131072 0 0 device_policy_m kernel: [ 376] 0 376 8855 3237 110592 0 0 google_network_ kernel: [ 379] 0 379 8890 3277 110592 0 -999 google_accounts kernel: [ 380] 0 380 8831 3197 110592 0 0 google_clock_sk kernel: [ 947] 0 947 4731 198 81920 0 0 sshd kernel: [ 950] 20141 950 4731 213 81920 0 0 sshd kernel: [ 951] 20141 951 4720 148 77824 0 0 sudo kernel: [ 953] 0 953 1314 38 45056 0 0 sh kernel: [ 954] 0 954 2385 37 57344 0 0 timeout kernel: [ 955] 0 955 289381 1492 159744 0 0 ginkgo kernel: [ 961] 0 961 384197 49041 688128 0 0 e2e_node.test kernel: [ 1856] 0 1856 3085576 131805 1372160 0 0 e2e_node.test [...]
7060135 to
e77e774
Compare
|
/cc SergeyKanzhelev dims |
|
/lgtm thank you! |
|
thanks for this! I was looking in this area and this helps a lot. We need to monitor if some tests are leaking indeed. |
|
/lgtm |
|
Some quick data point I can provide: cpumanager tests alone routinely take like ~450MB RSS. But this is at steady state. The e2e test rebuilds binaries, and that stage can eat a lot of memory. Need to check what exactly happens on CI, though, even if I believe we do run the very same command - hence we indeed recompile stuff in here. |
|
/lgtm |
|
Looking at recent serial log: Serial LogThere's quite a few containerd / containerd shim processes. The time of that OOM kill |
Yeah, that was essentially what I found as well. Since a few tests in e2e serial spins up 100+ pods, ~3.75GiB is Imo. the node e2e serial tests are some of the most important ones we have, since regressions and bugs we can find in them are extremely hard to find when running either a dev environment or a huge prod cluster. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, mrunalp, odinuge The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Increase machine size to get more memory, since sshd, and therefore
e2e_test get OOMed.
The test has been failing for ages: https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial
It also looks like there is an memory leak in the e2e_node.test as well (or just the stuff is using that much memory),
but increase the machine size to see if it works.