Skip to content

Testsuite: attempt to find / avoid valgrind warnings of killed processes#9679

Merged
oranagra merged 1 commit intoredis:unstablefrom
oranagra:valgrind_killed_processes
Oct 26, 2021
Merged

Testsuite: attempt to find / avoid valgrind warnings of killed processes#9679
oranagra merged 1 commit intoredis:unstablefrom
oranagra:valgrind_killed_processes

Conversation

@oranagra
Copy link
Member

I recently started seeing a lot of empty valgrind reports in the daily CI.
i.e. prints showing valgrind header but no leak report, which causes the tests to fail
https://github.com/redis/redis/runs/3991335416?check_suite_focus=true

This commit change 2 things:

  • first, considering valgrind is just slow, we used to give processes 60 seconds timeout on shutdown instead of 10 seconds we give normally. this commit changes that to 120.
  • secondly, when we reach the timeout, we first try to use SIGSEGV so that maybe we'll get a stack trace indicating where redis is hang, and we only resort to SIGKILL if double that time passed.

note that if there are indeed hang processes, we will normally not see that in the non-valgrind runs, since the tests didn't use to detect any failure in that case, and now they will since crashlog_from_file is run after kill_server.

@oranagra oranagra requested a review from yossigo October 25, 2021 11:27
@oranagra oranagra merged commit 665e428 into redis:unstable Oct 26, 2021
@oranagra oranagra deleted the valgrind_killed_processes branch October 26, 2021 05:34
oranagra added a commit that referenced this pull request Oct 28, 2021
When stopping an instance in the cluster tests, disable appendonly first, so that SIGTERM won't be ignored.

Recently in #9679 i change the test infra to use SIGSEGV to kill servers that refuse
the SIGTERM rather than do SIGKILL directly.

This surfaced an issue that i've added in #7725 which changed SIGKILL to SIGTERM (to resolve valgrind issues).
So the current situation in the past months was that sometimes servers refused the
SIGTERM and waited 10 seconds for the SIGKILL, and this commit resolves that (faster termination).
oranagra added a commit that referenced this pull request Oct 31, 2021
Fix failures introduced by #9695 which was an attempt to solve failures introduced by #9679.
And alternative to #9703 (i didn't like the extra argument to kill_instance).

Reverting #9695.
Instead of stopping AOF on all terminations, stop it only on the two which need it.
Do it as part of the test rather than the infra (it was add that kill_instance used `R`
to communicate to the instance)

Note that the original purpose of these tests was to trigger a crash, but that upsets
valgrind so in redis 6.2 i changed it to use SIGTERM, so i now rename the tests
(remove "kill" and "crash").

Also add some colors to failures, and the word "FAILED" so that it's searchable.

And solve a semi-related race condition in 14-consistency-check.tcl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants