test: kill process group to avoid dangling processes when using `--failfast` #22249

S3RK · 2021-06-15T07:09:36Z

This is an alternative to #19281

This PR fixes a problem when after test failure with --failfast option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

If there are any dangling nodes left at the end of the test run we kill the whole process group.
Pros: the operations is immediate and won't lead to CI timeout
Cons: the test_runner process is also killed and exit code is 137

Example output:

...
Early exiting after test failure

TEST                           | STATUS    | DURATION

rpc_decodescript.py            | ✓ Passed  | 2 s
rpc_deprecated.py              | ✓ Passed  | 2 s
rpc_deriveaddresses.py         | ✓ Passed  | 2 s
rpc_dumptxoutset.py            | ✖ Failed  | 2 s

ALL                            | ✖ Failed  | 8 s (accumulated)
Runtime: 4 s

Killed: 9
> echo $?
137

maflcko

This means --failfast no longer works on Windows, but who uses that anyway 🤷‍♂️

test/functional/test_runner.py

S3RK · 2021-06-15T07:39:48Z

I can add different code paths for Win and Unix if desirable

maflcko · 2021-06-15T09:56:34Z

review ACK 451b96f

mind pushing a temporary commit with a logic error in one of the tests, so that we can observe the CI fails properly?

practicalswift · 2021-06-15T10:02:21Z

Concept ACK on addressing the annoying dangling bitcoind issue :)

DrahtBot · 2021-06-15T11:49:42Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

test: fix dangling bitcoind in functional tests #19281 (test: fix dangling bitcoind in functional tests by S3RK)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

promag

It's not clear to me what is wrong with the kill+wait approach, why does it result in dangling processes.

S3RK · 2021-06-15T19:17:02Z

It's not clear to me what is wrong with the kill+wait approach, why does it result in dangling processes.

So, we have three layers of processes, something like

test_runner.py[100] -> test_1.py[101] -> bitcoind[102]
                    -> test_2.py[103] -> bitcoind[104]

Currently test_runner.py[100] sends SIGKILL to all test_x.py which leads to its immediate termination and orphaned bitcoind processes being linked to init. E.g. test_1.py[101] got killed and bitcoind[102] is dangling. Actually, I'm not sure why do we use wait there, I don't think it changes anything.

To avoid that we can do SIGTERM and wait, but that could lead to CI timeouts. I've tried that first in #19281

Another alternative is to setup signal handler in child process (bitcoind) to listen when parent process exits, but that's feels backward.

mind pushing a temporary commit with a logic error in one of the tests, so that we can observe the CI fails properly?

Sure, will do

aitorjs · 2021-06-15T23:24:09Z

ACK 451b96f. Manual testing with and without --failfast.

Failing an assert with --failfast, passes through reviewed code. When fails, exits gracefully stopping of processes.

Log

$ test/functional/test_runner.py --failfast rpc_decodescript.py rpc_deprecated.py rpc_deriveaddresses.py rpc_dumptxoutset.py

Temporary test directory at /tmp/test_runner_₿_🏃_20210616_004909 Running Unit Tests for Test Framework Modules .......... ---------------------------------------------------------------------- Ran 10 tests in 0.715s

OK
Remaining jobs: [rpc_decodescript.py, rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
1/4 - rpc_decodescript.py failed, Duration: 1 s

stdout:
2021-06-15T22:49:12.613000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3
2021-06-15T22:49:13.132000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/test_framework.py", line 128, in main
self.run_test()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 230, in run_test
self.decodescript_script_pub_key()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 67, in decodescript_script_pub_key
assert_equal('1 ' + public_key_hash, rpc_result['segwit']['asm'])
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/util.py", line 51, in assert_equal
raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
AssertionError: not(1 5dd1d3a048119c27b28293056724d9522f26d945 == 0 5dd1d3a048119c27b28293056724d9522f26d945)
2021-06-15T22:49:13.183000Z TestFramework (INFO): Stopping nodes
2021-06-15T22:49:13.285000Z TestFramework (WARNING): Not cleaning up dir /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3
2021-06-15T22:49:13.285000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3/test_framework.log
2021-06-15T22:49:13.285000Z TestFramework (ERROR):
2021-06-15T22:49:13.286000Z TestFramework (ERROR): Hint: Call /home/bitcoin/blockchain/bitcoinPR/22249/test/functional/combine_logs.py '/tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3' to consolidate all logs
2021-06-15T22:49:13.286000Z TestFramework (ERROR):
2021-06-15T22:49:13.286000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
2021-06-15T22:49:13.286000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
2021-06-15T22:49:13.286000Z TestFramework (ERROR):

stderr:

Early exiting after test failure

TEST | STATUS | DURATION

rpc_decodescript.py | ✖ Failed | 1 s

ALL | ✖ Failed | 1 s (accumulated)
Runtime: 1 s

dangling processes 
Terminado (killed)

Failing an assert without --failfast, dont passes through reviewed code and all tests are run.

Log

$ test/functional/test_runner.py rpc_decodescript.py rpc_deprecated.py rpc_deriveaddresses.py rpc_dumptxoutset.py

Temporary test directory at /tmp/test_runner_₿_🏃_20210616_005950 Running Unit Tests for Test Framework Modules .......... ---------------------------------------------------------------------- Ran 10 tests in 0.680s

OK
Remaining jobs: [rpc_decodescript.py, rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
1/4 - rpc_decodescript.py failed, Duration: 1 s

stdout:
2021-06-15T22:59:53.470000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3
2021-06-15T22:59:53.735000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/test_framework.py", line 128, in main
self.run_test()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 230, in run_test
self.decodescript_script_pub_key()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 67, in decodescript_script_pub_key
assert_equal('1 ' + public_key_hash, rpc_result['segwit']['asm'])
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/util.py", line 51, in assert_equal
raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
AssertionError: not(1 5dd1d3a048119c27b28293056724d9522f26d945 == 0 5dd1d3a048119c27b28293056724d9522f26d945)
2021-06-15T22:59:53.786000Z TestFramework (INFO): Stopping nodes
2021-06-15T22:59:53.888000Z TestFramework (WARNING): Not cleaning up dir /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3
2021-06-15T22:59:53.888000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3/test_framework.log
2021-06-15T22:59:53.888000Z TestFramework (ERROR):
2021-06-15T22:59:53.889000Z TestFramework (ERROR): Hint: Call /home/bitcoin/blockchain/bitcoinPR/22249/test/functional/combine_logs.py '/tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3' to consolidate all logs
2021-06-15T22:59:53.889000Z TestFramework (ERROR):
2021-06-15T22:59:53.889000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
2021-06-15T22:59:53.889000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
2021-06-15T22:59:53.889000Z TestFramework (ERROR):

stderr:

Remaining jobs: [rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
2/4 - rpc_deprecated.py passed, Duration: 1 s
Remaining jobs: [rpc_deriveaddresses.py, rpc_dumptxoutset.py]
3/4 - rpc_deriveaddresses.py passed, Duration: 2 s
Remaining jobs: [rpc_dumptxoutset.py]
4/4 - rpc_dumptxoutset.py passed, Duration: 2 s

TEST | STATUS | DURATION

ALL | ✖ Failed | 6 s (accumulated)
Runtime: 2 s

maflcko · 2021-06-17T07:02:36Z

Example failure is here and looks good: https://cirrus-ci.com/build/5678106947092480

Commit can be removed again.

S3RK · 2021-06-17T19:45:36Z

Done

…sses when using `--failfast` 451b96f test: kill process group to avoid dangling processes (S3RK) Pull request description: This is an alternative to bitcoin#19281 This PR fixes a problem when after test failure with `--failfast` option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures. If there are any dangling nodes left at the end of the test run we kill the whole process group. Pros: the operations is immediate and won't lead to CI timeout Cons: the test_runner process is also killed and exit code is 137 Example output: ``` ... Early exiting after test failure TEST | STATUS | DURATION rpc_decodescript.py | ✓ Passed | 2 s rpc_deprecated.py | ✓ Passed | 2 s rpc_deriveaddresses.py | ✓ Passed | 2 s rpc_dumptxoutset.py | ✖ Failed | 2 s ALL | ✖ Failed | 8 s (accumulated) Runtime: 4 s Killed: 9 > echo $? 137 ``` ACKs for top commit: MarcoFalke: review ACK 451b96f aitorjs: ACK 451b96f. Manual testing with and without **--failfast**. Tree-SHA512: 87e510a1411b9e7571e63cf7ffc8b9a8935daf9112ffc0f069d6c406ba87743ec439808181f7e13cb97bb200fad528589786c47f0b43cf3a2ef0d06a23cb86dd

fanquake added the Tests label Jun 15, 2021

maflcko approved these changes Jun 15, 2021

View reviewed changes

test/functional/test_runner.py Outdated Show resolved Hide resolved

test: kill process group to avoid dangling processes

451b96f

S3RK force-pushed the test_kill_process_group branch from 5012bf0 to 451b96f Compare June 15, 2021 07:38

promag reviewed Jun 15, 2021

View reviewed changes

DrahtBot mentioned this pull request Jun 15, 2021

test: fix dangling bitcoind in functional tests #19281

Closed

S3RK force-pushed the test_kill_process_group branch from 0a91c85 to 451b96f Compare June 17, 2021 19:43

maflcko merged commit 0844084 into bitcoin:master Jun 18, 2021

This was referenced Sep 10, 2021

test: don't try and use os.killpg() on Windows #22936

Closed

qa: Ignore --failfast option of test_runner.py on Windows #22980

Closed

gwillen pushed a commit to ElementsProject/elements that referenced this pull request Jun 1, 2022

Merge 0844084 into merged_master (Bitcoin PR bitcoin/bitcoin#22249)

16c93fc

bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: kill process group to avoid dangling processes when using `--failfast` #22249

test: kill process group to avoid dangling processes when using `--failfast` #22249

Uh oh!

S3RK commented Jun 15, 2021

Uh oh!

maflcko left a comment

Uh oh!

Uh oh!

S3RK commented Jun 15, 2021

Uh oh!

maflcko commented Jun 15, 2021

Uh oh!

practicalswift commented Jun 15, 2021

Uh oh!

DrahtBot commented Jun 15, 2021

Uh oh!

promag left a comment

Uh oh!

S3RK commented Jun 15, 2021

Uh oh!

aitorjs commented Jun 15, 2021

Uh oh!

maflcko commented Jun 17, 2021

Uh oh!

S3RK commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

test: kill process group to avoid dangling processes when using --failfast #22249

test: kill process group to avoid dangling processes when using --failfast #22249

Uh oh!

Conversation

S3RK commented Jun 15, 2021

Uh oh!

maflcko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

S3RK commented Jun 15, 2021

Uh oh!

maflcko commented Jun 15, 2021

Uh oh!

practicalswift commented Jun 15, 2021

Uh oh!

DrahtBot commented Jun 15, 2021

Conflicts

Uh oh!

promag left a comment

Choose a reason for hiding this comment

Uh oh!

S3RK commented Jun 15, 2021

Uh oh!

aitorjs commented Jun 15, 2021

Uh oh!

maflcko commented Jun 17, 2021

Uh oh!

S3RK commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

test: kill process group to avoid dangling processes when using `--failfast` #22249

test: kill process group to avoid dangling processes when using `--failfast` #22249