Skip to content

Conversation

@S3RK
Copy link
Contributor

@S3RK S3RK commented Jun 15, 2021

This is an alternative to #19281

This PR fixes a problem when after test failure with --failfast option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

If there are any dangling nodes left at the end of the test run we kill the whole process group.
Pros: the operations is immediate and won't lead to CI timeout
Cons: the test_runner process is also killed and exit code is 137

Example output:

...
Early exiting after test failure

TEST                           | STATUS    | DURATION

rpc_decodescript.py            | ✓ Passed  | 2 s
rpc_deprecated.py              | ✓ Passed  | 2 s
rpc_deriveaddresses.py         | ✓ Passed  | 2 s
rpc_dumptxoutset.py            | ✖ Failed  | 2 s

ALL                            | ✖ Failed  | 8 s (accumulated)
Runtime: 4 s

Killed: 9
> echo $?
137

@fanquake fanquake added the Tests label Jun 15, 2021
Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means --failfast no longer works on Windows, but who uses that anyway 🤷‍♂️

@S3RK S3RK force-pushed the test_kill_process_group branch from 5012bf0 to 451b96f Compare June 15, 2021 07:38
@S3RK
Copy link
Contributor Author

S3RK commented Jun 15, 2021

I can add different code paths for Win and Unix if desirable

@maflcko
Copy link
Member

maflcko commented Jun 15, 2021

review ACK 451b96f

mind pushing a temporary commit with a logic error in one of the tests, so that we can observe the CI fails properly?

@practicalswift
Copy link
Contributor

Concept ACK on addressing the annoying dangling bitcoind issue :)

@DrahtBot
Copy link
Contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what is wrong with the kill+wait approach, why does it result in dangling processes.

@S3RK
Copy link
Contributor Author

S3RK commented Jun 15, 2021

It's not clear to me what is wrong with the kill+wait approach, why does it result in dangling processes.

So, we have three layers of processes, something like

test_runner.py[100] -> test_1.py[101] -> bitcoind[102]
                    -> test_2.py[103] -> bitcoind[104] 

Currently test_runner.py[100] sends SIGKILL to all test_x.py which leads to its immediate termination and orphaned bitcoind processes being linked to init. E.g. test_1.py[101] got killed and bitcoind[102] is dangling. Actually, I'm not sure why do we use wait there, I don't think it changes anything.

To avoid that we can do SIGTERM and wait, but that could lead to CI timeouts. I've tried that first in #19281

Another alternative is to setup signal handler in child process (bitcoind) to listen when parent process exits, but that's feels backward.

mind pushing a temporary commit with a logic error in one of the tests, so that we can observe the CI fails properly?

Sure, will do

@aitorjs
Copy link
Contributor

aitorjs commented Jun 15, 2021

ACK 451b96f. Manual testing with and without --failfast.

Failing an assert with --failfast, passes through reviewed code. When fails, exits gracefully stopping of processes.

Log

$ test/functional/test_runner.py --failfast rpc_decodescript.py rpc_deprecated.py rpc_deriveaddresses.py rpc_dumptxoutset.py

Temporary test directory at /tmp/test_runner_₿_🏃_20210616_004909 Running Unit Tests for Test Framework Modules .......... ---------------------------------------------------------------------- Ran 10 tests in 0.715s

OK
Remaining jobs: [rpc_decodescript.py, rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
1/4 - rpc_decodescript.py failed, Duration: 1 s

stdout:
2021-06-15T22:49:12.613000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3
2021-06-15T22:49:13.132000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/test_framework.py", line 128, in main
self.run_test()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 230, in run_test
self.decodescript_script_pub_key()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 67, in decodescript_script_pub_key
assert_equal('1 ' + public_key_hash, rpc_result['segwit']['asm'])
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/util.py", line 51, in assert_equal
raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
AssertionError: not(1 5dd1d3a048119c27b28293056724d9522f26d945 == 0 5dd1d3a048119c27b28293056724d9522f26d945)
2021-06-15T22:49:13.183000Z TestFramework (INFO): Stopping nodes
2021-06-15T22:49:13.285000Z TestFramework (WARNING): Not cleaning up dir /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3
2021-06-15T22:49:13.285000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3/test_framework.log
2021-06-15T22:49:13.285000Z TestFramework (ERROR):
2021-06-15T22:49:13.286000Z TestFramework (ERROR): Hint: Call /home/bitcoin/blockchain/bitcoinPR/22249/test/functional/combine_logs.py '/tmp/test_runner_₿_🏃_20210616_004909/rpc_decodescript_3' to consolidate all logs
2021-06-15T22:49:13.286000Z TestFramework (ERROR):
2021-06-15T22:49:13.286000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
2021-06-15T22:49:13.286000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
2021-06-15T22:49:13.286000Z TestFramework (ERROR):

stderr:

Early exiting after test failure

TEST | STATUS | DURATION

rpc_decodescript.py | ✖ Failed | 1 s

ALL | ✖ Failed | 1 s (accumulated)
Runtime: 1 s

dangling processes <!-- Print I write to know if it goes through the added part in test_runner.py --->
Terminado (killed)

Failing an assert without --failfast, dont passes through reviewed code and all tests are run.

Log
$ test/functional/test_runner.py rpc_decodescript.py rpc_deprecated.py rpc_deriveaddresses.py rpc_dumptxoutset.py

Temporary test directory at /tmp/test_runner_₿_🏃_20210616_005950 Running Unit Tests for Test Framework Modules .......... ---------------------------------------------------------------------- Ran 10 tests in 0.680s

OK
Remaining jobs: [rpc_decodescript.py, rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
1/4 - rpc_decodescript.py failed, Duration: 1 s

stdout:
2021-06-15T22:59:53.470000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3
2021-06-15T22:59:53.735000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/test_framework.py", line 128, in main
self.run_test()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 230, in run_test
self.decodescript_script_pub_key()
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/rpc_decodescript.py", line 67, in decodescript_script_pub_key
assert_equal('1 ' + public_key_hash, rpc_result['segwit']['asm'])
File "/home/bitcoin/blockchain/bitcoinPR/22249/test/functional/test_framework/util.py", line 51, in assert_equal
raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
AssertionError: not(1 5dd1d3a048119c27b28293056724d9522f26d945 == 0 5dd1d3a048119c27b28293056724d9522f26d945)
2021-06-15T22:59:53.786000Z TestFramework (INFO): Stopping nodes
2021-06-15T22:59:53.888000Z TestFramework (WARNING): Not cleaning up dir /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3
2021-06-15T22:59:53.888000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3/test_framework.log
2021-06-15T22:59:53.888000Z TestFramework (ERROR):
2021-06-15T22:59:53.889000Z TestFramework (ERROR): Hint: Call /home/bitcoin/blockchain/bitcoinPR/22249/test/functional/combine_logs.py '/tmp/test_runner_₿_🏃_20210616_005950/rpc_decodescript_3' to consolidate all logs
2021-06-15T22:59:53.889000Z TestFramework (ERROR):
2021-06-15T22:59:53.889000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
2021-06-15T22:59:53.889000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
2021-06-15T22:59:53.889000Z TestFramework (ERROR):

stderr:

Remaining jobs: [rpc_deprecated.py, rpc_deriveaddresses.py, rpc_dumptxoutset.py]
2/4 - rpc_deprecated.py passed, Duration: 1 s
Remaining jobs: [rpc_deriveaddresses.py, rpc_dumptxoutset.py]
3/4 - rpc_deriveaddresses.py passed, Duration: 2 s
Remaining jobs: [rpc_dumptxoutset.py]
4/4 - rpc_dumptxoutset.py passed, Duration: 2 s

TEST | STATUS | DURATION

rpc_deprecated.py | ✓ Passed | 1 s
rpc_deriveaddresses.py | ✓ Passed | 2 s
rpc_dumptxoutset.py | ✓ Passed | 2 s
rpc_decodescript.py | ✖ Failed | 1 s

ALL | ✖ Failed | 6 s (accumulated)
Runtime: 2 s

@maflcko
Copy link
Member

maflcko commented Jun 17, 2021

Example failure is here and looks good: https://cirrus-ci.com/build/5678106947092480

Commit can be removed again.

@S3RK S3RK force-pushed the test_kill_process_group branch from 0a91c85 to 451b96f Compare June 17, 2021 19:43
@S3RK
Copy link
Contributor Author

S3RK commented Jun 17, 2021

Done

@maflcko maflcko merged commit 0844084 into bitcoin:master Jun 18, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jun 18, 2021
…sses when using `--failfast`

451b96f test: kill process group to avoid dangling processes (S3RK)

Pull request description:

  This is an alternative to bitcoin#19281

  This PR fixes a problem when after test failure with `--failfast` option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

  If there are any dangling nodes left at the end of the test run we kill the whole process group.
  Pros: the operations is immediate and won't lead to CI timeout
  Cons: the test_runner process is also killed and exit code is 137

  Example output:
  ```
  ...
  Early exiting after test failure

  TEST                           | STATUS    | DURATION

  rpc_decodescript.py            | ✓ Passed  | 2 s
  rpc_deprecated.py              | ✓ Passed  | 2 s
  rpc_deriveaddresses.py         | ✓ Passed  | 2 s
  rpc_dumptxoutset.py            | ✖ Failed  | 2 s

  ALL                            | ✖ Failed  | 8 s (accumulated)
  Runtime: 4 s

  Killed: 9
  > echo $?
  137
  ```

ACKs for top commit:
  MarcoFalke:
    review ACK 451b96f
  aitorjs:
    ACK 451b96f. Manual testing with and without **--failfast**.

Tree-SHA512: 87e510a1411b9e7571e63cf7ffc8b9a8935daf9112ffc0f069d6c406ba87743ec439808181f7e13cb97bb200fad528589786c47f0b43cf3a2ef0d06a23cb86dd
gwillen pushed a commit to ElementsProject/elements that referenced this pull request Jun 1, 2022
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants