Skip to content

Conversation

@S3RK
Copy link
Contributor

@S3RK S3RK commented Jun 15, 2020

This PR fixes a problem when after failure of a test with --failfast option there are dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

The fix: Introduce new test status — cancelled and register a signal handler that will raise TestCancelled exception to properly handle termination. Test runner now sends SIGTERM instead of SIGKILL.

@fanquake fanquake added the Tests label Jun 15, 2020
Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. A couple nits below, but no need to change anything until other reviewers agree with this.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. A couple nits below, but no need to change anything until other reviewers agree with this.

@S3RK S3RK changed the title test: gracefully terminate parallel tests test: fix dangling bitcoind in functional tests Jun 23, 2020
laanwj added a commit that referenced this pull request Jul 1, 2020
…acOS

3a7e794 test: retry when write to a socket fails on macOS (Ivan Metlushko)
8cf9d15 test: use pgrep for better compatibility (Ivan Metlushko)

Pull request description:

  Rationale: a few minor changes to make experience of running tests on macOS a bit better
  1.`pidof` is not available on BSD/macOS, while `pgrep` is present on BSD, Linux and macOS
  2. Add retry as a workaround for a weird behavior when writing to a socket (https://bugs.python.org/issue33450). Stacktrace attached

  Man pages:
  https://www.freebsd.org/cgi/man.cgi?query=pgrep&apropos=0&sektion=1&manpath=FreeBSD+6.0-RELEASE&arch=default&format=html
  https://man7.org/linux/man-pages/man1/pgrep.1.html

  Related to #19281

  Stacktrace example:
  ```
  ...
  33/161 - feature_abortnode.py failed, Duration: 63 s

  stdout:
  2020-06-11T10:46:43.947000Z TestFramework (INFO): Initializing test directory /var/folders/2q/d5w9zh614r7g5c8r74ln3g400000gq/T/test_runner_₿_🏃_20200611_174102/feature_abortnode_128
  2020-06-11T10:46:45.199000Z TestFramework (INFO): Waiting for crash
  2020-06-11T10:47:15.921000Z TestFramework (INFO): Node crashed - now verifying restart fails
  2020-06-11T10:47:47.068000Z TestFramework (INFO): Stopping nodes
  [node 1] Cleaning up leftover process

  stderr:
  Traceback (most recent call last):
    File "/Users/xxx/Projects/bitcoin/test/functional/feature_abortnode.py", line 50, in <module>
      AbortNodeTest().main()
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/test_framework.py", line 142, in main
      exit_code = self.shutdown()
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/test_framework.py", line 266, in shutdown
      self.stop_nodes()
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/test_framework.py", line 515, in stop_nodes
      node.stop_node(wait=wait)
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/test_node.py", line 318, in stop_node
      self.stop(wait=wait)
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
      return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/authproxy.py", line 142, in __call__
      response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
    File "/Users/xxx/Projects/bitcoin/test/functional/test_framework/authproxy.py", line 107, in _request
      self.__conn.request(method, path, postdata, headers)
    File "/Users/xxx/.pyenv/versions/3.5.6/lib/python3.5/http/client.py", line 1107, in request
      self._send_request(method, url, body, headers)
    File "/Users/xxx/.pyenv/versions/3.5.6/lib/python3.5/http/client.py", line 1152, in _send_request
      self.endheaders(body)
    File "/Users/xxx/.pyenv/versions/3.5.6/lib/python3.5/http/client.py", line 1103, in endheaders
      self._send_output(message_body)
    File "/Users/xxx/.pyenv/versions/3.5.6/lib/python3.5/http/client.py", line 936, in _send_output
      self.send(message_body)
    File "/Users/xxx/.pyenv/versions/3.5.6/lib/python3.5/http/client.py", line 908, in send
      self.sock.sendall(data)
  OSError: [Errno 41] Protocol wrong type for socket
  ```

ACKs for top commit:
  laanwj:
    ACK 3a7e794

Tree-SHA512: fefbe40ce94ab29f18bbbed2a434194b1384ffa5279b1d04db7a3708e3dd422bd9e450f1db3f95a1a851fac5a626ab533c6ebcfd7ede96f8ccae9e6f3e9fff92
Test runner sends SIGTERM instead of SIGKILL to a running test
to allow proper shutdown of nodes.
@S3RK S3RK force-pushed the test_terminate_gracefully branch from 2709209 to bb44fd9 Compare July 6, 2020 11:34
@S3RK
Copy link
Contributor Author

S3RK commented Jul 6, 2020

Fixed nits

@laanwj
Copy link
Member

laanwj commented Jul 15, 2020

Concept ACK

@S3RK S3RK requested a review from jonatack July 23, 2020 01:19
@NelsonGaldeman
Copy link
Contributor

ACK bb44fd9

  • I've ran it against 2059d32 with --failfast and forcing tests to fail. It properly triggered SIGTERM and processes were stopped gracefully as expected.
  • I've also tested sending an external SIGTERM to a running test and it was properly handled showing Test cancelled message.

Tested on OSX 11.1

@dunxen
Copy link
Contributor

dunxen commented Jun 4, 2021

ACK bb44fd9 (rebased on 3ac5209).

  • Forced fail in one test and ran with --failfast. SIGTERM triggered and also saw graceful stopping of processes. No dangling bitcoind processes seen.
  • Can also confirm that sending a SIGTERM to one of the running tests also triggers a fast fail with graceful process termination.
Screenshot with details of external SIGTERM

Screenshot 2021-06-04 at 15 14 20

@maflcko
Copy link
Member

maflcko commented Jun 4, 2021

I think --failfast is only supported by CI, which will throw away the whole VM, so dangling process is not an issue. However, timeout instead of bug might be more of an issue

@S3RK
Copy link
Contributor Author

S3RK commented Jun 8, 2021

@MarcoFalke would you concept ack this if I add a timeout to proc.wait()? In this case we don't need to restrict --failfast to CI only, and in case a process is stuck CI will not timeout

@maflcko
Copy link
Member

maflcko commented Jun 14, 2021

There shouldn't be any timeout. The whole process tree should be killed immediately. If there was a simple way in python to do that, I'd prefer that.

@DrahtBot
Copy link
Contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

maflcko pushed a commit that referenced this pull request Jun 18, 2021
…en using `--failfast`

451b96f test: kill process group to avoid dangling processes (S3RK)

Pull request description:

  This is an alternative to #19281

  This PR fixes a problem when after test failure with `--failfast` option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

  If there are any dangling nodes left at the end of the test run we kill the whole process group.
  Pros: the operations is immediate and won't lead to CI timeout
  Cons: the test_runner process is also killed and exit code is 137

  Example output:
  ```
  ...
  Early exiting after test failure

  TEST                           | STATUS    | DURATION

  rpc_decodescript.py            | ✓ Passed  | 2 s
  rpc_deprecated.py              | ✓ Passed  | 2 s
  rpc_deriveaddresses.py         | ✓ Passed  | 2 s
  rpc_dumptxoutset.py            | ✖ Failed  | 2 s

  ALL                            | ✖ Failed  | 8 s (accumulated)
  Runtime: 4 s

  Killed: 9
  > echo $?
  137
  ```

ACKs for top commit:
  MarcoFalke:
    review ACK 451b96f
  aitorjs:
    ACK 451b96f. Manual testing with and without **--failfast**.

Tree-SHA512: 87e510a1411b9e7571e63cf7ffc8b9a8935daf9112ffc0f069d6c406ba87743ec439808181f7e13cb97bb200fad528589786c47f0b43cf3a2ef0d06a23cb86dd
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".

@maflcko maflcko closed this Jun 18, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jun 18, 2021
…sses when using `--failfast`

451b96f test: kill process group to avoid dangling processes (S3RK)

Pull request description:

  This is an alternative to bitcoin#19281

  This PR fixes a problem when after test failure with `--failfast` option there could be dangling nodes. The nodes will continue to occupy rpc/p2p ports on the machine and will cause further test failures.

  If there are any dangling nodes left at the end of the test run we kill the whole process group.
  Pros: the operations is immediate and won't lead to CI timeout
  Cons: the test_runner process is also killed and exit code is 137

  Example output:
  ```
  ...
  Early exiting after test failure

  TEST                           | STATUS    | DURATION

  rpc_decodescript.py            | ✓ Passed  | 2 s
  rpc_deprecated.py              | ✓ Passed  | 2 s
  rpc_deriveaddresses.py         | ✓ Passed  | 2 s
  rpc_dumptxoutset.py            | ✖ Failed  | 2 s

  ALL                            | ✖ Failed  | 8 s (accumulated)
  Runtime: 4 s

  Killed: 9
  > echo $?
  137
  ```

ACKs for top commit:
  MarcoFalke:
    review ACK 451b96f
  aitorjs:
    ACK 451b96f. Manual testing with and without **--failfast**.

Tree-SHA512: 87e510a1411b9e7571e63cf7ffc8b9a8935daf9112ffc0f069d6c406ba87743ec439808181f7e13cb97bb200fad528589786c47f0b43cf3a2ef0d06a23cb86dd
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants