Skip to content

Conversation

@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Apr 6, 2018

When a node is restarted during a test after an unclean shutdown (such
as with -dbcrashratio), it's possible an old cookie file was left
behind. This can cause a race condition when restarting the node, where
the test framework might try to connect using credentials from the
old cookie file, just as the node will generate new credentials and
overwrite the old file.

Delete any such cookie file if present prior to startup.

@fanquake fanquake added the Tests label Apr 6, 2018
@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 6, 2018

I triggered this issue while running the tests on the 0.16 branch:

�[1mfeature_dbcrash.py�[0m failed, Duration: 1185 s

�[1mstdout:
�[0m2018-04-05 17:05:30.803000 TestFramework (INFO): Initializing test directory /tmp/bitcoin_test_runner_20180405_130431/feature_dbcrash_170
2018-04-05 17:06:08.451000 TestFramework (INFO): Prepped 5001 utxo entries
2018-04-05 17:06:59.908000 TestFramework (INFO): Iteration 0, generating 2500 transactions [0, 0, 0]
2018-04-05 17:07:17.195000 TestFramework (INFO): Iteration 1, generating 2500 transactions [0, 0, 0]
2018-04-05 17:07:33.347000 TestFramework (INFO): Iteration 2, generating 2500 transactions [0, 0, 0]
2018-04-05 17:07:49.595000 TestFramework (INFO): Iteration 3, generating 2500 transactions [0, 0, 0]
2018-04-05 17:08:09.162000 TestFramework (INFO): Iteration 4, generating 2500 transactions [0, 0, 0]
2018-04-05 17:08:26.442000 TestFramework (INFO): Iteration 5, generating 2500 transactions [0, 0, 0]
2018-04-05 17:08:47.728000 TestFramework (INFO): Iteration 6, generating 2500 transactions [0, 0, 0]
2018-04-05 17:09:13.273000 TestFramework (INFO): Iteration 7, generating 2500 transactions [1, 0, 0]
2018-04-05 17:09:34.107000 TestFramework (INFO): Iteration 8, generating 2500 transactions [2, 0, 0]
2018-04-05 17:10:23.227000 TestFramework (INFO): Iteration 9, generating 2500 transactions [3, 1, 0]
2018-04-05 17:10:48.601000 TestFramework (INFO): Iteration 10, generating 2500 transactions [3, 1, 0]
2018-04-05 17:11:14.825000 TestFramework (INFO): Iteration 11, generating 2500 transactions [4, 1, 0]
2018-04-05 17:11:41.301000 TestFramework (INFO): Iteration 12, generating 2500 transactions [4, 2, 0]
2018-04-05 17:12:04.667000 TestFramework (INFO): Iteration 13, generating 2500 transactions [5, 2, 0]
2018-04-05 17:12:29.121000 TestFramework (INFO): Iteration 14, generating 2500 transactions [5, 2, 0]
2018-04-05 17:12:56.188000 TestFramework (INFO): Iteration 15, generating 2500 transactions [5, 3, 0]
2018-04-05 17:13:24.200000 TestFramework (INFO): Iteration 16, generating 2500 transactions [6, 3, 0]
2018-04-05 17:14:22.805000 TestFramework (INFO): Iteration 17, generating 2500 transactions [7, 4, 0]
2018-04-05 17:15:35.174000 TestFramework (INFO): Iteration 18, generating 2500 transactions [8, 4, 1]
2018-04-05 17:16:01.286000 TestFramework (INFO): Iteration 19, generating 2500 transactions [8, 4, 1]
2018-04-05 17:16:30.793000 TestFramework (INFO): Iteration 20, generating 2500 transactions [8, 4, 1]
2018-04-05 17:16:56.134000 TestFramework (INFO): Iteration 21, generating 2500 transactions [8, 4, 1]
2018-04-05 17:17:24.077000 TestFramework (INFO): Iteration 22, generating 2500 transactions [8, 4, 1]
2018-04-05 17:18:05.554000 TestFramework (INFO): Iteration 23, generating 2500 transactions [9, 5, 2]
2018-04-05 17:18:39.609000 TestFramework (INFO): Iteration 24, generating 2500 transactions [10, 5, 2]
2018-04-05 17:19:25.984000 TestFramework (INFO): Iteration 25, generating 2500 transactions [10, 5, 2]
2018-04-05 17:20:07.936000 TestFramework (INFO): Iteration 26, generating 2500 transactions [11, 5, 2]
2018-04-05 17:20:47.133000 TestFramework (INFO): Iteration 27, generating 2500 transactions [11, 5, 2]
2018-04-05 17:21:31.789000 TestFramework (INFO): Iteration 28, generating 2500 transactions [11, 5, 3]
2018-04-05 17:22:11.479000 TestFramework (INFO): Iteration 29, generating 2500 transactions [11, 5, 3]
2018-04-05 17:22:55.978000 TestFramework (INFO): Iteration 30, generating 2500 transactions [12, 5, 3]
2018-04-05 17:23:42.142000 TestFramework (INFO): Iteration 31, generating 2500 transactions [12, 5, 3]
2018-04-05 17:25:15.294000 TestFramework (ERROR): Unexpected exception caught during testing
Traceback (most recent call last):
  File "/home/prod/jenkins/bitcoin/test/functional/feature_dbcrash.py", line 81, in restart_node
    self.start_node(node_index)
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/test_framework.py", line 229, in start_node
    node.wait_for_rpc_connection()
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/test_node.py", line 104, in wait_for_rpc_connection
    self.rpc.getblockcount()
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/authproxy.py", line 136, in __call__
    response = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/authproxy.py", line 107, in _request
    return self._get_response()
  File "/home/prod/jenkins/bitcoin/test/functional/test_framework/authproxy.py", line 168, in _get_response
    'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)})
test_framework.authproxy.JSONRPCException: non-JSON HTTP response with '401 Unauthorized' from server (-342)

But this is also straightforward to reproduce on master; I was able to do so by adding a small sleep in wait_for_rpc_connection in test_node.py:

diff --git a/test/functional/test_framework/test_node.py b/test/functional/test_framework/test_node.py
index 291ac3ee46..44627cba7c 100755
--- a/test/functional/test_framework/test_node.py
+++ b/test/functional/test_framework/test_node.py
@@ -111,6 +111,7 @@ class TestNode():
                 raise FailedToStartError('bitcoind exited with status {} during initialization'.format(self.process.returncode))
             try:
                 self.rpc = get_rpc_proxy(rpc_url(self.datadir, self.index, self.rpchost), self.index, timeout=self.rpc_timeout, coveragedir=self.coverage_dir)
+                time.sleep(0.5)
                 self.rpc.getblockcount()
                 # If the call to getblockcount() succeeds then the RPC connection is up
                 self.rpc_connected = True

@maflcko
Copy link
Member

maflcko commented Apr 6, 2018

Considering that we usually clean up the datadir when the test was successful, wouldn't it make more sense to delete the cookie when it was left behind due to a crash? That would also limit the changes to feature_dbcrash, the only place where the fix would be needed, it seems... Just a thought, no strong opinion.

Concept ACK

@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 6, 2018

@MarcoFalke I did consider that as well, also I considered changing the way we use the test framework so that if a cookie file is already present we instruct bitcoind to use it... None of these options seems all that great to me as this is pretty messy, but I figure that changing wait_for_rpc_connection is more robust (ie if other tests end up getting written that need something like this), versus special-casing cookie file deletion in a specific test (as that knowledge would likely be forgotten when a new test that needs it gets written).

Edit: I should add -- normally the cookie file is deleted by bitcoind when it shuts down cleanly, not the test framework (I think).

Copy link
Contributor

@jnewbery jnewbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. If there really is an authorization failure, then wait_for_rpc_connection() will eventually timeout anyway.

# using an old cookie file which is overwritten when the
# node starts up, so treat this the same as "No RPC
# credentials" and do nothing.
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps log, since this is an unexpected condition?

@laanwj
Copy link
Member

laanwj commented Apr 9, 2018

Considering that we usually clean up the datadir when the test was successful, wouldn't it make more sense to delete the cookie when it was left behind due to a crash?

That was my initial response to this as well.
If the test framework makes sure that no cookie file is left behind, the race will not trigger in the first place, and we don't have to special-case it.

Edit: I should add -- normally the cookie file is deleted by bitcoind when it shuts down cleanly, not the test framework (I think).

Correct.
This is done here: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/server.cpp#L323

If a cookie file exists in a datadir prior to node startup, it must have
been leftover from a prior unclean shutdown. As bitcoind will overwrite
it anyway, delete it before starting up to prevent the test framework
from inadvertently trying to connect using stale credentials.
@sdaftuar sdaftuar force-pushed the 2018-04-improve-dbcrash-restarts branch from 02b3c4a to 75d0e4c Compare April 9, 2018 18:36
@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 9, 2018

@laanwj @MarcoFalke I reworked this PR to delete cookie files, if present, as suggested. I avoided putting this logic directly in feature_dbcrash.py, since the authentication logic lives in the test framework (and the test is unaware of cookie files or RPC authentication in the first place), but let me know if you think I should move it.

Original version of this PR is here for comparison: sdaftuar@02b3c4a

Also updated OP to account for new behavior.

@maflcko
Copy link
Member

maflcko commented Apr 9, 2018

utACK 75d0e4c

@jnewbery
Copy link
Contributor

jnewbery commented Apr 9, 2018

Tested ACK 75d0e4c.

I'd be tempted to put the new functionality directly into TestNode.start(), since that's the only place it gets called, but this is also fine.

@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 9, 2018

I'd be tempted to put the new functionality directly into TestNode.start(), since that's the only place it gets called, but this is also fine.

No strong opinion, but it looked like util.py is the only place in the test framework that currently knows where cookie files live exactly, so I figured might as well continue to restrict that knowledge to one area of the codebase.

@maflcko maflcko merged commit 75d0e4c into bitcoin:master Apr 9, 2018
maflcko pushed a commit that referenced this pull request Apr 9, 2018
75d0e4c [qa] Delete cookie file before starting node (Suhas Daftuar)

Pull request description:

  When a node is restarted during a test after an unclean shutdown (such
  as with -dbcrashratio), it's possible an old cookie file was left
  behind. This can cause a race condition when restarting the node, where
  the test framework might try to connect using credentials from the
  old cookie file, just as the node will generate new credentials and
  overwrite the old file.

  Delete any such cookie file if present prior to startup.

Tree-SHA512: ae1e8bf8fd20e07c32b0715025693bb28b0e3dd34f328cae4346abf579b0c97b5db1c02782e1c46b7a3b6058d268b6d46b668e847658a6eed0be857ffb0d65dc
@laanwj
Copy link
Member

laanwj commented Apr 10, 2018

post-utACK 75d0e4c

@sdaftuar sdaftuar mentioned this pull request Apr 16, 2018
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Apr 20, 2018
If a cookie file exists in a datadir prior to node startup, it must have
been leftover from a prior unclean shutdown. As bitcoind will overwrite
it anyway, delete it before starting up to prevent the test framework
from inadvertently trying to connect using stale credentials.

Github-Pull: bitcoin#12902
Rebased-From: 75d0e4c
HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request May 13, 2018
If a cookie file exists in a datadir prior to node startup, it must have
been leftover from a prior unclean shutdown. As bitcoind will overwrite
it anyway, delete it before starting up to prevent the test framework
from inadvertently trying to connect using stale credentials.

Github-Pull: bitcoin#12902
Rebased-From: 75d0e4c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Mar 24, 2020
…node

75d0e4c [qa] Delete cookie file before starting node (Suhas Daftuar)

Pull request description:

  When a node is restarted during a test after an unclean shutdown (such
  as with -dbcrashratio), it's possible an old cookie file was left
  behind. This can cause a race condition when restarting the node, where
  the test framework might try to connect using credentials from the
  old cookie file, just as the node will generate new credentials and
  overwrite the old file.

  Delete any such cookie file if present prior to startup.

Tree-SHA512: ae1e8bf8fd20e07c32b0715025693bb28b0e3dd34f328cae4346abf579b0c97b5db1c02782e1c46b7a3b6058d268b6d46b668e847658a6eed0be857ffb0d65dc
gades pushed a commit to cosanta/cosanta-core that referenced this pull request Jun 30, 2021
…node

75d0e4c [qa] Delete cookie file before starting node (Suhas Daftuar)

Pull request description:

  When a node is restarted during a test after an unclean shutdown (such
  as with -dbcrashratio), it's possible an old cookie file was left
  behind. This can cause a race condition when restarting the node, where
  the test framework might try to connect using credentials from the
  old cookie file, just as the node will generate new credentials and
  overwrite the old file.

  Delete any such cookie file if present prior to startup.

Tree-SHA512: ae1e8bf8fd20e07c32b0715025693bb28b0e3dd34f328cae4346abf579b0c97b5db1c02782e1c46b7a3b6058d268b6d46b668e847658a6eed0be857ffb0d65dc
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants