Commit 01f7ddc
committed
Temporarily mitigate intermittent test failures
On a fresh BOSH Director upgrade, when recreating an existing ZooKeeper
deployment, we have been seeing the following failure approximately 80%
of the time. (8 failures out of 10 runs):
```
Task 5 | 19:21:39 | Preparing deployment: Preparing deployment (00:02:15)
L Error: zookeeper/10830558-3899-4127-bb4c-1b9766638ab4: Timed out sending 'get_state' to instance: 'zookeeper/10830558-3899-4127-bb4c-1b9766638ab4', agent-id: 'a872813b-1504-49c5-8dfe-dd8e2c70ad97' after 45 seconds
```
We have mitigated the failure by inserting a 3-minute pause between the
task which re-deploys the BOSH Director and the task which re-deploys
the ZooKeeper deployment. We speculate that this delay fixes the failure
by giving the BOSH Agents on the ZooKeeper VMs enough time to establish
their NATS connection with the new BOSH Director.
The upstream failure causing the "Timed out sending 'get_state'" failure
on the Director can be traced back to the BOSH Agent on the ZooKeeper
VMs, which emit the following error before restarting:
```
App run Running agent: Sending Heartbeat: nats: connection closed
```
We feel this is a timing issue, and note the following:
- The BOSH Director will fail a deployment if it takes longer than 45
seconds to 'get_state' from an instances (via NATS)
- Re-deploying a BOSH Director will close all previous NATS connections
- It takes 18 seconds after the BOSH agent restarts to re-establish the
NATS connection to the Director and send the first hearbeat
- The upgraded BOSH Director begins its re-deploy of Zookeeper 15
seconds after it comes up
- BOSH Agent sends a successful hearbeat every 30 seconds
- BOSH Agent can wait as long as 99 seconds (or longer) for a heartbeat
to be successfully delivered. While it's waiting it does not send
additional heartbeats. This occurs, for example, when a BOSH Director
is being re-deployed and the NATS server is unavailable1 parent bc99b31 commit 01f7ddc
1 file changed
+11
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
612 | 612 | | |
613 | 613 | | |
614 | 614 | | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
615 | 625 | | |
616 | 626 | | |
617 | 627 | | |
| |||
1111 | 1121 | | |
1112 | 1122 | | |
1113 | 1123 | | |
1114 | | - | |
| 1124 | + | |
1115 | 1125 | | |
1116 | 1126 | | |
1117 | 1127 | | |
| |||
0 commit comments