Skip to content

Commit 01f7ddc

Browse files
committed
Temporarily mitigate intermittent test failures
On a fresh BOSH Director upgrade, when recreating an existing ZooKeeper deployment, we have been seeing the following failure approximately 80% of the time. (8 failures out of 10 runs): ``` Task 5 | 19:21:39 | Preparing deployment: Preparing deployment (00:02:15) L Error: zookeeper/10830558-3899-4127-bb4c-1b9766638ab4: Timed out sending 'get_state' to instance: 'zookeeper/10830558-3899-4127-bb4c-1b9766638ab4', agent-id: 'a872813b-1504-49c5-8dfe-dd8e2c70ad97' after 45 seconds ``` We have mitigated the failure by inserting a 3-minute pause between the task which re-deploys the BOSH Director and the task which re-deploys the ZooKeeper deployment. We speculate that this delay fixes the failure by giving the BOSH Agents on the ZooKeeper VMs enough time to establish their NATS connection with the new BOSH Director. The upstream failure causing the "Timed out sending 'get_state'" failure on the Director can be traced back to the BOSH Agent on the ZooKeeper VMs, which emit the following error before restarting: ``` App run Running agent: Sending Heartbeat: nats: connection closed ``` We feel this is a timing issue, and note the following: - The BOSH Director will fail a deployment if it takes longer than 45 seconds to 'get_state' from an instances (via NATS) - Re-deploying a BOSH Director will close all previous NATS connections - It takes 18 seconds after the BOSH agent restarts to re-establish the NATS connection to the Director and send the first hearbeat - The upgraded BOSH Director begins its re-deploy of Zookeeper 15 seconds after it comes up - BOSH Agent sends a successful hearbeat every 30 seconds - BOSH Agent can wait as long as 99 seconds (or longer) for a heartbeat to be successfully delivered. While it's waiting it does not send additional heartbeats. This occurs, for example, when a BOSH Director is being re-deployed and the NATS server is unavailable
1 parent bc99b31 commit 01f7ddc

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

ci/pipeline.yml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -612,6 +612,16 @@ jobs:
612612
AWS_SSH_PRIVATE_KEY: ((aws_ssh_private_key))
613613
DEPLOY_ARGS: |
614614
-o bosh-deployment/external-ip-not-recommended.yml
615+
- task: sleep-180-seconds
616+
image: main-ruby-go-image
617+
config:
618+
platform: linux
619+
run:
620+
path: /bin/sh
621+
args:
622+
- -exc
623+
- |
624+
sleep 180
615625
- task: recreate-zookeeper
616626
image: main-ruby-go-image
617627
file: bosh-src/ci/tasks/deploy-zookeeper.yml
@@ -1111,7 +1121,7 @@ resources:
11111121
tag: ((branch_name))
11121122
username: ((docker.username))
11131123
password: ((docker.password))
1114-
1124+
11151125
- name: main-postgres-10-image
11161126
type: docker-image
11171127
source:

0 commit comments

Comments
 (0)