Add startup resilience for bosh create-env race conditions#2689
Merged
Add startup resilience for bosh create-env race conditions#2689
Conversation
When the director starts, it now waits for the database to become available instead of failing immediately. This fixes race conditions during bosh create-env when postgres is co-located and the role/database hasn't been created yet. - Add wait_for_db_connection method that retries on DatabaseConnectionError and DatabaseError with informative logging - Add director.db.connection_wait_timeout property (default: 30 seconds) - Add unit tests for the retry behavior
During bosh create-env, the health monitor may start before nats_sync has propagated the HM credentials to NATS. This causes NATS connection failures (AuthError/ConnectError) that result in the health monitor crashing and being restarted by BPM/Monit. This change adds retry logic using Bosh::Common.retryable to wait for NATS to become available, similar to the database connection retry logic added for the director. Changes: - Add bosh-common dependency to bosh-monitor gem - Add wait_for_nats_connection method with configurable timeout - Add hm.nats.connection_wait_timeout property (default: 60s) - Update health_monitor package to include bosh-common gem - Add unit tests for NATS connection retry behavior
During bosh create-env, nats-sync may start before the director API is available, causing connection refused errors. This adds retry logic using Bosh::Common.retryable to wait for the director to become available before attempting to sync users. Changes: - Add bosh-common dependency to bosh-nats-sync for retryable - Add wait_for_director_connection method with configurable timeout - Add nats-sync.director.connection_wait_timeout property (default: 60s) - Handle network errors: ECONNREFUSED, ECONNRESET, ETIMEDOUT, etc. - Update nats package to build all gemspecs like health_monitor does - Add unit tests for director connection retry scenarios
During bosh create-env, the scheduler process may start before the director has completed running migrations, causing 'relation does not exist' errors when the scheduler tries to query the director_attributes table. This adds migration waiting to the scheduler binary, following the same pattern used by sync_dns_scheduler and metrics_collector. The scheduler now: 1. Loads only config initially (not full director with models) 2. Calls DBMigrator.ensure_migrated! to wait for migrations 3. Then loads full director and starts the scheduler Also adds scheduler_logger method to Config for consistent logging.
Update template spec fixtures to include the new connection_wait_timeout properties that were added for startup race condition resilience: - director.db.connection_wait_timeout in director.yml.erb_spec.rb - hm.nats.connection_wait_timeout in health_monitor_templates_spec.rb - nats-sync.director.connection_wait_timeout in nats_templates_spec.rb
aramprice
reviewed
Mar 12, 2026
aramprice
previously approved these changes
Mar 12, 2026
Member
aramprice
left a comment
There was a problem hiding this comment.
I like the approach overall!
Found one, maybe unneeded rescue / guard block.
The begin/rescue LoadError block became a no-op after require 'fiber' was removed in a previous commit. Since Ruby 3.x has Fiber built-in, this guard is no longer needed.
aramprice
approved these changes
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds retry and wait logic to handle race conditions during
bosh create-envwhen BOSH components start before their dependencies are fully ready.Changes
1. Director: Database connection retry logic
director.db.connection_wait_timeout(default: 60s)2. Health Monitor: NATS connection retry logic
hm.nats.connection_wait_timeout(default: 60s)3. NATS Sync: Director API connection retry logic
nats-sync.director.connection_wait_timeout(default: 60s)4. Scheduler: Wait for migrations before starting
DBMigrator.ensure_migrated!pattern (waits up to 25s)Implementation Details
Bosh::Common.retryablefor consistencybosh-commondependency tobosh-monitorandbosh-nats-syncTesting
Commits
68514e0d98- Add database connection retry logic during director startup65c201aa89- Add NATS connection retry logic during health monitor startup6cded16a7b- Add director API connection retry logic during nats-sync startupc28a2a887f- Wait for migrations before starting scheduler