Add startup resilience for bosh create-env race conditions by rkoster · Pull Request #2689 · cloudfoundry/bosh

rkoster · 2026-03-12T14:58:32Z

Summary

This PR adds retry and wait logic to handle race conditions during bosh create-env when BOSH components start before their dependencies are fully ready.

Changes

1. Director: Database connection retry logic

Adds retry logic when connecting to PostgreSQL during director startup
Handles the race condition where postgres job is still creating the database role
Configurable via director.db.connection_wait_timeout (default: 60s)

2. Health Monitor: NATS connection retry logic

Adds retry logic when connecting to NATS during health monitor startup
Handles the race condition where NATS server isn't ready yet
Configurable via hm.nats.connection_wait_timeout (default: 60s)

3. NATS Sync: Director API connection retry logic

Adds retry logic when connecting to director API during nats-sync startup
Handles network errors: ECONNREFUSED, ECONNRESET, ETIMEDOUT, etc.
Configurable via nats-sync.director.connection_wait_timeout (default: 60s)

4. Scheduler: Wait for migrations before starting

Scheduler now waits for director migrations to complete before starting
Fixes "relation does not exist" errors when scheduler queries DB before migrations run
Uses existing DBMigrator.ensure_migrated! pattern (waits up to 25s)

Implementation Details

All retry logic uses Bosh::Common.retryable for consistency
Added bosh-common dependency to bosh-monitor and bosh-nats-sync
Updated packaging scripts to build all gemspecs (following director pattern)
Added comprehensive unit tests for all retry scenarios

Testing

All unit tests pass
Verified on actual bosh create-env deployment
Confirmed no stack traces in logs by running:
```
grep -r "/var/vcap/packages" /var/vcap/sys/log/*
```
on the BOSH director VM after deployment

Commits

68514e0d98 - Add database connection retry logic during director startup
65c201aa89 - Add NATS connection retry logic during health monitor startup
6cded16a7b - Add director API connection retry logic during nats-sync startup
c28a2a887f - Wait for migrations before starting scheduler

When the director starts, it now waits for the database to become available instead of failing immediately. This fixes race conditions during bosh create-env when postgres is co-located and the role/database hasn't been created yet. - Add wait_for_db_connection method that retries on DatabaseConnectionError and DatabaseError with informative logging - Add director.db.connection_wait_timeout property (default: 30 seconds) - Add unit tests for the retry behavior

During bosh create-env, the health monitor may start before nats_sync has propagated the HM credentials to NATS. This causes NATS connection failures (AuthError/ConnectError) that result in the health monitor crashing and being restarted by BPM/Monit. This change adds retry logic using Bosh::Common.retryable to wait for NATS to become available, similar to the database connection retry logic added for the director. Changes: - Add bosh-common dependency to bosh-monitor gem - Add wait_for_nats_connection method with configurable timeout - Add hm.nats.connection_wait_timeout property (default: 60s) - Update health_monitor package to include bosh-common gem - Add unit tests for NATS connection retry behavior

During bosh create-env, nats-sync may start before the director API is available, causing connection refused errors. This adds retry logic using Bosh::Common.retryable to wait for the director to become available before attempting to sync users. Changes: - Add bosh-common dependency to bosh-nats-sync for retryable - Add wait_for_director_connection method with configurable timeout - Add nats-sync.director.connection_wait_timeout property (default: 60s) - Handle network errors: ECONNREFUSED, ECONNRESET, ETIMEDOUT, etc. - Update nats package to build all gemspecs like health_monitor does - Add unit tests for director connection retry scenarios

During bosh create-env, the scheduler process may start before the director has completed running migrations, causing 'relation does not exist' errors when the scheduler tries to query the director_attributes table. This adds migration waiting to the scheduler binary, following the same pattern used by sync_dns_scheduler and metrics_collector. The scheduler now: 1. Loads only config initially (not full director with models) 2. Calls DBMigrator.ensure_migrated! to wait for migrations 3. Then loads full director and starts the scheduler Also adds scheduler_logger method to Config for consistent logging.

Update template spec fixtures to include the new connection_wait_timeout properties that were added for startup race condition resilience: - director.db.connection_wait_timeout in director.yml.erb_spec.rb - hm.nats.connection_wait_timeout in health_monitor_templates_spec.rb - nats-sync.director.connection_wait_timeout in nats_templates_spec.rb

src/bosh-monitor/lib/bosh/monitor.rb

aramprice

I like the approach overall!

Found one, maybe unneeded rescue / guard block.

The begin/rescue LoadError block became a no-op after require 'fiber' was removed in a previous commit. Since Ruby 3.x has Fiber built-in, this guard is no longer needed.

rkoster added 5 commits March 12, 2026 12:54

rkoster requested review from aramprice and ystros and removed request for aramprice March 12, 2026 15:14

rkoster mentioned this pull request Mar 12, 2026

Add wait script to postgres jobs for create-env reliability #2687

Closed

rkoster requested review from a team and aramprice March 12, 2026 15:17

cf-foundation-community-automation bot added this to Foundational Infrastructure Working Group Mar 12, 2026

cf-foundation-community-automation bot moved this to Inbox in Foundational Infrastructure Working Group Mar 12, 2026

aramprice reviewed Mar 12, 2026

View reviewed changes

src/bosh-monitor/lib/bosh/monitor.rb Show resolved Hide resolved

aramprice previously approved these changes Mar 12, 2026

View reviewed changes

github-project-automation bot moved this from Inbox to Pending Merge | Prioritized in Foundational Infrastructure Working Group Mar 12, 2026

Remove obsolete Fiber LoadError guard

99b204b

The begin/rescue LoadError block became a no-op after require 'fiber' was removed in a previous commit. Since Ruby 3.x has Fiber built-in, this guard is no longer needed.

rkoster dismissed aramprice’s stale review via 99b204b March 12, 2026 15:41

aramprice approved these changes Mar 12, 2026

View reviewed changes

aramprice merged commit 924757e into main Mar 12, 2026
19 checks passed

github-project-automation bot moved this from Pending Merge | Prioritized to Done in Foundational Infrastructure Working Group Mar 12, 2026

aramprice deleted the add-database-connection-retry-logic branch March 12, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add startup resilience for bosh create-env race conditions#2689

Add startup resilience for bosh create-env race conditions#2689
aramprice merged 6 commits intomainfrom
add-database-connection-retry-logic

rkoster commented Mar 12, 2026

Uh oh!

Uh oh!

aramprice left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rkoster commented Mar 12, 2026

Summary

Changes

1. Director: Database connection retry logic

2. Health Monitor: NATS connection retry logic

3. NATS Sync: Director API connection retry logic

4. Scheduler: Wait for migrations before starting

Implementation Details

Testing

Commits

Uh oh!

Uh oh!

aramprice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants