Skip to content

Add startup resilience for bosh create-env race conditions#2689

Merged
aramprice merged 6 commits intomainfrom
add-database-connection-retry-logic
Mar 12, 2026
Merged

Add startup resilience for bosh create-env race conditions#2689
aramprice merged 6 commits intomainfrom
add-database-connection-retry-logic

Conversation

@rkoster
Copy link
Copy Markdown
Contributor

@rkoster rkoster commented Mar 12, 2026

Summary

This PR adds retry and wait logic to handle race conditions during bosh create-env when BOSH components start before their dependencies are fully ready.

Changes

1. Director: Database connection retry logic

  • Adds retry logic when connecting to PostgreSQL during director startup
  • Handles the race condition where postgres job is still creating the database role
  • Configurable via director.db.connection_wait_timeout (default: 60s)

2. Health Monitor: NATS connection retry logic

  • Adds retry logic when connecting to NATS during health monitor startup
  • Handles the race condition where NATS server isn't ready yet
  • Configurable via hm.nats.connection_wait_timeout (default: 60s)

3. NATS Sync: Director API connection retry logic

  • Adds retry logic when connecting to director API during nats-sync startup
  • Handles network errors: ECONNREFUSED, ECONNRESET, ETIMEDOUT, etc.
  • Configurable via nats-sync.director.connection_wait_timeout (default: 60s)

4. Scheduler: Wait for migrations before starting

  • Scheduler now waits for director migrations to complete before starting
  • Fixes "relation does not exist" errors when scheduler queries DB before migrations run
  • Uses existing DBMigrator.ensure_migrated! pattern (waits up to 25s)

Implementation Details

  • All retry logic uses Bosh::Common.retryable for consistency
  • Added bosh-common dependency to bosh-monitor and bosh-nats-sync
  • Updated packaging scripts to build all gemspecs (following director pattern)
  • Added comprehensive unit tests for all retry scenarios

Testing

  • All unit tests pass
  • Verified on actual bosh create-env deployment
  • Confirmed no stack traces in logs by running:
    grep -r "/var/vcap/packages" /var/vcap/sys/log/*
    
    on the BOSH director VM after deployment

Commits

  1. 68514e0d98 - Add database connection retry logic during director startup
  2. 65c201aa89 - Add NATS connection retry logic during health monitor startup
  3. 6cded16a7b - Add director API connection retry logic during nats-sync startup
  4. c28a2a887f - Wait for migrations before starting scheduler

rkoster added 5 commits March 12, 2026 12:54
When the director starts, it now waits for the database to become
available instead of failing immediately. This fixes race conditions
during bosh create-env when postgres is co-located and the role/database
hasn't been created yet.

- Add wait_for_db_connection method that retries on DatabaseConnectionError
  and DatabaseError with informative logging
- Add director.db.connection_wait_timeout property (default: 30 seconds)
- Add unit tests for the retry behavior
During bosh create-env, the health monitor may start before nats_sync has
propagated the HM credentials to NATS. This causes NATS connection failures
(AuthError/ConnectError) that result in the health monitor crashing and
being restarted by BPM/Monit.

This change adds retry logic using Bosh::Common.retryable to wait for NATS
to become available, similar to the database connection retry logic added
for the director.

Changes:
- Add bosh-common dependency to bosh-monitor gem
- Add wait_for_nats_connection method with configurable timeout
- Add hm.nats.connection_wait_timeout property (default: 60s)
- Update health_monitor package to include bosh-common gem
- Add unit tests for NATS connection retry behavior
During bosh create-env, nats-sync may start before the director API is
available, causing connection refused errors. This adds retry logic
using Bosh::Common.retryable to wait for the director to become
available before attempting to sync users.

Changes:
- Add bosh-common dependency to bosh-nats-sync for retryable
- Add wait_for_director_connection method with configurable timeout
- Add nats-sync.director.connection_wait_timeout property (default: 60s)
- Handle network errors: ECONNREFUSED, ECONNRESET, ETIMEDOUT, etc.
- Update nats package to build all gemspecs like health_monitor does
- Add unit tests for director connection retry scenarios
During bosh create-env, the scheduler process may start before the
director has completed running migrations, causing 'relation does not
exist' errors when the scheduler tries to query the director_attributes
table.

This adds migration waiting to the scheduler binary, following the same
pattern used by sync_dns_scheduler and metrics_collector. The scheduler
now:
1. Loads only config initially (not full director with models)
2. Calls DBMigrator.ensure_migrated! to wait for migrations
3. Then loads full director and starts the scheduler

Also adds scheduler_logger method to Config for consistent logging.
Update template spec fixtures to include the new connection_wait_timeout
properties that were added for startup race condition resilience:
- director.db.connection_wait_timeout in director.yml.erb_spec.rb
- hm.nats.connection_wait_timeout in health_monitor_templates_spec.rb
- nats-sync.director.connection_wait_timeout in nats_templates_spec.rb
aramprice
aramprice previously approved these changes Mar 12, 2026
Copy link
Copy Markdown
Member

@aramprice aramprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach overall!

Found one, maybe unneeded rescue / guard block.

@github-project-automation github-project-automation bot moved this from Inbox to Pending Merge | Prioritized in Foundational Infrastructure Working Group Mar 12, 2026
The begin/rescue LoadError block became a no-op after require 'fiber'
was removed in a previous commit. Since Ruby 3.x has Fiber built-in,
this guard is no longer needed.
@aramprice aramprice merged commit 924757e into main Mar 12, 2026
19 checks passed
@github-project-automation github-project-automation bot moved this from Pending Merge | Prioritized to Done in Foundational Infrastructure Working Group Mar 12, 2026
@aramprice aramprice deleted the add-database-connection-retry-logic branch March 12, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants