Skip to content

Conversation

@jvstme
Copy link
Collaborator

@jvstme jvstme commented Jul 24, 2025

Allow configuring HTTP probes for services and use probe results to avoid downtime during
deployments.

type: service
port: 8000
replicas: 2
commands:
- mkdir /www
- cd /www
- touch health-0
- touch health-1
- python -m http.server
probes:
- type: http
  url: /health-0
  interval: 5s
  ready_after: 2
- type: http
  url: /health-1
  interval: 5s
  timeout: 1
deployment-sped-up-5x.mp4

#2181

Allow configuring HTTP probes for services and use
probe results to avoid downtime during
deployments.
@r4victor
Copy link
Collaborator

I recently made an analysis of dstack server db queries that showed a substantial load on the db and slow api endpoints caused by inefficient/redundant queries. Improved this in #2928 and #2929. For now on, I recommend loading only the required attributes and relationship when makes sense. Please adjust this for probes.

Comment on lines 92 to 106
async with get_session_ctx() as session:
async with get_locker(get_db().dialect_name).lock_ctx(
ProbeModel.__tablename__, [probe.id]
):
assert isinstance(probe_config.interval, int), (
"Expected probe interval to be parsed by validator"
)
await session.execute(
update(ProbeModel)
.where(ProbeModel.id == probe.id)
.values(
success_streak=0 if not success else ProbeModel.success_streak + 1,
due=get_current_datetime() + timedelta(seconds=probe_config.interval),
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is problematic – you have batch_size=100, so there can be that many _process_probe_async. If every function produces a write transactions – there can be too many of them and you don't control the write rate anymore, which can lead to "database is locked" on SQLite and overall is not good. I think we should batch probe execution and updates as well. Unbounded probe_config.timeout is a problem here. Perhaps we should put some restrictions on it to make batch updates work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, I think batch updates also complicate things since processing different resources becomes dependent (at least time-wise). Per-resource updates may be ok and performant enough for Postgres. We should check how much the problem it would be for sqlite and what else can we do there differently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock that syncs all probe updates. SQLite syncs concurrent writes anyway and its file-based locking is much less efficient than our app-level lock. There is a per-resource locking now but it doesn't serve that purpose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock

We can consider this separately, since it may require more testing. The current solution works on SQLite without errors with 50 parallel probe executions. With 100 executions, database-is-locked errors do happen from time to time, but I don't think 100 parallel probes is realistic for SQLite setups

BATCH_SIZE = 100
SSH_CONNECT_TIMEOUT = timedelta(seconds=10)
PROCESSING_OVERHEAD_TIMEOUT = timedelta(minutes=1)
PROBES_SCHEDULER = AsyncIOScheduler()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introducing a new scheduler and adding _process_probe_async there? I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete. There can be multiple instances of process_probes controlled by max_instances= of the background scheduler. The current approach puts no upper bound on active processing transactions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introducing a new scheduler

The scheduler is just a way to run fire-and-forget tasks. Similar behavior can be achieved with asyncio.create_task, but it requires boilerplate code to store references to tasks, so I decided to use a scheduler instead.

We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.

We can introduce a global scheduler for fire-and-forget tasks when needed.

I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete

That would result in process_probes being blocked until all probes are executed. If there are probes that take a long time to execute, such as LLM prompts, the max_instances of process_probes tasks will be quickly exhausted and shorter probes will not be executed at the user-configured interval.

The current approach puts no upper bound on active processing transactions.

There will typically be at most one instance of the processing task (process_probes), since this task exits quickly after starting _process_probe_async tasks. As for _process_probe_async — yes, there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH. Each task can trigger a db transaction, but it should be short since it only consists of one UPDATE query.

Setting bounds on the number of _process_probe_async tasks will result in probes not being executed at the user-configured interval.

If the number of db write transactions is a problem, even through these transactions are short, we can try to limit them by having a separate worker write the results of probe executions in batches instead of writing them for each probe in _process_probe_async.

If the overall number of concurrent probe execution tasks is a problem, we can try to solve it by moving probe execution to dstack-runner, but this requires further research.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of db write transactions is a problem

it is a problem for sqlite – but I think we can mitigate it with a app-level lock.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree overall but

there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH.

there should be limit on the probe timeout and number of probes, otherwise the server can be overloaded with many never-ending probes.

jvstme added 2 commits August 1, 2025 08:18
- Docs
- Compatibility
- More tests
- Optimized DB queries
- Limit max probes and probe timeout
- Defaults out of `ProbeConfig`
- Stricter validation
- Migration rebase
- Locking fix
@jvstme jvstme marked this pull request as ready for review August 5, 2025 02:01
@jvstme jvstme requested a review from r4victor August 5, 2025 02:02
@jvstme jvstme merged commit 62e5b87 into master Aug 5, 2025
26 checks passed
@jvstme jvstme deleted the issue_2181_probes branch August 5, 2025 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants