Service probes #2927

jvstme · 2025-07-24T00:58:58Z

Allow configuring HTTP probes for services and use probe results to avoid downtime during
deployments.

type: service
port: 8000
replicas: 2
commands:
- mkdir /www
- cd /www
- touch health-0
- touch health-1
- python -m http.server
probes:
- type: http
  url: /health-0
  interval: 5s
  ready_after: 2
- type: http
  url: /health-1
  interval: 5s
  timeout: 1

deployment-sped-up-5x.mp4

#2181

Allow configuring HTTP probes for services and use probe results to avoid downtime during deployments.

src/dstack/_internal/server/models.py

r4victor · 2025-07-28T10:12:45Z

I recently made an analysis of dstack server db queries that showed a substantial load on the db and slow api endpoints caused by inefficient/redundant queries. Improved this in #2928 and #2929. For now on, I recommend loading only the required attributes and relationship when makes sense. Please adjust this for probes.

src/dstack/_internal/server/background/tasks/process_probes.py

r4victor · 2025-07-30T08:23:26Z

src/dstack/_internal/server/background/tasks/process_probes.py

+    async with get_session_ctx() as session:
+        async with get_locker(get_db().dialect_name).lock_ctx(
+            ProbeModel.__tablename__, [probe.id]
+        ):
+            assert isinstance(probe_config.interval, int), (
+                "Expected probe interval to be parsed by validator"
+            )
+            await session.execute(
+                update(ProbeModel)
+                .where(ProbeModel.id == probe.id)
+                .values(
+                    success_streak=0 if not success else ProbeModel.success_streak + 1,
+                    due=get_current_datetime() + timedelta(seconds=probe_config.interval),
+                )
+            )


I think this is problematic – you have batch_size=100, so there can be that many _process_probe_async. If every function produces a write transactions – there can be too many of them and you don't control the write rate anymore, which can lead to "database is locked" on SQLite and overall is not good. I think we should batch probe execution and updates as well. Unbounded probe_config.timeout is a problem here. Perhaps we should put some restrictions on it to make batch updates work.

At the same time, I think batch updates also complicate things since processing different resources becomes dependent (at least time-wise). Per-resource updates may be ok and performant enough for Postgres. We should check how much the problem it would be for sqlite and what else can we do there differently.

A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock that syncs all probe updates. SQLite syncs concurrent writes anyway and its file-based locking is much less efficient than our app-level lock. There is a per-resource locking now but it doesn't serve that purpose.

A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock

We can consider this separately, since it may require more testing. The current solution works on SQLite without errors with 50 parallel probe executions. With 100 executions, database-is-locked errors do happen from time to time, but I don't think 100 parallel probes is realistic for SQLite setups

src/dstack/_internal/core/models/configurations.py

r4victor · 2025-07-30T08:37:30Z

src/dstack/_internal/server/background/tasks/process_probes.py

+BATCH_SIZE = 100
+SSH_CONNECT_TIMEOUT = timedelta(seconds=10)
+PROCESSING_OVERHEAD_TIMEOUT = timedelta(minutes=1)
+PROBES_SCHEDULER = AsyncIOScheduler()


Why introducing a new scheduler and adding _process_probe_async there? I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete. There can be multiple instances of process_probes controlled by max_instances= of the background scheduler. The current approach puts no upper bound on active processing transactions.

We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.

Why introducing a new scheduler

The scheduler is just a way to run fire-and-forget tasks. Similar behavior can be achieved with asyncio.create_task, but it requires boilerplate code to store references to tasks, so I decided to use a scheduler instead.

We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.

We can introduce a global scheduler for fire-and-forget tasks when needed.

I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete

That would result in process_probes being blocked until all probes are executed. If there are probes that take a long time to execute, such as LLM prompts, the max_instances of process_probes tasks will be quickly exhausted and shorter probes will not be executed at the user-configured interval.

The current approach puts no upper bound on active processing transactions.

There will typically be at most one instance of the processing task (process_probes), since this task exits quickly after starting _process_probe_async tasks. As for _process_probe_async — yes, there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH. Each task can trigger a db transaction, but it should be short since it only consists of one UPDATE query.

Setting bounds on the number of _process_probe_async tasks will result in probes not being executed at the user-configured interval.

If the number of db write transactions is a problem, even through these transactions are short, we can try to limit them by having a separate worker write the results of probe executions in batches instead of writing them for each probe in _process_probe_async.

If the overall number of concurrent probe execution tasks is a problem, we can try to solve it by moving probe execution to dstack-runner, but this requires further research.

If the number of db write transactions is a problem

it is a problem for sqlite – but I think we can mitigate it with a app-level lock.

Agree overall but

there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH.

there should be limit on the probe timeout and number of probes, otherwise the server can be overloaded with many never-ending probes.

- Docs - Compatibility - More tests - Optimized DB queries - Limit max probes and probe timeout - Defaults out of `ProbeConfig` - Stricter validation - Migration rebase - Locking fix

Service probes

acd006d

Allow configuring HTTP probes for services and use probe results to avoid downtime during deployments.

r4victor reviewed Jul 28, 2025

View reviewed changes

src/dstack/_internal/server/models.py Outdated Show resolved Hide resolved

r4victor reviewed Jul 30, 2025

View reviewed changes

src/dstack/_internal/server/background/tasks/process_probes.py Show resolved Hide resolved

r4victor reviewed Jul 30, 2025

View reviewed changes

src/dstack/_internal/core/models/configurations.py Outdated Show resolved Hide resolved

r4victor reviewed Jul 30, 2025

View reviewed changes

jvstme added 2 commits August 1, 2025 08:18

Merge remote-tracking branch 'origin/master' into issue_2181_probes

2c05b93

Review and TODO fixes

4870df9

- Docs - Compatibility - More tests - Optimized DB queries - Limit max probes and probe timeout - Defaults out of `ProbeConfig` - Stricter validation - Migration rebase - Locking fix

jvstme marked this pull request as ready for review August 5, 2025 02:01

jvstme requested a review from r4victor August 5, 2025 02:02

r4victor approved these changes Aug 5, 2025

View reviewed changes

jvstme merged commit 62e5b87 into master Aug 5, 2025
26 checks passed

jvstme deleted the issue_2181_probes branch August 5, 2025 07:27

Service probes #2927

Service probes #2927

Uh oh!

Conversation

jvstme commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

r4victor commented Jul 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jvstme commented Jul 24, 2025 •

edited

Loading