-
Notifications
You must be signed in to change notification settings - Fork 207
Service probes #2927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service probes #2927
Conversation
Allow configuring HTTP probes for services and use probe results to avoid downtime during deployments.
|
I recently made an analysis of dstack server db queries that showed a substantial load on the db and slow api endpoints caused by inefficient/redundant queries. Improved this in #2928 and #2929. For now on, I recommend loading only the required attributes and relationship when makes sense. Please adjust this for probes. |
| async with get_session_ctx() as session: | ||
| async with get_locker(get_db().dialect_name).lock_ctx( | ||
| ProbeModel.__tablename__, [probe.id] | ||
| ): | ||
| assert isinstance(probe_config.interval, int), ( | ||
| "Expected probe interval to be parsed by validator" | ||
| ) | ||
| await session.execute( | ||
| update(ProbeModel) | ||
| .where(ProbeModel.id == probe.id) | ||
| .values( | ||
| success_streak=0 if not success else ProbeModel.success_streak + 1, | ||
| due=get_current_datetime() + timedelta(seconds=probe_config.interval), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is problematic – you have batch_size=100, so there can be that many _process_probe_async. If every function produces a write transactions – there can be too many of them and you don't control the write rate anymore, which can lead to "database is locked" on SQLite and overall is not good. I think we should batch probe execution and updates as well. Unbounded probe_config.timeout is a problem here. Perhaps we should put some restrictions on it to make batch updates work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same time, I think batch updates also complicate things since processing different resources becomes dependent (at least time-wise). Per-resource updates may be ok and performant enough for Postgres. We should check how much the problem it would be for sqlite and what else can we do there differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock that syncs all probe updates. SQLite syncs concurrent writes anyway and its file-based locking is much less efficient than our app-level lock. There is a per-resource locking now but it doesn't serve that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good solution for sqlite (while doing an update for every probe processed) may be an app-level lock
We can consider this separately, since it may require more testing. The current solution works on SQLite without errors with 50 parallel probe executions. With 100 executions, database-is-locked errors do happen from time to time, but I don't think 100 parallel probes is realistic for SQLite setups
| BATCH_SIZE = 100 | ||
| SSH_CONNECT_TIMEOUT = timedelta(seconds=10) | ||
| PROCESSING_OVERHEAD_TIMEOUT = timedelta(minutes=1) | ||
| PROBES_SCHEDULER = AsyncIOScheduler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introducing a new scheduler and adding _process_probe_async there? I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete. There can be multiple instances of process_probes controlled by max_instances= of the background scheduler. The current approach puts no upper bound on active processing transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introducing a new scheduler
The scheduler is just a way to run fire-and-forget tasks. Similar behavior can be achieved with asyncio.create_task, but it requires boilerplate code to store references to tasks, so I decided to use a scheduler instead.
We'll also migrate the rest of background tasks to a similar approach, and I don't think we want to introduce a separate scheduler for every one.
We can introduce a global scheduler for fire-and-forget tasks when needed.
I think process_probes should take probes for processing, commit, and then process probes (asyncio.gather if need process multiple concurrently), waiting for processing to complete
That would result in process_probes being blocked until all probes are executed. If there are probes that take a long time to execute, such as LLM prompts, the max_instances of process_probes tasks will be quickly exhausted and shorter probes will not be executed at the user-configured interval.
The current approach puts no upper bound on active processing transactions.
There will typically be at most one instance of the processing task (process_probes), since this task exits quickly after starting _process_probe_async tasks. As for _process_probe_async — yes, there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH. Each task can trigger a db transaction, but it should be short since it only consists of one UPDATE query.
Setting bounds on the number of _process_probe_async tasks will result in probes not being executed at the user-configured interval.
If the number of db write transactions is a problem, even through these transactions are short, we can try to limit them by having a separate worker write the results of probe executions in batches instead of writing them for each probe in _process_probe_async.
If the overall number of concurrent probe execution tasks is a problem, we can try to solve it by moving probe execution to dstack-runner, but this requires further research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of db write transactions is a problem
it is a problem for sqlite – but I think we can mitigate it with a app-level lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree overall but
there can be as many of these tasks as there are active probes. This may be acceptable, because these tasks happen outside of a db transaction and spend most of the time waiting for I/O from the service or from SSH.
there should be limit on the probe timeout and number of probes, otherwise the server can be overloaded with many never-ending probes.
- Docs - Compatibility - More tests - Optimized DB queries - Limit max probes and probe timeout - Defaults out of `ProbeConfig` - Stricter validation - Migration rebase - Locking fix
Allow configuring HTTP probes for services and use probe results to avoid downtime during
deployments.
deployment-sped-up-5x.mp4
#2181