Improve concurrency support for SQLite3#50370
Improve concurrency support for SQLite3#50370fractaledmind wants to merge 1 commit intorails:mainfrom
Conversation
The SQLite3 `busy_timeout` C API will hold the global interpreter lock (GIL) for the duration of the retry process, disallowing any other Ruby threads from running while this thread is waiting for a database connection to become available. In order to allow concurrent threads to naturally coordinate their order of execution, the `busy_timeout` C API has been replaced with the `busy_handler` C API. This allows the Ruby thread to be put to sleep while waiting for a database connection to become available, allowing other Ruby threads to run in the meantime, while also respecting the `timeout` option.
|
Also, I added a deprecation for the |
byroot
left a comment
There was a problem hiding this comment.
This is prone to happen when running your Rails with a multi-threaded server,
If the assumption is that the other writer is likely yo be in another thread in the same process, it seems to me that it would widely more efficient to first acquire an in-process lock before attempting to acquire a write lock on the database.
Also 60 micro-seconds don't seem like much, it's almost as close as busy looping, it's not clear what kind of impact it may have on other threads.
But more importantly, your code seem to assume that the callback is invoked every 60 usec (hence the count * retry_interval, but to exit the sleep the thread has to re-acquire the GVL, so this can take way longer (there is really no limit)
| raise TypeError, "Timeout must be an integer" unless timeout.is_a?(Integer) | ||
|
|
||
| timeout_seconds = timeout.fdiv(1000) | ||
| retry_interval = 6e-5 # 60 microseconds |
There was a problem hiding this comment.
Loosely and quickly 😅
I wanted a value that would allow for sub-millisecond resolution, since SQLite can perform various operation in microseconds (https://www.sqlite.org/fasterthanfs.html#approx, https://github.com/chrisdavies/dbench) but also wouldn't create too much thrashing (busy looping as you say). 60 seconds in a minute, 60 minutes in an hour. 60 just seemed like a place to start, and when running the code, I went from errors all the time to never errors, so I've kept it.
I opened the PR looking to get wiser insight from people like you, as while I have a clear picture of the shape of the problem and how I can generally fix it, I am in no way certain that this is the perfect or even one of the better solutions. I'm very much open to finding a better solution to the problem.
| if (count % 100).zero? | ||
| # fail if we exceed the timeout value (captured from the timeout config option, converted to seconds) | ||
| timed_out = (count * retry_interval) > timeout_seconds |
There was a problem hiding this comment.
Sounds really weird to me. You say it's to "minimize the ruby execution", but in the end with the extra modulo etc, I'm pretty sure you are doing as much work as if you'd just precomputed the max_count and compared it.
But even without precomputing, this amount of code is absolutely trivial.
There was a problem hiding this comment.
My understanding of what you mean by "precomputing the max_count and comparing it" would be the retries mechanism recently added. And this is precisely the reason I added this option. It has the benefits of doing the bare minimum amount of work in Ruby land, but doing work in Ruby land allows for the GVL to be released across retrying connections.
After getting some community feedback though, I realized that it does have the major downside of providing no real intuition for what number to put as the max_count/retries. One example required 1M+ retries to get stable performance.
I think the semantics and intuition of timeout are vastly superior to retries. I also think that having some cap so that completely broken connections do eventually error is necessary. But, I also want "normal" connections, in "normal" multi-threaded contexts to not error at all. So, the goal here is to find the middle ground that achieves these goals. Whatever ways we can find to improve this busy_handler callback while still achieving these goals I am all in for.
This article explores how using an "in-language" mutex performs relative to using SQLite's mutex across a collection of different operation scenarios. When only reading, there was essentially no difference1; when only writing, there was a notable difference past 64 concurrent connections2, when both reading and writing, "the performance is generally not quite as high without the [in-language] mutex compared to with the mutex. But also, the worst performance isn’t as bad without the mutex compared to with the mutex. I suspect this has something to do with how the RWMutex blocks readers when writing, which SQLite itself doesn’t do when writing." Given that result, he then also considered using two database connection pools, one for reads (with unlimited connections) and one for writes (with just one allowed connection, so effectively a mutex). A mix of reading and writing were tested again in this situation, and "the results are quite similar to the no-mutex version above, except the worst-case performance seems more consistent across different degrees of parallelism." In the end, he concluded — "Using mutexes and different connections pools has its benefits and drawbacks. If in doubt, I would probably use neither and just stick with one big connection pool for all connections." I am open to exploring alternatives, but I do want to ensure that concurrent reads are kept. This is the major benefit of WAL mode, and the biggest and most important change that we made with 7.1.0.
I agree. I wanted to start with something just above busy looping to allow for the finest-grained queue resolution possible. I am still trying to come up with a simple benchmark to allow us to experiment with different
My goal was not to assume any precise rhythm, only to ensure that the I know my answers are all a bit vague. Hopefully, though, you can see the high-level constraints that I am trying to balance. I readily admit that this might not be the right solution to the problem that balances these constraints. I opened the PR because it is a solution that balances these constraints, and the problem absolutely needs to be solved at the Rails level. And I knew by opening the PR, I could start conversations with top-flight Rails developers like yourself to find the best possible solution together. So, the vagueness comes from the fact that I do not hold any aspect of this solution as sacrosanct. I only hold the problem as requiring a solution. Footnotes |
|
@byroot: After traveling some, I found some time to work up a benchmarking script to better explore the performance differences between different implementations of a I experimented by varying 3 details:
I benchmarked every combination of these variations. I ran the benchmark 50 times, randomizing the order to implementations and using When looking at the mean, we see roughly what we would expect; that is, having a finer-grained retry interval allows for slightly faster execution times. However, to my mind, the overall picture demonstrates that the perf differences simply aren't really that meaningful, so we should probably optimize for code readability, maintainability, and predictability.
As you can see, the delta between the fastest and slowest mean is only 3.113 milliseconds. This is close enough to be considered equal in my mind. Given that, I say we go with the simplest retry_interval = 1e-3 # 1 millisecond
timeout_seconds = 5
busy_handler do |count|
return false if (count * retry_interval) > timeout_seconds
sleep(retry_interval)
endFootnotes
|
|
That benchmark assume the program is single threaded, hence no other thread compete for the GVL.
But anyway, I'm not convinced at all it's Rails'. job to deal with this. Such logic would have a much better place in |
|
Hmm. @byroot if you mean the gem should implement a better default busy timeout (one which would allow the release of the GVL) I would buy that. The current implementation just calls the default sqlite busy timeout handler, which is a C function. |
That's the idea. But note that it's a double edged sword, because if the gem do:
Step 3 here may take a while (e.g. 100ms thread quantum, and perhaps worse if the running thread is stuck in C-land). So you might be holding the writer lock for a long time while you wait for the GVL, degrading the overall latency. So I'm not certain this is really that good of a solution. I'd need to spend time thinking and trying things, but not really a priority right now. |
|
@byroot Could you add a bit more thought around under what worst-case scenario any solution (a lower-level solution in the Even in your updated benchmark, we still see that there is effectively no difference between any of the And the central problem of regular and frequent busy exceptions is legitimately the number one most common reason I hear from people on why they don't try SQLite. But, it is a solvable problem. |
|
Also, the reason I made a PR to Rails and not SQLite3 is that the handler is for handling a timeout, which is defined in and for your Rails app, not the lower level SQLite wrapper. Plus, I have a working solution used via the enhanced adapter gem in production apps now. And I have seen hesitation to consider making low level changes to the sqlite3-ruby gem and its relationship to the GVL and SQLite's C interface. |
Yes. Rule number one of resilient software is "fail fast". If you retry too aggressively and wait too long you can create a situation where requests are piling up causing the whole service to grind to a halt instead of just having endpoints that need to do writes return a 500 or something. In other words, trading latency of throughput isn't always a good idea.
I would. Because down the line I likely have an overall request timeout, so if my DB query timeout isn't reliable I can't properly set my overall request timeout.
It can be mitigated, the the root cause of this problem is that SQLite only support a single concurrent writer, and there is no solving that.
The underlying wrapper is in a better place to provide a simple |
|
That is indeed helpful clarification, thank you. To clarify, if @flavorjones and I get something like a My primary goal is to make the default, out-of-the-box experience for a Rails developer using SQLite resilient and solid. Since Rails generates a Given the minuscule performance differences, I completely support using the most precise implementation of the |
Yes. |
|
Ok. Perfect. I'll close this PR then and start working in the I really appreciate your thoughtful guidance here, and in being patient with me as I tried to keep up with your points and perspective. You have been very helpful. |
Motivation / Background
As SQLite's popularity grows as a production database engine for Rails applications, so does the need for robust and resilient default configuration. One of the most common issues faced when using SQLite in a Rails application are the occasional
ActiveRecord::StatementInvalid (SQLite3::BusyException: database is locked)exceptions. These occur when a connection cannot acquire SQLite's database lock within the allotted timeout.Detail
This is prone to happen when running your Rails with a multi-threaded server, like Puma in clustered mode, because the SQLite
busy_timeoutC function does not release the Ruby global interpreter lock (GIL). So, while one connection is attempting to acquire the SQLite database lock, no other Ruby threads can perform work or progress. In order to improve concurrency support for ActiveRecord's SQLite3 adapter, we must implement a custom Rubybusy_handlerfunction that still respects the timeout, releases the GIL, and doesn't overly thrash.This implementation achieves these goals by
sleeping while waiting to retry again, thus releasing the GILsleeping 60 microseconds to ensure other Threads have an opportunity to acquire the GIL and progressAdditional information
I have written a more in-depth exploration of this problem on my blog: https://fractaledmind.github.io/2023/12/11/sqlite-on-rails-improving-concurrency/.
Together with #50371, this PR is one-half of the story of stabilizing the SQLite3 adapters ability to handle concurrency.
Checklist
Before submitting the PR make sure the following are checked:
[Fix #issue-number]