Improve concurrency support for SQLite3 by fractaledmind · Pull Request #50370 · rails/rails

fractaledmind · 2023-12-16T16:08:19Z

Motivation / Background

As SQLite's popularity grows as a production database engine for Rails applications, so does the need for robust and resilient default configuration. One of the most common issues faced when using SQLite in a Rails application are the occasional ActiveRecord::StatementInvalid (SQLite3::BusyException: database is locked) exceptions. These occur when a connection cannot acquire SQLite's database lock within the allotted timeout.

Detail

This is prone to happen when running your Rails with a multi-threaded server, like Puma in clustered mode, because the SQLite busy_timeout C function does not release the Ruby global interpreter lock (GIL). So, while one connection is attempting to acquire the SQLite database lock, no other Ruby threads can perform work or progress. In order to improve concurrency support for ActiveRecord's SQLite3 adapter, we must implement a custom Ruby busy_handler function that still respects the timeout, releases the GIL, and doesn't overly thrash.

This implementation achieves these goals by

sleeping while waiting to retry again, thus releasing the GIL
only checking if the timeout has passed every 100 retries to minimize Ruby execution
sleeping 60 microseconds to ensure other Threads have an opportunity to acquire the GIL and progress

Additional information

I have written a more in-depth exploration of this problem on my blog: https://fractaledmind.github.io/2023/12/11/sqlite-on-rails-improving-concurrency/.

Together with #50371, this PR is one-half of the story of stabilizing the SQLite3 adapters ability to handle concurrency.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

The SQLite3 `busy_timeout` C API will hold the global interpreter lock (GIL) for the duration of the retry process, disallowing any other Ruby threads from running while this thread is waiting for a database connection to become available. In order to allow concurrent threads to naturally coordinate their order of execution, the `busy_timeout` C API has been replaced with the `busy_handler` C API. This allows the Ruby thread to be put to sleep while waiting for a database connection to become available, allowing other Ruby threads to run in the meantime, while also respecting the `timeout` option.

fractaledmind · 2023-12-16T16:34:53Z

Also, I added a deprecation for the retries option as the new busy_handler implementation for the timeout option is superior in every way. The retries option was a (failed) initial attempt at dealing with the concurrency issue. Now that I am confident in the new solution, I believe we should remove this option altogether.

byroot

This is prone to happen when running your Rails with a multi-threaded server,

If the assumption is that the other writer is likely yo be in another thread in the same process, it seems to me that it would widely more efficient to first acquire an in-process lock before attempting to acquire a write lock on the database.

Also 60 micro-seconds don't seem like much, it's almost as close as busy looping, it's not clear what kind of impact it may have on other threads.

But more importantly, your code seem to assume that the callback is invoked every 60 usec (hence the count * retry_interval, but to exit the sleep the thread has to re-acquire the GVL, so this can take way longer (there is really no limit)

byroot · 2023-12-16T17:08:57Z

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

+            raise TypeError, "Timeout must be an integer" unless timeout.is_a?(Integer)
+
+            timeout_seconds = timeout.fdiv(1000)
+            retry_interval = 6e-5 # 60 microseconds


How was this value selected?

Loosely and quickly 😅

I wanted a value that would allow for sub-millisecond resolution, since SQLite can perform various operation in microseconds (https://www.sqlite.org/fasterthanfs.html#approx, https://github.com/chrisdavies/dbench) but also wouldn't create too much thrashing (busy looping as you say). 60 seconds in a minute, 60 minutes in an hour. 60 just seemed like a place to start, and when running the code, I went from errors all the time to never errors, so I've kept it.

I opened the PR looking to get wiser insight from people like you, as while I have a clear picture of the shape of the problem and how I can generally fix it, I am in no way certain that this is the perfect or even one of the better solutions. I'm very much open to finding a better solution to the problem.

byroot · 2023-12-16T17:15:22Z

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

+              if (count % 100).zero?
+                # fail if we exceed the timeout value (captured from the timeout config option, converted to seconds)
+                timed_out = (count * retry_interval) > timeout_seconds


Sounds really weird to me. You say it's to "minimize the ruby execution", but in the end with the extra modulo etc, I'm pretty sure you are doing as much work as if you'd just precomputed the max_count and compared it.

But even without precomputing, this amount of code is absolutely trivial.

My understanding of what you mean by "precomputing the max_count and comparing it" would be the retries mechanism recently added. And this is precisely the reason I added this option. It has the benefits of doing the bare minimum amount of work in Ruby land, but doing work in Ruby land allows for the GVL to be released across retrying connections.

After getting some community feedback though, I realized that it does have the major downside of providing no real intuition for what number to put as the max_count/retries. One example required 1M+ retries to get stable performance.

I think the semantics and intuition of timeout are vastly superior to retries. I also think that having some cap so that completely broken connections do eventually error is necessary. But, I also want "normal" connections, in "normal" multi-threaded contexts to not error at all. So, the goal here is to find the middle ground that achieves these goals. Whatever ways we can find to improve this busy_handler callback while still achieving these goals I am all in for.

fractaledmind · 2023-12-18T08:49:53Z

If the assumption is that the other writer is likely yo be in another thread in the same process, it seems to me that it would widely more efficient to first acquire an in-process lock before attempting to acquire a write lock on the database.

This article explores how using an "in-language" mutex performs relative to using SQLite's mutex across a collection of different operation scenarios. When only reading, there was essentially no difference¹; when only writing, there was a notable difference past 64 concurrent connections², when both reading and writing, "the performance is generally not quite as high without the [in-language] mutex compared to with the mutex. But also, the worst performance isn’t as bad without the mutex compared to with the mutex. I suspect this has something to do with how the RWMutex blocks readers when writing, which SQLite itself doesn’t do when writing."

Given that result, he then also considered using two database connection pools, one for reads (with unlimited connections) and one for writes (with just one allowed connection, so effectively a mutex). A mix of reading and writing were tested again in this situation, and "the results are quite similar to the no-mutex version above, except the worst-case performance seems more consistent across different degrees of parallelism."

In the end, he concluded — "Using mutexes and different connections pools has its benefits and drawbacks. If in doubt, I would probably use neither and just stick with one big connection pool for all connections."

I am open to exploring alternatives, but I do want to ensure that concurrent reads are kept. This is the major benefit of WAL mode, and the biggest and most important change that we made with 7.1.0.

Also 60 micro-seconds don't seem like much, it's almost as close as busy looping, it's not clear what kind of impact it may have on other threads.

I agree. I wanted to start with something just above busy looping to allow for the finest-grained queue resolution possible. I am still trying to come up with a simple benchmark to allow us to experiment with different busy_handlers in different operational scenarios to see which approach is best. At the moment, the benchmark I have used goes thru the entire Rails stack (https://github.com/fractaledmind/rubyconftw/blob/main/app/controllers/benchmarking_controller.rb) and has too much randomness. I am trying to simplify this approach to focus exclusively on SQLite3::Database connections and Threads outside of the context of Rails with a completely stable operational load to allow the differences between different possible busy_handler implementations to shine through clearly.

But more importantly, your code seem to assume that the callback is invoked every 60 usec (hence the count * retry_interval, but to exit the sleep the thread has to re-acquire the GVL, so this can take way longer (there is really no limit)

My goal was not to assume any precise rhythm, only to ensure that the timeout value is respected within reason, where my intuition is that "within reason" would look like: don't throw a busy exception earlier than timeout and don't throw a busy exception later than timeout + 500 milliseconds. I did not hard code the latter rule into the handler as I didn't want to add the computational overhead. My sense was that it was reasonable to presume that re-acquiring the GVL wouldn't take longer than 500 milliseconds (- 60us).

I know my answers are all a bit vague. Hopefully, though, you can see the high-level constraints that I am trying to balance. I readily admit that this might not be the right solution to the problem that balances these constraints. I opened the PR because it is a solution that balances these constraints, and the problem absolutely needs to be solved at the Rails level. And I knew by opening the PR, I could start conversations with top-flight Rails developers like yourself to find the best possible solution together. So, the vagueness comes from the fact that I do not hold any aspect of this solution as sacrosanct. I only hold the problem as requiring a solution.

fractaledmind · 2023-12-20T16:51:59Z

@byroot: After traveling some, I found some time to work up a benchmarking script to better explore the performance differences between different implementations of a busy_handler. You can find my benchmarking script here if you'd like to run it yourself. Here are the details from my runs though:

I experimented by varying 3 details:

what retry interval does the busy_handler use
does the handler use a modulo to minimize timeout checks
does the handler do a more precise timeout check

I benchmarked every combination of these variations. I ran the benchmark 50 times, randomizing the order to implementations and using Benchmark.bmbm to minimize perf noise and strengthen the overall signal of the reported results. I aggregate the benchmarking runs and look at the mean and median both to see which methods are more performant and by what degrees.¹

When looking at the mean, we see roughly what we would expect; that is, having a finer-grained retry interval allows for slightly faster execution times. However, to my mind, the overall picture demonstrates that the perf differences simply aren't really that meaningful, so we should probably optimize for code readability, maintainability, and predictability.

interval	precise?	modulo?	mean	median
60us	true	true	2.011489	2.012199
60us	true	false	2.011776	2.012169
60us	false	false	2.011876	2.012005
60us	false	true	2.012015	2.012695
1ms	false	false	2.013657	2.013681
1ms	true	false	2.014069	2.013985
1ms	false	true	2.014216	2.014409
1ms	true	true	2.014602	2.014305

As you can see, the delta between the fastest and slowest mean is only 3.113 milliseconds. This is close enough to be considered equal in my mind.

Given that, I say we go with the simplest busy_handler with the simplest interval:

retry_interval = 1e-3 # 1 millisecond
timeout_seconds = 5
busy_handler do |count|
  return false if (count * retry_interval) > timeout_seconds

  sleep(retry_interval)
end

I also did some checks to ensure that the number of invocations of the busy_handler essentially matched expectations (e.g. 1000/60), and they did; that is, when using a 60 microsecond retry interval, the busy_handler is called ~16× more often. ↩

casperisfine · 2023-12-20T18:03:44Z

That benchmark assume the program is single threaded, hence no other thread compete for the GVL.

See: https://gist.github.com/casperisfine/53874924989b0520d22b31375b274b27

Rehearsal ------------------------------------------
~01ms%   2.936606   0.005398   2.942004 (  2.949760)
^01ms-   2.953363   0.004228   2.957591 (  2.963247)
~60us%   2.941404   0.005308   2.946712 (  2.956542)
~60us-   2.955617   0.005538   2.961155 (  2.966896)
^60us%   2.951074   0.004302   2.955376 (  2.961478)
^01ms%   2.948821   0.004708   2.953529 (  2.959705)
~01ms-   2.916062   0.005160   2.921222 (  2.976912)
^60us-   2.952263   0.004406   2.956669 (  2.962022)
-------------------------------- total: 23.594258sec

             user     system      total        real
~01ms%   2.923195   0.008168   2.931363 (  2.953626)
^01ms-   2.947695   0.004403   2.952098 (  2.958378)
~60us%   2.943108   0.004328   2.947436 (  2.954118)
~60us-   2.944694   0.004686   2.949380 (  2.955218)
^60us%   2.944198   0.003781   2.947979 (  2.953552)
^01ms%   2.947019   0.004482   2.951501 (  2.958295)
~01ms-   2.943193   0.004268   2.947461 (  2.955124)
^60us-   2.948041   0.004003   2.952044 (  2.957150)
Rehearsal ------------------------------------------
~01ms-   2.947294   0.004244   2.951538 (  2.956414)
^60us%   2.950053   0.004201   2.954254 (  2.960218)
^01ms-   2.943538   0.004287   2.947825 (  2.957053)
^01ms%   2.939451   0.004797   2.944248 (  2.951419)
~60us-   2.949422   0.004496   2.953918 (  2.959852)
^60us-   2.950441   0.004146   2.954587 (  2.961246)
~01ms%   2.945014   0.004545   2.949559 (  2.955216)
~60us%   2.947575   0.004428   2.952003 (  2.957626)
-------------------------------- total: 23.607932sec

             user     system      total        real
~01ms-   2.918656   0.008690   2.927346 (  2.955671)
^60us%   2.946097   0.003956   2.950053 (  2.955129)
^01ms-   2.945925   0.004541   2.950466 (  2.957782)
^01ms%   2.923187   0.008146   2.931333 (  2.952630)
~60us-   2.939946   0.004225   2.944171 (  2.953618)
^60us-   2.949804   0.004199   2.954003 (  2.959855)
~01ms%   2.942872   0.004414   2.947286 (  2.955766)
~60us%   2.949557   0.004290   2.953847 (  2.960483)

sleep(duration) is absolutely not guaranteed to resume as soon as duration is elapsed, it's just guaranteed not to resumed sooner. Hence why counting iterations is not precise at all, you'd need to check the process clock.

But anyway, I'm not convinced at all it's Rails'. job to deal with this. Such logic would have a much better place in sqlite3 gem, to be seen with @flavorjones. I won't have time to look at that this year anyway.

flavorjones · 2023-12-20T21:16:49Z

Hmm. @byroot if you mean the gem should implement a better default busy timeout (one which would allow the release of the GVL) I would buy that. The current implementation just calls the default sqlite busy timeout handler, which is a C function.

byroot · 2023-12-20T21:30:01Z

if you mean the gem should implement a better default busy timeout (one which would allow the release of the GVL) I would buy that.

That's the idea. But note that it's a double edged sword, because if the gem do:

release the GVL
acquire the SQLite writer lock
acquire the GVL

Step 3 here may take a while (e.g. 100ms thread quantum, and perhaps worse if the running thread is stuck in C-land). So you might be holding the writer lock for a long time while you wait for the GVL, degrading the overall latency.

So I'm not certain this is really that good of a solution. I'd need to spend time thinking and trying things, but not really a priority right now.

fractaledmind · 2023-12-21T00:06:39Z

@byroot Could you add a bit more thought around under what worst-case scenario any solution (a lower-level solution in the sqlite3-ruby gem and its interface with the C interface or a Ruby busy_handler that releases the GVL) would provide a worse end-user experience than the busy exceptions that are thrown regularly and frequently if your Rails app using SQLite is using, for example, Puma in clustered mode?

Even in your updated benchmark, we still see that there is effectively no difference between any of the busy_handler implementations. And if the worst-case scenario is that your app sets a timeout of 5 seconds, but a busy exception isn't thrown until 6 seconds after the first connection attempt was made, I don't think an end-user would report that as a bug. I personally have always interpreted the timeout as a min, not an exact value.

And the central problem of regular and frequent busy exceptions is legitimately the number one most common reason I hear from people on why they don't try SQLite. But, it is a solvable problem.

fractaledmind · 2023-12-21T00:52:22Z

Also, the reason I made a PR to Rails and not SQLite3 is that the handler is for handling a timeout, which is defined in and for your Rails app, not the lower level SQLite wrapper.

Plus, I have a working solution used via the enhanced adapter gem in production apps now. And I have seen hesitation to consider making low level changes to the sqlite3-ruby gem and its relationship to the GVL and SQLite's C interface.

byroot · 2023-12-21T09:48:35Z

a bit more thought around under what worst-case scenario any solution [..] would provide a worse end-user experience than the busy exceptions that are thrown regularly and frequently

Yes. Rule number one of resilient software is "fail fast". If you retry too aggressively and wait too long you can create a situation where requests are piling up causing the whole service to grind to a halt instead of just having endpoints that need to do writes return a 500 or something.

In other words, trading latency of throughput isn't always a good idea.

your app sets a timeout of 5 seconds, but a busy exception isn't thrown until 6 seconds after the first connection attempt was made, I don't think an end-user would report that as a bug.

I would. Because down the line I likely have an overall request timeout, so if my DB query timeout isn't reliable I can't properly set my overall request timeout.

But, it is a solvable problem.

It can be mitigated, the the root cause of this problem is that SQLite only support a single concurrent writer, and there is no solving that.

Also, the reason I made a PR to Rails and not SQLite3 is that the handler is for handling a timeout, which is defined in and for your Rails app, not the lower level SQLite wrapper.

The underlying wrapper is in a better place to provide a simple lock_wait_timeout configuration.

fractaledmind · 2023-12-21T10:20:59Z

That is indeed helpful clarification, thank you.

To clarify, if @flavorjones and I get something like a lock_wait_timeout added to the sqlite3-ruby gem, would you support mapping the timeout option in the database.yml file to this implementation instead of the concurrency-blocking busy_timeout function?

My primary goal is to make the default, out-of-the-box experience for a Rails developer using SQLite resilient and solid. Since Rails generates a database.yml file with a timeout: 5000 value, I want the implementation to be concurrency-friendly.

Given the minuscule performance differences, I completely support using the most precise implementation of the busy_handler to ensure that the timeout is reliable. I am also onboard to drive the code itself of this implementation into the sqlite3-ruby gem. If those things are true and in-place, would you support the timeout option for the SQLite adapter using such a reliable lock_wait_timeout implementation of busy_handler?

byroot · 2023-12-21T10:33:45Z

would you support mapping the timeout option in the database.yml file to this implementation instead of the concurrency-blocking busy_timeout function?

Yes.

fractaledmind · 2023-12-21T10:46:13Z

Ok. Perfect. I'll close this PR then and start working in the sqlite3-ruby repo. I'll open a new PR here later once everything over there is completed and released.

I really appreciate your thoughtful guidance here, and in being patient with me as I tried to keep up with your points and perspective. You have been very helpful.

rails-bot bot added the activerecord label Dec 16, 2023

$@fractaledmind$ fractaledmind mentioned this pull request Dec 16, 2023

Ensure SQLite transaction default to IMMEDIATE mode #50371

Merged

4 tasks

byroot reviewed Dec 16, 2023

View reviewed changes

$@fractaledmind$ fractaledmind closed this Dec 21, 2023

$@fractaledmind$ fractaledmind mentioned this pull request Dec 21, 2023

Add a busy_handler_timeout setter sparklemotion/sqlite3-ruby#443

Merged

This was referenced May 30, 2024

SQLite non-GVL-blocking, fair retry interval busy handler #51958

Merged

Bump sqlite3 gem dependency to version 2.0 #51957

Closed

$@fractaledmind$ fractaledmind deleted the ar-busy-handler-timeout branch August 2, 2024 14:08

Comments

Conversation

fractaledmind commented Dec 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation / Background

Detail

Additional information

Checklist

Uh oh!

fractaledmind commented Dec 16, 2023

Uh oh!

byroot left a comment

Choose a reason for hiding this comment

Uh oh!

byroot Dec 16, 2023

Choose a reason for hiding this comment

Uh oh!

fractaledmind Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

byroot Dec 16, 2023

Choose a reason for hiding this comment

Uh oh!

fractaledmind Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

fractaledmind commented Dec 18, 2023

Footnotes

Uh oh!

fractaledmind commented Dec 20, 2023

Footnotes

Uh oh!

casperisfine commented Dec 20, 2023

Uh oh!

flavorjones commented Dec 20, 2023

Uh oh!

byroot commented Dec 20, 2023

Uh oh!

fractaledmind commented Dec 21, 2023

Uh oh!

fractaledmind commented Dec 21, 2023

Uh oh!

byroot commented Dec 21, 2023

Uh oh!

fractaledmind commented Dec 21, 2023

Uh oh!

byroot commented Dec 21, 2023

Uh oh!

fractaledmind commented Dec 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

$@fractaledmind$ fractaledmind commented Dec 16, 2023 •

edited

Loading

$@fractaledmind$ fractaledmind Dec 18, 2023

$@fractaledmind$ fractaledmind Dec 18, 2023