Fix pool thread safety issues (option 2)#893
Conversation
The pool implementation had an unsound `unsafe impl Sync` on `PoolInner` containing `Vec<PoolPlugin>` where `PoolPlugin` was `Rc<RefCell<Plugin>>`. This caused intermittent "RefCell already borrowed" panics in multi-threaded usage (issue extism#891). The problem with the original design was that it seemed to assumed external locking (the pool's `Mutex`) would be sufficient to make `Rc<RefCell<>>` safe to share across threads. However, there were several issues that made the design unsound: 1. `PoolPlugin` had no `Drop` implementation to synchronize refcount decrements. When dropped on different threads, unsynchronized `Rc` refcount modifications could lead to data races. 2. The pool used `Rc::strong_count == 1` to track availability, but had no control over when clones were created or destroyed. Callers could clone and keep `PoolPlugin` instances, breaking the availability tracking. 3. The `PoolPlugin` type didn't leverage the type system to enforce safe usage patterns; it relied entirely on external locking that the type system couldn't verify. This replaces the unsafe implementation with proper thread-safe primitives and move semantics inspired by the deadpool crate. Changes: - Replace shared `Rc<RefCell<Plugin>>` with move semantics: plugins are moved out of a `VecDeque` when checked out and moved back on drop - `PoolPlugin` now owns the plugin and implements `Drop` to automatically return it, making checkout/return atomic and type-safe. - Add `Send` bound to `UserDataHandle::Rust` (`dyn Any` → `dyn Any + Send`). `UserData<T>` already had the `Send` bound, but when converted to `UserDataHandle::Rust` it was being removed during type erasure. This was necessary to make `PluginBuilder` with custom functions `Send`, enabling pools with custom host functions - Replace busy-wait polling with `Condvar` for efficient waiting - `PoolPlugin` implements `Deref`/`DerefMut` for ergonomic access - Use `Weak` reference to pool so checked-out plugins don't keep pool alive - Remove all `unsafe impl Send/Sync`. Soundness is now enforced by the type system. - Fix `function_exists` to handle timeout gracefully instead of panicking. - Add test for pools with captured `PluginBuilder` containing custom functions. Breaking API changes: - `PoolPlugin` no longer has a `.plugin()` method - it now implements `Deref`/`DerefMut`, so you can use it directly where you previously called `.plugin()`. Example: `pool.get()?.call(...)` instead of `pool.get()?.plugin().call(...)` - `PoolPlugin::call()` now takes `&mut self` instead of `&self` (direct ownership instead of interior mutability via `RefCell`) - `UserData` types must be `Send` (previously claimed via `unsafe impl` but not enforced through type erasure to `dyn Any`) Fixes extism#891
…e and Function This alternative approach trusts that the existing 'unsafe impl Send/Sync for UserData<T>' is sound (T is never extracted from the Arc<Mutex<>>), and extends the unsafe chain to UserDataHandle and Function rather than adding explicit T: Send bounds. Changes: - Add 'unsafe impl Send/Sync' for UserDataHandle. This trusts that the `UserData<T>`s unsafe Send/Sync impls being sound. - Keep 'unsafe impl Send/Sync for Plugin' - No T: Send bounds on UserData::new() or with_function() - Pool works correctly with move semantics and Arc<Mutex<Plugin>> Trade-offs vs the other approach: + No breaking API changes. Users can continue use non-Send types in single-threaded contexts. - Less safe Fixes extism#891
|
Thanks @bgmerrell! I think both this and #892 are solid. It does seem like #892 may be a slightly safer implementation, and if we're rolling out a fix for that reason maybe we go with that option. Does that work for you? Let me know if you've leaned more in one direction or the other since opening these PRs. |
|
@nilslice #892 is safer from a code safety perspective for sure because it doesn't use So I created this PR which follows this pattern. Either PR works for me, and I've been running my unit and integration tests for my project with a patched Extism with this PR here for a couple of weeks. |
Yes, thanks for calling this out. I think originally, we planned for mostly non-Rust embedders via FFI, in which case there was no safety enforcement possible. Rust has gained a lot of usage since then though so it's worth tightening up places that we didn't give as much attention to. I think #892 is the way to go then. I'll let that thought simmer for another day though, and then close this PR and merge that one. Will ping you with any more questions :) |
|
@nilslice just a polite little bump on this ;) I'd like to ditch my extism fork as soon as possible |
|
Thank you @bgmerrell! |
|
Release v1.20.0 is on its way out |
|
ah.. my bad. I took your comment here as indicating this was your expected version to merge. I'll probably revert and release #892 Doing too much at once today 😢 |
|
@nilslice was thinking about this more this evening.. it's still a net positive that this one landed. It fixes a real bug and it's consistent with the other Extism code. It also let's me get rid of my fork (thanks for the release!). It's also what I ended up using that last several weeks in my local testing/development because I thought it would be the more likely to be merged... turns out I was right! Just for an unexpected reason! ;) |
|
@bgmerrell - well that's good to hear! since you've exercised the implementation all this time, I am confident in it and if it's all the same to you, will just keep it as-is. we have the alternative in the event things should change and can always revisit the decision then. |
This is an alternative to #892
The pool implementation had an unsound
unsafe impl SynconPoolInnercontainingVec<PoolPlugin>wherePoolPluginwasRc<RefCell<Plugin>>. This caused intermittent "RefCell already borrowed" panics in multi-threaded usage (issue #891).The problem with the original design was that it assumed external locking (the pool's
Mutex) would be sufficient to makeRc<RefCell<>>safe to share across threads. However, there were several issues that made the design unsound:PoolPluginhad noDropimplementation to synchronize refcount decrements. When dropped on different threads, unsynchronizedRcrefcount modifications could lead to data races.Rc::strong_count == 1to track availability, but had no control over when clones were created or destroyed. Callers could clone and keepPoolPlugininstances, breaking the availability tracking.PoolPlugintype didn't leverage the type system to enforce safe usage patterns; it relied entirely on external locking that the type system couldn't verify.This replaces the unsafe implementation with proper thread-safe primitives and move semantics inspired by the deadpool crate.
Changes:
Rc<RefCell<Plugin>>with move semantics: plugins are moved out of aVecDequewhen checked out and moved back on dropPoolPluginnow owns the plugin and implementsDropto automatically return it, making checkout/return atomic and type-safeunsafe impl Send/SyncforUserDataHandleto complete the chain of unsafe assertions. This trusts that the existingunsafe impl<T> Send for UserData<T>is sound (verified:UserData::get()only returnsArc<Mutex<T>>, never extractsT, so it cannot be moved across threads unsafely)Condvarfor efficient waitingPoolPluginimplementsDeref/DerefMutfor ergonomic accessWeakreference to pool so checked-out plugins don't keep pool alivefunction_existsto handle timeout gracefully instead of panickingPluginBuildercontaining custom functionsBreaking API changes:
PoolPluginno longer has a.plugin()method - it now implementsDeref/DerefMut, so you can use it directly where you previouslycalled
.plugin(). Example:pool.get()?.call(...)instead ofpool.get()?.plugin().call(...)PoolPlugin::call()now takes&mut selfinstead of&self(directownership instead of interior mutability via
RefCell)Fixes #891