Handle early finalization of TSFN by Node.js#744
Handle early finalization of TSFN by Node.js#744kjvalencik merged 1 commit intoneon-bindings:mainfrom
Conversation
| /// function for scheduling tasks to execute on a JavaScript thread. | ||
| pub struct ThreadsafeFunction<T> { | ||
| tsfn: Tsfn, | ||
| is_finalized: Arc<Mutex<bool>>, |
There was a problem hiding this comment.
Can this be an AtomicBool instead of a Mutex<bool>? I'm also not clear why it needs to be an Arc.
There was a problem hiding this comment.
Arc provides a nice way of passing the pointer to the C function, because unlike Box you can clone it before turning into a raw pointer, and then decrement the reference count when turning it back from the raw pointer to the Arc instance and dropping it.
I think it could be AtomicBool. Let me try that in a moment.
There was a problem hiding this comment.
Moved to AtomicBool. @jrose-signal must be really happy to see this happen 😉
There was a problem hiding this comment.
Yeah, it has to be something indirect because ThreadsafeFunction is passed around by value. Box is probably safe because the finalizer callback is guaranteed to be invoked before Drop::drop completes (either Node invokes it on shutdown or we invoke it in drop), but it's a lot easier to mess that up somehow.
|
I think you're right about the race in #739 (comment). :-( A 99% multithreading fix can be worse than no fix, because it makes it harder to track down further issues. |
|
There more I think about it the more I agree that this is a serious bug in Node.js that has to be fixed. Another race condition that I just discovered is with the |
|
Nevermind, I'm a fool. ThreadsafeFunction can only be destroyed from the main loop because it might hold a reference to a JS function. The documentation explicitly says:
Now, I've checked all code paths and it cannot crash as I previously thought. To prevent the race condition in this patch, however, we'll have to hold the mutex for the duration of the |
|
Alright, pushed an update. Sadly
In both of the cases we get the desired result. This PR is ready for a review again. |
|
Had to add a mutex to the
This is quite inconvenient for Neon, but would likely be needed until we'll get some resolution in nodejs/node#38775 (which would be an documentation change, but hopefully not ABI change) |
Thread-safe functions can get released early by Node.js itself during the teardown of the Environment that holds them. When this happens - `finalize_cb` of the thread-safe function is invoked to notify us. This change makes us remember if the thread-safe function was finalized before we got into the `Drop` implementation for it so that we don't deallocate it twice. Additionally, we should not invoke `call_threadsafe_function` after either `finalize_cb` call or `napi_closing` return value from previous `call_threadsafe_function`. Handle this by holding a mutex around the call, which doesn't introduce any additional locking because `call_threadsafe_function` itself holds another mutex while active.
|
Force pushed PR with a rebase over main branch and all changes squashed. |
| // Node.js | ||
| if *is_finalized { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Don't you need to drop the lock before calling napi::release_threadsafe_function?
There was a problem hiding this comment.
Nope, actually I don't the idea here is that we want finalize_cb to block until release_threadsafe_function completes, because otherwise finalize_cb might finish first and delete the tsfn in C++.
There was a problem hiding this comment.
Mutex isn't re-entrant, though, so in the non-early-exit case you'll lock, call release_threadsafe_function, and fail when it's time to lock again.
There was a problem hiding this comment.
release_threadsafe_function doesn't call finalize_cb directly. It schedules an event on the loop thread and executes finalize_cb in another tick.
|
See: napi-rs/napi-rs#522 . We are not the first to run into it! napi-rs took a milder approach of just leaking the function and letting Node.js reclaim it 😂 |
|
After fixing a leak in #750, this started happening every time with the global queue. I think I fully understand this fix now and appreciate it. Thanks! |
|
No worries! Let's land the thing! 😉 |
|
Thanks so much for this fix! I'm going to release it along with this PR: #750 |
Thread-safe functions can get released early by Node.js itself during
the teardown of the Environment that holds them. When this happens -
finalize_cbof the thread-safe function is invoked to notify us. Thischange makes us remember if the thread-safe function was finalized
before we got into the
Dropimplementation for it so that we don'tdeallocate it twice.
Additionally, we should not invoke
call_threadsafe_functionaftereither
finalize_cbcall ornapi_closingreturn value from previouscall_threadsafe_function. Handle this by holding a mutex around thecall, which doesn't introduce any additional locking because
call_threadsafe_functionitself holds another mutex while active.See: #739 (comment)