Skip to content

Revert #16288#25827

Merged
yashykt merged 1 commit intogrpc:masterfrom
yashykt:revert16288
Apr 7, 2021
Merged

Revert #16288#25827
yashykt merged 1 commit intogrpc:masterfrom
yashykt:revert16288

Conversation

@yashykt
Copy link
Copy Markdown
Member

@yashykt yashykt commented Mar 28, 2021

#16288 added ExecCtx::Flush() to CallCombiner sites where SetNotifyOnCancel() was invoked but did not add it to places where Cancel() was invoked. This leads to a bug where when RPCs are cancelled, we invoke the call combiner cancellation closure but we do not flush ExecCtx, which can cause SEGFAULTs when the cancellation closure is actually invoked.

To fix this we could also add ExecCtx::Flush to sites where we do a Cancel() but sprinkling ExecCtx::Flush() across the codebase and expecting future modifications to remember to flush it seems like a recipe for future bugs. A better way to do this would be to simply change CallCombiner to use Closure::Run instead of ExecCtx::Run. Reverting #16288 first

Internal context - b/179905578

@yashykt yashykt requested a review from markdroth as a code owner March 28, 2021 18:40
@yashykt yashykt added release notes: yes Indicates if PR needs to be in release notes lang/core labels Mar 28, 2021
@apolcyn
Copy link
Copy Markdown
Contributor

apolcyn commented Apr 6, 2021

Friendly ping - this still needs to be merged, right?

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

This looks good. In practice, it seems like most callers of SetNotifyOnCancel() need to hold a ref to the call stack for their cancellation objects anyway; I suspect that the auth filters are actually the exception here, not the rule.

We should brainstorm about next steps here. Independently of this, I've been starting to think that the call combiner cancellation mechanism may be too heavy-weight. Most callers of SetNotifyOnCancel() wind up needing a mutex for synchronization anyway, but the whole point of the call combiner was to prevent individual filters from needing mutexes, so it's clearly not completely succeeding in its goal here.

One possible alternative would be to move the cancel_stream op out of grpc_transport_stream_op_batch and replace it with a new cancel_stream() method on the filter API, which would basically work the same way as start_transport_stream_op_batch() except that it can be invoked without holding the call combiner. That way, filters may still need mutexes to handle cancellation, but they would not need any synchronization for registering a cancel callback with the call combiner itself.

A more radical approach might be to reconsider whether the call combiner as a whole really still makes sense. Part of the original idea of it was that we thought that there were going to be a lot more filters added over time, so it made sense to amortize the synchronization cost by doing it at a central point instead of having each individual filter do it. But since we are now expecting most plugins to be done as interceptors at the wrapped language level instead of core filters, we should reevaluate whether this trade-off still makes sense.

Anyway, we should talk further about this to figure out next steps, but this PR is a good first step.

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Apr 6, 2021

One possible alternative would be to move the cancel_stream op out of grpc_transport_stream_op_batch and replace it with a new cancel_stream() method on the filter API, which would basically work the same way as start_transport_stream_op_batch() except that it can be invoked without holding the call combiner. That way, filters may still need mutexes to handle cancellation, but they would not need any synchronization for registering a cancel callback with the call combiner itself.

Yes, I think this is doable, and I might make the code easier to understand.

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Apr 6, 2021

looks like this has exposed a bug in the fault injection filter :)

@markdroth
Copy link
Copy Markdown
Member

One possible alternative would be to move the cancel_stream op out of grpc_transport_stream_op_batch and replace it with a new cancel_stream() method on the filter API, which would basically work the same way as start_transport_stream_op_batch() except that it can be invoked without holding the call combiner. That way, filters may still need mutexes to handle cancellation, but they would not need any synchronization for registering a cancel callback with the call combiner itself.

Yes, I think this is doable, and I might make the code easier to understand.

Actually, thinking about this further, we may want to do both -- i.e., keep the cancel_stream op in grpc_transport_stream_op_batch and add the new cancel_stream() method to the filter API. That way, filters can minimize the lock contention by having their cancel_stream() method only cancel long-running operations (the type that they currently use the call combiner cancellation callback for), and then still handle the call cancellation via the cancel_stream op in start_transport_stream_op_batch. Otherwise, if the actual call cancellation happens in the cancel_stream() method, then filters will probably have to guard a lot more data members with the mutex, which will hurt performance.

But anyway, we can talk about this more when you're ready to implement it.

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Apr 7, 2021

Sure! Thanks for reviewing!

@yashykt yashykt merged commit 1784e6c into grpc:master Apr 7, 2021
yashykt added a commit to yashykt/grpc that referenced this pull request Apr 13, 2021
@yashykt yashykt mentioned this pull request Apr 13, 2021
yashykt added a commit to yashykt/grpc that referenced this pull request Apr 13, 2021
@yashykt yashykt mentioned this pull request Apr 13, 2021
@yashykt yashykt deleted the revert16288 branch May 18, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/core release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants