Skip to content

Add a null check for the endpoint on shutdown#15727

Merged
kpayson64 merged 1 commit intogrpc:masterfrom
kpayson64:null_endpoint
Jun 14, 2018
Merged

Add a null check for the endpoint on shutdown#15727
kpayson64 merged 1 commit intogrpc:masterfrom
kpayson64:null_endpoint

Conversation

@kpayson64
Copy link
Copy Markdown
Contributor

@kpayson64 kpayson64 commented Jun 12, 2018

We are seeing a null pointer exception internally on the call to grpc_endpoint_shutdown() a few lines below internally.

Hypothetical race condition:

1: A handshaker completes successfully
2: on_handshake_done is scheduled on the ExecCtx (with no error)
3: grpc_handshake_manager_shutdown() is invoked on some other thread.
4: The handshakers shutdown function shuts down the endpoint and sets it to null (see rpc-switch)
5: ExecCtx::Flush() gets invoked on the original thread, on_handshake_done is invoked with GRPC_ERROR_NONE and mgr->shutdown=true, and the endpoint has already been nulled out.

@kpayson64 kpayson64 requested a review from markdroth June 12, 2018 20:25
@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE                                          FILE SIZE
 ++++++++++++++ GROWING                            ++++++++++++++
  +0.6%     +16 src/core/lib/channel/handshaker.cc     +16  +0.6%
      +1.9%     +16 call_next_handshaker_locked            +16  +1.9%

 -------------- SHRINKING                          --------------
  -0.0%     -16 [None]                                 -16  -0.0%

  [ = ]       0 TOTAL                                    0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good diagnosis of this problem! I think the fix is not quite right, though.

Is there some reasonable way to write a test to catch this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to reset error on the next line even if the endpoint is null. Otherwise, we will invoke mgr->on_handshake_done with no error, and the caller will expect the endpoint to be non-null, so we'd just be moving the crash to a different place.

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE                                          FILE SIZE
 ++++++++++++++ GROWING                            ++++++++++++++
  +0.6%     +16 src/core/lib/channel/handshaker.cc     +16  +0.6%
      +1.9%     +16 call_next_handshaker_locked            +16  +1.9%

 -------------- SHRINKING                          --------------
  -0.0%     -16 [None]                                 -16  -0.0%

  [ = ]       0 TOTAL                                    0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@kpayson64
Copy link
Copy Markdown
Contributor Author

I don't see a way to force this failure case with any end-to-end tests because the core surface has no guarantees about when flush gets called.

I think the only way to test this would be set up some kind of mock no-op handshaker and a unit test for the handshake manager.

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, Ken!

I think the only way to test this would be set up some kind of mock no-op handshaker and a unit test for the handshake manager.

I agree. @yashykt, that's probably something to think about doing in the future, maybe as part of the C++-ification of the handshaker APIs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining the race condition that we're catching here (the case you describe in the PR description).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE                                          FILE SIZE
 ++++++++++++++ GROWING                            ++++++++++++++
  +0.6%     +16 src/core/lib/channel/handshaker.cc     +16  +0.6%
      +1.9%     +16 call_next_handshaker_locked            +16  +1.9%

 -------------- SHRINKING                          --------------
  -0.0%     -16 [None]                                 -16  -0.0%

  [ = ]       0 TOTAL                                    0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] Performance differences noted:
Benchmark                                                                                 cpu_time    real_time
----------------------------------------------------------------------------------------  ----------  -----------
BM_PumpStreamClientToServer<InProcess>/262144                                             +10%        +10%
BM_PumpStreamServerToClient<InProcess>/32768                                              +8%         +8%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/262144/2                        +4%         +4%
BM_StreamingPingPongMsgs<InProcess, NoOpMutator, NoOpMutator>/262144                      +6%         +6%
BM_StreamingPingPongMsgs<InProcess, NoOpMutator, NoOpMutator>/32768                       +7%         +7%
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/32768                    +9%         +9%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/262144/2/1  +5%         +5%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/262144/262144                    -4%         -4%

@grpc-testing
Copy link
Copy Markdown

****************************************************************

libgrpc.so

     VM SIZE                                          FILE SIZE
 ++++++++++++++ GROWING                            ++++++++++++++
  +0.6%     +16 src/core/lib/channel/handshaker.cc     +16  +0.6%
      +1.9%     +16 call_next_handshaker_locked            +16  +1.9%

 -------------- SHRINKING                          --------------
  -0.0%     -16 [None]                                 -16  -0.0%

  [ = ]       0 TOTAL                                    0  [ = ]


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link
Copy Markdown

[trickle] No significant performance differences

@grpc-testing
Copy link
Copy Markdown

[microbenchmarks] No significant performance differences

@kpayson64
Copy link
Copy Markdown
Contributor Author

#15693
#15751
#15752

@kpayson64 kpayson64 merged commit 26287ea into grpc:master Jun 14, 2018
@kpayson64 kpayson64 added the release notes: no Indicates if PR should not be in release notes label Jul 23, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants