-
Notifications
You must be signed in to change notification settings - Fork 11.1k
gRPC channels blocking indefinitely and not respecting deadlines on network disconnect #15889
Description
What version of gRPC and what language are you using?
gRPC==1.12.1
python
What operating system (Linux, Windows, …) and version?
Linux Ubuntu "14.04.5 LTS, Trusty Tahr"
What runtime / compiler are you using (e.g. python version or version of gcc)
gcc==4:4.8.2-1ubuntu6
python==2.7
What did you do?
I consistently reproduced a bug that causes gRPC channels to block indefinitely when sending to a destination that continuously times out.
I slightly modified the greeter_client.py from the python helloworld example, added an infinite while loop, added a timeout of 0.3 to the SayHello() grpc call and added exception handling to catch any DEADLINE_EXCEEDED errors.
Running this script and an un-modified greeter_server.py (on a different host listening on port 6106) I was able to observe the two communicating as expected.
To reproduce the bug in question, I add an iptables rule to the server side to drop all traffic arriving at the port in use:
/sbin/iptables -I INPUT 1 -s xx.xxx.xxx.xxx -j DROP
What did you expect to see?
Adding the iptables rule results in lots of DEADLINE_EXCEEDED (timeout) exceptions as expected.
Traceback (most recent call last):
File "greeter_client.py", line 36, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 500, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 434, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>
I expected to see the greeter_client.py continue iterating over the while loop, timing out in exactly 300ms on every call to response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
What did you see instead?
If you leave the client running with everything in this state, it eventually blocks indefinitely on the send response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3) and never times out. This is an example of the bug we've been trying to figure out these past few weeks. After some digging we found that the blocking call here is: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_channel.py#L508
Once the iptables rule is removed from the server side, the client un-blocks and continues to send.
Now what's interesting is that if you raise the size of the msg variable here, to say... 'a' * 1000 the client breaks much faster (starts to block indefinitely).
On inspection we discovered that when we add the iptables rule to drop the traffic, the send-q for the socket in use by the grpc client continues to grow as the client continues to attempt to send and timeout. The send-q grows because grpc is buffering the request for sending but never receiving an ack so the data sits there in the queue piling up on every attempt. Until.. the send-q seems to reach it's maximum threshold, at which point the client exhibits the indefinite blocking behaviour. We believe the kernel refuses to add any more data to the send-q causing grpc to block/spin deep within the C code.
For every reproduction of this, when sending msg='a', once the send-q hits 33425 the client blocks.
root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 33425 xx.xxx.xxx.xxx:45048 xxx.xxx.xx.xxx:6106 ESTABLISHED 19834/python
For larger messages the breaking threshold rises slightly, for example sending msg='a'*1000, the send-q reaches 46278 every time before blocking
root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 46278 xx.xxx.xxx.xxx:36000 xxx.xxx.xx.xxx:6106 ESTABLISHED 27166/python
Anything else we should know about your project / environment?
No