-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Getting Internal Error (GOAWAY) when server goes down #14109
Description
What version of gRPC and what language are you using?
Upgrading gRPC version from 1.1.0 to 1.8.0
What operating system (Linux, Windows, …) and version?
Windows
What runtime / compiler are you using (e.g. python version or version of gcc)
using c# .Net framework 4.5.2
What did you do?
We were using gRPC 1.1.0 on our production servers since a long time. We had done some stress testing and found out that having multiple channels per server gave us better results. So if we have two servers A and B. Then we would create 5 Channels each for server A and server B. And then make calls to the servers in round robin fashion. This gave us a pretty good throughput.
Then we upgraded our servers and clients to 1.8.0 some time back. Everything worked fine when the servers were working. But when one of the server went down we started getting Internal RPC error.
ERROR [51] ExceptionHandler LogException - Grpc.Core.RpcException: Status(StatusCode=Internal, Detail="GOAWAY received")
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at Grpc.Core.Internal.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at EditCMSWindowsService.Messages.EditCMSGrpcService.EditCMSGrpcServiceClient.GetArticleViewsCount(GrpcInt request, CallOptions options)
Strangely this error did not occur every time, and when it did, it would on random clients.
And sometimes this error is also seen when the server is actually up.
Then when I was trying to recreate this error, I found out that this error was easily reproduced when I created a lot of instances (10) to the same server. Also the state of the clients was constantly idle instead of transient failure or connecting.
As soon as I created only one instance to the server the errors went away.
We also have servers and clients running on goLang. This issue is not there on those servers.
I also have a few questions:
- Is using multiple instances of the channel to the same server the correct thing to do?
- If yes, the above thing needs to be fixed.
- If no. Then there is a lot of overhead in thread management when there are a large number of calls. What is the best way to mitigate that.