Skip to content

Getting Internal Error (GOAWAY) when server goes down #14109

@pratikvasa

Description

@pratikvasa

What version of gRPC and what language are you using?

Upgrading gRPC version from 1.1.0 to 1.8.0

What operating system (Linux, Windows, …) and version?

Windows

What runtime / compiler are you using (e.g. python version or version of gcc)

using c# .Net framework 4.5.2

What did you do?

We were using gRPC 1.1.0 on our production servers since a long time. We had done some stress testing and found out that having multiple channels per server gave us better results. So if we have two servers A and B. Then we would create 5 Channels each for server A and server B. And then make calls to the servers in round robin fashion. This gave us a pretty good throughput.

Then we upgraded our servers and clients to 1.8.0 some time back. Everything worked fine when the servers were working. But when one of the server went down we started getting Internal RPC error.

ERROR [51] ExceptionHandler LogException - Grpc.Core.RpcException: Status(StatusCode=Internal, Detail="GOAWAY received")
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Internal.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at EditCMSWindowsService.Messages.EditCMSGrpcService.EditCMSGrpcServiceClient.GetArticleViewsCount(GrpcInt request, CallOptions options)

Strangely this error did not occur every time, and when it did, it would on random clients.
And sometimes this error is also seen when the server is actually up.

Then when I was trying to recreate this error, I found out that this error was easily reproduced when I created a lot of instances (10) to the same server. Also the state of the clients was constantly idle instead of transient failure or connecting.

As soon as I created only one instance to the server the errors went away.

We also have servers and clients running on goLang. This issue is not there on those servers.

I also have a few questions:

  1. Is using multiple instances of the channel to the same server the correct thing to do?
  2. If yes, the above thing needs to be fixed.
  3. If no. Then there is a lot of overhead in thread management when there are a large number of calls. What is the best way to mitigate that.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions