We experienced an issue where our grpc-java clients filled up their heaps and the entire cluster went down. The cause appears to be a list of futures in io.grpc.internal.TransportSet. 5 of these sets would consume 47% of a 30GB heap. The cluster was running without issue for a month.
Context: Our clients each make about 10,000 requests/s to a cluster of 80 servers. The problem began when the cluster of servers was restarted. The clients filled their heaps and effectively died. A restart of the clients resolved the immediate issue, but we want to fix the root cause.
The clients each maintain a single blocking stub for each of the 80 servers. We enforce a 10ms timeout using withDeadlineAfter(). All calling threads reuse the same blocking stub.
We did manage to get a heap dump of an offending client. We noticed that about 5 blocking stubs account for 47% of the retained heap on the machine. ~70 blocking stubs have the expected ~1kb size. I'm attaching a screenshot of the Dominator Tree report from Eclipse Memory Analyzer showing the problem. The labels on the columns are "shallow heap", "retained heap" and "retained heap %". This shows that a single io.grpc.internal.TransportSet has references to effectively a linked-list of RunnableExecutorPair objects each containing a Future. This linked list is 1.6 GB in size, where each element is about 1.5kb.
Any thoughts? What can I do to help debug this?
We are running grpc-java master as of Dec 9. v0.9 has a bug that makes it unusable for us.

We experienced an issue where our grpc-java clients filled up their heaps and the entire cluster went down. The cause appears to be a list of futures in io.grpc.internal.TransportSet. 5 of these sets would consume 47% of a 30GB heap. The cluster was running without issue for a month.
Context: Our clients each make about 10,000 requests/s to a cluster of 80 servers. The problem began when the cluster of servers was restarted. The clients filled their heaps and effectively died. A restart of the clients resolved the immediate issue, but we want to fix the root cause.
The clients each maintain a single blocking stub for each of the 80 servers. We enforce a 10ms timeout using withDeadlineAfter(). All calling threads reuse the same blocking stub.
We did manage to get a heap dump of an offending client. We noticed that about 5 blocking stubs account for 47% of the retained heap on the machine. ~70 blocking stubs have the expected ~1kb size. I'm attaching a screenshot of the Dominator Tree report from Eclipse Memory Analyzer showing the problem. The labels on the columns are "shallow heap", "retained heap" and "retained heap %". This shows that a single io.grpc.internal.TransportSet has references to effectively a linked-list of RunnableExecutorPair objects each containing a Future. This linked list is 1.6 GB in size, where each element is about 1.5kb.
Any thoughts? What can I do to help debug this?
We are running grpc-java master as of Dec 9. v0.9 has a bug that makes it unusable for us.