Skip to content

Possible Deadlock/Livelock in Pinot Broker at High QPS #9019

@ankitsultana

Description

@ankitsultana

Around a month ago, we had seen an issue where Pinot brokers for one of our High QPS use-cases had what looked like a deadlock/livelock related issue. The brokers were serving traffic around 300-400 QPS and during the issue the number of the threads in the brokers kept on increasing (to up to 30k/40k). When we took a thread-dump, we saw around 30k threads with the following stack-trace.

We were able to remediate the issue by increasing the number of brokers in the tenant to lower the average QPS per broker-instance. We didn't have a chance to dive-deep into the issue but thought of sharing it here in case someone here has seen this issue as well.

The Pinot base version was 0.7.0 with some cherry-picks from later versions.

"jersey-server-managed-async-executor-154433" #173980 prio=5 os_prio=0 tid=0x00007f5de8468000 nid=0xa1b8 waiting on condition [0x00007f5a0a868000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f6182902770> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at org.glassfish.hk2.utilities.general.Hk2ThreadLocal.get(Hk2ThreadLocal.java:108)
        at org.jvnet.hk2.internal.PerLocatorUtilities.getAutoAnalyzerName(PerLocatorUtilities.java:166)
        at org.jvnet.hk2.internal.ConstantActiveDescriptor.<init>(ConstantActiveDescriptor.java:84)
        at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetInjecteeDescriptor(ServiceLocatorImpl.java:562)
        at org.jvnet.hk2.internal.ServiceLocatorImpl.getInjecteeDescriptor(ServiceLocatorImpl.java:587)
        at org.jvnet.hk2.internal.ThreeThirtyResolver.resolve(ThreeThirtyResolver.java:70)
        at org.jvnet.hk2.internal.ClazzCreator.resolve(ClazzCreator.java:212)
        at org.jvnet.hk2.internal.ClazzCreator.resolveAllDependencies(ClazzCreator.java:229)
        at org.jvnet.hk2.internal.ClazzCreator.create(ClazzCreator.java:358)
        at org.jvnet.hk2.internal.SystemDescriptor.create(SystemDescriptor.java:487)
        at org.jvnet.hk2.internal.PerLookupContext.findOrCreate(PerLookupContext.java:70)
        at org.jvnet.hk2.internal.Utilities.createService(Utilities.java:2022)
        at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetService(ServiceLocatorImpl.java:774)
        at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetService(ServiceLocatorImpl.java:737)
        at org.jvnet.hk2.internal.ServiceLocatorImpl.getService(ServiceLocatorImpl.java:733)
        at org.glassfish.jersey.inject.hk2.SupplierFactoryBridge.provide(SupplierFactoryBridge.java:74)
        at org.jvnet.hk2.internal.FactoryCreator.create(FactoryCreator.java:153)
        at org.jvnet.hk2.internal.SystemDescriptor.create(SystemDescriptor.java:487)
        at org.glassfish.jersey.inject.hk2.RequestContext.findOrCreate(RequestContext.java:59)
        at org.jvnet.hk2.internal.Utilities.createService(Utilities.java:2022)
        at org.jvnet.hk2.internal.ServiceHandleImpl.getService(ServiceHandleImpl.java:114)
        - locked <0x00007f7e5d003c88> (a java.lang.Object)
        at org.jvnet.hk2.internal.ServiceHandleImpl.getService(ServiceHandleImpl.java:88)
        at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:103)
        at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:121)
        at org.glassfish.jersey.server.internal.inject.DelegatedInjectionValueParamProvider.lambda$getValueProvider$0(DelegatedInjectionValueParamProvider.java:67)
        at org.glassfish.jersey.server.internal.inject.DelegatedInjectionValueParamProvider$$Lambda$199/1475889071.apply(Unknown Source)
        at org.glassfish.jersey.server.spi.internal.ParamValueFactoryWithSource.apply(ParamValueFactoryWithSource.java:50)
        at org.glassfish.jersey.server.spi.internal.ParameterValueHelper.getParameterValues(ParameterValueHelper.java:64)
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$AbstractMethodParamInvoker.getParamValues(JavaResourceMethodDispatcherProvider.java:109)
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:381)
        at org.glassfish.jersey.server.model.ResourceMethodInvoker$$Lambda$233/665077335.call(Unknown Source)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2$1.run(ServerRuntime.java:819)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2.run(ServerRuntime.java:814)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Here's the paste for the possible object which could be part of a deadlock or livelock.

┌  /private/tmp 130 ↵
❯❯❯ cat ~/Desktop/a.thdump| grep "0x00007f6182902770" | wc -l
   33893

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions