Skip to content

Change default otlp exporter GRPC load balancer to round robin #10298

@taniyourstruly

Description

@taniyourstruly

Is your feature request related to a problem? Please describe.

Using the pick_first load balancer will default to sending data to the same backend. This can potentially cause problems in resource allocation. Having one connection to the same IP address or backend would cause throttling. For scalable services, if this same backend is overloaded, it cannot accept anymore data sent and data can then be dropped instead of sending to another backend, especially if all are being used at max limit. Also, pick_first does no actual load balancing (link), and instead just tries each address from the name resolver and connects to the one that works.

Describe the solution you'd like

Updating the load balancer to use the round_robin policy will allow data to be sent to different backends, and therefore more evenly allocate resources. Round robin only picks ready connections, and so is better in a typical cloud-compute setup, where clients are sending data into a scalable service with multiple workers. This is because it allows for connections to alternate between backends, and therefore when resources on one backend are being used, the data is then moved to another available backend based on availability. In round_robin, users that want to send data to only one address can create connections against the address more than once to ensure that multiple connections can be made (link). This would not let the connections be throttled.
In this case, pick_first would resolve to that one address and can only accept data when that connection is available, therefore causing throttling. When having more than one connection to that one address, like what round_robin is able to do, throttling is less of an issue since data can be sent to multiple locations that all send to that one address.

Describe alternatives you've considered

The alternative is to leave pick_first the default and recommend users make a choice. This is not an adequate solution because users expect reliable delivery by default and round_robin is substantially more reliable for minimal additional cost. We have restricted our choices to round_robin and pick_first because these two are registered by default, other custom load balancers would have to be registered by the user.

Additional context

I tested these two different load balancers in the Lightstep/SNCO dashboard. Envoy-edge, a proxy that load balances data, being sent to our service spaningest, which as its name implies, ingests OTLP spans using the arrow format.
In the first image, we see that when sending data from our service, envoy to spaningest arrow, at first, looks pretty even. This is currently using round-robin load balancing, as it distributes traffic in rotation, to different k8s pods, and therefore the resource allocation is even.
image
After the time the change to pick_first is made, we see that each pod changes how much data(spans) is sent drastically, as some pods get more spans than others.
image
Comparing the two, we see that the differences using round robin and pick first load balancers are pretty apparent. Since pick first sends to the first pod that is available, data is sent to that pod and all the other pods are left using no resources.
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions