Skip to content

Retry mechanism does not work if proxy server returns 503 on gRPC call #6780

@OrangeFlag

Description

@OrangeFlag

Describe the bug
The retry mechanism in OkHttpGrpcSender fails to retry when Envoy returns a 503 HTTP status code for gRPC requests. The gRPC call fails without retries being triggered, which should be handled according to the provided retry policy. This issue arises because of the logic at this line of code, where retries are not attempted if the HTTP response is not successful.

Steps to reproduce
To make the issue easier to reproduce, you can use WireMock to simulate the proxy server behavior:

  1. Set up WireMock to return a 503 status code for certain gRPC requests.
    • Install and configure WireMock to mock the gRPC service.
    • Create a rule in WireMock to respond with a 503 Service Unavailable for requests.
  2. Configure the OpenTelemetry SDK to use OkHttpGrpcSender for exporting spans.
  3. Run the application, making sure requests go through the WireMock server.
  4. Observe that retries are not triggered when WireMock returns the 503 status code, and the request fails immediately.

Or use envoy for more production like behaviour:

  1. Set up Envoy as a proxy server in front of a opentelemetry collector.
  2. Configure Envoy to return a 503 Service Unavailable when the backend is unavailable or as part of error injection for testing.
  3. Use grpc exporter in OpenTelemetry (v1.41.0) to send gRPC requests through the proxy server.
  4. Observe that retries are not triggered even though a retry policy is configured, and the gRPC request fails immediately after receiving the 503 HTTP status code.

What did you expect to see?
When the proxy server (Envoy) returns a 503 Service Unavailable, the retry mechanism should trigger based on the provided RetryPolicy and automatically retry the gRPC call.

What did you see instead?
No retries were attempted. The following log is observed:
Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: no healthy upstream.
There are no logs indicating retries were made.

What version and what artifacts are you using?
Artifacts: opentelemetry-sdk, opentelemetry-exporter-sender-okhttp
Version: 1.41.0

Environment
OS: Ubuntu Jammy (22.04 LTS)
Runtime: openjdk 17

Additional context
The issue seems to originate from the following code in OkHttpGrpcSender:

if (!response.isSuccessful()) {
      return false;
}

This logic prevents retries from being attempted when the response code is anything other than 200. However, non-200 HTTP status codes like 503 should be treated as retryable. It may be beneficial to use RetryUtil.retryableHttpResponseCodes along with response.isSuccessful() to determine whether a request should be retried based on transient HTTP error codes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions