-
Notifications
You must be signed in to change notification settings - Fork 934
Description
Describe the bug
The retry mechanism in OkHttpGrpcSender fails to retry when Envoy returns a 503 HTTP status code for gRPC requests. The gRPC call fails without retries being triggered, which should be handled according to the provided retry policy. This issue arises because of the logic at this line of code, where retries are not attempted if the HTTP response is not successful.
Steps to reproduce
To make the issue easier to reproduce, you can use WireMock to simulate the proxy server behavior:
- Set up WireMock to return a
503status code for certain gRPC requests.- Install and configure WireMock to mock the gRPC service.
- Create a rule in WireMock to respond with a
503 Service Unavailablefor requests.
- Configure the OpenTelemetry SDK to use
OkHttpGrpcSenderfor exporting spans. - Run the application, making sure requests go through the WireMock server.
- Observe that retries are not triggered when WireMock returns the
503status code, and the request fails immediately.
Or use envoy for more production like behaviour:
- Set up Envoy as a proxy server in front of a opentelemetry collector.
- Configure Envoy to return a
503 Service Unavailablewhen the backend is unavailable or as part of error injection for testing. - Use grpc exporter in OpenTelemetry (
v1.41.0) to send gRPC requests through the proxy server. - Observe that retries are not triggered even though a retry policy is configured, and the gRPC request fails immediately after receiving the
503HTTP status code.
What did you expect to see?
When the proxy server (Envoy) returns a 503 Service Unavailable, the retry mechanism should trigger based on the provided RetryPolicy and automatically retry the gRPC call.
What did you see instead?
No retries were attempted. The following log is observed:
Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: no healthy upstream.
There are no logs indicating retries were made.
What version and what artifacts are you using?
Artifacts: opentelemetry-sdk, opentelemetry-exporter-sender-okhttp
Version: 1.41.0
Environment
OS: Ubuntu Jammy (22.04 LTS)
Runtime: openjdk 17
Additional context
The issue seems to originate from the following code in OkHttpGrpcSender:
if (!response.isSuccessful()) {
return false;
}
This logic prevents retries from being attempted when the response code is anything other than 200. However, non-200 HTTP status codes like 503 should be treated as retryable. It may be beneficial to use RetryUtil.retryableHttpResponseCodes along with response.isSuccessful() to determine whether a request should be retried based on transient HTTP error codes.