Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on otlp_exporter_exported{success="false"}). We do have OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED enabled.
I plotted some of the SDK metrics for a pod where this happened:
It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.
[otel.javaagent 2024-07-31 21:19:36:628 +0000] [OkHttp http://localhost:4318/...] ERROR io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Suppressed: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4318
... 21 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
... 20 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
... 20 more
[otel.javaagent] [OkHttp http://localhost:4317/...] WARN io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export metrics. Server responded with gRPC status code 2. Error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4317
(forgot to capture stacktrace)
My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The queueSize isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?
Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error
What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.
Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container
What did you see instead?
Failed exports
What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent
Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"
Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on
otlp_exporter_exported{success="false"}). We do haveOTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLEDenabled.I plotted some of the SDK metrics for a pod where this happened:
It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.
My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The
queueSizeisn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error
What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.
Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container
What did you see instead?
Failed exports
What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent
Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"