Skip to content

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

@swar8080

Description

@swar8080

Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on otlp_exporter_exported{success="false"}). We do have OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED enabled.

I plotted some of the SDK metrics for a pod where this happened:

image

It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.

[otel.javaagent 2024-07-31 21:19:36:628 +0000] [OkHttp http://localhost:4318/...] ERROR io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318 
java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
	at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4318
		... 21 more
	Caused by: java.net.ConnectException: Connection refused
		at java.base/sun.nio.ch.Net.pollConnect(Native Method)
		at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
		at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
		at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
		at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
		at java.base/java.net.Socket.connect(Socket.java:633)
		at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
		at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
		... 20 more
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
	... 20 more
[otel.javaagent] [OkHttp http://localhost:4317/...] WARN io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export metrics. Server responded with gRPC status code 2. Error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4317

(forgot to capture stacktrace)

My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The queueSize isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?

Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error

What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.

Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container

What did you see instead?
Failed exports

What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent

Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions