-
Notifications
You must be signed in to change notification settings - Fork 923
Fix OkHttpGrpcSender to properly await executor shutdown #7840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OkHttpGrpcSender to properly await executor shutdown #7840
Conversation
f7e52d1 to
7125eef
Compare
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (75.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #7840 +/- ##
============================================
+ Coverage 90.02% 90.06% +0.03%
- Complexity 7313 7320 +7
============================================
Files 825 825
Lines 22034 22048 +14
Branches 2178 2179 +1
============================================
+ Hits 19837 19857 +20
+ Misses 1518 1511 -7
- Partials 679 680 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| try { | ||
| // Wait up to 5 seconds for threads to terminate | ||
| // Even if timeout occurs, we succeed since these are daemon threads | ||
| executorService.awaitTermination(5, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it fails to terminate in 5s, is it worth logging something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a log line at level warning
d40b072 to
92418ad
Compare
jkwatson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. @jack-berg do you want to give it a once over as well?
...ender/jdk/src/test/java/io/opentelemetry/exporter/sender/jdk/internal/JdkHttpSenderTest.java
Show resolved
Hide resolved
|
|
||
| // Capture OkHttp threads before shutdown | ||
| Set<Thread> threadsBeforeShutdown = getOkHttpThreads(); | ||
| logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of these log statements. They just make for noisy build logs, which I'm in a never-ending battle to wrangle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...ttp/src/test/java/io/opentelemetry/exporter/sender/okhttp/internal/OkHttpGrpcSenderTest.java
Outdated
Show resolved
Hide resolved
...ttp/src/test/java/io/opentelemetry/exporter/sender/okhttp/internal/OkHttpGrpcSenderTest.java
Outdated
Show resolved
Hide resolved
...ttp/src/test/java/io/opentelemetry/exporter/sender/okhttp/internal/OkHttpGrpcSenderTest.java
Outdated
Show resolved
Hide resolved
92418ad to
975eef4
Compare
jack-berg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
OkHttpGrpcSender.shutdown() was calling shutdownNow() on the dispatcher's ExecutorService but not waiting for threads to actually terminate. This caused two issues: 1. The CompletableResultCode returned immediately (ofSuccess()), even though background threads were still running 2. Applications that wait for all threads to complete before exiting would hang, as HTTP call threads remained alive The original implementation: - Called executorService.shutdownNow() to interrupt threads - Immediately returned CompletableResultCode.ofSuccess() - Did not call awaitTermination() to wait for threads to finish This violated the semantic contract of CompletableResultCode, where callers expect to use .join() to wait for async operations to complete. Modified OkHttpGrpcSender.shutdown() to: 1. Use shutdownNow() to interrupt idle threads immediately (safe since we've already called cancelAll() on pending work) 2. Create a background daemon thread that calls awaitTermination() 3. Complete the CompletableResultCode only after threads terminate (or timeout after 5 seconds) This ensures callers who use .join() will wait for actual thread termination, not just shutdown initiation. One limitation: OkHttp's internal TaskRunner threads (used for connection keep-alive) may remain alive temporarily. These are: - Daemon threads (won't prevent JVM exit) - Idle (in TIMED_WAITING state) - Unable to receive new work after shutdown - Will terminate after their 60-second keep-alive timeout The test verifies that HTTP call threads (dispatcher executor threads) are properly terminated, which is the critical requirement. Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads() which verifies: - Dispatcher threads are terminated after shutdown - CompletableResultCode completes successfully - TaskRunner daemon threads are acknowledged but not required to terminate Fix environment-dependent test failures: Fixed four tests that were failing due to OS/JDK-specific networking behavior differences: 1. RetryInterceptorTest.connectTimeout() and nonRetryableException(): Changed to accept both SocketTimeoutException and SocketException, as different environments throw different exceptions when connecting to non-routable IPs 2. JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException(): Changed to accept both HttpConnectTimeoutException and IOException, as the specific exception varies by environment 3. AbstractGrpcTelemetryExporterTest.connectTimeout(): Increased timeout threshold from 1s to 6s to account for OS-specific variations in connection failure timing (was failing before shutdown changes, pre-existing flaky test) Fixes open-telemetry#7827.
This version only checks the value of CompletableResultCode instead of trying to wait, it's simpler and also exposes the failure.
cb1396a to
287ab44
Compare
I don't have a windows machine so let's see ifthis passes CI.
|
any chance I can get a pass on the code coverage? I tried adding a test to cover this, but it failed on the Windows builder... I can try to figure out something if you'd like |
You're good. Thanks for this! |
|
Thank you for your contribution @kevinburkesegment! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. |
OkHttpGrpcSender.shutdown() was calling shutdownNow() on the dispatcher's ExecutorService but not waiting for threads to actually terminate. This caused two issues:
The original implementation:
This violated the semantic contract of CompletableResultCode, where callers expect to use .join() to wait for async operations to complete.
Modified OkHttpGrpcSender.shutdown() to:
This ensures callers who use .join() will wait for actual thread termination, not just shutdown initiation.
One limitation: OkHttp's internal TaskRunner threads (used for connection keep-alive) may remain alive temporarily. These are:
The test verifies that HTTP call threads (dispatcher executor threads) are properly terminated, which is the critical requirement.
Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads() which verifies:
Fix environment-dependent test failures:
Fixed four tests that were failing due to OS/JDK-specific networking behavior differences:
RetryInterceptorTest.connectTimeout() and nonRetryableException(): Changed to accept both SocketTimeoutException and SocketException, as different environments throw different exceptions when connecting to non-routable IPs
JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException(): Changed to accept both HttpConnectTimeoutException and IOException, as the specific exception varies by environment
AbstractGrpcTelemetryExporterTest.connectTimeout(): Increased timeout threshold from 1s to 6s to account for OS-specific variations in connection failure timing (was failing before shutdown changes, pre-existing flaky test)
Fixes #7827.