Skip to content

Conversation

@kevinburkesegment
Copy link
Contributor

OkHttpGrpcSender.shutdown() was calling shutdownNow() on the dispatcher's ExecutorService but not waiting for threads to actually terminate. This caused two issues:

  1. The CompletableResultCode returned immediately (ofSuccess()), even though background threads were still running
  2. Applications that wait for all threads to complete before exiting would hang, as HTTP call threads remained alive

The original implementation:

  • Called executorService.shutdownNow() to interrupt threads
  • Immediately returned CompletableResultCode.ofSuccess()
  • Did not call awaitTermination() to wait for threads to finish

This violated the semantic contract of CompletableResultCode, where callers expect to use .join() to wait for async operations to complete.

Modified OkHttpGrpcSender.shutdown() to:

  1. Use shutdownNow() to interrupt idle threads immediately (safe since we've already called cancelAll() on pending work)
  2. Create a background daemon thread that calls awaitTermination()
  3. Complete the CompletableResultCode only after threads terminate (or timeout after 5 seconds)

This ensures callers who use .join() will wait for actual thread termination, not just shutdown initiation.

One limitation: OkHttp's internal TaskRunner threads (used for connection keep-alive) may remain alive temporarily. These are:

  • Daemon threads (won't prevent JVM exit)
  • Idle (in TIMED_WAITING state)
  • Unable to receive new work after shutdown
  • Will terminate after their 60-second keep-alive timeout

The test verifies that HTTP call threads (dispatcher executor threads) are properly terminated, which is the critical requirement.

Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads() which verifies:

  • Dispatcher threads are terminated after shutdown
  • CompletableResultCode completes successfully
  • TaskRunner daemon threads are acknowledged but not required to terminate

Fix environment-dependent test failures:

Fixed four tests that were failing due to OS/JDK-specific networking behavior differences:

  1. RetryInterceptorTest.connectTimeout() and nonRetryableException(): Changed to accept both SocketTimeoutException and SocketException, as different environments throw different exceptions when connecting to non-routable IPs

  2. JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException(): Changed to accept both HttpConnectTimeoutException and IOException, as the specific exception varies by environment

  3. AbstractGrpcTelemetryExporterTest.connectTimeout(): Increased timeout threshold from 1s to 6s to account for OS-specific variations in connection failure timing (was failing before shutdown changes, pre-existing flaky test)

Fixes #7827.

@codecov
Copy link

codecov bot commented Nov 13, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.06%. Comparing base (42b04c5) to head (df28814).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...orter/sender/okhttp/internal/OkHttpGrpcSender.java 75.00% 3 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (75.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #7840      +/-   ##
============================================
+ Coverage     90.02%   90.06%   +0.03%     
- Complexity     7313     7320       +7     
============================================
  Files           825      825              
  Lines         22034    22048      +14     
  Branches       2178     2179       +1     
============================================
+ Hits          19837    19857      +20     
+ Misses         1518     1511       -7     
- Partials        679      680       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

try {
// Wait up to 5 seconds for threads to terminate
// Even if timeout occurs, we succeed since these are daemon threads
executorService.awaitTermination(5, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails to terminate in 5s, is it worth logging something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a log line at level warning

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. @jack-berg do you want to give it a once over as well?


// Capture OkHttp threads before shutdown
Set<Thread> threadsBeforeShutdown = getOkHttpThreads();
logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of these log statements. They just make for noisy build logs, which I'm in a never-ending battle to wrangle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

OkHttpGrpcSender.shutdown() was calling shutdownNow() on the
dispatcher's ExecutorService but not waiting for threads to actually
terminate. This caused two issues:

1. The CompletableResultCode returned immediately (ofSuccess()), even
   though background threads were still running
2. Applications that wait for all threads to complete before exiting
   would hang, as HTTP call threads remained alive

The original implementation:

- Called executorService.shutdownNow() to interrupt threads
- Immediately returned CompletableResultCode.ofSuccess()
- Did not call awaitTermination() to wait for threads to finish

This violated the semantic contract of CompletableResultCode, where
callers expect to use .join() to wait for async operations to complete.

Modified OkHttpGrpcSender.shutdown() to:

1. Use shutdownNow() to interrupt idle threads immediately (safe since
   we've already called cancelAll() on pending work)
2. Create a background daemon thread that calls awaitTermination()
3. Complete the CompletableResultCode only after threads terminate
   (or timeout after 5 seconds)

This ensures callers who use .join() will wait for actual thread
termination, not just shutdown initiation.

One limitation: OkHttp's internal TaskRunner threads (used for
connection keep-alive) may remain alive temporarily. These are:

- Daemon threads (won't prevent JVM exit)
- Idle (in TIMED_WAITING state)
- Unable to receive new work after shutdown
- Will terminate after their 60-second keep-alive timeout

The test verifies that HTTP call threads (dispatcher executor threads)
are properly terminated, which is the critical requirement.

Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads()
which verifies:

- Dispatcher threads are terminated after shutdown
- CompletableResultCode completes successfully
- TaskRunner daemon threads are acknowledged but not required to terminate

Fix environment-dependent test failures:

Fixed four tests that were failing due to OS/JDK-specific networking
behavior differences:

1. RetryInterceptorTest.connectTimeout() and nonRetryableException():
   Changed to accept both SocketTimeoutException and SocketException,
   as different environments throw different exceptions when connecting
   to non-routable IPs

2. JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException():
   Changed to accept both HttpConnectTimeoutException and IOException,
   as the specific exception varies by environment

3. AbstractGrpcTelemetryExporterTest.connectTimeout():
   Increased timeout threshold from 1s to 6s to account for OS-specific
   variations in connection failure timing (was failing before shutdown
   changes, pre-existing flaky test)

Fixes open-telemetry#7827.
This version only checks the value of CompletableResultCode instead of
trying to wait, it's simpler and also exposes the failure.
I don't have a windows machine so let's see ifthis passes CI.
@kevinburkesegment
Copy link
Contributor Author

any chance I can get a pass on the code coverage? I tried adding a test to cover this, but it failed on the Windows builder... I can try to figure out something if you'd like

@jack-berg
Copy link
Member

any chance I can get a pass on the code coverage? I tried adding a test to cover this, but it failed on the Windows builder... I can try to figure out something if you'd like

You're good. Thanks for this!

@jack-berg jack-berg merged commit 92d23b2 into open-telemetry:main Nov 26, 2025
27 of 28 checks passed
@otelbot
Copy link
Contributor

otelbot bot commented Nov 26, 2025

Thank you for your contribution @kevinburkesegment! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OkHttpGrpcSender.shutdown() does not properly await executor thread termination

3 participants