Skip to content

Retry exporting unsuccessfully exported batches#1493

Merged
breedx-splk merged 4 commits intoopen-telemetry:mainfrom
bencehornak:reexport-unsuccessfully-exported-batches
Jan 22, 2026
Merged

Retry exporting unsuccessfully exported batches#1493
breedx-splk merged 4 commits intoopen-telemetry:mainfrom
bencehornak:reexport-unsuccessfully-exported-batches

Conversation

@bencehornak
Copy link
Copy Markdown
Contributor

@bencehornak bencehornak commented Jan 3, 2026

Current behavior

Since #1326 when the SDK exports batches from the disk it essentially ignores any exporting failures and deletes the unsuccessfully exported batches from the disk. Consequently, valuable telemetry might be dropped silently, if there is any exception taking place in the OTel exporter (e.g. IOException due to the device being offline or the backend experiencing a temporary outage).

Proposed behavior

The export scheduler stops exporting on the first failure to prevent the deletion of unconsumed signals. On the next periodic run, the exporting will be retried from the point where the previous attempt left off.

Note

I think the root cause of the problem is the wrong example in the disk buffering module's README, see open-telemetry/opentelemetry-java-contrib#2539.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 3, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.60%. Comparing base (73b6d7f) to head (d4ab340).
⚠️ Report is 29 commits behind head on main.

Files with missing lines Patch % Lines
.../diskbuffering/scheduler/DefaultExportScheduler.kt 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
+ Coverage   63.51%   64.60%   +1.09%     
==========================================
  Files         159      159              
  Lines        3152     3176      +24     
  Branches      326      326              
==========================================
+ Hits         2002     2052      +50     
+ Misses       1050     1021      -29     
- Partials      100      103       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🙏

}
return atLeastOneWorked
}
fun exportAllSignalsFromDisk(): Boolean = exportSpansFromDisk() && exportMetricsFromDisk() && exportLogsFromDisk()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside to this approach is that a failure in one telemetry receiver can impact the ability of the others to even attempt export. That is, if a client is configured to use separate endpoints for metrics, traces, and logs, an outage in the spans receiver endpoint will cause the metrics and logs to not even be attempted!

It's probably a minor point, because I think that most users will have the same endpoint for all telemetry, but this is not necessarily the case. Because the signals are stored separately, we could attempt to export the other types even when one type fails. I'm not sure it's worth it, though...and doing so might cause some overengineering.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I oversaw this possibility. It doesn't cost much to implement the right way, I will make sure that all 3 methods are called regardless of failures.

Comment on lines +58 to +59
* Exports all persisted spans/metrics/logs from the disk and deletes the successfully
* consumed batches from the disk. It aborts the exporting on the first failure.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that the javadocs on these addresses the deletion -- as pointed out in the other repo/issue/PR, the deletion is a side effect of iteration. I know we're going to be addressing this, but because this class doesn't actually do the deletion, I don't think it should be describing behavior performed by a collaborator...unless that is explicitly called out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion is implicitly there due to the current tricky API (next() with side-effect). Since you can't see the deleting/retrying spelled out in the current version, I'd argue that it's even more important to describe when data is removed from the persistent storage and whether a failure is retried. The android-sdk makes an opinionated choice here (there are at least two ways of consuming the disk-buffering lib, the old way by ignoring failures and the proposed way by retrying failures), so that choice is what I aimed to document here.

I can try to improve the documentation of the opinionated part, e.g. I could put an emphasis on the retry mechanism, like

Suggested change
* Exports all persisted spans/metrics/logs from the disk and deletes the successfully
* consumed batches from the disk. It aborts the exporting on the first failure.
* Exports all persisted spans/metrics/logs from the disk.
* If an export fails, the method aborts and retries exporting the
* affected signals on the next invocation.

would that work for you @breedx-splk?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I guess that's fine...but I still think it's incorrect for a javadoc to describe the internal workings of one of its collaborators. If (when!) the collaborator changes, we are left with incorrect javadoc.

I think we can just roll with this for now and address that in a follow-up PR effort.

@breedx-splk breedx-splk merged commit 0ab4348 into open-telemetry:main Jan 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants