Retry exporting unsuccessfully exported batches#1493
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1493 +/- ##
==========================================
+ Coverage 63.51% 64.60% +1.09%
==========================================
Files 159 159
Lines 3152 3176 +24
Branches 326 326
==========================================
+ Hits 2002 2052 +50
+ Misses 1050 1021 -29
- Partials 100 103 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| } | ||
| return atLeastOneWorked | ||
| } | ||
| fun exportAllSignalsFromDisk(): Boolean = exportSpansFromDisk() && exportMetricsFromDisk() && exportLogsFromDisk() |
There was a problem hiding this comment.
One downside to this approach is that a failure in one telemetry receiver can impact the ability of the others to even attempt export. That is, if a client is configured to use separate endpoints for metrics, traces, and logs, an outage in the spans receiver endpoint will cause the metrics and logs to not even be attempted!
It's probably a minor point, because I think that most users will have the same endpoint for all telemetry, but this is not necessarily the case. Because the signals are stored separately, we could attempt to export the other types even when one type fails. I'm not sure it's worth it, though...and doing so might cause some overengineering.
There was a problem hiding this comment.
Good point, I oversaw this possibility. It doesn't cost much to implement the right way, I will make sure that all 3 methods are called regardless of failures.
| * Exports all persisted spans/metrics/logs from the disk and deletes the successfully | ||
| * consumed batches from the disk. It aborts the exporting on the first failure. |
There was a problem hiding this comment.
I don't love that the javadocs on these addresses the deletion -- as pointed out in the other repo/issue/PR, the deletion is a side effect of iteration. I know we're going to be addressing this, but because this class doesn't actually do the deletion, I don't think it should be describing behavior performed by a collaborator...unless that is explicitly called out.
There was a problem hiding this comment.
The deletion is implicitly there due to the current tricky API (next() with side-effect). Since you can't see the deleting/retrying spelled out in the current version, I'd argue that it's even more important to describe when data is removed from the persistent storage and whether a failure is retried. The android-sdk makes an opinionated choice here (there are at least two ways of consuming the disk-buffering lib, the old way by ignoring failures and the proposed way by retrying failures), so that choice is what I aimed to document here.
I can try to improve the documentation of the opinionated part, e.g. I could put an emphasis on the retry mechanism, like
| * Exports all persisted spans/metrics/logs from the disk and deletes the successfully | |
| * consumed batches from the disk. It aborts the exporting on the first failure. | |
| * Exports all persisted spans/metrics/logs from the disk. | |
| * If an export fails, the method aborts and retries exporting the | |
| * affected signals on the next invocation. |
would that work for you @breedx-splk?
There was a problem hiding this comment.
I mean, I guess that's fine...but I still think it's incorrect for a javadoc to describe the internal workings of one of its collaborators. If (when!) the collaborator changes, we are left with incorrect javadoc.
I think we can just roll with this for now and address that in a follow-up PR effort.
Current behavior
Since #1326 when the SDK exports batches from the disk it essentially ignores any exporting failures and deletes the unsuccessfully exported batches from the disk. Consequently, valuable telemetry might be dropped silently, if there is any exception taking place in the OTel exporter (e.g. IOException due to the device being offline or the backend experiencing a temporary outage).
Proposed behavior
The export scheduler stops exporting on the first failure to prevent the deletion of unconsumed signals. On the next periodic run, the exporting will be retried from the point where the previous attempt left off.
Note
I think the root cause of the problem is the wrong example in the disk buffering module's README, see open-telemetry/opentelemetry-java-contrib#2539.