-
Notifications
You must be signed in to change notification settings - Fork 391
fix(telemetry): Fix operator responses appearing as verified and missed #1694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
avilagaston9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected! Left a nit comment
JuArce
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working, I'm reviewing the code
JuArce
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the solution with @Oppen
…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>
…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>
…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>
Fix operators responses appearing both as verified and missed at the same time
Description
This PR fixes a bug on which the grafana dashboard is showing some operator responses both as missed and received at the same time.
On telemetry, we are storing active traces information on the
TraceStore. This module implements an Agent which allow us to make crud operations to the traces. The problem with this is that those operations are not atomic, so if more than one tracing message is received by telemetry at the same time and for the same merkle root, then a write on write conflict may occur while trying to modify the trace information or when setting the current trace context for OpenTelemetry.This PR fixes this by refactoring the
traces.exmodule, implementing a GenServer, which proceses each tracing message sequentially, eliminating the race condition.How to test
Happy path
Here we test the normal behavior of the system
Not responding operator
Here we test the system's behavior when one operator is not responding
BLS service timeout
This following test scenario should not happen in mainnet, due to the really large configured bls task timeout. Anyways, we test it so we are sure the system is stable if it occurs.
For this, you should modify the
bls_service_task_timeoutparameter underconfig-files/config-aggregator.yamlwith a small value, for instance:Then, start the system as always:
You should see that some errors are thrown on the aggregator due to a BLS task expiration. Those batches should have an extra event in the aggregator jaeger logs, indicating the batch verification failure due to expiration. Anyways, the remaining logs should be there, showing that the batch was successfully verified.
On telemetry execution logs, you should see a "Context not found for 0x...". This is because when the task is finished, the context is erased from the internal storage structure of telemetry. Notice how in this capture, we first finish the trace sucessfully, and then a new finish request is received, logging that error message.
Type of change
Please delete options that are not relevant.
Checklist
testnet, everything else tostaging