fix(telemetry): Fix operator responses appearing as verified and missed #1694

JulianVentura · 2024-12-30T21:58:32Z

Fix operators responses appearing both as verified and missed at the same time

Description

This PR fixes a bug on which the grafana dashboard is showing some operator responses both as missed and received at the same time.

On telemetry, we are storing active traces information on the TraceStore. This module implements an Agent which allow us to make crud operations to the traces. The problem with this is that those operations are not atomic, so if more than one tracing message is received by telemetry at the same time and for the same merkle root, then a write on write conflict may occur while trying to modify the trace information or when setting the current trace context for OpenTelemetry.

This PR fixes this by refactoring the traces.ex module, implementing a GenServer, which proceses each tracing message sequentially, eliminating the race condition.

How to test

Happy path

Here we test the normal behavior of the system

Start all services
Send some proofs
Check that traces are correctly generated on Jaeger.
Check that metrics are correctly updated on Grafana

Not responding operator

Here we test the system's behavior when one operator is not responding

Start all services, including 4 operators.
Send infinite proofs.
Check that traces are correctly generated on Jaeger.
Check that metrics are correctly updated on Grafana. You should see one of the operators (and just one) not responding to any task.

BLS service timeout

This following test scenario should not happen in mainnet, due to the really large configured bls task timeout. Anyways, we test it so we are sure the system is stable if it occurs.
For this, you should modify the bls_service_task_timeout parameter under config-files/config-aggregator.yaml with a small value, for instance:

aggregator:
  bls_service_task_timeout: 20s

Then, start the system as always:

Start all services.
Send infinite proofs.
Check that traces are correctly generated on Jaeger.
Check that metrics are correctly updated on Grafana.

You should see that some errors are thrown on the aggregator due to a BLS task expiration. Those batches should have an extra event in the aggregator jaeger logs, indicating the batch verification failure due to expiration. Anyways, the remaining logs should be there, showing that the batch was successfully verified.

On telemetry execution logs, you should see a "Context not found for 0x...". This is because when the task is finished, the context is erased from the internal storage structure of telemetry. Notice how in this capture, we first finish the trace sucessfully, and then a new finish request is received, logging that error message.

Type of change

Please delete options that are not relevant.

New feature
Bug fix
Optimization
Refactor

Checklist

telemetry_api/lib/telemetry_api/traces.ex

avilagaston9

Works as expected! Left a nit comment

telemetry_api/lib/telemetry_api/traces.ex

JuArce

Working, I'm reviewing the code

JuArce

Check the solution with @Oppen

…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>

Refactor traces to use GenServer instead of Agent

9e7ab47

JulianVentura self-assigned this Dec 30, 2024

avilagaston9 reviewed Jan 2, 2025

View reviewed changes

telemetry_api/lib/telemetry_api/traces.ex Show resolved Hide resolved

Julian Ventura added 2 commits January 2, 2025 11:58

Remove debug logs, improve error handling

d1c527f

Add context clear when new trace is created

97d6a76

JulianVentura marked this pull request as ready for review January 2, 2025 16:04

JulianVentura changed the title ~~Fix telemetry traces race condition~~ fix(telemetry): Fix traces race condition Jan 2, 2025

PatStiles approved these changes Jan 2, 2025

View reviewed changes

avilagaston9 approved these changes Jan 2, 2025

View reviewed changes

telemetry_api/lib/telemetry_api/traces.ex Outdated Show resolved Hide resolved

Refactor error handling to make use of an auxiliary function

6844b6e

JuArce reviewed Jan 6, 2025

View reviewed changes

JuArce approved these changes Jan 6, 2025

View reviewed changes

JuArce requested changes Jan 6, 2025

View reviewed changes

JulianVentura changed the title ~~fix(telemetry): Fix traces race condition~~ fix(telemetry): Fix operator responses appearing as verified and missed Jan 6, 2025

Oppen approved these changes Jan 6, 2025

View reviewed changes

JuArce approved these changes Jan 6, 2025

View reviewed changes

MauroToscano added this pull request to the merge queue Jan 7, 2025

Merged via the queue into staging with commit e7adde6 Jan 7, 2025
1 check passed

MauroToscano deleted the fix-telemetry-traces-race-conditions branch January 7, 2025 15:14

PatStiles pushed a commit that referenced this pull request Jan 9, 2025

fix(telemetry): Fix operator responses appearing as verified and miss…

f6e9bce

…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>

PatStiles pushed a commit that referenced this pull request Jan 10, 2025

fix(telemetry): Fix operator responses appearing as verified and miss…

d09dcee

…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>

PatStiles pushed a commit that referenced this pull request Jan 10, 2025

fix(telemetry): Fix operator responses appearing as verified and miss…

fe5436a

…ed (#1694) Co-authored-by: Julian Ventura <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(telemetry): Fix operator responses appearing as verified and missed #1694

fix(telemetry): Fix operator responses appearing as verified and missed #1694

Uh oh!

JulianVentura commented Dec 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

avilagaston9 left a comment

Uh oh!

Uh oh!

JuArce left a comment

Uh oh!

JuArce left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fix(telemetry): Fix operator responses appearing as verified and missed #1694

fix(telemetry): Fix operator responses appearing as verified and missed #1694

Uh oh!

Conversation

JulianVentura commented Dec 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix operators responses appearing both as verified and missed at the same time

Description

How to test

Happy path

Not responding operator

BLS service timeout

Type of change

Checklist

Uh oh!

Uh oh!

avilagaston9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JuArce left a comment

Choose a reason for hiding this comment

Uh oh!

JuArce left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JulianVentura commented Dec 30, 2024 •

edited

Loading