Skip to content

Conversation

@JulianVentura
Copy link
Contributor

@JulianVentura JulianVentura commented Dec 30, 2024

Fix operators responses appearing both as verified and missed at the same time

Description

This PR fixes a bug on which the grafana dashboard is showing some operator responses both as missed and received at the same time.

On telemetry, we are storing active traces information on the TraceStore. This module implements an Agent which allow us to make crud operations to the traces. The problem with this is that those operations are not atomic, so if more than one tracing message is received by telemetry at the same time and for the same merkle root, then a write on write conflict may occur while trying to modify the trace information or when setting the current trace context for OpenTelemetry.

This PR fixes this by refactoring the traces.ex module, implementing a GenServer, which proceses each tracing message sequentially, eliminating the race condition.

How to test

Happy path

Here we test the normal behavior of the system

  1. Start all services
  2. Send some proofs
  3. Check that traces are correctly generated on Jaeger.
  4. Check that metrics are correctly updated on Grafana

Not responding operator

Here we test the system's behavior when one operator is not responding

  1. Start all services, including 4 operators.
  2. Send infinite proofs.
  3. Check that traces are correctly generated on Jaeger.
  4. Check that metrics are correctly updated on Grafana. You should see one of the operators (and just one) not responding to any task.
image

BLS service timeout

This following test scenario should not happen in mainnet, due to the really large configured bls task timeout. Anyways, we test it so we are sure the system is stable if it occurs.
For this, you should modify the bls_service_task_timeout parameter under config-files/config-aggregator.yaml with a small value, for instance:

aggregator:
  bls_service_task_timeout: 20s

Then, start the system as always:

  1. Start all services.
  2. Send infinite proofs.
  3. Check that traces are correctly generated on Jaeger.
  4. Check that metrics are correctly updated on Grafana.

You should see that some errors are thrown on the aggregator due to a BLS task expiration. Those batches should have an extra event in the aggregator jaeger logs, indicating the batch verification failure due to expiration. Anyways, the remaining logs should be there, showing that the batch was successfully verified.

image

On telemetry execution logs, you should see a "Context not found for 0x...". This is because when the task is finished, the context is erased from the internal storage structure of telemetry. Notice how in this capture, we first finish the trace sucessfully, and then a new finish request is received, logging that error message.

image

Type of change

Please delete options that are not relevant.

  • New feature
  • Bug fix
  • Optimization
  • Refactor

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@JulianVentura JulianVentura self-assigned this Dec 30, 2024
@JulianVentura JulianVentura marked this pull request as ready for review January 2, 2025 16:04
@JulianVentura JulianVentura changed the title Fix telemetry traces race condition fix(telemetry): Fix traces race condition Jan 2, 2025
Copy link
Contributor

@avilagaston9 avilagaston9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected! Left a nit comment

Copy link
Collaborator

@JuArce JuArce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working, I'm reviewing the code

Copy link
Collaborator

@JuArce JuArce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the solution with @Oppen

@JulianVentura JulianVentura changed the title fix(telemetry): Fix traces race condition fix(telemetry): Fix operator responses appearing as verified and missed Jan 6, 2025
@MauroToscano MauroToscano added this pull request to the merge queue Jan 7, 2025
Merged via the queue into staging with commit e7adde6 Jan 7, 2025
1 check passed
@MauroToscano MauroToscano deleted the fix-telemetry-traces-race-conditions branch January 7, 2025 15:14
PatStiles pushed a commit that referenced this pull request Jan 9, 2025
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants