Skip to content

Conversation

@avilagaston9
Copy link
Contributor

@avilagaston9 avilagaston9 commented Dec 16, 2024

Operator Liveness Metric

Motivation

We need a way to rapidly determine if an operator is down.

Description

Adds a Bar Gauge with the count of missed tasks for each operator over a specified time range.

image

How To Test

  1. Start anvil:
make anvil_start_with_block_time
  1. Start the telemetry server in interactive mode:
make telemetry_run_db
make telemetry_ecto_migrate
cd telemetry_api
ALIGNED_CONFIG_FILE="../contracts/script/output/devnet/alignedlayer_deployment_output.json" OPERATOR_FETCHER_WAIT_TIME_MS=5000 ENVIRONMENT=devnet RPC_URL=http://localhost:8545 iex -S mix phx.server
  1. Run metrics:
make run_metrics
  1. Initialize different operator names from the telemetry terminal:
TelemetryApi.PrometheusMetrics.initialize_operator_metrics("gaston")
  1. Call the missing_operator method from the telemetry terminal with the initialized operator names:
TelemetryApi.PrometheusMetrics.missing_operator("gaston")
  1. Go to the aggregator-batcher dashboard in localhost:3000 and you should see the dashboard with the values.

Test also the full flow:

  1. Start anvil:
make anvil_start_with_block_time
  1. Start batcher:
make batcher_start_local
  1. Go to config-files/config-aggregator.yaml and reduce the bls_service_task_timeout.
  2. Start the aggregator:
make aggregator_start
  1. Register and start the operator:
make operator_register_and_start
  1. Send tasks:
make batcher_send_burst_groth16
  1. Kill the operator and watch the panel.

Type of change

  • New feature

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@avilagaston9 avilagaston9 self-assigned this Dec 16, 2024
@avilagaston9 avilagaston9 changed the base branch from testnet to staging December 16, 2024 16:59
@avilagaston9 avilagaston9 marked this pull request as ready for review December 16, 2024 17:59
@uri-99
Copy link
Contributor

uri-99 commented Dec 16, 2024

WIP discussing the possibility of a graph

@Oppen
Copy link
Contributor

Oppen commented Dec 17, 2024

WIP discussing the possibility of a graph

Given we need the metric, wouldn't it be better to merge and change later to improve its display?

@avilagaston9
Copy link
Contributor Author

Given we need the metric, wouldn't it be better to merge and change later to improve its display?

I agree with this. I don't know how much information an historical graph will add.

Copy link
Member

@MarcosNicolau MarcosNicolau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works locally!

@Oppen
Copy link
Contributor

Oppen commented Dec 18, 2024

The metrics doesn't seem to update locally. The manual triggers work, but killing my real operator doesn't make the metric go up:
image
image

Maybe some instruction is missing?

It also seems like the metric clears itself after some time, I guess that's normal?
image

@avilagaston9
Copy link
Contributor Author

The metrics doesn't seem to update locally. The manual triggers work, but killing my real operator doesn't make the metric go up:
@Oppen

I was missing the step to send proofs, but from the picture you send it looks like you are already doing that. I followed the steps again an it worked in my machine 🤔.

It also seems like the metric clears itself after some time, I guess that's normal?

Yes, the metric is configured to show only the missed responses in the selected time range on the top-right corner:

image

Copy link
Contributor

@Oppen Oppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked after solving the skill issue shortening ttl for tasks.

Copy link
Collaborator

@JuArce JuArce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remove the "total" transformation in grafana dashboard
  • Sort the values by higher not responding operators first

@avilagaston9
Copy link
Contributor Author

@JuArce

I added the total transformation since when only one operator was missing tasks, no name was displayed on the bar gauge:

image

Also, the bar gauge lacked an option to dynamically order the labels.

I addressed both in #15a53ae by switching from a "Bar Gauge" to a simple "table + gauge display" with successful results:

image

@JuArce JuArce enabled auto-merge December 23, 2024 16:28
@JuArce JuArce added this pull request to the merge queue Dec 23, 2024
Merged via the queue into staging with commit b767977 Dec 23, 2024
1 check passed
@JuArce JuArce deleted the operator-liveness-metric branch December 23, 2024 16:56
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants