fix(operator): Operator panic's when reconnect fails after max retries #1692

PatStiles · 2024-12-30T20:14:13Z

Fix: Operator Panic when reconnect fails after max Retries

Description

Closes #960.
SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable are called when the operator starts . SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable both spawn a go-routine to handle errors from the subscription to two rpc's a primary and a fallback. Upon erroring the go-routine attempts to subscribe again up to a retry limit.

When SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable exhaust the maximum number of retries the respective subscribe functions panics due to dereferencing a nil reference within the auto-generated contract bindings.

In the case that one of the subscriptions errors and then fails to re-connect the operator will panic while one subscription (the fallback subscription) is still available and the operator can continue to function as normal.

In the case that both connections fail error and fail to reconnect then the operator should shut down.

If the operator fails to subscribe initially the operator should shut down.

This pr addressed this issue by:
1.) Adding a defer() to the go routine the converts the panic into an error before the operator process exits.
2.) Changing the retry parameters for SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable to retry infinitely (up to retry intervals of 60 sec).

To Test:

install and use nginx as a proxy to test the connection brew install nginx
create a directory nginx with the file nginx.conf inside of it, with this content:

events { }

http {
    server {
        listen 8082;

        location /main/ {
            proxy_pass http://host.docker.internal:8545/;  # Note the trailing slashes
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_connect_timeout 1d;
            proxy_send_timeout 1d;
            proxy_read_timeout 1d;
        }
    }
}

create a docker-compose.yml with the following contents:

version: '3'
services:
  nginx:
    image: nginx:alpine
    container_name: nginx-anvil-proxy
    volumes:
      - ./nginx:/etc/nginx
    ports:
      - "8082:8082"

Make sure the nginx.conf file is located within ./nignx folder and you docker-compose-file has the proper path to the ngix.conf file.
change the respective Operator RPC url's in config-files within the aligned-layer repo to use :8082/main/ port and path.
Start a local anvil node make anvil_start_with_block_time
Start nginx in a docker container via docker compose -f nginx-docker-compose.yaml up -d
Observe proof verification is successful.
kill the connection to anvil via docker compose -f nginx-docker-compose.yaml stop
observe operator blocks with the following message:

Start the nginx proxy again docker compose -f nginx-docker-compose.yaml up -d && restart the aggregator make aggregator_start
Observe the operator reconnects and start verifying batches

Panic Operation: - change line 49 in `core/retry.go` to ``` SubscribeToNewTasksNumRetries = 1 ``` - Start the local network with the nginx proxy as described above and kill the proxy - Observe the operator errors and shuts down with the panic wrapped within an error.

Type of change

Please delete options that are not relevant.

Bug fix

Checklist

core/chainio/avs_subscriber.go

core/chainio/retryable.go

avilagaston9 · 2025-01-02T16:31:10Z

core/retry.go

+	RespondToTaskV2MaxElapsedTime = 0                      //	Maximum time all retries may take. `0` corresponds to no limit on the time of the retries.
+	RespondToTaskV2NumRetries     = 0                      // Total number of retries attempted. If 0, retries indefinitely until maxElapsedTime is reached.
+
+	// Retry Parameters for SubscribeToNewTasksV3


Suggested change

// Retry Parameters for SubscribeToNewTasksV3

// Retry Parameters for SubscribeToNewTasks

avilagaston9

Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.

PatStiles · 2025-01-02T17:29:34Z

Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.

Agreed! @JuArce should we do this in a separate pr.

MauroToscano

I'm not convinced not shutting down is the best solution. I'm putting this on hold until the description tells in which scenario the subscription is retrying for a long time but the node is working

PatStiles · 2025-01-09T23:54:51Z

Closing in favor of #1729

PatStiles added 3 commits December 30, 2024 11:52

add panic catch to SubscribeToNewTasksV3 error channel

73daa3c

change retry parameters

85fdb3b

add panic recovery + params to SubscribeNewTasksV2 error channel

d2cd86f

PatStiles self-assigned this Dec 30, 2024

avilagaston9 reviewed Jan 2, 2025

View reviewed changes

core/chainio/avs_subscriber.go Show resolved Hide resolved

avilagaston9 reviewed Jan 2, 2025

View reviewed changes

core/chainio/retryable.go Show resolved Hide resolved

avilagaston9 reviewed Jan 2, 2025

View reviewed changes

gaston's comments

9967909

avilagaston9 approved these changes Jan 2, 2025

View reviewed changes

MauroToscano requested changes Jan 6, 2025

View reviewed changes

PatStiles closed this Jan 10, 2025

uri-99 mentioned this pull request Jan 13, 2025

fix(operator): manage subscription to events without panic #1729

Merged

17 tasks

JuArce deleted the 960-fix-operator-panic-when-reconnect-fails-after-max-retries branch March 28, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(operator): Operator panic's when reconnect fails after max retries #1692

fix(operator): Operator panic's when reconnect fails after max retries #1692

Uh oh!

PatStiles commented Dec 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

avilagaston9 Jan 2, 2025

Uh oh!

PatStiles Jan 2, 2025

Uh oh!

avilagaston9 left a comment

Uh oh!

PatStiles commented Jan 2, 2025

Uh oh!

MauroToscano left a comment

Uh oh!

PatStiles commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Retry Parameters for SubscribeToNewTasksV3
	// Retry Parameters for SubscribeToNewTasks

fix(operator): Operator panic's when reconnect fails after max retries #1692

fix(operator): Operator panic's when reconnect fails after max retries #1692

Uh oh!

Conversation

PatStiles commented Dec 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Operator Panic when reconnect fails after max Retries

Description

To Test:

Type of change

Checklist

Uh oh!

Uh oh!

Uh oh!

avilagaston9 Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

PatStiles Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

avilagaston9 left a comment

Choose a reason for hiding this comment

Uh oh!

PatStiles commented Jan 2, 2025

Uh oh!

MauroToscano left a comment

Choose a reason for hiding this comment

Uh oh!

PatStiles commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PatStiles commented Dec 30, 2024 •

edited

Loading