-
Notifications
You must be signed in to change notification settings - Fork 391
fix(operator): Operator panic's when reconnect fails after max retries #1692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(operator): Operator panic's when reconnect fails after max retries #1692
Conversation
core/retry.go
Outdated
| RespondToTaskV2MaxElapsedTime = 0 // Maximum time all retries may take. `0` corresponds to no limit on the time of the retries. | ||
| RespondToTaskV2NumRetries = 0 // Total number of retries attempted. If 0, retries indefinitely until maxElapsedTime is reached. | ||
|
|
||
| // Retry Parameters for SubscribeToNewTasksV3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Retry Parameters for SubscribeToNewTasksV3 | |
| // Retry Parameters for SubscribeToNewTasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed!
avilagaston9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.
Agreed! @JuArce should we do this in a separate pr. |
MauroToscano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced not shutting down is the best solution. I'm putting this on hold until the description tells in which scenario the subscription is retrying for a long time but the node is working
|
Closing in favor of #1729 |
Fix: Operator Panic when reconnect fails after max Retries
Description
Closes #960.
SubscribeToNewTasksV2RetryableandSubscribeToNewTasksV3Retryableare called when the operator starts .SubscribeToNewTasksV2RetryableandSubscribeToNewTasksV3Retryableboth spawn a go-routine to handle errors from the subscription to two rpc's a primary and a fallback. Upon erroring the go-routine attempts to subscribe again up to a retry limit.When
SubscribeToNewTasksV2RetryableandSubscribeToNewTasksV3Retryableexhaust the maximum number of retries the respective subscribe functions panics due to dereferencing anilreference within the auto-generated contract bindings.In the case that one of the subscriptions errors and then fails to re-connect the operator will panic while one subscription (the fallback subscription) is still available and the operator can continue to function as normal.
In the case that both connections fail error and fail to reconnect then the operator should shut down.
If the operator fails to subscribe initially the operator should shut down.
This pr addressed this issue by:
1.) Adding a
defer()to the go routine the converts the panic into an error before the operator process exits.2.) Changing the retry parameters for
SubscribeToNewTasksV2RetryableandSubscribeToNewTasksV3Retryableto retry infinitely (up to retry intervals of 60 sec).To Test:
brew install nginxconfig-fileswithin the aligned-layer repo to use:8082/main/port and path.make anvil_start_with_block_timedocker compose -f nginx-docker-compose.yaml up -ddocker compose -f nginx-docker-compose.yaml stopdocker compose -f nginx-docker-compose.yaml up -d&& restart the aggregatormake aggregator_startType of change
Please delete options that are not relevant.
Checklist
testnet, everything else tostaging