More aggressive retry strategy for NIO#2083
Conversation
(but only when OptionMaxChannelReopens is set)
| // this error isn't marked as retryable since the channel is closed; | ||
| if (reopens < maxChannelReopens && ( | ||
| exs.getMessage().contains("Connection closed prematurely") | ||
| || exs.getCause() instanceof javax.net.ssl.SSLHandshakeException |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| exs.getMessage().contains("Connection closed prematurely") | ||
| || exs.getCause() instanceof javax.net.ssl.SSLHandshakeException | ||
| || exs.getCause() instanceof java.io.EOFException | ||
| || exs.getCause() instanceof java.net.SocketTimeoutException |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| || exs.getCause() instanceof javax.net.ssl.SSLHandshakeException | ||
| || exs.getCause() instanceof java.io.EOFException | ||
| || exs.getCause() instanceof java.net.SocketTimeoutException | ||
| )) { |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| if (exs.getMessage().contains("Connection closed prematurely") && reopens < maxChannelReopens) { | ||
| // this error isn't marked as retryable since the channel is closed; | ||
| if (reopens < maxChannelReopens && ( | ||
| exs.getMessage().contains("Connection closed prematurely") |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| continue; | ||
| } | ||
| throw exs; | ||
| throw new StorageException(exs.getCode(), "Retry failed", exs); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
|
||
| private void sleepForAttempt(int attempt) { | ||
| long delay = 500; | ||
| // exponential backoff |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
Added my feedback @jean-philippe-martin. Thanks for doing this, it will (hopefully) solve a major issue for us with our use of the gcloud NIO functionality! |
- longer timeout - check exs.getCause() as well - retry for more exception classes - better phrasing
| return exs != null && (exs.getMessage().contains("Connection closed prematurely") | ||
| || exs.getCause() instanceof javax.net.ssl.SSLException | ||
| || exs.getCause() instanceof java.io.EOFException | ||
| || exs.getCause() instanceof java.net.SocketException); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| if (exs.getMessage().contains("Connection closed prematurely") && reopens < maxChannelReopens) { | ||
| // this error isn't marked as retryable since the channel is closed; | ||
| // but here at this higher level we can retry it. | ||
| if (reopens < maxChannelReopens && (isReopenable(exs) || isReopenable(exs.getCause()))) { |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Also added a test for getCause().getCause()
|
👍 Latest version looks good to me -- hope we can get this merged soon! |
| Throwable throwable = exs; | ||
| while (throwable != null) { | ||
| if (throwable.getMessage().contains("Connection closed prematurely") | ||
| || throwable.getCause() instanceof javax.net.ssl.SSLException |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| int amt; | ||
| int retries = 0; | ||
| int maxRetries = 3; | ||
| int maxRetries = Math.max(3, maxChannelReopens); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| long maxDelay = 120_000; | ||
| // exponential backoff | ||
| for (int i=0; i<attempt; i++) { | ||
| delay *= 2; |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| long delay = 1000; | ||
| long maxDelay = 120_000; | ||
| // exponential backoff | ||
| for (int i=0; i<attempt; i++) { |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| long maxDelay = 120_000; | ||
| // exponential backoff | ||
| for (int i=0; i<attempt; i++) { | ||
| delay *= 2; |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Simplify backoff computation, add loop variant variable, and code formatting changes.
| long firstdelay = 1000; | ||
| long maxDelay = 120_000; | ||
| // exponential backoff | ||
| long delay = firstdelay * Math.min(1 << attempt, maxDelay); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
vam-google
left a comment
There was a problem hiding this comment.
LGTM, but please address the last comment about if statement first.
| long delay = 1000L * (1L << Math.min(attempt, 7)); | ||
| try { | ||
| Thread.sleep((attempt - 1) * 500); | ||
| Thread.sleep(delay); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| int maxDepth = 10; | ||
| while (throwable != null && maxDepth-- > 0) { | ||
| Throwable cause = throwable.getCause(); | ||
| if (throwable.getMessage().contains("Connection closed prematurely") |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Also simplify 'isReopenable' code, make its test harder, and add a comment about why 2x backoff.
|
@vam-google I'm not sure which comment you mean. What I see is the last comment on the if statement is the concern about the loop not terminating but I have addressed that with |
| // ensures finite iteration | ||
| int maxDepth = 10; | ||
| while (throwable != null && maxDepth-- > 0) { | ||
| if (throwable.getMessage().contains("Connection closed prematurely") |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
@jean-philippe-martin sorry, I meant the comment about extracting check from the loop. But I was wrong about it, wo it is ok. But I added one more comment. Please address it, and then LGTM. |
| } else if (exs.isRetryable() || exs.getCode() == 500 || exs.getCode() == 503) { | ||
| retries++; | ||
| if (retries > maxRetries) { | ||
| throw new StorageException(exs.getCode(), "All retries failed", exs); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
Ok, I'm going to merge this pull request. @jean-philippe-martin please confirm that you are done with your changes. |
(comment explains how you may get a retriable exception that says "retries failed")
|
Now that I added the comment, I'm done! Thanks for the review and for doing the merge (since I cannot). I suppose we want to wait for the Travis test to pass, though, yes? |
|
Great, Travis is green! @vam-google I leave it to you to merge in the code. Thank you very much! |
|
Merged. Thanks @jean-philippe-martin for your work! |
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
(but only when OptionMaxChannelReopens is set)
Testing shows that sometimes storage will fail with one of the errors below when load is high. In these situations, the ChannelReopen policy is to attempt to re-open the channel after a delay.