Skip to content

More aggressive retry strategy for NIO#2083

Merged
vam-google merged 9 commits intogoogleapis:masterfrom
jean-philippe-martin:jp_aggressive_reopen
May 31, 2017
Merged

More aggressive retry strategy for NIO#2083
vam-google merged 9 commits intogoogleapis:masterfrom
jean-philippe-martin:jp_aggressive_reopen

Conversation

@jean-philippe-martin
Copy link
Copy Markdown

(but only when OptionMaxChannelReopens is set)

Testing shows that sometimes storage will fail with one of the errors below when load is high. In these situations, the ChannelReopen policy is to attempt to re-open the channel after a delay.

(but only when OptionMaxChannelReopens is set)
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 22, 2017
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.01%) to 80.947% when pulling b1408f0 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

// this error isn't marked as retryable since the channel is closed;
if (reopens < maxChannelReopens && (
exs.getMessage().contains("Connection closed prematurely")
|| exs.getCause() instanceof javax.net.ssl.SSLHandshakeException

This comment was marked as spam.

This comment was marked as spam.

exs.getMessage().contains("Connection closed prematurely")
|| exs.getCause() instanceof javax.net.ssl.SSLHandshakeException
|| exs.getCause() instanceof java.io.EOFException
|| exs.getCause() instanceof java.net.SocketTimeoutException

This comment was marked as spam.

This comment was marked as spam.

|| exs.getCause() instanceof javax.net.ssl.SSLHandshakeException
|| exs.getCause() instanceof java.io.EOFException
|| exs.getCause() instanceof java.net.SocketTimeoutException
)) {

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

if (exs.getMessage().contains("Connection closed prematurely") && reopens < maxChannelReopens) {
// this error isn't marked as retryable since the channel is closed;
if (reopens < maxChannelReopens && (
exs.getMessage().contains("Connection closed prematurely")

This comment was marked as spam.

This comment was marked as spam.

continue;
}
throw exs;
throw new StorageException(exs.getCode(), "Retry failed", exs);

This comment was marked as spam.

This comment was marked as spam.


private void sleepForAttempt(int attempt) {
long delay = 500;
// exponential backoff

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@droazen
Copy link
Copy Markdown

droazen commented May 25, 2017

Added my feedback @jean-philippe-martin. Thanks for doing this, it will (hopefully) solve a major issue for us with our use of the gcloud NIO functionality!

- longer timeout
- check exs.getCause() as well
- retry for more exception classes
- better phrasing
return exs != null && (exs.getMessage().contains("Connection closed prematurely")
|| exs.getCause() instanceof javax.net.ssl.SSLException
|| exs.getCause() instanceof java.io.EOFException
|| exs.getCause() instanceof java.net.SocketException);

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

if (exs.getMessage().contains("Connection closed prematurely") && reopens < maxChannelReopens) {
// this error isn't marked as retryable since the channel is closed;
// but here at this higher level we can retry it.
if (reopens < maxChannelReopens && (isReopenable(exs) || isReopenable(exs.getCause()))) {

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.03%) to 80.903% when pulling b2bb415 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

Also added a test for getCause().getCause()
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.03%) to 80.902% when pulling 2fddc77 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.03%) to 80.903% when pulling 7c81e40 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@droazen
Copy link
Copy Markdown

droazen commented May 26, 2017

👍 Latest version looks good to me -- hope we can get this merged soon!

Throwable throwable = exs;
while (throwable != null) {
if (throwable.getMessage().contains("Connection closed prematurely")
|| throwable.getCause() instanceof javax.net.ssl.SSLException

This comment was marked as spam.

This comment was marked as spam.

int amt;
int retries = 0;
int maxRetries = 3;
int maxRetries = Math.max(3, maxChannelReopens);

This comment was marked as spam.

This comment was marked as spam.

long maxDelay = 120_000;
// exponential backoff
for (int i=0; i<attempt; i++) {
delay *= 2;

This comment was marked as spam.

This comment was marked as spam.

long delay = 1000;
long maxDelay = 120_000;
// exponential backoff
for (int i=0; i<attempt; i++) {

This comment was marked as spam.

This comment was marked as spam.

long maxDelay = 120_000;
// exponential backoff
for (int i=0; i<attempt; i++) {
delay *= 2;

This comment was marked as spam.

This comment was marked as spam.

Simplify backoff computation, add loop variant variable,
and code formatting changes.
long firstdelay = 1000;
long maxDelay = 120_000;
// exponential backoff
long delay = firstdelay * Math.min(1 << attempt, maxDelay);

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.03%) to 80.903% when pulling 1593376 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.02%) to 80.914% when pulling cc035c9 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

Copy link
Copy Markdown
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please address the last comment about if statement first.

long delay = 1000L * (1L << Math.min(attempt, 7));
try {
Thread.sleep((attempt - 1) * 500);
Thread.sleep(delay);

This comment was marked as spam.

This comment was marked as spam.

int maxDepth = 10;
while (throwable != null && maxDepth-- > 0) {
Throwable cause = throwable.getCause();
if (throwable.getMessage().contains("Connection closed prematurely")

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

Also simplify 'isReopenable' code, make its test harder,
and add a comment about why 2x backoff.
@jean-philippe-martin
Copy link
Copy Markdown
Author

@vam-google I'm not sure which comment you mean. What I see is the last comment on the if statement is the concern about the loop not terminating but I have addressed that with maxDepth. I apologize, but could you please point me to the part that still needs a fix?

Copy link
Copy Markdown
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment

// ensures finite iteration
int maxDepth = 10;
while (throwable != null && maxDepth-- > 0) {
if (throwable.getMessage().contains("Connection closed prematurely")

This comment was marked as spam.

This comment was marked as spam.

@vam-google
Copy link
Copy Markdown
Contributor

vam-google commented May 30, 2017

@jean-philippe-martin sorry, I meant the comment about extracting check from the loop. But I was wrong about it, wo it is ok. But I added one more comment. Please address it, and then LGTM.

Copy link
Copy Markdown
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

} else if (exs.isRetryable() || exs.getCode() == 500 || exs.getCode() == 503) {
retries++;
if (retries > maxRetries) {
throw new StorageException(exs.getCode(), "All retries failed", exs);

This comment was marked as spam.

This comment was marked as spam.

@vam-google
Copy link
Copy Markdown
Contributor

Ok, I'm going to merge this pull request. @jean-philippe-martin please confirm that you are done with your changes.

(comment explains how you may get a retriable exception that says
"retries failed")
@jean-philippe-martin
Copy link
Copy Markdown
Author

Now that I added the comment, I'm done! Thanks for the review and for doing the merge (since I cannot). I suppose we want to wait for the Travis test to pass, though, yes?

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.05%) to 80.981% when pulling 4161392 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.05%) to 80.981% when pulling aba972d on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.05%) to 80.981% when pulling 4161392 on jean-philippe-martin:jp_aggressive_reopen into 0fbd82e on GoogleCloudPlatform:master.

@jean-philippe-martin
Copy link
Copy Markdown
Author

Great, Travis is green! @vam-google I leave it to you to merge in the code. Thank you very much!

@vam-google vam-google merged commit 6aa8964 into googleapis:master May 31, 2017
@vam-google
Copy link
Copy Markdown
Contributor

Merged. Thanks @jean-philippe-martin for your work!

@jean-philippe-martin jean-philippe-martin deleted the jp_aggressive_reopen branch May 31, 2017 16:39
suztomo pushed a commit to suztomo/google-cloud-java that referenced this pull request Mar 23, 2026
meltsufin pushed a commit that referenced this pull request Apr 29, 2026
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
meltsufin pushed a commit that referenced this pull request May 1, 2026
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants