-
Notifications
You must be signed in to change notification settings - Fork 1.5k
If onError fails with cluster_version_changed, retry the error on the new transaction. #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If onError fails with cluster_version_changed, retry the error on the new transaction. #1734
Conversation
atn34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (other than documentation build)
| Fixes | ||
| ----- | ||
|
|
||
| * If a cluster is upgraded during an `onError` call, the cluster could return a `cluster_version_changed` error. `(PR #1734) <https://github.com/apple/foundationdb/pull/1734>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to break the documentation build
Warning, treated as error:
/this_is_some_very_long_name_dir_needed_to_fix_a_bug_with_debug_rpms/foundationdb/documentation/sphinx/source/release-notes.rst:17: ERROR: Broken role invoked
| f = abortableFuture(f, tr.onChange); | ||
|
|
||
| return flatMapThreadFuture<Void, Void>(f, [this, e](ErrorOr<Void> ready) { | ||
| if(!ready.isError() || ready.getError().code() != error_code_cluster_version_changed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so the exact problem here is:
- The user sets up two client libraries and then uses one of them to connect to the cluster.
- They do a thing that results in an error (like, say, get a transaction conflict).
- While waiting on the error, the cluster is upgraded.
- The
onErrorcall used to propagate that error up, but now it callsonErrorin the new thing and updates the transaction to use the new client library.
I think that means that for existing users of the retry loops, they would have just seen a retriable error propagate up rather than, say, seeing the transaction retry loop try again with an incompatible protocol version (right?). I ask because I think if it were the latter, then there would be a somewhat compelling case that this should go onto 6.1 as a patch, but I think it's fine on master if only the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good summary of the problem.
The old behavior should have resulted in it bubbling up a retryable error, but since it came from onError it's likely they wouldn't have actually attempted to retry it.
No description provided.