-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Switch cluster file feature #1413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well. |
b4705cd to
b3099d3
Compare
That makes sense. The draft implementation doesn't currently work since a transaction can span a cluster as implemented. What's the testing story of the multi version transactions? Can the simulator test that code? |
The simulator does use multi-version transactions in some tests, but I don't think it exercises the switching logic or the part where it calls out to external libraries. In your case, though, it may be a little bit easier to test switching if you have multiple clusters being simulated. |
|
I've been thinking about this some more, and there's a weird externally visible effect that I think can occur no matter how we do this (if we try doing this transparently in the c client). If a client sets the same read version for multiple transactions, and reads a key in each transaction, each transaction won't necessarily return the same value during a switch. One idea would be to allow clients to set the expected Another idea would be to have the application manage hot standby logic, but this would be unfortunate since it seems to be tricky to get right. |
b3099d3 to
88d9e75
Compare
|
We have another idea for how to get around the problem of reading using a read version from one cluster on a different cluster: we can initialize a variable I updated this PR with an implementation of that idea |
da8ae02 to
8425a00
Compare
|
Update: this PR now contains an implementation of the above minAcceptableReadVersion idea. The test I added fails (as expected) with the minAcceptableReadVersion logic missing, and passes (or times out) when the logic is present. I'm also happy to revert the part where writing to TODO
|
bc216b0 to
ee9896d
Compare
|
I addressed all the above TODOs.
It seems that most of the complexity of this change is juggling read versions carefully, so I don't think the multi-version client actually buys us much here. |
Yeah, using it was based on the assumption that you'd be aborting in-flight transactions, but that doesn't seem to be the case here (or maybe at least not directly). |
|
Darn, it seems that the 'writing to |
|
It's possible you could go a similar route to what we do with options, where you store any modifications to this key and apply them when you connect to a new cluster. I guess you probably also need the ability to specify where an incompatible client is trying to test its connection, and I think this is actually a problem that needs to be solved more generally. See: It's possible a solution to that issue could be used more generally for your problem as well. |
79bc2c8 to
9adb27a
Compare
|
Hmm, not sure. Maybe we should try it again? |
|
@fdb-build test this please |
|
Looks like it passed! Btw, this is based on #1597, so if this gets merged before that we can just close 1597 |
ajbeamon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to be required that the current cluster the clients are connected to be up in order for the switch to occur? Is there a mechanism to indicate or determine when there is no longer any dependence on the original cluster? I ask because if someone plans to switch clusters in order to bring down the original cluster, you'd want to make sure that no client had any outstanding operation that would fail to complete in its absence.
9adb27a to
690a9cd
Compare
|
@ajbeamon PTAL |
No, clients can switch away from a down cluster.
My thinking is that any outstanding operation will get retried either because the db was locked or due to a timeout because the cluster was brought down. I'm not sure how this interacts with I'm also imagining external orchestration of locking clusters so that by the time you want to switch away from a cluster it doesn't matter if any outstanding operation failed to complete |
e669a8a to
f367fca
Compare
|
@fdb-build test this please |
Abort reads when connection file changes
This reverts commit f367fca.
|
Adding the cycle workload failed because every workload is run with a |
|
Actually, it looks like the switch is not working as expected, it's currently hanging, even though the test is passing. We will need to fix this bug before this PR can be merged. |
Update the implementation to interact with the new "don't maintain a connection to the cluster controller unless necessary" change, and unlock the originalDB at the end of the workload.
|
The issues @tclinken mentioned above are resolved now. One remaining issue is that the watch future here fails with |
Throw transaction_too_old instead of all_alternatives_failed when the cluster file changes while a read request is outstanding
ajbeamon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @etschannen said he also wanted to take a quick look at it, so I'll let him do so before merging.
|
@fdb-build test this please |
Unlocking the second database can take a while because the local ratekkeeper throttles reads until the updateStorage actor catches up after advanceVersion
This is a draft implementation of feature one described below, although in this implementation one writes to
\xff\xff/cluster_file_pathinstead of using an additional symbol in the API.Does this make sense to you guys? @alexmiller-apple @mpilman @dyoungworth
I plan to add a workload to test this in the simulator once consensus about the approach and desired interface is reached. It seems to work for the manual testing I've done.
Overall summary of what this is trying to accomplish
I'm interested in two features relevant to hot standbys and cluster connection files.
My understanding is the following sequence of steps accomplishes switching from a primary cluster
pto a hot standbyswithout compromising any of the guarantees made by foundationdb. Letsbegin as a locked cluster.pso that clients can't write topor get read versions frompunless they are "lock aware". Clients not writing topmakes step 2 possible by fixing the set of key-value pairs inp. Clients not reading fromp(or at least not getting an 'up to date' read version fromp) means that clients who thinkpis still the active cluster won't read something that has since been updated ons. Disallowing clients from getting read versions larger than some version v obtained after lockingpmakes it possible in step 3 to guarantee clients see monotonically increasing commit versions.pandscontain the same set of key-value pairs (let's ignore how to do this for now)sis at least the commit version ofp+ 1e8, so that clients observe monotonically increasing commit versions. (This can be done with\xff/minRequiredCommitVersion)sand inform all clients thatsis now the active cluster.My understanding of the current state of the art for how to manage this from the client side is to have separate
pandsdatabases and have a wrapper database-like object that delegatescreate_transactionto the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.Feature one would be to move the above "retry transactions on the new database and recreate pending watches" logic into the c client itself, and try to make this as transparent as possible. I propose adding something like
fdb_error_t fdb_change_cluster_file_path( FDBDatabase* database, const char* cluster_file_path);to the C api.Feature two is somewhat orthogonal, and aims to simplify the process of adding a new standby cluster and distributing a cluster connection file to all clients.
The idea would be to introduce the concept of a cluster-file-like interface to the C api, and have multiple implementations. One would be the current local disk implementation. Another could be a key in a remote FDB database.
This could be done by adding a FDBClusterConnectionInterface (or a better name) type to the C api, and then adding symbols like
FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_file(const char* path)FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_fdb_key(FDBDatabase* database, const char* key)fdb_error_t fdb_change_cluster_connection_interface(FDBDatabase* database, FDBClusterConnectionInterface* interface)Then clients could watch some key in a database like
current_cluster_key, and change their cluster connection interface to whatever key that key references. Adding a standby cluster and switching all clients to it would then look something like this:current_cluster_keyin a service discovery database for the name of the current clustername_of_new_clusterin the service discovery fdb database with the current connection string for the new clustercurrent_cluster_keytoname_of_new_cluster. Clients change their cluster connection interface to the keyname_of_new_clusterin the service discovery database, and then proceed to keep the connection string forname_of_new_clusterup to date.