Skip to content

Conversation

@atn34
Copy link
Collaborator

@atn34 atn34 commented Apr 3, 2019

This is a draft implementation of feature one described below, although in this implementation one writes to \xff\xff/cluster_file_path instead of using an additional symbol in the API.

Does this make sense to you guys? @alexmiller-apple @mpilman @dyoungworth

I plan to add a workload to test this in the simulator once consensus about the approach and desired interface is reached. It seems to work for the manual testing I've done.

Overall summary of what this is trying to accomplish

I'm interested in two features relevant to hot standbys and cluster connection files.

My understanding is the following sequence of steps accomplishes switching from a primary cluster p to a hot standby s without compromising any of the guarantees made by foundationdb. Let s begin as a locked cluster.

  1. Lock p so that clients can't write to p or get read versions from p unless they are "lock aware". Clients not writing to p makes step 2 possible by fixing the set of key-value pairs in p. Clients not reading from p (or at least not getting an 'up to date' read version from p) means that clients who think p is still the active cluster won't read something that has since been updated on s. Disallowing clients from getting read versions larger than some version v obtained after locking p makes it possible in step 3 to guarantee clients see monotonically increasing commit versions.
  2. Ensure that p and s contain the same set of key-value pairs (let's ignore how to do this for now)
  3. Ensure the commit version of s is at least the commit version of p + 1e8, so that clients observe monotonically increasing commit versions. (This can be done with \xff/minRequiredCommitVersion)
  4. Unlock s and inform all clients that s is now the active cluster.

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

Feature one would be to move the above "retry transactions on the new database and recreate pending watches" logic into the c client itself, and try to make this as transparent as possible. I propose adding something like
fdb_error_t fdb_change_cluster_file_path( FDBDatabase* database, const char* cluster_file_path); to the C api.

Feature two is somewhat orthogonal, and aims to simplify the process of adding a new standby cluster and distributing a cluster connection file to all clients.

The idea would be to introduce the concept of a cluster-file-like interface to the C api, and have multiple implementations. One would be the current local disk implementation. Another could be a key in a remote FDB database.

This could be done by adding a FDBClusterConnectionInterface (or a better name) type to the C api, and then adding symbols like

  • FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_file(const char* path)
  • FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_fdb_key(FDBDatabase* database, const char* key)
  • fdb_error_t fdb_change_cluster_connection_interface(FDBDatabase* database, FDBClusterConnectionInterface* interface)

Then clients could watch some key in a database like current_cluster_key, and change their cluster connection interface to whatever key that key references. Adding a standby cluster and switching all clients to it would then look something like this:

  1. Assume clients are watching the key current_cluster_key in a service discovery database for the name of the current cluster
  2. Provision a new cluster
  3. Set a key name_of_new_cluster in the service discovery fdb database with the current connection string for the new cluster
  4. When you get to the 'inform all clients of the active cluster' step in your switch, you set current_cluster_key to name_of_new_cluster. Clients change their cluster connection interface to the key name_of_new_cluster in the service discovery database, and then proceed to keep the connection string for name_of_new_cluster up to date.

@ajbeamon
Copy link
Contributor

ajbeamon commented Apr 3, 2019

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

@atn34 atn34 force-pushed the change-connection-file branch 2 times, most recently from b4705cd to b3099d3 Compare April 5, 2019 20:04
@atn34
Copy link
Collaborator Author

atn34 commented Apr 5, 2019

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

That makes sense. The draft implementation doesn't currently work since a transaction can span a cluster as implemented.

What's the testing story of the multi version transactions? Can the simulator test that code?

@ajbeamon
Copy link
Contributor

ajbeamon commented Apr 8, 2019

What's the testing story of the multi version transactions? Can the simulator test that code?

The simulator does use multi-version transactions in some tests, but I don't think it exercises the switching logic or the part where it calls out to external libraries. In your case, though, it may be a little bit easier to test switching if you have multiple clusters being simulated.

@atn34
Copy link
Collaborator Author

atn34 commented Apr 8, 2019

I've been thinking about this some more, and there's a weird externally visible effect that I think can occur no matter how we do this (if we try doing this transparently in the c client). If a client sets the same read version for multiple transactions, and reads a key in each transaction, each transaction won't necessarily return the same value during a switch.

One idea would be to allow clients to set the expected description:ID string (or something else that uniquely identifies the cluster) as part of set read version, and check that string at the storage servers.

Another idea would be to have the application manage hot standby logic, but this would be unfortunate since it seems to be tricky to get right.

@atn34 atn34 force-pushed the change-connection-file branch from b3099d3 to 88d9e75 Compare April 10, 2019 22:15
@atn34
Copy link
Collaborator Author

atn34 commented Apr 10, 2019

We have another idea for how to get around the problem of reading using a read version from one cluster on a different cluster: we can initialize a variable minAcceptableReadVersion with the lowest version ever received from a cluster, and enforce that we only attempt to read from that cluster with a read version of at least minAcceptableReadVersion. On switch we would invalidate minAcceptableReadVersion, effectively disallowing all reads until a new read version is received from the new cluster.

I updated this PR with an implementation of that idea

@atn34 atn34 force-pushed the change-connection-file branch 2 times, most recently from da8ae02 to 8425a00 Compare April 20, 2019 22:14
@atn34
Copy link
Collaborator Author

atn34 commented Apr 20, 2019

Update: this PR now contains an implementation of the above minAcceptableReadVersion idea. The test I added fails (as expected) with the minAcceptableReadVersion logic missing, and passes (or times out) when the logic is present.

I'm also happy to revert the part where writing to \xff\xff/cluster_file_path changes the cluster_file_path so we don't have to commit to a particular client API in this PR.

TODO

  • Test recreating watches in the simulator.
  • Resolve documentation TODO
  • Figure out why the simulation tests is occasionally timing out.

@atn34 atn34 force-pushed the change-connection-file branch 2 times, most recently from bc216b0 to ee9896d Compare April 22, 2019 22:51
@atn34
Copy link
Collaborator Author

atn34 commented Apr 22, 2019

I addressed all the above TODOs.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

It seems that most of the complexity of this change is juggling read versions carefully, so I don't think the multi-version client actually buys us much here.

@ajbeamon
Copy link
Contributor

It seems that most of the complexity of this change is juggling read versions carefully, so I don't think the multi-version client actually buys us much here.

Yeah, using it was based on the assumption that you'd be aborting in-flight transactions, but that doesn't seem to be the case here (or maybe at least not directly).

@atn34
Copy link
Collaborator Author

atn34 commented Apr 23, 2019

Darn, it seems that the 'writing to \xff\xff/cluster_file_path' API won't quite work for our use case since the multi-version client will not propagate the write to each client it's managing. It only goes to the currently active one. I think we need to add something like change_cluster_file to IDatabase and the c api.

@ajbeamon
Copy link
Contributor

It's possible you could go a similar route to what we do with options, where you store any modifications to this key and apply them when you connect to a new cluster.

I guess you probably also need the ability to specify where an incompatible client is trying to test its connection, and I think this is actually a problem that needs to be solved more generally. See:

#498

It's possible a solution to that issue could be used more generally for your problem as well.

@atn34 atn34 force-pushed the change-connection-file branch 2 times, most recently from 79bc2c8 to 9adb27a Compare May 17, 2019 00:10
@atn34 atn34 changed the title Change cluster file via \xff\xff/cluster_file_path Change cluster file feature May 17, 2019
@atn34
Copy link
Collaborator Author

atn34 commented May 17, 2019

@ajbeamon Any idea what went wrong in the linux build? Details link isn't helpful

@ajbeamon
Copy link
Contributor

Hmm, not sure. Maybe we should try it again?

@ajbeamon
Copy link
Contributor

@fdb-build test this please

@atn34
Copy link
Collaborator Author

atn34 commented May 20, 2019

Looks like it passed! Btw, this is based on #1597, so if this gets merged before that we can just close 1597

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it going to be required that the current cluster the clients are connected to be up in order for the switch to occur? Is there a mechanism to indicate or determine when there is no longer any dependence on the original cluster? I ask because if someone plans to switch clusters in order to bring down the original cluster, you'd want to make sure that no client had any outstanding operation that would fail to complete in its absence.

@atn34 atn34 changed the title Change cluster file feature Switch cluster file feature May 24, 2019
@atn34 atn34 force-pushed the change-connection-file branch from 9adb27a to 690a9cd Compare May 24, 2019 17:51
@atn34
Copy link
Collaborator Author

atn34 commented May 28, 2019

@ajbeamon PTAL

@atn34
Copy link
Collaborator Author

atn34 commented May 28, 2019

Is it going to be required that the current cluster the clients are connected to be up in order for the switch to occur?

No, clients can switch away from a down cluster.

Is there a mechanism to indicate or determine when there is no longer any dependence on the original cluster? I ask because if someone plans to switch clusters in order to bring down the original cluster, you'd want to make sure that no client had any outstanding operation that would fail to complete in its absence.

My thinking is that any outstanding operation will get retried either because the db was locked or due to a timeout because the cluster was brought down. I'm not sure how this interacts with loadBalance if there's no timeout set as a network option.

I'm also imagining external orchestration of locking clusters so that by the time you want to switch away from a cluster it doesn't matter if any outstanding operation failed to complete

@atn34 atn34 force-pushed the change-connection-file branch from e669a8a to f367fca Compare June 11, 2019 21:11
@AlvinMooreSr
Copy link
Contributor

@fdb-build test this please

@alexmiller-apple alexmiller-apple added this to the 6.2 milestone Jun 11, 2019
@tclinken
Copy link
Contributor

Adding the cycle workload failed because every workload is run with a DatabaseContext object, and the data from the first cluster isn't all copied to the second cluster anyway (only one key is copied by the DifferentClustersSameRV workload). Thus, we can't currently combine the DifferentClustersSameRV workload with other workloads. I have begun working on a way to do this in a separate PR. In the meantime, 8af92a9 has addressed @ajbeamon's comment about aborting reads when the connection file changes. Are there any more changes necessary to merge this PR?

@tclinken
Copy link
Contributor

Actually, it looks like the switch is not working as expected, it's currently hanging, even though the test is passing. We will need to fix this bug before this PR can be merged.

atn34 added 2 commits June 22, 2019 12:23
Update the implementation to interact with the new "don't maintain a
connection to the cluster controller unless necessary" change, and
unlock the originalDB at the end of the workload.
@atn34
Copy link
Collaborator Author

atn34 commented Jun 24, 2019

The issues @tclinken mentioned above are resolved now. One remaining issue is that the watch future here fails with future_version sometimes, but as far as I can tell this is true regardless of the switch feature. See https://forums.foundationdb.org/t/watch-semantics/1480

atn34 and others added 3 commits June 24, 2019 09:32
Throw transaction_too_old instead of all_alternatives_failed when the
cluster file changes while a read request is outstanding
Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @etschannen said he also wanted to take a quick look at it, so I'll let him do so before merging.

@AlvinMooreSr
Copy link
Contributor

@fdb-build test this please

@etschannen etschannen merged commit 8149b5b into apple:master Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants