Switch cluster file feature #1413

atn34 · 2019-04-03T21:20:55Z

This is a draft implementation of feature one described below, although in this implementation one writes to \xff\xff/cluster_file_path instead of using an additional symbol in the API.

Does this make sense to you guys? @alexmiller-apple @mpilman @dyoungworth

I plan to add a workload to test this in the simulator once consensus about the approach and desired interface is reached. It seems to work for the manual testing I've done.

Overall summary of what this is trying to accomplish

I'm interested in two features relevant to hot standbys and cluster connection files.

My understanding is the following sequence of steps accomplishes switching from a primary cluster p to a hot standby s without compromising any of the guarantees made by foundationdb. Let s begin as a locked cluster.

Lock p so that clients can't write to p or get read versions from p unless they are "lock aware". Clients not writing to p makes step 2 possible by fixing the set of key-value pairs in p. Clients not reading from p (or at least not getting an 'up to date' read version from p) means that clients who think p is still the active cluster won't read something that has since been updated on s. Disallowing clients from getting read versions larger than some version v obtained after locking p makes it possible in step 3 to guarantee clients see monotonically increasing commit versions.
Ensure that p and s contain the same set of key-value pairs (let's ignore how to do this for now)
Ensure the commit version of s is at least the commit version of p + 1e8, so that clients observe monotonically increasing commit versions. (This can be done with \xff/minRequiredCommitVersion)
Unlock s and inform all clients that s is now the active cluster.

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

Feature one would be to move the above "retry transactions on the new database and recreate pending watches" logic into the c client itself, and try to make this as transparent as possible. I propose adding something like
fdb_error_t fdb_change_cluster_file_path( FDBDatabase* database, const char* cluster_file_path); to the C api.

Feature two is somewhat orthogonal, and aims to simplify the process of adding a new standby cluster and distributing a cluster connection file to all clients.

The idea would be to introduce the concept of a cluster-file-like interface to the C api, and have multiple implementations. One would be the current local disk implementation. Another could be a key in a remote FDB database.

This could be done by adding a FDBClusterConnectionInterface (or a better name) type to the C api, and then adding symbols like

FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_file(const char* path)
FDBClusterConnectionInterface* fdb_create_cluster_connection_interface_from_fdb_key(FDBDatabase* database, const char* key)
fdb_error_t fdb_change_cluster_connection_interface(FDBDatabase* database, FDBClusterConnectionInterface* interface)

Then clients could watch some key in a database like current_cluster_key, and change their cluster connection interface to whatever key that key references. Adding a standby cluster and switching all clients to it would then look something like this:

Assume clients are watching the key current_cluster_key in a service discovery database for the name of the current cluster
Provision a new cluster
Set a key name_of_new_cluster in the service discovery fdb database with the current connection string for the new cluster
When you get to the 'inform all clients of the active cluster' step in your switch, you set current_cluster_key to name_of_new_cluster. Clients change their cluster connection interface to the key name_of_new_cluster in the service discovery database, and then proceed to keep the connection string for name_of_new_cluster up to date.

ajbeamon · 2019-04-03T22:51:23Z

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

atn34 · 2019-04-05T23:52:26Z

My understanding of the current state of the art for how to manage this from the client side is to have separate p and s databases and have a wrapper database-like object that delegates create_transaction to the client's current understanding of which cluster is active. If one is using watches, one would want to recreate all pending watches each time the active cluster changes.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

That makes sense. The draft implementation doesn't currently work since a transaction can span a cluster as implemented.

What's the testing story of the multi version transactions? Can the simulator test that code?

ajbeamon · 2019-04-08T15:58:03Z

What's the testing story of the multi version transactions? Can the simulator test that code?

The simulator does use multi-version transactions in some tests, but I don't think it exercises the switching logic or the part where it calls out to external libraries. In your case, though, it may be a little bit easier to test switching if you have multiple clusters being simulated.

atn34 · 2019-04-08T17:28:48Z

I've been thinking about this some more, and there's a weird externally visible effect that I think can occur no matter how we do this (if we try doing this transparently in the c client). If a client sets the same read version for multiple transactions, and reads a key in each transaction, each transaction won't necessarily return the same value during a switch.

One idea would be to allow clients to set the expected description:ID string (or something else that uniquely identifies the cluster) as part of set read version, and check that string at the storage servers.

Another idea would be to have the application manage hot standby logic, but this would be unfortunate since it seems to be tricky to get right.

atn34 · 2019-04-10T22:41:48Z

We have another idea for how to get around the problem of reading using a read version from one cluster on a different cluster: we can initialize a variable minAcceptableReadVersion with the lowest version ever received from a cluster, and enforce that we only attempt to read from that cluster with a read version of at least minAcceptableReadVersion. On switch we would invalidate minAcceptableReadVersion, effectively disallowing all reads until a new read version is received from the new cluster.

I updated this PR with an implementation of that idea

atn34 · 2019-04-20T22:24:41Z

Update: this PR now contains an implementation of the above minAcceptableReadVersion idea. The test I added fails (as expected) with the minAcceptableReadVersion logic missing, and passes (or times out) when the logic is present.

I'm also happy to revert the part where writing to \xff\xff/cluster_file_path changes the cluster_file_path so we don't have to commit to a particular client API in this PR.

TODO

Test recreating watches in the simulator.
Resolve documentation TODO
Figure out why the simulation tests is occasionally timing out.

atn34 · 2019-04-22T22:54:42Z

I addressed all the above TODOs.

This sounds similar to what the multi-version client does. In particular, it provides a database wrapper that manages multiple connections (in this case, to the same cluster) and switches between them depending on the version of the cluster. I wonder if we can leverage some of that logic to help with this feature as well.

It seems that most of the complexity of this change is juggling read versions carefully, so I don't think the multi-version client actually buys us much here.

fdbclient/ReadYourWrites.actor.cpp

ajbeamon · 2019-04-23T15:30:34Z

It seems that most of the complexity of this change is juggling read versions carefully, so I don't think the multi-version client actually buys us much here.

Yeah, using it was based on the assumption that you'd be aborting in-flight transactions, but that doesn't seem to be the case here (or maybe at least not directly).

atn34 · 2019-04-23T21:45:07Z

Darn, it seems that the 'writing to \xff\xff/cluster_file_path' API won't quite work for our use case since the multi-version client will not propagate the write to each client it's managing. It only goes to the currently active one. I think we need to add something like change_cluster_file to IDatabase and the c api.

ajbeamon · 2019-04-23T21:49:34Z

It's possible you could go a similar route to what we do with options, where you store any modifications to this key and apply them when you connect to a new cluster.

I guess you probably also need the ability to specify where an incompatible client is trying to test its connection, and I think this is actually a problem that needs to be solved more generally. See:

#498

It's possible a solution to that issue could be used more generally for your problem as well.

atn34 · 2019-05-17T20:11:54Z

@ajbeamon Any idea what went wrong in the linux build? Details link isn't helpful

ajbeamon · 2019-05-17T20:44:23Z

Hmm, not sure. Maybe we should try it again?

ajbeamon · 2019-05-17T20:44:31Z

@fdb-build test this please

atn34 · 2019-05-20T16:40:28Z

Looks like it passed! Btw, this is based on #1597, so if this gets merged before that we can just close 1597

ajbeamon

Is it going to be required that the current cluster the clients are connected to be up in order for the switch to occur? Is there a mechanism to indicate or determine when there is no longer any dependence on the original cluster? I ask because if someone plans to switch clusters in order to bring down the original cluster, you'd want to make sure that no client had any outstanding operation that would fail to complete in its absence.

fdbclient/NativeAPI.actor.cpp

fdbserver/storageserver.actor.cpp

atn34 · 2019-05-28T18:41:32Z

@ajbeamon PTAL

atn34 · 2019-05-28T22:52:05Z

Is it going to be required that the current cluster the clients are connected to be up in order for the switch to occur?

No, clients can switch away from a down cluster.

Is there a mechanism to indicate or determine when there is no longer any dependence on the original cluster? I ask because if someone plans to switch clusters in order to bring down the original cluster, you'd want to make sure that no client had any outstanding operation that would fail to complete in its absence.

My thinking is that any outstanding operation will get retried either because the db was locked or due to a timeout because the cluster was brought down. I'm not sure how this interacts with loadBalance if there's no timeout set as a network option.

I'm also imagining external orchestration of locking clusters so that by the time you want to switch away from a cluster it doesn't matter if any outstanding operation failed to complete

AlvinMooreSr · 2019-06-11T22:21:35Z

@fdb-build test this please

Abort reads when connection file changes

This reverts commit f367fca.

…hange-connection-file

tclinken · 2019-06-21T17:28:12Z

Adding the cycle workload failed because every workload is run with a DatabaseContext object, and the data from the first cluster isn't all copied to the second cluster anyway (only one key is copied by the DifferentClustersSameRV workload). Thus, we can't currently combine the DifferentClustersSameRV workload with other workloads. I have begun working on a way to do this in a separate PR. In the meantime, 8af92a9 has addressed @ajbeamon's comment about aborting reads when the connection file changes. Are there any more changes necessary to merge this PR?

tclinken · 2019-06-21T23:19:14Z

Actually, it looks like the switch is not working as expected, it's currently hanging, even though the test is passing. We will need to fix this bug before this PR can be merged.

Update the implementation to interact with the new "don't maintain a connection to the cluster controller unless necessary" change, and unlock the originalDB at the end of the workload.

atn34 · 2019-06-24T15:55:40Z

The issues @tclinken mentioned above are resolved now. One remaining issue is that the watch future here fails with future_version sometimes, but as far as I can tell this is true regardless of the switch feature. See https://forums.foundationdb.org/t/watch-semantics/1480

fdbclient/NativeAPI.actor.cpp

Throw transaction_too_old instead of all_alternatives_failed when the cluster file changes while a read request is outstanding

ajbeamon

This looks good to me. @etschannen said he also wanted to take a quick look at it, so I'll let him do so before merging.

AlvinMooreSr · 2019-06-25T18:52:49Z

@fdb-build test this please

Unlocking the second database can take a while because the local ratekkeeper throttles reads until the updateStorage actor catches up after advanceVersion

…hange-connection-file

fdbclient/NativeAPI.actor.cpp

…hange-connection-file

atn34 force-pushed the change-connection-file branch 2 times, most recently from b4705cd to b3099d3 Compare April 5, 2019 20:04

atn34 force-pushed the change-connection-file branch from b3099d3 to 88d9e75 Compare April 10, 2019 22:15

atn34 force-pushed the change-connection-file branch 2 times, most recently from da8ae02 to 8425a00 Compare April 20, 2019 22:14

atn34 force-pushed the change-connection-file branch 2 times, most recently from bc216b0 to ee9896d Compare April 22, 2019 22:51

ajbeamon reviewed Apr 23, 2019

View reviewed changes

fdbclient/ReadYourWrites.actor.cpp Outdated Show resolved Hide resolved

atn34 force-pushed the change-connection-file branch 2 times, most recently from 79bc2c8 to 9adb27a Compare May 17, 2019 00:10

atn34 changed the title ~~Change cluster file via \xff\xff/cluster_file_path~~ Change cluster file feature May 17, 2019

ajbeamon reviewed May 23, 2019

View reviewed changes

fdbclient/NativeAPI.actor.cpp Outdated Show resolved Hide resolved

fdbclient/NativeAPI.actor.cpp Outdated Show resolved Hide resolved

fdbserver/storageserver.actor.cpp Outdated Show resolved Hide resolved

atn34 changed the title ~~Change cluster file feature~~ Switch cluster file feature May 24, 2019

atn34 force-pushed the change-connection-file branch from 9adb27a to 690a9cd Compare May 24, 2019 17:51

atn34 force-pushed the change-connection-file branch from e669a8a to f367fca Compare June 11, 2019 21:11

alexmiller-apple added this to the 6.2 milestone Jun 11, 2019

tclinken and others added 4 commits June 19, 2019 16:53

Abort reads when connection file changes

8af92a9

Merge pull request #1 from tclinken/change-connection-file

7fb85cb

Abort reads when connection file changes

Revert "Add CycleTest (currently times out)"

148f922

This reverts commit f367fca.

Merge branch 'master' of https://github.com/apple/foundationdb into c…

afb0dbc

…hange-connection-file

atn34 added 2 commits June 22, 2019 12:23

Fail test if switch does not complete

5787210

Fix quietDatabase timeouts

70ebcb3

Update the implementation to interact with the new "don't maintain a connection to the cluster controller unless necessary" change, and unlock the originalDB at the end of the workload.

ajbeamon reviewed Jun 24, 2019

View reviewed changes

fdbclient/NativeAPI.actor.cpp Outdated Show resolved Hide resolved

fdbclient/NativeAPI.actor.cpp Outdated Show resolved Hide resolved

fdbclient/NativeAPI.actor.cpp Outdated Show resolved Hide resolved

atn34 and others added 3 commits June 24, 2019 09:32

Use camelCase instead of snake_case

f7cd943

Changes error type on aborted read

f38414e

Throw transaction_too_old instead of all_alternatives_failed when the cluster file changes while a read request is outstanding

Suppress future_version error from watch future

ff26044

ajbeamon approved these changes Jun 25, 2019

View reviewed changes

tclinken and others added 5 commits June 25, 2019 16:09

Fail DifferentClustersSameRV workload when switch doesn't finish

6e13cbb

Increased testDuration for DifferentClustersSameRV test

8f06ccb

Unlocking the second database can take a while because the local ratekkeeper throttles reads until the updateStorage actor catches up after advanceVersion

Merge branch 'master' of https://github.com/apple/foundationdb into c…

1582a2a

…hange-connection-file

Merge branch 'master' into change-connection-file

969957e

Fix bug introduced in merge commit

9001f5d

hgray1 assigned etschannen Jul 15, 2019

etschannen reviewed Jul 20, 2019

View reviewed changes

fdbclient/NativeAPI.actor.cpp Show resolved Hide resolved

tclinken and others added 4 commits July 22, 2019 11:13

Trigger masterProxiesChangeTrigger in switchConnectionFileImpl

3507bfd

Update release notes

2a3233a

Fix rst syntax

e34e5b7

Merge branch 'master' of https://github.com/apple/foundationdb into c…

9ad9bd4

…hange-connection-file

etschannen merged commit 8149b5b into apple:master Jul 26, 2019

Switch cluster file feature #1413

Switch cluster file feature #1413

Uh oh!

Conversation

atn34 commented Apr 3, 2019

Overall summary of what this is trying to accomplish

Uh oh!

ajbeamon commented Apr 3, 2019

Uh oh!

atn34 commented Apr 5, 2019

Uh oh!

ajbeamon commented Apr 8, 2019

Uh oh!

atn34 commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atn34 commented Apr 10, 2019

Uh oh!

atn34 commented Apr 20, 2019

Uh oh!

atn34 commented Apr 22, 2019

Uh oh!

Uh oh!

ajbeamon commented Apr 23, 2019

Uh oh!

atn34 commented Apr 23, 2019

Uh oh!

ajbeamon commented Apr 23, 2019

Uh oh!

atn34 commented May 17, 2019

Uh oh!

ajbeamon commented May 17, 2019

Uh oh!

ajbeamon commented May 17, 2019

Uh oh!

atn34 commented May 20, 2019

Uh oh!

ajbeamon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atn34 commented May 28, 2019

Uh oh!

atn34 commented May 28, 2019

Uh oh!

AlvinMooreSr commented Jun 11, 2019

Uh oh!

tclinken commented Jun 21, 2019

Uh oh!

tclinken commented Jun 21, 2019

Uh oh!

atn34 commented Jun 24, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbeamon left a comment

Choose a reason for hiding this comment

Uh oh!

AlvinMooreSr commented Jun 25, 2019

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

atn34 commented Apr 8, 2019 •

edited

Loading