Skip to content

Conversation

@senthil-ram
Copy link
Contributor

Currently, we have a disk based snapshot and restore mechanism implemented in FDB6, but, there are few restrictions with the current implementation, namely:

  • Works only in environments where the disk snapshot request completes in < 5 seconds.
  • Status of snapshot request needs to be verified externally by parsing through the trace events.

Requirements:

  • Ability for FDB snapshot command to work on systems where the disk snapshot command may take few 10s of seconds or up to a minute.
  • Ability to determine the success or failure of the snapshot request and communicate it to the user.

Non Requirements:

  • Ability to support multi-region snapshot and restore

more details here:
https://docs.google.com/document/d/1Udgrpqu2D4DNEv1_2id9E7CQVDkisaTtK31FybVgNYQ/edit?usp=sharing

@senthil-ram senthil-ram force-pushed the snapv2 branch 2 times, most recently from 918aac0 to 00e2a30 Compare June 20, 2019 20:53
@alexmiller-apple alexmiller-apple self-assigned this Jun 21, 2019
Copy link
Contributor

@mpilman mpilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite through with my review yet. As a general comment: please check your excessive use of state variables. I didn't check all of them and I think a lot of them are unnecessary in the old code. This is bad style, makes your code slower, uses more memory, and makes understanding the code much harder (as state variables are not scoped).

I also added a few other comments

Copy link
Contributor

@alexmiller-apple alexmiller-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run out of time about half way through.

@mpilman, could you please do a closer read through and review of this? I'm still going to be deeply lacking time while my own 6.2 work is unfinished.

// execute payload in 'snapCmd' on all the coordinators, TLogs and
// storage nodes
ACTOR Future<Void> mgmtSnapCreate(Database cx, StringRef snapCmd);
ACTOR Future<Void> mgmtSnapCreate(Database cx, StringRef snapCmd, int version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, as this is about to be removed anyway, int version is a great opportunity to use an enum class instead of an int.

@senthil-ram
Copy link
Contributor Author

Following are the items that we know of that will be addressed in 6.2 or later:

  • disable DD now uses a persistent state and an operator has to unset it if the CLI fails in-between the snap request. This will be addressed in a subsequent PR in 6.2 time-frame to keep the disable DD as a in-memory state.
  • To detect if there was a recovery - currently code checks the client DB id and monitor the server DB info on the DataDistributor. But, Markus and I were thinking that it is possible that there is a possibility of a recovery that goes un-noticed and ending up with a bad coordinated state. One improvement there is to comparing the read generation id of the coordinated state before and after snapshots.
  • Markus pointed out that coordinators need not be workers and for that case we need to find a different mechanism to snapshot the coordinators (for eg: write the coordinated state into tlog/storage snapshots and re-populated it after restore). This will be mostly addressed beyond 6.2.

@senthil-ram
Copy link
Contributor Author

This PR is dependent on a bug fix in setDDMode which is in this PR: #1866

@alexmiller-apple
Copy link
Contributor

I'm not sure why your CI got stuck. Let's see if this works in one line...

@fdb-build, run bindings please, run cmake linux please, run documentation please, run linux please.

Copy link
Contributor

@alexmiller-apple alexmiller-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick skim while building, and this seems fine to merge.

I'm kicking off some extra correctness:
all tests = 20190723-004450-alexmiller-4034900cf15fc655
snap only = 20190723-004752-alexmiller-4d476ce0d0cc6fc7

@senthil-ram
Copy link
Contributor Author

I'm not sure why your CI got stuck. Let's see if this works in one line...

@fdb-build, run bindings please, run cmake linux please, run documentation please, run linux please.

It compiled and tested just fine in our environments. With the commit that removed the commented line of code - it cleared CI. Sounds like CI getting struck for unknown issue.

@senthil-ram
Copy link
Contributor Author

I did a quick skim while building, and this seems fine to merge.

I'm kicking off some extra correctness:
all tests = 20190723-004450-alexmiller-4034900cf15fc655
snap only = 20190723-004752-alexmiller-4d476ce0d0cc6fc7

Few snap tests will fail, because the dependency fix #1866 is not yet merged.

@senthil-ram senthil-ram mentioned this pull request Jul 23, 2019
@alexmiller-apple alexmiller-apple merged commit edeec8a into apple:master Jul 24, 2019
@senthil-ram
Copy link
Contributor Author

Fixed the issue #1675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants