Skip to content

Conversation

@alexmiller-apple
Copy link
Contributor

@alexmiller-apple alexmiller-apple commented Mar 11, 2019

Part 5 of probably 5 for #1048

This PR introduces 3 important things:

  1. It checksums pages it reads from the disk queue, thereby closing the accidentally existing possibility of feeding bitrot to storage servers.
  2. It removes versions from versionLocation when they're spilled, instead of when they're popped.
    versionLocation is a map of version to location on disk. Rough back-of-the-envelope math suggested that a 1TB DiskQueue could take 10s of GB of memory for versionLocation, which means it has to be cleaned up.
  3. It restricts recovery to only read the portion of the disk queue that needs to be re-indexed in memory. Anything that has already been spilled won't be read. This means recovering a 1TB disk queue will only read ~2GB of it.

hash = UID( (int64_t(part[0])<<32)+part[1], 0xfdb );
}
void updateHash_crc32c() {
uint32_t checksum = crc32c_append( 0xfdbeefdb, (uint8_t*)&seq, sizeof(Page)-sizeof(hash) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great seed choice.

#include "fdbclient/FDBTypes.h"
#include "fdbserver/IKeyValueStore.h"

enum class CheckHashes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use a bool ?

Copy link
Contributor Author

@alexmiller-apple alexmiller-apple Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's personal preference. I really dislike functions that look like doFoo(true, true, false, false, true), because any bool argument always requires me going to look at the declaration to see what each of the trues and falses mean, rather than just giving them a name that's meaningful at the callsite.

@alexmiller-apple
Copy link
Contributor Author

I've left "potato" as "potato" because it's filled with trace events that will need to be rebased out anyway. Otherwise, this code is complete as far as I can tell. I'll find out from the correctness that I'll leave running overnight if that's actually accurate. The only other work I have planned for spill-by-reference is to add TraceEvents and knobs, which can be done on release-6.1, and I'm sure performance tests will result in something to fix.

@xumengpanda
Copy link
Contributor

I've left "potato" as "potato" because it's filled with trace events that will need to be rebased out anyway. Otherwise, this code is complete as far as I can tell. I'll find out from the correctness that I'll leave running overnight if that's actually accurate. The only other work I have planned for spill-by-reference is to add TraceEvents and knobs, which can be done on release-6.1, and I'm sure performance tests will result in something to fix.

Can I assume it is at least faster than non-spilling solution? :)
Maybe more performance tests aim to make it even faster?

@alexmiller-apple
Copy link
Contributor Author

I'm expecting performance tests will reveal things like slow tasks, large packets, OOMs, etc. The sorts of things that simulation is bad at finding, but saturating workloads on a real cluster are good at finding.

Though this format is being deprecated in favor of an eventual plumbing
through of TLogVersion, we should probably bump it anyway.

And also remove the fallback to OldTLogServer code.  It should never be
executed, as OldTLogServer_6_0 is entirely relied upon to execute
OldTLogServer_4_6.
There's various ASSERT()'s that assume firstPages is empty, and enforces
things about `seq`.  Some of these asserts have spuriously passed, since
uninitialized pages look like they have a `seq` of 0, which would be the
beginning of the disk queue.

Now they'll look like the end of the disk queue, which is far easier to
fail on.
We don't have a forward compatibility story for the memory storage
engine, so its DiskQueue will still be hashlittle2 until one exists.
This allows us to do easy upgrades of SpilledData in the future, if the
need arises, because we then have a protocol version to compare against.
If a server has its data spilled, then it's behind the 5s window.
Feeding it data is less important than committing, so we can hide the
extra CPU usage from checksumming the read amplified disk queue pages.
@alexmiller-apple alexmiller-apple changed the title Tstlog9 Spill-By-Reference TLog Part 5: TLogs That Can Spill Large Amounts Of Data Mar 16, 2019
@alexmiller-apple alexmiller-apple marked this pull request as ready for review March 16, 2019 04:44
IKeyValueStore* keyValueStoreMemory( std::string const& basename, UID logID, int64_t memoryLimit, std::string ext ) {
TraceEvent("KVSMemOpening", logID).detail("Basename", basename).detail("MemoryLimit", memoryLimit);
IDiskQueue *log = openDiskQueue( basename, ext, logID);
IDiskQueue *log = openDiskQueue( basename, ext, logID, DiskQueueVersion::V0 );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've validated in correctness that we'd be fine flipping this to V1 passes restarting tests, so it can be set to V1 for 6.2.

Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.

The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled.  This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.

Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
This time, track what location in the DiskQueue has been spilled in
persistent state, and then feed it back into the disk queue before
recovery.

This also introduces an ASSERT that recovery only reads exactly the
bytes that it needs to have in memory.
@alexmiller-apple
Copy link
Contributor Author

Simulation seems to approve of the recent changes, so this should be good to merge.

@etschannen etschannen merged commit ddf8e86 into apple:master Mar 19, 2019
etschannen added a commit to etschannen/foundationdb that referenced this pull request Mar 26, 2019
Spill-By-Reference TLog Part 5: TLogs That Can Spill Large Amounts Of Data
etschannen added a commit to etschannen/foundationdb that referenced this pull request Mar 26, 2019
Spill-By-Reference TLog Part 5: TLogs That Can Spill Large Amounts Of Data
alexmiller-apple pushed a commit to etschannen/foundationdb that referenced this pull request Mar 26, 2019
Spill-By-Reference TLog Part 5: TLogs That Can Spill Large Amounts Of Data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants