-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Range deletion memory usage improvements #10048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
saintstack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some comments in below.
fdbclient/ServerKnobs.cpp
Outdated
| init( ROCKSDB_WAL_RECOVERY_MODE, 2 ); // kPointInTimeRecovery, RocksDB default. | ||
| init( ROCKSDB_TARGET_FILE_SIZE_BASE, 16777216 ); // 16MB, RocksDB default. | ||
| init( ROCKSDB_MAX_OPEN_FILES, 50000 ); // Should be smaller than OS's fd limit. | ||
| init( ROCKSDB_USE_POINT_DELETE_FOR_SYSTEM_KEYS, false ); if(isSimulated)ROCKSDB_USE_POINT_DELETE_FOR_SYSTEM_KEYS = deterministicRandom()->coinflip(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this option meant to address the OOM we've been seeing? It wasn't happening in system keys I don't think.
Where do these two new options get a bit of commentary? They are a little exotic in their operation. Some commentary on what they are about will help those that come after us I think.
| wait(kvStore->commit(false)); | ||
| } | ||
|
|
||
| // TODO: flush memtable. The process is expected to OOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this test fail always w/o the rocksdb fix or ROCKSDB_CF_RANGE_DELETION_LIMIT <= 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not. Single shard rocksdb doesn't have the flush api.
|
|
||
| // TODO: Disable this once RocksDB is upgraded to a version with range delete improvement. | ||
| if (SERVER_KNOBS->ROCKSDB_USE_POINT_DELETE_FOR_SYSTEM_KEYS && systemKeys.contains(range)) { | ||
| rocksdb::ReadOptions options = getReadOptions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read options here won't have the issue Neethu just ran into around prefix for blooms? The auto_prefix_mode setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I updated the read options.
| if (!cursor->status().ok()) { | ||
| // if readrange iteration fails, then do a deleteRange. | ||
| writeBatch->DeleteRange(physicalShard->cf, beginSlice, endSlice); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code repeated elsewhere? E.g. The Neethu feature that changes range deletes to point deletes? Can we reuse her implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea is the same. Because we read multiple CFs here, the implementation is quite different.
| fOptions.wait = true; | ||
| fOptions.allow_write_stall = true; | ||
|
|
||
| for (auto shard : (*a.dirtyShards)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be expensive? Will it up the latency on commits being done inline w/ the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect latency to go up. As we do commit every 5 sec, it shouldn't affect other parts of the system. I added a knob to control blocking/non-blocking flush. We could do non-blocking flush if latency is an issue.
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Co-authored-by: Jingyu Zhou <[email protected]>
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
| } | ||
| } | ||
|
|
||
| a.done.send(Void()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to move this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're blocking future commits. We don't want to accept more commits before the flush finished.
| init( ROCKSDB_MAX_OPEN_FILES, 50000 ); // Should be smaller than OS's fd limit. | ||
| init( ROCKSDB_USE_POINT_DELETE_FOR_SYSTEM_KEYS, false ); if (isSimulated) ROCKSDB_USE_POINT_DELETE_FOR_SYSTEM_KEYS = deterministicRandom()->coinflip(); | ||
| init( ROCKSDB_CF_RANGE_DELETION_LIMIT, 1000 ); | ||
| init (ROCKSDB_WAIT_ON_CF_FLUSH, true ); if (isSimulated) ROCKSDB_WAIT_ON_CF_FLUSH = deterministicRandom()->coinflip(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change in behavior? We wait on flush now where before we didn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously flushes are scheduled by rocksdb and running in background. In this case, we're blocking future write. It makes sense to wait for flushes. However the storage server could observe a higher commit latency due to this. If the high latency causes too much noise, we can turn it off.
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)