Skip to content

Conversation

@oleg68
Copy link

@oleg68 oleg68 commented Apr 3, 2023

No description provided.

jzhou77 and others added 30 commits November 5, 2022 11:36
When VV is enabled, the comparison of storage server version and read version
should use the original read version, otherwise, the client may get the wrong
transaction_too_old error.
Fix transaction_too_old error when version vector is enabled
Cherry pick PR 8630 to release 7.1
This is a patch to release-7.1 after resolving conflicts from commit in
main branch, in order to enable byteLimit in release-7.1

A fraction of byteLimit will be used as the limit to fetch index.
For the indexes fetched, fetch records for them in batch.

byteLimit always count the index size, it also count record if exist,
it at least return 1 index-record entry and always include the last entry
despite that adding the last entry despite it might exceed limit.

There is a Knob STRICTLY_ENFORCE_BYTE_LIMIT, when it is set, records
will be discarded once the byteLimit is hit, despite they are fetched.
Otherwise, return the whole batch.
This reverts commit fadcb08.
* Add SS read range bytes metrics. (apple#8697)

* Fix build failure

* clang-fmt

* fmt
add bytelimit for prefetch (release-7.1)
The number of released bytes exceeds the number of acquired bytes in locks.
This is because the bytes counted towards release is calculated after a "wait",
when more bytes could be allocated.
Fix backup worker assertion failure [release-7.1]
To fix simulation failures where the knob value is too small.
sfc-gh-anoyes and others added 26 commits February 27, 2023 14:40
PTree improvements [release-7.1]
…te-7.1

[Release 7.1] Enhance fdbbackup query command to estimate data processing from a specific snapshot to a target version
Set max length as well to avoid TraceEventOverflow.
Use KeyspaceSnapshotFile to filter range files
To reduce the number of network requests.
Refactor decoder to read file as a whole once
…elease-7.1] (apple#9640)

* Add DcLag tests and workload

* Add disableSimSpeedup to clog network longer

* Ignore the DcLag test

* Refactor LogRouter's pullAsyncData

* Switch DC if log router peek becomes stuck

Trying to a different DC if this happens.

* Enable DcLag test

* Require at least 2 regions and having satellites

* Simplify DcLag code

* Limit connection failures to be within tests

In particular, disable connection failures when initializing the database
during the startup phase, i.e., before running with test specs.

* Revert disableSimSpeedup

* Fix conflicts after cherrypick

* More fixes after cherrypick

* Refactor to address comments

* Use a constant for connectionFailuresDisableDuration

* Fix ClogTlog workload valgrind error

* Address comments

* Reduce running time for DcLag

The switch can happen quicker than the workload detection time, so need to
adjust detection time lower than LOG_ROUTER_PEEK_SWITCH_DC_TIME.
Seed storage servers are recruited as the intial set of storage servers
when a database is first created. They function a little bit differently
than normal, and do not set an initial version like storages normally do
when they get recruited (typically equal to the recovery version).

Version correction is a feature where versions advance in sync with the
clock, and are equal across FDB clusters. To allow different FDB
clusters to have matching versions, they must share the same base
version. This defaults to the Unix epoch, and clusters with the version
epoch enabled will have a current version equal to the number of
microseconds since the Unix epoch.

When the version epoch is enabled on a cluster, it causes a one time
jump from the clusters current version to the version based on the
epoch. After a recovery, the recovery version sent to storages should
have advanced by a significant amount.

The recovery path contained a `BUGGIFY` to randomly advance the recovery
version in simulation, testing the version epoch being enabled.
However, it was also advancing the version during an initial recovery,
when the seed storage servers are recruited. If a set of storage
servers were recruited as seed servers, but another recovery occurred
before the bootstrap process was complete, the randomly selected version
increase could be smaller during the second recovery than during the
first. This could cause the initial set of seed servers to think they
should be at a version larger than what the cluuster was actually at.

The fix contained in this commit is to only cause a random version jump
when the recovery is occuring on an existing database, and not when it
is recruiting seed storages.

This commit fixes an issue found in simulation, reproducible with:

Commit: 93dc4bf
Test: fast/DataLossRecovery.toml
Seed: 3101495991
Buggify: on
Compiler: clang
When the ClogTlog is running, we may already pass the 450s, i.e., SIM_SPEEDUP_AFTER_SECONDS,
and clogging is no longer effective. If that's the case, we want to finish the test quickly.
Fix issue where the versions on seed storage servers decreased [release-7.1]
@oleg68 oleg68 requested a review from foxyholic April 3, 2023 08:19
@foxyholic foxyholic merged commit 9779c5f into owtech:ow-fork-7.1 Apr 4, 2023
@oleg68 oleg68 deleted the ow-fork-7.1-29 branch April 5, 2023 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.