Skip to content

Conversation

@sfc-gh-yajin
Copy link
Collaborator

@sfc-gh-yajin sfc-gh-yajin commented Mar 8, 2023

In FastTriggeredWatchesWorkload, if the randomized new value for the given setKey happens to be the same as the current value, the following will hold

  • first is true, and
  • getDuration is 0, and
  • assertion lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration) will fail

To fix this, change the assertion to

assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
	lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration));

Test plan:
On c27cdcd:

  • Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes.
  • devRunCorrectnessFiltered +FastTriggeredWatches* --max-runs 10000

Replace this text with your description here...

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

In FastTriggeredWatchesWorkload, if the randomized new value for the given
`setKey` happens to be the same as the current value, the following will hold

- `first` is true, and
- `getDuration` is 0, and
- assertion `lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)` will fail

To fix this, change the assertion to
```
assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
				       lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration));

```

Test plan:
Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes.
@sfc-gh-yajin
Copy link
Collaborator Author

Hi @sfc-gh-nwijetunga and @sfc-gh-ljoswiak , given that you have some context in #4539, would you be able to review this one? Thanks

// Assert that the time from setting the key to triggering the watch is no greater than 25s
// TODO: This assertion can cause flaky behaviour since sometimes a watch can take longer to fire
ASSERT(lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
ASSERT(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Collaborator

@sfc-gh-abeamon sfc-gh-abeamon Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it that makes this assert fail in this case? It looks like lastReadVersion could become a value less than ver, so you might have lastReadVersion - ver become a negative number. getDuration will also be 0.

That means that the comparison lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration) would become something like:

lastReadVersion - ver < 25000000, which should still be true even if lastReadVersion < ver. If ver is smaller than lastReadVersion, it doesn't seem like this would extend the gap between them any. Do you know what the actual measured gap was?

Copy link
Collaborator Author

@sfc-gh-yajin sfc-gh-yajin Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a test run with additional printing, I got

first=true lastReadVersion=441719868 ver=409158709 max_in_flight=100000000 versions_per_s=1000000 duration=0.00

Copy link
Collaborator

@sfc-gh-nwijetunga sfc-gh-nwijetunga Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible for the lastReadVersion to be greater if the setter commits its transaction before we start the transaction for the watch

Copy link
Collaborator

@sfc-gh-nwijetunga sfc-gh-nwijetunga Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this fix would not fix all occurrences of this test being flaky if for example we hit this case where first is false but due to failure injection the lastReadVersion >> ver.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happens at Y < X (i.e. we get a read version and then have a very long delay for reading the key, which I think is possible but should fail to commit

Oh, the transaction trying to set a watch will fail to commit because it's not read-only? If so, it will actually retry with first remaining true which falls back to the case of Y >= X. Correct?

Copy link
Collaborator

@sfc-gh-abeamon sfc-gh-abeamon Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it won't fail on commit because it is read-only. And if that happens, then we are susceptible to this even on the second pass (where first == false) because our timing of the watch triggering includes all of the time that we spent getting the read version and reading the key initially.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sequence would look like:

  1. Get a read version X
  2. Set the key at version Y = X+n
  3. Long delay (say 30s)
  4. Read the key from a SS that is lagging and can still serve old reads
  5. Set the watch and commit it
  6. Watch fires
  7. Re-read the key at version Z>Y+30s
  8. Fail our assert because first == false

Copy link
Collaborator

@sfc-gh-abeamon sfc-gh-abeamon Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple fix for this would be to force the transaction to commit (e.g. by setting a write conflict range). We could even then incorporate the commit version as a way to eliminate the time spent in the first transaction from our timing:

int64_t versionDelta = lastReadVersion - std::max(setVersion, watchCommitVersion)

@sfc-gh-nwijetunga
Copy link
Collaborator

I would also just run FastTriggeredWaetches workload (maybe 10k) to see if there are any other failures

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: c2939ba
  • Duration 0:13:58
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: c2939ba
  • Duration 0:30:35
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@sfc-gh-yajin
Copy link
Collaborator Author

thanks @sfc-gh-nwijetunga for the review!

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: c2939ba
  • Duration 0:50:11
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: c2939ba
  • Duration 1:20:12
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@sfc-gh-ljoswiak
Copy link
Collaborator

This is a good fix, and this is a small nit, but in the case where the values are equal on the first pass, the test isn't testing the watch functionality and so isn't really functioning as intended. What about ensuring setValue never overlaps with defaultValue with something like this

diff --git a/fdbserver/workloads/FastTriggeredWatches.actor.cpp b/fdbserver/workloads/FastTriggeredWatches.actor.cpp
index 0822e0c36..adf4cf965 100644
--- a/fdbserver/workloads/FastTriggeredWatches.actor.cpp
+++ b/fdbserver/workloads/FastTriggeredWatches.actor.cpp
@@ -110,7 +110,7 @@ struct FastTriggeredWatchesWorkload : TestWorkload {
                                state Key setKey = self->keyForIndex(deterministicRandom()->randomInt(0, self->nodes));
                                state Optional<Value> setValue;
                                if (deterministicRandom()->random01() > 0.5)
-                                       setValue = StringRef(format("%010d", deterministicRandom()->randomInt(0, 1000)));
+                                       setValue = StringRef(format("%010d", deterministicRandom()->randomInt(1000, 2000)));
                                // Set the value at setKey to something random
                                state Future<Version> setFuture = self->setter(cx, setKey, setValue);
                                wait(delay(deterministicRandom()->random01()));

@sfc-gh-yajin
Copy link
Collaborator Author

Thanks @sfc-gh-ljoswiak for the suggestion. I may be misreading the code, but choosing a random number from a different range still seems susceptible to the same issue, unless we alternate between two disjoint ranges.

@sfc-gh-nwijetunga
Copy link
Collaborator

sfc-gh-nwijetunga commented Mar 9, 2023

but choosing a random number from a different range still seems susceptible to the same issue, unless we alternate between two disjoint ranges.

Yeah you would have to do something like setValue = StringRef(format("%010d", deterministicRandom()->randomInt(self->nodeCount + 1, self->nodeCount + 1000)));

@sfc-gh-abeamon
Copy link
Collaborator

We could also remember the last value stored and do our generation in a loop until we get a value different than before

Version ver = wait(setFuture);
//TraceEvent("FTWWatchDone").detail("Key", printable(setKey));
// Assert that the time from setting the key to triggering the watch is no greater than 25s
// TODO: This assertion can cause flaky behaviour since sometimes a watch can take longer to fire
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this TODO is actually accurate. It's quite possible all of the cases we saw of the watch firing slowly were actually not timing the watch at all. Perhaps we could delete it and revisit if it recurs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@sfc-gh-yajin
Copy link
Collaborator Author

We could also remember the last value stored and do our generation in a loop until we get a value different than before

We need to do this for the entire key space, right? I think the probability of generating the same value as current should be low, not sure if is worth tracking this information.

@sfc-gh-abeamon
Copy link
Collaborator

sfc-gh-abeamon commented Mar 9, 2023

Yeah you're right, I didn't notice that it was generating random keys. And as pointed out above, even fixing this doesn't eliminate the need for your assert because the same problem occurs even if the value is different. My suggestion would be to just leave it as is, since it doesn't really change much about the test to ensure a change in value.

ASSERT(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration));
ASSERT(!watchCommitted ||
lastReadVersion - std::max(ver, watchCommitVersion) >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rename ver for clarity, since we now have three versions being evaluated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 8e478b7
  • Duration 0:18:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

Change `StringRef()` to `""_sr`.

Co-authored-by: A.J. Beamon <[email protected]>
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 8e478b7
  • Duration 1:06:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 8e478b7
  • Duration 1:06:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 8668266
  • Duration 1:22:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 8668266
  • Duration 0:13:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 8668266
  • Duration 0:28:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 8668266
  • Duration 0:31:15
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 8668266
  • Duration 0:36:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 8668266
  • Duration 1:03:44
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 8668266
  • Duration 1:17:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 63098b0
  • Duration 0:13:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 63098b0
  • Duration 0:23:08
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 63098b0
  • Duration 0:29:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 63098b0
  • Duration 0:36:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 63098b0
  • Duration 0:48:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 63098b0
  • Duration 1:24:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 37b0b08
  • Duration 0:16:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 37b0b08
  • Duration 0:32:32
  • Result: ❌ FAILED
  • Error: Error while executing command: docker build --label "org.foundationdb.version=${FDB_VERSION}" --label "org.foundationdb.build_date=${BUILD_DATE}" --label "org.foundationdb.commit=${COMMIT_SHA}" --progress plain --build-arg FDB_VERSION="${FDB_VERSION}" --build-arg FDB_LIBRARY_VERSIONS="${FDB_VERSION}" --build-arg FDB_WEBSITE="${FDB_WEBSITE}" --tag foundationdb/ycsb:${FDB_VERSION}-${COMMIT_SHA} --file Dockerfile --target ycsb .. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 37b0b08
  • Duration 0:35:57
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@sfc-gh-yajin
Copy link
Collaborator Author

#8 ERROR: executor failed running [/bin/sh -c yum -y install     java-11-openjdk-11.0.13.0.8-1.el7_9 &&     yum clean all &&     rm -rf /var/cache/yum]: exit code: 1
------
 > [ycsb 1/7] RUN yum -y install     java-11-openjdk-11.0.13.0.8-1.el7_9 &&     yum clean all &&     rm -rf /var/cache/yum:
------
executor failed running [/bin/sh -c yum -y install     java-11-openjdk-11.0.13.0.8-1.el7_9 &&     yum clean all &&     rm -rf /var/cache/yum]: exit code: 1
time="2023-03-14T16:42:08.802676880Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {localhost  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc

Unrelated to this PR

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 37b0b08
  • Duration 1:54:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@jzhou77 jzhou77 merged commit 74f4b08 into apple:main Mar 14, 2023
@sfc-gh-yajin sfc-gh-yajin deleted the deflake-test-1 branch March 14, 2023 18:26
sfc-gh-yajin added a commit that referenced this pull request Mar 14, 2023
sfc-gh-yajin added a commit that referenced this pull request Mar 14, 2023
…ease-71.3) (#9689)

Cherry-Pick of #9627

Original Description:

In FastTriggeredWatchesWorkload, if the randomized new value for the given setKey happens to be the same as the current value, the following will hold

first is true, and
getDuration is 0, and
assertion lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration) will fail
To fix this, change the assertion to

assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
	lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration));

Test plan:
On c27cdcd:

Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes.
devRunCorrectnessFiltered +FastTriggeredWatches* --max-runs 10000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants