-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Deflake test FastTriggeredWatches #9627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In FastTriggeredWatchesWorkload, if the randomized new value for the given `setKey` happens to be the same as the current value, the following will hold - `first` is true, and - `getDuration` is 0, and - assertion `lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)` will fail To fix this, change the assertion to ``` assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)); ``` Test plan: Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes.
|
Hi @sfc-gh-nwijetunga and @sfc-gh-ljoswiak , given that you have some context in #4539, would you be able to review this one? Thanks |
| // Assert that the time from setting the key to triggering the watch is no greater than 25s | ||
| // TODO: This assertion can cause flaky behaviour since sometimes a watch can take longer to fire | ||
| ASSERT(lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || | ||
| ASSERT(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it that makes this assert fail in this case? It looks like lastReadVersion could become a value less than ver, so you might have lastReadVersion - ver become a negative number. getDuration will also be 0.
That means that the comparison lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration) would become something like:
lastReadVersion - ver < 25000000, which should still be true even if lastReadVersion < ver. If ver is smaller than lastReadVersion, it doesn't seem like this would extend the gap between them any. Do you know what the actual measured gap was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a test run with additional printing, I got
first=true lastReadVersion=441719868 ver=409158709 max_in_flight=100000000 versions_per_s=1000000 duration=0.00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for the lastReadVersion to be greater if the setter commits its transaction before we start the transaction for the watch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this fix would not fix all occurrences of this test being flaky if for example we hit this case where first is false but due to failure injection the lastReadVersion >> ver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happens at Y < X (i.e. we get a read version and then have a very long delay for reading the key, which I think is possible but should fail to commit
Oh, the transaction trying to set a watch will fail to commit because it's not read-only? If so, it will actually retry with first remaining true which falls back to the case of Y >= X. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe it won't fail on commit because it is read-only. And if that happens, then we are susceptible to this even on the second pass (where first == false) because our timing of the watch triggering includes all of the time that we spent getting the read version and reading the key initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sequence would look like:
- Get a read version X
- Set the key at version Y = X+n
- Long delay (say 30s)
- Read the key from a SS that is lagging and can still serve old reads
- Set the watch and commit it
- Watch fires
- Re-read the key at version Z>Y+30s
- Fail our assert because
first == false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a simple fix for this would be to force the transaction to commit (e.g. by setting a write conflict range). We could even then incorporate the commit version as a way to eliminate the time spent in the first transaction from our timing:
int64_t versionDelta = lastReadVersion - std::max(setVersion, watchCommitVersion)
|
I would also just run FastTriggeredWaetches workload (maybe 10k) to see if there are any other failures |
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
|
thanks @sfc-gh-nwijetunga for the review! |
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
|
This is a good fix, and this is a small nit, but in the case where the values are equal on the first pass, the test isn't testing the watch functionality and so isn't really functioning as intended. What about ensuring |
|
Thanks @sfc-gh-ljoswiak for the suggestion. I may be misreading the code, but choosing a random number from a different range still seems susceptible to the same issue, unless we alternate between two disjoint ranges. |
Yeah you would have to do something like |
|
We could also remember the last value stored and do our generation in a loop until we get a value different than before |
| Version ver = wait(setFuture); | ||
| //TraceEvent("FTWWatchDone").detail("Key", printable(setKey)); | ||
| // Assert that the time from setting the key to triggering the watch is no greater than 25s | ||
| // TODO: This assertion can cause flaky behaviour since sometimes a watch can take longer to fire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this TODO is actually accurate. It's quite possible all of the cases we saw of the watch firing slowly were actually not timing the watch at all. Perhaps we could delete it and revisit if it recurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
We need to do this for the entire key space, right? I think the probability of generating the same value as current should be low, not sure if is worth tracking this information. |
|
Yeah you're right, I didn't notice that it was generating random keys. And as pointed out above, even fixing this doesn't eliminate the need for your assert because the same problem occurs even if the value is different. My suggestion would be to just leave it as is, since it doesn't really change much about the test to ensure a change in value. |
| ASSERT(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || | ||
| lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)); | ||
| ASSERT(!watchCommitted || | ||
| lastReadVersion - std::max(ver, watchCommitVersion) >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we rename ver for clarity, since we now have three versions being evaluated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Change `StringRef()` to `""_sr`. Co-authored-by: A.J. Beamon <[email protected]>
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Unrelated to this PR |
Result of foundationdb-pr on Linux CentOS 7
|
…ease-71.3) (#9689) Cherry-Pick of #9627 Original Description: In FastTriggeredWatchesWorkload, if the randomized new value for the given setKey happens to be the same as the current value, the following will hold first is true, and getDuration is 0, and assertion lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration) will fail To fix this, change the assertion to assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)); Test plan: On c27cdcd: Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes. devRunCorrectnessFiltered +FastTriggeredWatches* --max-runs 10000
In FastTriggeredWatchesWorkload, if the randomized new value for the given
setKeyhappens to be the same as the current value, the following will holdfirstis true, andgetDurationis 0, andlastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)will failTo fix this, change the assertion to
Test plan:
On c27cdcd:
Replace this text with your description here...
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)