Back off between attempts to start the tests#15106
Conversation
|
As a note: Run 2 succeeded, with a couple of tests succeeding on try 2&3. It ultimately failed because it couldn't publish the results, because there was already a successful run, whatever.
|
|
I can't understand how "run 4" succeeded in "first try" 2 times? It's a single run, it only ran once? |
|
Shh Dustin ignore all that. I'm basically just ctrl+f'ing the log for |
|
Hmm. After re-doing the tests 6 times, Run 7 actually straight up failed too. I always assumed that just adding a longer delay wouldn't really work. We need to figure out why it's not actually attaching. that being said, it sure is weird that a retry does work sometimes... like, that's weird |
src/host/ft_host/InitTests.cpp
Outdated
| VERIFY_WIN32_BOOL_SUCCEEDED_RETURN(FreeConsole()); | ||
|
|
||
| // Wait a moment for the driver to be ready after freeing to attach. | ||
| Sleep(100); |
src/host/ft_host/InitTests.cpp
Outdated
| Sleep(1000); | ||
|
|
||
| // Sleep with a backoff, to give us longer to try next time. | ||
| Sleep(1000 * (1 + tries)); |
There was a problem hiding this comment.
As far as I can tell, and really really don't quote me on this, but as far as I can tell, Sleep and all other "waitable" functions are all based on KiCheckWaitNext which uses QueryInterruptTimePrecise for calls like Sleep and QueryUnbiasedInterruptTime for APIs like WaitForSingleObject.
The difference between the Query*Time APIs that have unbiased in their name is:
The unbiased interrupt-time count does not include time the system spends in sleep or hibernation.
In other words, as far as I can tell, Sleep sleeps in "real time" and not in "run time", so if our CIs, which probably run on some very cheap spot instances, get suspended quite often, Sleep will effectively be skipped every time, without the OS and the apps on it really progressing at all.
I would thus try and fix it by calling this instead:
WaitForSingleObject(GetCurrentThread(), 1000);There was a problem hiding this comment.
Using exponential backoff with 5-10 retries would probably be a good idea anyways though. A good initial delay would probably be something like 10ms?, so that it doesn't throttle the tests too much on our fast developer machines. (Could you test on your own PC how many ms are enough so that it never hits retry when running locally?) Something like this:
// This will wait for up to 32s in total (from 10ms to 163840ms)
for (DWORD delay = 10; delay < 30000; delay *= 2)
{
// ...
if (succeeded)
{
return true;
}
WaitForSingleObject(GetCurrentThread(), delay);
}
VERIFY_FAILED(); // what's the right call again?
return false;BTW should we really reopen the CRT handles inside the loop? Seems like we should only do that once before the loop...
|
|
0a39269 had 9/10 runs pass. That's annoying, but maybe better than what we're getting now. |
| }; | ||
|
|
||
| VERIFY_IS_LESS_THAN(tries, 100, L"Make sure we set up the new console in time"); | ||
| VERIFY_IS_LESS_THAN(delay, 30000u, L"Make sure we set up the new console in time"); |
There was a problem hiding this comment.
Another option would be to just return true above instead of breaking. That way you can blindly VERIFY_FAIL() here.

This is a test PR for kicking the CI.
Looking through this test, I seriously don't understand how this doesn't work. I mean, I don't really get how it does work, but at this point in the tests, we've actually established that both
Nihilist.exeand openconsole are running. From my read, there's no reason these should be failing at this point.We previously added a "retry 5 times" bit to this test, in #8534. That did work back then. So uh, just do that... again?