Skip to content

Back off between attempts to start the tests#15106

Merged
DHowett merged 5 commits intomainfrom
dev/migrie/b/these-mother-uckers-ucking-with-my-tests
Apr 13, 2023
Merged

Back off between attempts to start the tests#15106
DHowett merged 5 commits intomainfrom
dev/migrie/b/these-mother-uckers-ucking-with-my-tests

Conversation

@zadjii-msft
Copy link
Member

@zadjii-msft zadjii-msft commented Apr 4, 2023

This is a test PR for kicking the CI.

Looking through this test, I seriously don't understand how this doesn't work. I mean, I don't really get how it does work, but at this point in the tests, we've actually established that both Nihilist.exe and openconsole are running. From my read, there's no reason these should be failing at this point.

We previously added a "retry 5 times" bit to this test, in #8534. That did work back then. So uh, just do that... again?

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 4, 2023

As a note: Run 2 succeeded, with a couple of tests succeeding on try 2&3. It ultimately failed because it couldn't publish the results, because there was already a successful run, whatever.

Run 1 Retry 2 Retries 3 Retries 4 Retries
1 0 0 0 0
2 1 1 0 0
3 1 0 0 0
4 2 1 0 0
5 0 0 0 0
6 2 0 0 0
7❌ 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0

@DHowett
Copy link
Member

DHowett commented Apr 4, 2023

I can't understand how "run 4" succeeded in "first try" 2 times? It's a single run, it only ran once?

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 4, 2023

Shh Dustin ignore all that. I'm basically just ctrl+f'ing the log for Succeeded on try #<N> and then logging how many times a test took a retry to get initialized. So far, no test has taken till retry 4. We didn't have the logging before though, so I can't easily tell before this change how deep in the retries tests would usually take.

@zadjii-msft
Copy link
Member Author

Hmm. After re-doing the tests 6 times, Run 7 actually straight up failed too. I always assumed that just adding a longer delay wouldn't really work. We need to figure out why it's not actually attaching.

that being said, it sure is weird that a retry does work sometimes... like, that's weird

@zadjii-msft
Copy link
Member Author

image

well that's ironic

VERIFY_WIN32_BOOL_SUCCEEDED_RETURN(FreeConsole());

// Wait a moment for the driver to be ready after freeing to attach.
Sleep(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?

Sleep(1000);

// Sleep with a backoff, to give us longer to try next time.
Sleep(1000 * (1 + tries));
Copy link
Member

@lhecker lhecker Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, and really really don't quote me on this, but as far as I can tell, Sleep and all other "waitable" functions are all based on KiCheckWaitNext which uses QueryInterruptTimePrecise for calls like Sleep and QueryUnbiasedInterruptTime for APIs like WaitForSingleObject.

The difference between the Query*Time APIs that have unbiased in their name is:

The unbiased interrupt-time count does not include time the system spends in sleep or hibernation.

In other words, as far as I can tell, Sleep sleeps in "real time" and not in "run time", so if our CIs, which probably run on some very cheap spot instances, get suspended quite often, Sleep will effectively be skipped every time, without the OS and the apps on it really progressing at all.

I would thus try and fix it by calling this instead:

WaitForSingleObject(GetCurrentThread(), 1000);

Copy link
Member

@lhecker lhecker Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using exponential backoff with 5-10 retries would probably be a good idea anyways though. A good initial delay would probably be something like 10ms?, so that it doesn't throttle the tests too much on our fast developer machines. (Could you test on your own PC how many ms are enough so that it never hits retry when running locally?) Something like this:

// This will wait for up to 32s in total (from 10ms to 163840ms)
for (DWORD delay = 10; delay < 30000; delay *= 2)
{
    // ...
    if (succeeded)
    {
        return true;
    }

    WaitForSingleObject(GetCurrentThread(), delay);
}

VERIFY_FAILED(); // what's the right call again?
return false;

BTW should we really reopen the CRT handles inside the loop? Seems like we should only do that once before the loop...

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 12, 2023

Run 1 Retry 2 Retries 3 Retries >=4 Retries
1✅ 2 0 0 0
2✅ 2 1 0 1
3✅ 2 1 2 0
4✅ 1 3 2 1
5❔ 0 0 0 0
6❔ 0 0 0 0
7❔ 0 0 0 0
8❔ 0 0 0 0
9❔ 0 0 0 0
10❔ 0 0 0 0

@zadjii-msft
Copy link
Member Author

0a39269 had 9/10 runs pass. That's annoying, but maybe better than what we're getting now.

};

VERIFY_IS_LESS_THAN(tries, 100, L"Make sure we set up the new console in time");
VERIFY_IS_LESS_THAN(delay, 30000u, L"Make sure we set up the new console in time");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to just return true above instead of breaking. That way you can blindly VERIFY_FAIL() here.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jeez. if it works.

@DHowett DHowett merged commit 72d0566 into main Apr 13, 2023
@DHowett DHowett deleted the dev/migrie/b/these-mother-uckers-ucking-with-my-tests branch April 13, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants