Skip to content

fix a flaky test in advanced TLS#8474

Merged
ZhenLian merged 3 commits intogrpc:masterfrom
ZhenLian:zhen_advancedtls_5
Sep 8, 2021
Merged

fix a flaky test in advanced TLS#8474
ZhenLian merged 3 commits intogrpc:masterfrom
ZhenLian:zhen_advancedtls_5

Conversation

@ZhenLian
Copy link
Copy Markdown
Contributor

@ZhenLian ZhenLian commented Sep 2, 2021

This is intended to fix #8454.
The problem was because at the time we scheduled the thread for the client, we didn't reserve enough time before starting the connection, so there could be a situation where the handshake has established, while the client hasn't finished its first read/reload yet.
I added the code to sleep for 5 seconds and then begin our connection. Ideally, to fully remove the race condition, we might need some ways for the reloading thread telling us that the first read has been completed in the main test. But I can't think of any way that won't introduce any changes to the main code(I assume it is a bad practice to change code under "src" to better fit code under "test").
I've verified internally that this changes worked fine by running it for 1000 times(it failed previously).

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Sep 2, 2021

Oh, gosh, sleeps. No, we need to remove the sleeps. This is an API issue and will impact users as well.

I don't understand what code you are talking about with "we scheduled the thread for the client". You only changed advancedTlsKeyManagerTrustManagerMutualTlsTest and it doesn't use the executor at all. Seems you needed to modify onFileReloadingKeyManagerTrustManagerTest instead, since that was the flaky one.

I think all the sleeps need to be removed though, and we need to change the behavior to immediately load the file as soon as updateIdentityCredentialsFromFile and the like are called. The other option is to block within a handshake if the files haven't been loaded (doesn't seem good) or to have a method that can load from file without polling (seems we really should have that now even) and require/expect users to load from file manually the first time.

GeneralSecurityException {
UpdateResult newResult = readAndUpdate(keyFile, certFile, 0, 0);
if (!newResult.success) {
throw new GeneralSecurityException("The initial update was not successful");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, it isn't clear what to do with this error. As we discussed, success really has more of updated semantics. Maybe use the language, "Files were unmodified before their initial update. Probably a bug."? Ditto for trust manager.

@ZhenLian
Copy link
Copy Markdown
Contributor Author

ZhenLian commented Sep 8, 2021

@ejona86 Hi Eric, the newly added code contains some conditions, e.g. the first time readAndUpdate(), that I couldn't think of a way to deliberately make them fail to cover all code execution paths. That makes the code coverage(codecov/patch) a bit low for this PR. Can we proceed without considering about the coverage at this moment? Or can you think of some other ways to enhance the code coverage?

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Sep 8, 2021

Yeah, @ZhenLian, it is fine as-is.

@ZhenLian
Copy link
Copy Markdown
Contributor Author

ZhenLian commented Sep 8, 2021

Cool, thank you so much!

@ZhenLian ZhenLian merged commit fb00463 into grpc:master Sep 8, 2021
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Dec 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AdvancedTlsTest.onFileReloadingKeyManagerTrustManagerTest is flaky

2 participants