xds_end2end_test: avoid flakes from lingering shutdown from previous test by markdroth · Pull Request #25561 · grpc/grpc

markdroth · 2021-02-25T18:50:40Z

This addresses xds_end2end_test flakes caused by lingering shutdown from a previous test impacting the following test. As an example, one common flake pattern was the following:

A test that uses load reporting finishes, but the channel code is not yet done shutting down its LB policies.
The next test, which does not use load reporting, starts up. The first thing it does is clear the global XdsClient instance, so that the new test will create a new one that points to the new xDS server instance.
At this point, the lingering client code from the previous test winds up creating a new XdsClient instance, which means that the lingering client code from the previous test will now be talking to the xDS server instance for the new test.
The client code from the previous test starts a new LRS stream, since the previous test was configured to do load reporting. However, the new test is not configured for load reporting, so when it sees an LRS stream, the test fails.

To fix this, we make two changes.

First, we change the order in which we initialize the infrastructure for each test. We now set up the new xDS server instances before we clear the global XdsClient state, which gives the previous test more time to finish shutting down. Note that this is still not a complete guarantee that the previous test will be finished shutting down by the time we clear the global XdsClient state, so it may still be possible for there to be more flakes in the future. But in practice, it seems to give enough leeway to avoid the problem.

The second change is to stop reusing ports between tests. Reusing ports between tests means that the xDS server will usually be on the same port, which means that the lingering client code from the previous test will wind up talking to the xDS server instance from the new test. Not reusing ports avoids this by ensuring that any lingering client code from the previous test will not interfere with the state in the new test.

Note that the code to reuse ports between tests was added back in #20787 to address a limitation of the number of ports that can be used by a test internally. To avoid hitting that internal limit, I have increased the number of shards for the test, so that each test instantiation is using fewer ports and therefore not bumping into the limit (which I have verified). I have also added a TODO for ways that we can further decrease port usage in the future.

With this PR, there is no more risk of cross-test contamination with regard to communication with the backend or balancer instances themselves. The only remaining risk here is that it's still possible that the previous test might still recreate the XdsClient after the new test clears the global state, but the probability of that has been lowered by changing the initialization order of the tests.

…test

lidizheng · 2021-02-25T20:55:01Z

In #25008, the comment did mention a possibility that a reset of state may be required. It's a debt we choose to bear.

Without an explicit grpc_init()/grpc_shutdown() there is going to be some state that does not get reset, but that should not affect test outcomes atleast currently. There is a chance that in the future that we might have such state that needs to be reset between tests and a blocking shutdown is the only solution I can think of for now (though it would potentially slow down our tests).

There is no internal C-core API that will solve this.

We could add a new API to XdsClient. E.g., XdsClient::Destory to block until the object goes away.

(My bad to comment on the other PR, I though by merging the changes you want to submit both of them in one batch.)

donnadionne

Seems to provide much better isolation from one test to the next! Wondering if this would make the test run longer? but it's good to have the isolation.

markdroth · 2021-02-25T21:00:04Z

@lidizheng

We could add a new API to XdsClient. E.g., XdsClient::Destory to block until the object goes away.

Yes, we can consider something like that if we need to in the future. For now, I think this is good enough.

@donnadionne

Seems to provide much better isolation from one test to the next! Wondering if this would make the test run longer? but it's good to have the isolation.

No, this shouldn't make the test run any longer. (Although if we implement Lidi's suggestion in the future, that would make the tests take a bit longer, because it would delay starting each test until the previous one is completely done cleaning up.)

markdroth · 2021-02-26T08:29:26Z

The "Bazel C/C++ Dbg MacOS" failure is an infrastructure failure.
The "Artifact Build Linux" failure is an infrastructure timeout.

Known issues: #24204

xds_end2end_test: avoid flakes from lingering shutdown from previous …

4333d5d

…test

markdroth added the release notes: no Indicates if PR should not be in release notes label Feb 25, 2021

markdroth requested a review from donnadionne February 25, 2021 18:50

markdroth mentioned this pull request Feb 25, 2021

Second attempt: xDS HTTP filter support for gRPC client #25558

Merged

donnadionne approved these changes Feb 25, 2021

View reviewed changes

markdroth merged commit d482f12 into grpc:master Feb 26, 2021

markdroth deleted the xds_test_init_fix branch February 26, 2021 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561

xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561
markdroth merged 1 commit intogrpc:masterfrom
markdroth:xds_test_init_fix

markdroth commented Feb 25, 2021

Uh oh!

lidizheng commented Feb 25, 2021

Uh oh!

donnadionne left a comment

Uh oh!

markdroth commented Feb 25, 2021

Uh oh!

markdroth commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

markdroth commented Feb 25, 2021

Uh oh!

lidizheng commented Feb 25, 2021

Uh oh!

donnadionne left a comment

Choose a reason for hiding this comment

Uh oh!

markdroth commented Feb 25, 2021

Uh oh!

markdroth commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants