Skip to content

xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561

Merged
markdroth merged 1 commit intogrpc:masterfrom
markdroth:xds_test_init_fix
Feb 26, 2021
Merged

xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561
markdroth merged 1 commit intogrpc:masterfrom
markdroth:xds_test_init_fix

Conversation

@markdroth
Copy link
Copy Markdown
Member

This addresses xds_end2end_test flakes caused by lingering shutdown from a previous test impacting the following test. As an example, one common flake pattern was the following:

  1. A test that uses load reporting finishes, but the channel code is not yet done shutting down its LB policies.
  2. The next test, which does not use load reporting, starts up. The first thing it does is clear the global XdsClient instance, so that the new test will create a new one that points to the new xDS server instance.
  3. At this point, the lingering client code from the previous test winds up creating a new XdsClient instance, which means that the lingering client code from the previous test will now be talking to the xDS server instance for the new test.
  4. The client code from the previous test starts a new LRS stream, since the previous test was configured to do load reporting. However, the new test is not configured for load reporting, so when it sees an LRS stream, the test fails.

To fix this, we make two changes.

First, we change the order in which we initialize the infrastructure for each test. We now set up the new xDS server instances before we clear the global XdsClient state, which gives the previous test more time to finish shutting down. Note that this is still not a complete guarantee that the previous test will be finished shutting down by the time we clear the global XdsClient state, so it may still be possible for there to be more flakes in the future. But in practice, it seems to give enough leeway to avoid the problem.

The second change is to stop reusing ports between tests. Reusing ports between tests means that the xDS server will usually be on the same port, which means that the lingering client code from the previous test will wind up talking to the xDS server instance from the new test. Not reusing ports avoids this by ensuring that any lingering client code from the previous test will not interfere with the state in the new test.

Note that the code to reuse ports between tests was added back in #20787 to address a limitation of the number of ports that can be used by a test internally. To avoid hitting that internal limit, I have increased the number of shards for the test, so that each test instantiation is using fewer ports and therefore not bumping into the limit (which I have verified). I have also added a TODO for ways that we can further decrease port usage in the future.

With this PR, there is no more risk of cross-test contamination with regard to communication with the backend or balancer instances themselves. The only remaining risk here is that it's still possible that the previous test might still recreate the XdsClient after the new test clears the global state, but the probability of that has been lowered by changing the initialization order of the tests.

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Feb 25, 2021
@lidizheng
Copy link
Copy Markdown
Contributor

In #25008, the comment did mention a possibility that a reset of state may be required. It's a debt we choose to bear.

Without an explicit grpc_init()/grpc_shutdown() there is going to be some state that does not get reset, but that should not affect test outcomes atleast currently. There is a chance that in the future that we might have such state that needs to be reset between tests and a blocking shutdown is the only solution I can think of for now (though it would potentially slow down our tests).

There is no internal C-core API that will solve this.

We could add a new API to XdsClient. E.g., XdsClient::Destory to block until the object goes away.

(My bad to comment on the other PR, I though by merging the changes you want to submit both of them in one batch.)

Copy link
Copy Markdown
Contributor

@donnadionne donnadionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to provide much better isolation from one test to the next! Wondering if this would make the test run longer? but it's good to have the isolation.

@markdroth
Copy link
Copy Markdown
Member Author

@lidizheng

We could add a new API to XdsClient. E.g., XdsClient::Destory to block until the object goes away.

Yes, we can consider something like that if we need to in the future. For now, I think this is good enough.

@donnadionne

Seems to provide much better isolation from one test to the next! Wondering if this would make the test run longer? but it's good to have the isolation.

No, this shouldn't make the test run any longer. (Although if we implement Lidi's suggestion in the future, that would make the tests take a bit longer, because it would delay starting each test until the previous one is completely done cleaning up.)

@markdroth
Copy link
Copy Markdown
Member Author

The "Bazel C/C++ Dbg MacOS" failure is an infrastructure failure.
The "Artifact Build Linux" failure is an infrastructure timeout.

Known issues: #24204

@markdroth markdroth merged commit d482f12 into grpc:master Feb 26, 2021
@markdroth markdroth deleted the xds_test_init_fix branch February 26, 2021 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants