xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561
xds_end2end_test: avoid flakes from lingering shutdown from previous test#25561markdroth merged 1 commit intogrpc:masterfrom
Conversation
|
In #25008, the comment did mention a possibility that a reset of state may be required. It's a debt we choose to bear.
We could add a new API to XdsClient. E.g., (My bad to comment on the other PR, I though by merging the changes you want to submit both of them in one batch.) |
donnadionne
left a comment
There was a problem hiding this comment.
Seems to provide much better isolation from one test to the next! Wondering if this would make the test run longer? but it's good to have the isolation.
Yes, we can consider something like that if we need to in the future. For now, I think this is good enough.
No, this shouldn't make the test run any longer. (Although if we implement Lidi's suggestion in the future, that would make the tests take a bit longer, because it would delay starting each test until the previous one is completely done cleaning up.) |
|
The "Bazel C/C++ Dbg MacOS" failure is an infrastructure failure. Known issues: #24204 |
This addresses xds_end2end_test flakes caused by lingering shutdown from a previous test impacting the following test. As an example, one common flake pattern was the following:
XdsClientinstance, so that the new test will create a new one that points to the new xDS server instance.XdsClientinstance, which means that the lingering client code from the previous test will now be talking to the xDS server instance for the new test.To fix this, we make two changes.
First, we change the order in which we initialize the infrastructure for each test. We now set up the new xDS server instances before we clear the global
XdsClientstate, which gives the previous test more time to finish shutting down. Note that this is still not a complete guarantee that the previous test will be finished shutting down by the time we clear the globalXdsClientstate, so it may still be possible for there to be more flakes in the future. But in practice, it seems to give enough leeway to avoid the problem.The second change is to stop reusing ports between tests. Reusing ports between tests means that the xDS server will usually be on the same port, which means that the lingering client code from the previous test will wind up talking to the xDS server instance from the new test. Not reusing ports avoids this by ensuring that any lingering client code from the previous test will not interfere with the state in the new test.
Note that the code to reuse ports between tests was added back in #20787 to address a limitation of the number of ports that can be used by a test internally. To avoid hitting that internal limit, I have increased the number of shards for the test, so that each test instantiation is using fewer ports and therefore not bumping into the limit (which I have verified). I have also added a TODO for ways that we can further decrease port usage in the future.
With this PR, there is no more risk of cross-test contamination with regard to communication with the backend or balancer instances themselves. The only remaining risk here is that it's still possible that the previous test might still recreate the
XdsClientafter the new test clears the global state, but the probability of that has been lowered by changing the initialization order of the tests.