-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pull-kubernetes-federation-e2e-gce is flaky #45978
Comments
The clusters have been re-cycled (jenkins job re-build) and test runs are
passing as of ~10:10 PDT. We'll have to figure out why the base clusters
timed out.
…On Wed, May 17, 2017 at 11:15 AM, Erick Fejta ***@***.***> wrote:
from @nikhita <https://github.com/nikhita> on kubernetes/test-infra#2787
<kubernetes/test-infra#2787>
I keep hitting this flake: https://k8s-gubernator.
appspot.com/build/kubernetes-jenkins/pr-logs/pull/45721/
pull-kubernetes-federation-e2e-gce/5138/. Noticed that most of the
recently updated PRs are failing on this one. See also:
kubernetes/kubernetes#42072
<#42072>.
I mentioned this in the sig-testing channel as well and wasn't sure if I
should be mentioning in the above issue or create a new one in this repo
but thought it would be better to document it here. :)
@kubernetes/sig-federation-test-failures
<https://github.com/orgs/kubernetes/teams/sig-federation-test-failures>
/kind flake
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#45978>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AWSmG1qJDUzkG4TiJnP3PEIToA615-4_ks5r6zkmgaJpZM4NeQHv>
.
|
This was an infrastructure issue. @shashidharatd and @irfanurrehman who were around when this happened both noticed this but they did not have access to Jenkins to manually trigger a new run. We will prioritize moving the federation presubmit deploy job from Jenkins to prow. Filed an issue here - kubernetes/test-infra#2791 |
It has been a little over a week since we started reporting the results of this job on all the PRs. We did not have any major issues. We had a couple of minor hiccups: Issues kubernetes/kubernetes#45795 and kubernetes/kubernetes#45978. We had foreseen problems of the first type but the second one was a little surprising. SIG-Federation is starting a buildcop rotation and the buildcops should be able to handle both these types of situations. We don't have all the tooling in place for non-Googlers to handle these issues because it needs access to Jenkins, so they still need to ping a Googler when they see a problem. We are working on moving these jobs out of Jenkins to prow (Issue kubernetes#2791). Empirically, these problems have been uncommon and shouldn't affect the submit queue often.
It has been a little over a week since we started reporting the results of this job on all the PRs. We did not have any major issues. We had a couple of minor hiccups: Issues kubernetes/kubernetes#45795 and kubernetes/kubernetes#45978. We had foreseen problems of the first type but the second one was a little surprising. SIG-Federation is starting a buildcop rotation and the buildcops should be able to handle both these types of situations. We don't have all the tooling in place for non-Googlers to handle these issues because it needs access to Jenkins, so they still need to ping a Googler when they see a problem. We are working on moving these jobs out of Jenkins to prow (Issue kubernetes#2791). Empirically, these problems have been uncommon and shouldn't affect the submit queue often.
Thanks! We are debugging this. We have made federation presubmits non-blocking for now. You should be able to merge PRs without that job passing. |
Can't seem to get past this flake in #46169 |
Ok, just saw @madhusudancs's comment and it seems that the PR is in the queue, sorry for the noise |
My PR hit this flaky test as well #46213 |
In my PR #46138, I see this fail and pull-kubernetes-kubemark-e2e-gce and pull-kubernetesnode-e2e. Not sure if flakes should be generated for the other two tests. |
@pmichali as we discussed on slack, the other 2 failures were most likely related to changes in your PR itself and not actual flakes. |
Corrected the other two issues with my latest commit on #46138, but still see this issue. |
Is it flaky or broken? I haven't seen it pass |
There was a fix yesterday for one issue that was causing flakiness, but it appears that it was not comprehensive. We're looking into this. |
This appears to have been fixed by recycling the clusters: the issue that was fixed earlier was merged in on May 25, after the daily cluster recycling. So, some clusters were left in a bad state because of previous failures. Once the clusters were recycled on the morning of the 26th, the tests stopped being flaky. |
/assign @perotinus |
/close |
I'm reopening the issue. @madhusudancs could you take a look? Since the error message is |
@caesarxuchao this test ran when we had just fixed the test infra issue and redeploying things. You shouldn't see this problem if you re-run the test now. |
Thanks. Closing. |
@madhusudancs I still see this issue (I think it is this) on #46138 and #46874. Can you please advise? |
I see a few other people have the same log messages as what I see. The failure seems consistently occurring (I tried retest 2x). |
There's another recent string of failures for this: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-federation/? For example, the most recent: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-federation/5221/
|
Looking into it.
…On Thu, Jul 13, 2017 at 9:05 AM, Matt Moyer ***@***.***> wrote:
There's another recent string of failures for this:
https://k8s-gubernator.appspot.com/builds/kubernetes-
jenkins/logs/ci-kubernetes-e2e-gce-federation/?
For example, the most recent: https://k8s-gubernator.
appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-
e2e-gce-federation/5221/
W0713 15:10:55.034] 2017/07/13 15:10:54 main.go:191: Something went wrong: error starting federation: error during ./federation/cluster/federation-up.sh: exit status 124
W0713 15:10:
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#45978 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWSmG_PJ5zew60N0VTjGucMOkxxuszW5ks5sNkA5gaJpZM4NeQHv>
.
|
I believe this has been addressed. |
/close |
from @nikhita on kubernetes/test-infra#2787
I keep hitting this flake: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/45721/pull-kubernetes-federation-e2e-gce/5138/. Noticed that most of the recently updated PRs are failing on this one. See also: #42072.
I mentioned this in the sig-testing channel as well and wasn't sure if I should be mentioning in the above issue or create a new one in this repo but thought it would be better to document it here. :)
@kubernetes/sig-federation-test-failures
/kind flake
The text was updated successfully, but these errors were encountered: