Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-kubernetes-federation-e2e-gce is flaky #45978

Closed
fejta opened this issue May 17, 2017 · 27 comments
Closed

pull-kubernetes-federation-e2e-gce is flaky #45978

fejta opened this issue May 17, 2017 · 27 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@fejta
Copy link
Contributor

fejta commented May 17, 2017

from @nikhita on kubernetes/test-infra#2787

I keep hitting this flake: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/45721/pull-kubernetes-federation-e2e-gce/5138/. Noticed that most of the recently updated PRs are failing on this one. See also: #42072.

I mentioned this in the sig-testing channel as well and wasn't sure if I should be mentioning in the above issue or create a new one in this repo but thought it would be better to document it here. :)

@kubernetes/sig-federation-test-failures
/kind flake

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/federation labels May 17, 2017
@csbell
Copy link
Contributor

csbell commented May 17, 2017 via email

@madhusudancs
Copy link
Contributor

This was an infrastructure issue. @shashidharatd and @irfanurrehman who were around when this happened both noticed this but they did not have access to Jenkins to manually trigger a new run. We will prioritize moving the federation presubmit deploy job from Jenkins to prow.

Filed an issue here - kubernetes/test-infra#2791

madhusudancs added a commit to madhusudancs/test-infra that referenced this issue May 17, 2017
It has been a little over a week since we started reporting the results
of this job on all the PRs. We did not have any major issues. We had a
couple of minor hiccups: Issues kubernetes/kubernetes#45795 and kubernetes/kubernetes#45978. We had foreseen
problems of the first type but the second one was a little surprising.

SIG-Federation is starting a buildcop rotation and the buildcops should
be able to handle both these types of situations. We don't have all the
tooling in place for non-Googlers to handle these issues because it
needs access to Jenkins, so they still need to ping a Googler when
they see a problem. We are working on moving these jobs out of Jenkins
to prow (Issue kubernetes#2791).

Empirically, these problems have been uncommon and shouldn't affect
the submit queue often.
madhusudancs added a commit to madhusudancs/test-infra that referenced this issue May 17, 2017
It has been a little over a week since we started reporting the results
of this job on all the PRs. We did not have any major issues. We had a
couple of minor hiccups: Issues kubernetes/kubernetes#45795
and kubernetes/kubernetes#45978. We had foreseen
problems of the first type but the second one was a little surprising.

SIG-Federation is starting a buildcop rotation and the buildcops should
be able to handle both these types of situations. We don't have all the
tooling in place for non-Googlers to handle these issues because it
needs access to Jenkins, so they still need to ping a Googler when
they see a problem. We are working on moving these jobs out of Jenkins
to prow (Issue kubernetes#2791).

Empirically, these problems have been uncommon and shouldn't affect
the submit queue often.
@foxish
Copy link
Contributor

foxish commented May 22, 2017

The same issue appears to be happening again with #46071 and #46071

@madhusudancs
Copy link
Contributor

Thanks! We are debugging this. We have made federation presubmits non-blocking for now. You should be able to merge PRs without that job passing.

@0xmichalis
Copy link
Contributor

Can't seem to get past this flake in #46169

@0xmichalis
Copy link
Contributor

Ok, just saw @madhusudancs's comment and it seems that the PR is in the queue, sorry for the noise

@xiao-zhou
Copy link
Contributor

My PR hit this flaky test as well #46213

@pmichali
Copy link
Contributor

In my PR #46138, I see this fail and pull-kubernetes-kubemark-e2e-gce and pull-kubernetesnode-e2e. Not sure if flakes should be generated for the other two tests.

@ncdc
Copy link
Member

ncdc commented May 23, 2017

@pmichali as we discussed on slack, the other 2 failures were most likely related to changes in your PR itself and not actual flakes.

@pmichali
Copy link
Contributor

Corrected the other two issues with my latest commit on #46138, but still see this issue.

@janetkuo
Copy link
Member

Is it flaky or broken? I haven't seen it pass

@janetkuo janetkuo added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label May 25, 2017
@perotinus
Copy link
Contributor

It's flaky: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-federation-e2e-gce

There was a fix yesterday for one issue that was causing flakiness, but it appears that it was not comprehensive. We're looking into this.

@perotinus
Copy link
Contributor

This appears to have been fixed by recycling the clusters: the issue that was fixed earlier was merged in on May 25, after the daily cluster recycling. So, some clusters were left in a bad state because of previous failures. Once the clusters were recycled on the morning of the 26th, the tests stopped being flaky.

@perotinus
Copy link
Contributor

/assign @perotinus

@perotinus
Copy link
Contributor

/close

@caesarxuchao
Copy link
Member

I'm reopening the issue. @madhusudancs could you take a look? Since the error message is error during ./federation/cluster/federation-up.sh: exit status 255, so it might be a test-infra issue.

@caesarxuchao caesarxuchao reopened this Jun 2, 2017
@madhusudancs
Copy link
Contributor

@caesarxuchao this test ran when we had just fixed the test infra issue and redeploying things. You shouldn't see this problem if you re-run the test now.

@caesarxuchao
Copy link
Member

Thanks. Closing.

@pmichali
Copy link
Contributor

@madhusudancs I still see this issue (I think it is this) on #46138 and #46874. Can you please advise?

@sttts sttts reopened this Jul 10, 2017
@pmichali
Copy link
Contributor

I see a few other people have the same log messages as what I see. The failure seems consistently occurring (I tried retest 2x).

@mattmoyer
Copy link
Contributor

There's another recent string of failures for this: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-federation/?

For example, the most recent: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-federation/5221/

W0713 15:10:55.034] 2017/07/13 15:10:54 main.go:191: Something went wrong: error starting federation: error during ./federation/cluster/federation-up.sh: exit status 124
W0713 15:10:

@csbell
Copy link
Contributor

csbell commented Jul 13, 2017 via email

@perotinus
Copy link
Contributor

I believe this has been addressed.

@perotinus
Copy link
Contributor

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests