Addressing some PR feedback on tests by pradipd · Pull Request #34898 · moby/moby

pradipd · 2017-09-19T20:55:22Z

Signed-off-by: Pradip Dhara [email protected]

- What I did
Address additional feedback from @dnephin on the following PR:
#34674 (review)

- How I did it

- How to verify it
make test-ingregration

- Description for the changelog
Some test cleanup in integration/services/create_test.go

- A picture of a cute animal (not mandatory but encouraged)
Imagine some kittens sleeping

Signed-off-by: Pradip Dhara <[email protected]>

dnephin · 2017-09-20T15:29:00Z

 }

-func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result {
+func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result {


I think we can remove these new arguments. Checking for running state should be fine in all cases.

That's what I thought to. Turns out the service created in TestInspect does not go into Running state. If I remember correctly, it ends up Rejected or Failed.
Do you want me to make that change and add a log message to see what the final state is?

I'll try it out and see why it doesn't get to the running state.

I tried this branch locally with this diff:

diff --git a/integration/service/create_test.go b/integration/service/create_test.go index 423b929d48..2f80c340b9 100644 --- a/integration/service/create_test.go +++ b/integration/service/create_test.go @@ -43,7 +43,7 @@ func TestCreateWithLBSandbox(t *testing.T) { require.NoError(t, err) serviceID := serviceResp.ID - poll.WaitOn(t, serviceContainerCount(client, serviceID, instances, true, swarm.TaskStateRunning)) + poll.WaitOn(t, serviceContainerCount(client, serviceID, instances)) network, err := client.NetworkInspect(context.Background(), overlayID, types.NetworkInspectOptions{}) require.NoError(t, err) diff --git a/integration/service/inspect_test.go b/integration/service/inspect_test.go index 3d19d78275..c5adfa4bf0 100644 --- a/integration/service/inspect_test.go +++ b/integration/service/inspect_test.go @@ -37,7 +37,7 @@ func TestInspect(t *testing.T) { require.NoError(t, err) id := resp.ID - poll.WaitOn(t, serviceContainerCount(client, id, instances, false, swarm.TaskStateNew)) + poll.WaitOn(t, serviceContainerCount(client, id, instances)) service, _, err := client.ServiceInspectWithRaw(ctx, id, types.ServiceInspectOptions{}) require.NoError(t, err) @@ -129,7 +129,7 @@ func newSwarm(t *testing.T) *daemon.Swarm { return d } -func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result { +func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result { return func(log poll.LogT) poll.Result { filter := filters.NewArgs() filter.Add("service", id) @@ -140,11 +140,9 @@ func serviceContainerCount(client client.ServiceAPIClient, id string, count uint case err != nil: return poll.Error(err) case len(tasks) == int(count): - if checkState { - for _, task := range tasks { - if task.Status.State != desiredState { - return poll.Continue("waiting for tasks to enter %v", desiredState) - } + for _, task := range tasks { + if task.Status.State != swarm.TaskStateRunning { + return poll.Continue("waiting for tasks to enter state %v", swarm.TaskStateRunning) } } return poll.Success()

and it worked for me

dnephin

I'm also seeing TestCreateWithLBSandbox run 3x slower then TestInspect which I wouldn't expect. Any idea why that might be?

Signed-off-by: Pradip Dhara <[email protected]>

pradipd · 2017-09-25T21:57:34Z

Whew! For a second I thought I was going crazy.

Checkout:
https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/6097/console

From the logs, the following 2 message are repeated for every task that is created.

time="2017-09-25T20:55:26.502388186Z" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj task.id=4lncndlo2jxknd4wqaqksawov

time="2017-09-25T20:55:26.502535290Z" level=debug msg="state changed" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj state.desired=RUNNING state.transition="RUNNING->FAILED" task.id=4lncndlo2jxknd4wqaqksawov

dnephin · 2017-09-26T15:03:55Z

Thanks for reproducing the error. I guess there is some setting in the service config that fails on powerpc/z, but it's not obvious from the logs.

@tophj-ibm any idea what that might be? Are there things that aren't supported? The config is here https://github.com/moby/moby/blob/master/integration/service/inspect_test.go#L55-L109

tophj-ibm · 2017-09-26T15:24:27Z

@dnephin I'm not aware of anything not supported, and that config looks like it should work to me. I'll run this locally and see what I can find

tophj-ibm · 2017-09-26T20:24:55Z

I can't get this to fail on either my machine, or the CI machine this was ran on.

Looking through the logs of the swarm node, I'm seeing over 2300+ calls to GET tasks, so something is definitely going on, and I'll do a more thorough look of the logs tomorrow

dnephin · 2017-09-26T20:29:20Z

It might need to sleep longer between polls. We could try adding poll.WithDelay(time.Second)

pradipd · 2017-09-26T21:27:59Z

NOTE: Do NOT merge this. I modified .integration-test-helpers to see if I can get a repro on why powerpc and z fail in TestInspect.

pradipd · 2017-09-26T21:56:01Z

Sorry. All of that provided nothing more than we already knew.
task: non-zero exit (1)

tophj-ibm · 2017-09-27T13:02:10Z

I think you need to change https://github.com/moby/moby/pull/34898/files#diff-19c15b89215cbd1096e51081369e2b50R142 to case len(tasks) >= int(count):

I'm guessing that the tasks and count are equal in that case, but the swarm task isn't running yet so it triggers a continue, and then never gets back to that loop because tasks > count. The test does a GET task every iteration, so I'm also guessing that's why I'm seeing all these calls to GET task.

dnephin · 2017-09-27T14:59:14Z

hmm, why do you think len(tasks) would be greater than expected count? The test uses the same variable for both replicas and expected count.

tophj-ibm · 2017-09-27T15:08:32Z

@dnephin oh, you're right I thought it was being incremented in the test. The task count is definitely wrong though, this is the error msg from jenkins poll.go:121: timeout hit after 10s: task count at 10 waiting for 2

dnephin · 2017-09-27T15:24:04Z

Ah, maybe the tasks need to be filtered to only the active tasks. But even the, the only reason it would ever go above 2 would be because the earlier tasks failed.

tophj-ibm · 2017-09-27T15:25:41Z

yeah I'm testing it now, it's definitely failing and hitting the max restart policy limit (4)

tophj-ibm · 2017-09-27T19:27:00Z

Okay here we go, from the service logs

 /bin/top: invalid option -- 'u'

I guess the busybox top version doesn't support that option.

pradipd · 2017-09-27T22:23:03Z

Shouldn't the behavior be the same across all platforms? The same command runs on experimental.

dnephin · 2017-09-28T15:52:18Z

Ya, I agree that it is strange that busybox on one arch has different flags from busybox on another.

Maybe this would be fixed by using the new canonical busybox mutli-arch image, instead of the s390x/buysbox image?

tophj-ibm · 2017-09-28T16:41:42Z

They should be the same, this also fails on x86_64 and you can test it with docker run -it busybox top -u root

What is happening is that the tests just want both replicas to be running at the same time, and as soon as they are it passes. Because the top command exists in the container, it actually does change state to running and stays there for a few seconds before exiting and changing state to failed. However because of timing, it's possible both replicas go from start->running->failed->restart and neither of them land on running at the same time which is what is happening with the p/z/janky nodes.

IMO we should:

1.) definitely change the config and remove the -u root arg, as that is the main issue.

2.) If one of the nodes does legitimately trigger a restart, tasks does get incremented so this line will never run case len(tasks) == int(count): I'm not sure if this incrementation is intended or not @dnephin but if it is (the other tasks are the failed tasks) then that should be changed to >=

3.) To get around future flakiness, one option would be to increase the number of replicas, that way it's a lot less likely that this will falsely pass with a failing node, or you could check and make sure the service logs stderr is nil before proceeding, and if it isn't, error out.

I have a test branch here that monitors the states of the nodes over 30 seconds, you can see what is going on https://github.com/tophj-ibm/moby/tree/test-pr-34898

dnephin · 2017-09-28T17:16:35Z

@tophj-ibm thanks for looking into this, that all makes sense. That is my mistake. I must have looked at top somewhere else and not on busybox.

I think we can fix this by:

changing the args to -n 200 (i'd like to keep args in the test, since it's supposed to be a full config)
fixing the task inspect to only return "active" tasks (I think that's possible, but I have to look into the proper filters)

@pradipd if you'd like me to make these changes I will try to push a commit to this PR in the next day or so

tophj-ibm · 2017-09-28T18:23:30Z

those fixes seem good to me, -n 200 definitely works

pradipd · 2017-09-28T18:47:01Z

@dnephin I would appreciate if you could try them out. thanks!

fix args passed to top set default polling config Signed-off-by: Daniel Nephin <[email protected]>

dnephin · 2017-09-28T22:09:07Z

It seems to be working now with the changes I pushed

tophj-ibm

power job passed all the tests and failed on removing for some reason, will look at that shortly. LGTM, thanks @pradipd, @dnephin!

vdemeester

LGTM 🐯

thaJeztah · 2017-10-02T18:39:02Z

ping @johnstep PTAL

yongtang · 2017-10-10T14:29:01Z

ping @johnstep to take a look.

thaJeztah · 2017-11-09T16:56:10Z

@pradipd looks like this needs a rebase 😢

ping @johnstep PTAL

thaJeztah · 2017-11-09T16:56:42Z

May need some squashing as well

thaJeztah · 2018-06-07T18:32:15Z

ping @pradipd @johnstep were you still working on this? PTAL

cpuguy83 · 2018-11-27T00:51:47Z

Going to close this because it seems inactive. Ping me if you'd still like to get this in and I can re-open.

Thanks! 🙇 👼

Addressing some PR feedback on tests

f2aa345

Signed-off-by: Pradip Dhara <[email protected]>

pradipd requested review from dnephin and vdemeester as code owners September 19, 2017 20:55

GordonTheTurtle added the status/0-triage label Sep 19, 2017

pradipd mentioned this pull request Sep 19, 2017

Enabling ILB/ELB on windows using per-node, per-network LB endpoint. #34674

Merged

dnephin added status/2-code-review and removed status/0-triage labels Sep 20, 2017

dnephin reviewed Sep 20, 2017

View reviewed changes

dnephin reviewed Sep 22, 2017

View reviewed changes

Wait for services to get to running state for all tests.

a8bcd6a

Signed-off-by: Pradip Dhara <[email protected]>

pradipd requested a review from tianon as a code owner September 26, 2017 21:05

pradipd force-pushed the test_fixes branch from 441f2f2 to a8bcd6a Compare September 26, 2017 21:59

Fix integration test for service inspect

bbea041

fix args passed to top set default polling config Signed-off-by: Daniel Nephin <[email protected]>

dnephin force-pushed the test_fixes branch from 7cf4f68 to bbea041 Compare September 28, 2017 21:36

tophj-ibm approved these changes Sep 29, 2017

View reviewed changes

vdemeester approved these changes Oct 2, 2017

View reviewed changes

yongtang added the rebuild/powerpc label Oct 2, 2017

GordonTheTurtle removed the rebuild/powerpc label Oct 2, 2017

GordonTheTurtle assigned yongtang Oct 4, 2017

vieux self-assigned this Nov 9, 2017

thaJeztah assigned johnstep and unassigned vieux Jun 7, 2018

thaJeztah added the area/testing label Jun 7, 2018

cpuguy83 closed this Nov 27, 2018

Conversation

pradipd commented Sep 19, 2017

Uh oh!

dnephin Sep 20, 2017

Choose a reason for hiding this comment

Uh oh!

pradipd Sep 20, 2017

Choose a reason for hiding this comment

Uh oh!

dnephin Sep 20, 2017

Choose a reason for hiding this comment

Uh oh!

dnephin Sep 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnephin left a comment

Choose a reason for hiding this comment

Uh oh!

pradipd commented Sep 25, 2017

Uh oh!

dnephin commented Sep 26, 2017

Uh oh!

tophj-ibm commented Sep 26, 2017

Uh oh!

tophj-ibm commented Sep 26, 2017

Uh oh!

dnephin commented Sep 26, 2017

Uh oh!

pradipd commented Sep 26, 2017

Uh oh!

pradipd commented Sep 26, 2017

Uh oh!

tophj-ibm commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnephin commented Sep 27, 2017

Uh oh!

tophj-ibm commented Sep 27, 2017

Uh oh!

dnephin commented Sep 27, 2017

Uh oh!

tophj-ibm commented Sep 27, 2017

Uh oh!

tophj-ibm commented Sep 27, 2017

Uh oh!

pradipd commented Sep 27, 2017

Uh oh!

dnephin commented Sep 28, 2017

Uh oh!

tophj-ibm commented Sep 28, 2017

Uh oh!

dnephin commented Sep 28, 2017

Uh oh!

tophj-ibm commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pradipd commented Sep 28, 2017

Uh oh!

dnephin commented Sep 28, 2017

Uh oh!

tophj-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

vdemeester left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Oct 2, 2017

Uh oh!

yongtang commented Oct 10, 2017

Uh oh!

thaJeztah commented Nov 9, 2017

Uh oh!

thaJeztah commented Nov 9, 2017

Uh oh!

thaJeztah commented Jun 7, 2018

Uh oh!

cpuguy83 commented Nov 27, 2018

Uh oh!

Reviewers

dnephin Sep 22, 2017 •

edited

Loading

tophj-ibm commented Sep 27, 2017 •

edited

Loading

tophj-ibm commented Sep 28, 2017 •

edited

Loading