Skip to content

Addressing some PR feedback on tests#34898

Closed
pradipd wants to merge 3 commits intomoby:masterfrom
pradipd:test_fixes
Closed

Addressing some PR feedback on tests#34898
pradipd wants to merge 3 commits intomoby:masterfrom
pradipd:test_fixes

Conversation

@pradipd
Copy link
Copy Markdown
Contributor

@pradipd pradipd commented Sep 19, 2017

Signed-off-by: Pradip Dhara [email protected]

- What I did
Address additional feedback from @dnephin on the following PR:
#34674 (review)

- How I did it

- How to verify it
make test-ingregration

- Description for the changelog
Some test cleanup in integration/services/create_test.go

- A picture of a cute animal (not mandatory but encouraged)
Imagine some kittens sleeping

Comment thread integration/service/inspect_test.go Outdated
}

func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result {
func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these new arguments. Checking for running state should be fine in all cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought to. Turns out the service created in TestInspect does not go into Running state. If I remember correctly, it ends up Rejected or Failed.
Do you want me to make that change and add a log message to see what the final state is?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it out and see why it doesn't get to the running state.

Copy link
Copy Markdown
Member

@dnephin dnephin Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this branch locally with this diff:

diff --git a/integration/service/create_test.go b/integration/service/create_test.go
index 423b929d48..2f80c340b9 100644
--- a/integration/service/create_test.go
+++ b/integration/service/create_test.go
@@ -43,7 +43,7 @@ func TestCreateWithLBSandbox(t *testing.T) {
 	require.NoError(t, err)
 
 	serviceID := serviceResp.ID
-	poll.WaitOn(t, serviceContainerCount(client, serviceID, instances, true, swarm.TaskStateRunning))
+	poll.WaitOn(t, serviceContainerCount(client, serviceID, instances))
 
 	network, err := client.NetworkInspect(context.Background(), overlayID, types.NetworkInspectOptions{})
 	require.NoError(t, err)
diff --git a/integration/service/inspect_test.go b/integration/service/inspect_test.go
index 3d19d78275..c5adfa4bf0 100644
--- a/integration/service/inspect_test.go
+++ b/integration/service/inspect_test.go
@@ -37,7 +37,7 @@ func TestInspect(t *testing.T) {
 	require.NoError(t, err)
 
 	id := resp.ID
-	poll.WaitOn(t, serviceContainerCount(client, id, instances, false, swarm.TaskStateNew))
+	poll.WaitOn(t, serviceContainerCount(client, id, instances))
 
 	service, _, err := client.ServiceInspectWithRaw(ctx, id, types.ServiceInspectOptions{})
 	require.NoError(t, err)
@@ -129,7 +129,7 @@ func newSwarm(t *testing.T) *daemon.Swarm {
 	return d
 }
 
-func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result {
+func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result {
 	return func(log poll.LogT) poll.Result {
 		filter := filters.NewArgs()
 		filter.Add("service", id)
@@ -140,11 +140,9 @@ func serviceContainerCount(client client.ServiceAPIClient, id string, count uint
 		case err != nil:
 			return poll.Error(err)
 		case len(tasks) == int(count):
-			if checkState {
-				for _, task := range tasks {
-					if task.Status.State != desiredState {
-						return poll.Continue("waiting for tasks to enter %v", desiredState)
-					}
+			for _, task := range tasks {
+				if task.Status.State != swarm.TaskStateRunning {
+					return poll.Continue("waiting for tasks to enter state %v", swarm.TaskStateRunning)
 				}
 			}
 			return poll.Success()

and it worked for me

Copy link
Copy Markdown
Member

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also seeing TestCreateWithLBSandbox run 3x slower then TestInspect which I wouldn't expect. Any idea why that might be?

@pradipd
Copy link
Copy Markdown
Contributor Author

pradipd commented Sep 25, 2017

Whew! For a second I thought I was going crazy.

Checkout:
https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/6097/console

From the logs, the following 2 message are repeated for every task that is created.

time="2017-09-25T20:55:26.502388186Z" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj task.id=4lncndlo2jxknd4wqaqksawov

time="2017-09-25T20:55:26.502535290Z" level=debug msg="state changed" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj state.desired=RUNNING state.transition="RUNNING->FAILED" task.id=4lncndlo2jxknd4wqaqksawov

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 26, 2017

Thanks for reproducing the error. I guess there is some setting in the service config that fails on powerpc/z, but it's not obvious from the logs.

@tophj-ibm any idea what that might be? Are there things that aren't supported? The config is here https://github.com/moby/moby/blob/master/integration/service/inspect_test.go#L55-L109

@tophj-ibm
Copy link
Copy Markdown
Contributor

@dnephin I'm not aware of anything not supported, and that config looks like it should work to me. I'll run this locally and see what I can find

@tophj-ibm
Copy link
Copy Markdown
Contributor

I can't get this to fail on either my machine, or the CI machine this was ran on.

Looking through the logs of the swarm node, I'm seeing over 2300+ calls to GET tasks, so something is definitely going on, and I'll do a more thorough look of the logs tomorrow

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 26, 2017

It might need to sleep longer between polls. We could try adding poll.WithDelay(time.Second)

@pradipd pradipd requested a review from tianon as a code owner September 26, 2017 21:05
@pradipd
Copy link
Copy Markdown
Contributor Author

pradipd commented Sep 26, 2017

NOTE: Do NOT merge this. I modified .integration-test-helpers to see if I can get a repro on why powerpc and z fail in TestInspect.

@pradipd
Copy link
Copy Markdown
Contributor Author

pradipd commented Sep 26, 2017

Sorry. All of that provided nothing more than we already knew.
task: non-zero exit (1)

@tophj-ibm
Copy link
Copy Markdown
Contributor

tophj-ibm commented Sep 27, 2017

I think you need to change https://github.com/moby/moby/pull/34898/files#diff-19c15b89215cbd1096e51081369e2b50R142 to case len(tasks) >= int(count):

I'm guessing that the tasks and count are equal in that case, but the swarm task isn't running yet so it triggers a continue, and then never gets back to that loop because tasks > count. The test does a GET task every iteration, so I'm also guessing that's why I'm seeing all these calls to GET task.

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 27, 2017

hmm, why do you think len(tasks) would be greater than expected count? The test uses the same variable for both replicas and expected count.

@tophj-ibm
Copy link
Copy Markdown
Contributor

@dnephin oh, you're right I thought it was being incremented in the test. The task count is definitely wrong though, this is the error msg from jenkins poll.go:121: timeout hit after 10s: task count at 10 waiting for 2

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 27, 2017

Ah, maybe the tasks need to be filtered to only the active tasks. But even the, the only reason it would ever go above 2 would be because the earlier tasks failed.

@tophj-ibm
Copy link
Copy Markdown
Contributor

yeah I'm testing it now, it's definitely failing and hitting the max restart policy limit (4)

@tophj-ibm
Copy link
Copy Markdown
Contributor

Okay here we go, from the service logs

 /bin/top: invalid option -- 'u'

I guess the busybox top version doesn't support that option.

@pradipd
Copy link
Copy Markdown
Contributor Author

pradipd commented Sep 27, 2017

Shouldn't the behavior be the same across all platforms? The same command runs on experimental.

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 28, 2017

Ya, I agree that it is strange that busybox on one arch has different flags from busybox on another.

Maybe this would be fixed by using the new canonical busybox mutli-arch image, instead of the s390x/buysbox image?

@tophj-ibm
Copy link
Copy Markdown
Contributor

They should be the same, this also fails on x86_64 and you can test it with docker run -it busybox top -u root

What is happening is that the tests just want both replicas to be running at the same time, and as soon as they are it passes. Because the top command exists in the container, it actually does change state to running and stays there for a few seconds before exiting and changing state to failed. However because of timing, it's possible both replicas go from start->running->failed->restart and neither of them land on running at the same time which is what is happening with the p/z/janky nodes.

IMO we should:

1.) definitely change the config and remove the -u root arg, as that is the main issue.

2.) If one of the nodes does legitimately trigger a restart, tasks does get incremented so this line will never run case len(tasks) == int(count): I'm not sure if this incrementation is intended or not @dnephin but if it is (the other tasks are the failed tasks) then that should be changed to >=

3.) To get around future flakiness, one option would be to increase the number of replicas, that way it's a lot less likely that this will falsely pass with a failing node, or you could check and make sure the service logs stderr is nil before proceeding, and if it isn't, error out.

I have a test branch here that monitors the states of the nodes over 30 seconds, you can see what is going on https://github.com/tophj-ibm/moby/tree/test-pr-34898

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 28, 2017

@tophj-ibm thanks for looking into this, that all makes sense. That is my mistake. I must have looked at top somewhere else and not on busybox.

I think we can fix this by:

  1. changing the args to -n 200 (i'd like to keep args in the test, since it's supposed to be a full config)
  2. fixing the task inspect to only return "active" tasks (I think that's possible, but I have to look into the proper filters)

@pradipd if you'd like me to make these changes I will try to push a commit to this PR in the next day or so

@tophj-ibm
Copy link
Copy Markdown
Contributor

tophj-ibm commented Sep 28, 2017

those fixes seem good to me, -n 200 definitely works

@pradipd
Copy link
Copy Markdown
Contributor Author

pradipd commented Sep 28, 2017

@dnephin I would appreciate if you could try them out. thanks!

fix args passed to top
set default polling config

Signed-off-by: Daniel Nephin <[email protected]>
@dnephin
Copy link
Copy Markdown
Member

dnephin commented Sep 28, 2017

It seems to be working now with the changes I pushed

Copy link
Copy Markdown
Contributor

@tophj-ibm tophj-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

power job passed all the tests and failed on removing for some reason, will look at that shortly. LGTM, thanks @pradipd, @dnephin!

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐯

@thaJeztah
Copy link
Copy Markdown
Member

ping @johnstep PTAL

@yongtang
Copy link
Copy Markdown
Member

ping @johnstep to take a look.

@thaJeztah
Copy link
Copy Markdown
Member

@pradipd looks like this needs a rebase 😢

ping @johnstep PTAL

@thaJeztah
Copy link
Copy Markdown
Member

May need some squashing as well

@vieux vieux self-assigned this Nov 9, 2017
@thaJeztah thaJeztah assigned johnstep and unassigned vieux Jun 7, 2018
@thaJeztah
Copy link
Copy Markdown
Member

ping @pradipd @johnstep were you still working on this? PTAL

@cpuguy83
Copy link
Copy Markdown
Member

Going to close this because it seems inactive. Ping me if you'd still like to get this in and I can re-open.

Thanks! 🙇 👼

@cpuguy83 cpuguy83 closed this Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants