Addressing some PR feedback on tests#34898
Conversation
Signed-off-by: Pradip Dhara <[email protected]>
| } | ||
|
|
||
| func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result { | ||
| func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result { |
There was a problem hiding this comment.
I think we can remove these new arguments. Checking for running state should be fine in all cases.
There was a problem hiding this comment.
That's what I thought to. Turns out the service created in TestInspect does not go into Running state. If I remember correctly, it ends up Rejected or Failed.
Do you want me to make that change and add a log message to see what the final state is?
There was a problem hiding this comment.
I'll try it out and see why it doesn't get to the running state.
There was a problem hiding this comment.
I tried this branch locally with this diff:
diff --git a/integration/service/create_test.go b/integration/service/create_test.go
index 423b929d48..2f80c340b9 100644
--- a/integration/service/create_test.go
+++ b/integration/service/create_test.go
@@ -43,7 +43,7 @@ func TestCreateWithLBSandbox(t *testing.T) {
require.NoError(t, err)
serviceID := serviceResp.ID
- poll.WaitOn(t, serviceContainerCount(client, serviceID, instances, true, swarm.TaskStateRunning))
+ poll.WaitOn(t, serviceContainerCount(client, serviceID, instances))
network, err := client.NetworkInspect(context.Background(), overlayID, types.NetworkInspectOptions{})
require.NoError(t, err)
diff --git a/integration/service/inspect_test.go b/integration/service/inspect_test.go
index 3d19d78275..c5adfa4bf0 100644
--- a/integration/service/inspect_test.go
+++ b/integration/service/inspect_test.go
@@ -37,7 +37,7 @@ func TestInspect(t *testing.T) {
require.NoError(t, err)
id := resp.ID
- poll.WaitOn(t, serviceContainerCount(client, id, instances, false, swarm.TaskStateNew))
+ poll.WaitOn(t, serviceContainerCount(client, id, instances))
service, _, err := client.ServiceInspectWithRaw(ctx, id, types.ServiceInspectOptions{})
require.NoError(t, err)
@@ -129,7 +129,7 @@ func newSwarm(t *testing.T) *daemon.Swarm {
return d
}
-func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64, checkState bool, desiredState swarm.TaskState) func(log poll.LogT) poll.Result {
+func serviceContainerCount(client client.ServiceAPIClient, id string, count uint64) func(log poll.LogT) poll.Result {
return func(log poll.LogT) poll.Result {
filter := filters.NewArgs()
filter.Add("service", id)
@@ -140,11 +140,9 @@ func serviceContainerCount(client client.ServiceAPIClient, id string, count uint
case err != nil:
return poll.Error(err)
case len(tasks) == int(count):
- if checkState {
- for _, task := range tasks {
- if task.Status.State != desiredState {
- return poll.Continue("waiting for tasks to enter %v", desiredState)
- }
+ for _, task := range tasks {
+ if task.Status.State != swarm.TaskStateRunning {
+ return poll.Continue("waiting for tasks to enter state %v", swarm.TaskStateRunning)
}
}
return poll.Success()and it worked for me
dnephin
left a comment
There was a problem hiding this comment.
I'm also seeing TestCreateWithLBSandbox run 3x slower then TestInspect which I wouldn't expect. Any idea why that might be?
Signed-off-by: Pradip Dhara <[email protected]>
|
Whew! For a second I thought I was going crazy. From the logs, the following 2 message are repeated for every task that is created. time="2017-09-25T20:55:26.502388186Z" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj task.id=4lncndlo2jxknd4wqaqksawov time="2017-09-25T20:55:26.502535290Z" level=debug msg="state changed" module=node/agent/taskmanager node.id=8ozhptw4rxg5zqlvecf32nzrs service.id=m2cgf1szlx1cqlsr4y3likpxj state.desired=RUNNING state.transition="RUNNING->FAILED" task.id=4lncndlo2jxknd4wqaqksawov |
|
Thanks for reproducing the error. I guess there is some setting in the service config that fails on powerpc/z, but it's not obvious from the logs. @tophj-ibm any idea what that might be? Are there things that aren't supported? The config is here https://github.com/moby/moby/blob/master/integration/service/inspect_test.go#L55-L109 |
|
@dnephin I'm not aware of anything not supported, and that config looks like it should work to me. I'll run this locally and see what I can find |
|
I can't get this to fail on either my machine, or the CI machine this was ran on. Looking through the logs of the swarm node, I'm seeing over 2300+ calls to GET tasks, so something is definitely going on, and I'll do a more thorough look of the logs tomorrow |
|
It might need to sleep longer between polls. We could try adding |
|
NOTE: Do NOT merge this. I modified .integration-test-helpers to see if I can get a repro on why powerpc and z fail in TestInspect. |
|
Sorry. All of that provided nothing more than we already knew. |
|
I think you need to change https://github.com/moby/moby/pull/34898/files#diff-19c15b89215cbd1096e51081369e2b50R142 to I'm guessing that the tasks and count are equal in that case, but the swarm task isn't running yet so it triggers a continue, and then never gets back to that loop because tasks > count. The test does a GET task every iteration, so I'm also guessing that's why I'm seeing all these calls to GET task. |
|
hmm, why do you think len(tasks) would be greater than expected count? The test uses the same variable for both |
|
@dnephin oh, you're right I thought it was being incremented in the test. The task count is definitely wrong though, this is the error msg from jenkins |
|
Ah, maybe the tasks need to be filtered to only the active tasks. But even the, the only reason it would ever go above 2 would be because the earlier tasks failed. |
|
yeah I'm testing it now, it's definitely failing and hitting the max restart policy limit (4) |
|
Okay here we go, from the service logs I guess the busybox top version doesn't support that option. |
|
Shouldn't the behavior be the same across all platforms? The same command runs on experimental. |
|
Ya, I agree that it is strange that busybox on one arch has different flags from busybox on another. Maybe this would be fixed by using the new canonical |
|
They should be the same, this also fails on x86_64 and you can test it with What is happening is that the tests just want both replicas to be running at the same time, and as soon as they are it passes. Because the top command exists in the container, it actually does change state to running and stays there for a few seconds before exiting and changing state to failed. However because of timing, it's possible both replicas go from start->running->failed->restart and neither of them land on running at the same time which is what is happening with the p/z/janky nodes. IMO we should: 1.) definitely change the config and remove the 2.) If one of the nodes does legitimately trigger a restart, tasks does get incremented so this line will never run 3.) To get around future flakiness, one option would be to increase the number of replicas, that way it's a lot less likely that this will falsely pass with a failing node, or you could check and make sure the service logs stderr is nil before proceeding, and if it isn't, error out. I have a test branch here that monitors the states of the nodes over 30 seconds, you can see what is going on https://github.com/tophj-ibm/moby/tree/test-pr-34898 |
|
@tophj-ibm thanks for looking into this, that all makes sense. That is my mistake. I must have looked at top somewhere else and not on busybox. I think we can fix this by:
@pradipd if you'd like me to make these changes I will try to push a commit to this PR in the next day or so |
|
those fixes seem good to me, |
|
@dnephin I would appreciate if you could try them out. thanks! |
fix args passed to top set default polling config Signed-off-by: Daniel Nephin <[email protected]>
|
It seems to be working now with the changes I pushed |
|
ping @johnstep PTAL |
|
ping @johnstep to take a look. |
|
May need some squashing as well |
|
Going to close this because it seems inactive. Ping me if you'd still like to get this in and I can re-open. Thanks! 🙇 👼 |
Signed-off-by: Pradip Dhara [email protected]
- What I did
Address additional feedback from @dnephin on the following PR:
#34674 (review)
- How I did it
- How to verify it
make test-ingregration
- Description for the changelog
Some test cleanup in integration/services/create_test.go
- A picture of a cute animal (not mandatory but encouraged)
Imagine some kittens sleeping