Skip to content

Fix racey TestHealthKillContainer#42958

Merged
thaJeztah merged 1 commit intomoby:masterfrom
cpuguy83:fix_racey_health_test
Oct 21, 2021
Merged

Fix racey TestHealthKillContainer#42958
thaJeztah merged 1 commit intomoby:masterfrom
cpuguy83:fix_racey_health_test

Conversation

@cpuguy83
Copy link
Copy Markdown
Member

Before this change if you assume that things work the way the test
expects them to (it does not, but lets assume for now) we aren't really
testing anything because we are testing that a container is healthy
before and after we send a signal. This will give false positives even
if there is a bug in the underlying code. Sending a signal can take any
amount of time to cause a container to exit or to trigger healthchecks
to stop or whatever.

Now lets remove the assumption that things are working as expected,
because they are not.
In this case, top (which is what is running in the container) is
actually exiting when it receives USR1.
This totally invalidates the test.

We need more control and knowledge as to what is happening in the
container to properly test this.
This change introduces a custom script which traps USR1 and flips the
health status each time the signal is received.
We then send the signal twice so that we know the change has occurred
and check that the value has flipped so that we know the change has
actually occurred.

Fixes #41930

Before this change if you assume that things work the way the test
expects them to (it does not, but lets assume for now) we aren't really
testing anything because we are testing that a container is healthy
before and after we send a signal. This will give false positives even
if there is a bug in the underlying code. Sending a signal can take any
amount of time to cause a container to exit or to trigger healthchecks
to stop or whatever.

Now lets remove the assumption that things are working as expected,
because they are not.
In this case, `top` (which is what is running in the container) is
actually exiting when it receives `USR1`.
This totally invalidates the test.

We need more control and knowledge as to what is happening in the
container to properly test this.
This change introduces a custom script which traps `USR1` and flips the
health status each time the signal is received.
We then send the signal twice so that we know the change has occurred
and check that the value has flipped so that we know the change has
actually occurred.

Signed-off-by: Brian Goff <[email protected]>
Copy link
Copy Markdown
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adorable 😂

LGTM

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ❤️


while true; do sleep 1; done
`
c.Config.Cmd = []string{"/bin/sh", "-c", cmd}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually surprised the multiline cmd works. Makes sense that it does, just that I didn't directly thing of doing it this way (without writing it to a file)

Nice! 👍👍

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but maybe slightly better would be to use different signals (usr1, usr2) to set healthy vs unhealthy instead of flipping between them. This would make it more certain that we are reacting to the correct kill call. Just checking [ ! -f /unhealthy ] could be simpler as well then.

Copy link
Copy Markdown
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, glad this is an actual test now.

@thaJeztah
Copy link
Copy Markdown
Member

Failure is unrelated, and a known flaky test

=== RUN   TestCreateServiceConfigFileMode
    create_test.go:347: timeout hit after 10s: running task count at 0 waiting for 1 (total tasks: 3)
--- FAIL: TestCreateServiceConfigFileMode (11.22s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestHealthKillContainer fails intermittently

5 participants