Skip to content

ci: misc test enhancements#2815

Merged
georglauterbach merged 10 commits intomasterfrom
ci/fix-tests-bats-and-more
Oct 14, 2022
Merged

ci: misc test enhancements#2815
georglauterbach merged 10 commits intomasterfrom
ci/fix-tests-bats-and-more

Conversation

@georglauterbach
Copy link
Copy Markdown
Member

@georglauterbach georglauterbach commented Oct 5, 2022

Description

Fresh start of #2811

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (README.md or the documentation under docs/)
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Comment thread .gitignore Outdated
Lacking justification for being in PR.
@georglauterbach
Copy link
Copy Markdown
Member Author

@polarathene do you have any idea, btw, why the test job does not use the cache from the build job?

@georglauterbach
Copy link
Copy Markdown
Member Author

So, I know, not pretty - but I spent too much time yesterday without really getting down to it. Therefore, I figured, adding the sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

@casperklein
Copy link
Copy Markdown
Member

sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

I haven't looked in detail in this current issue. But another pragmatic approach could be: find out, when the failing tests started and then revert the last changes made to the tests? That should fix it or am I wrong.

Anything was introduced that caused that issue. So instead of fixing the symptoms, I would roll back the change that introduced the issue in the first place.

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Oct 6, 2022

sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

I haven't looked in detail in this current issue. But another pragmatic approach could be: find out, when the failing tests started and then revert the last changes made to the tests? That should fix it or am I wrong.

IIRC #2729 showed that CI is fragile. While I do not think #2729 introduced a bug, it unveiled the fragility. That's why it is hard to pinpoint: there is no obvious change that serves as the cause.

Anything was introduced that caused that issue. So instead of fixing the symptoms, I would roll back the change that introduced the issue in the first place.

As I said above, this works in theory, but it is more or less difficult in this case for the reasons mentioned above.


I'm not for introducing sleep as well, but unless someone else finds the cause (which may take time), I prefer it over restarting CI 3 or 4 times.

@polarathene
Copy link
Copy Markdown
Member

@polarathene do you have any idea, btw, why the test job does not use the cache from the build job?

The build job did not create a new cache. It had a primary key cache hit. The file contents that derive the cache key has been seen by the CI before.

When this happens, if something has changed that invalidates that cache, unfortunately the cache action does not upload a new cache, this only happens when the expected cache key for the contents (the primary key that we used) didn't have a match and a fallback cache key was used.

So we restored a previous cached build, that for some reason uploaded the build cache that for the same key still got it invalidated. I only remember this happening when I worked on the feature and ARG was invalidating the build, or there was something causing incorrect cache key generation.


I am wondering if we implemented the re-usable workflows correctly, or if they're incompatible here. I'd need to check the cache key generated is for this PR contents or if it was generating that from master?

Cache key CI used: cache-buildx-a0ceadd91401151c4991b14c3f1e2d38e77032937df0f7d9dcf9a12582b50b89

I just noticed this is quite a problem with commits on master often failing to pass/complete the Build, Test & Deploy workflows.

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Oct 6, 2022

As suspected... (EDIT: Tracked down where the key was from and why it hasn't changed, totally fine)

Cache restored from key: cache-buildx-a0ceadd91401151c4991b14c3f1e2d38e77032937df0f7d9dcf9a12582b50b89

That is the same cache key used by this PR CI build/test, as was used on Master for merging of this recent PR. That was also the cache key used in that PRs last commits (merges from master, that was the specific job that generated the cache key quoted above) which isn't too surprising. We can also see that non-merge commits in that PR was using different cache keys as they should (this ones build step restored a fallback key, then generated a new one, that was used to restore cache in future commits on the PR and generate the same cache key for the image build being used here).

Oh ok, just noticed that this PR doesn't contribute anything that should cause a new image cache key to be created, so it appears to be behaving correctly.


Looking back at the related PR merge commit on master and the CI build job there, it does not have the known build cache key from the PR due to the security scoping stuff, it needs to do a fresh build/cache upload of that. And we see that it selects and restores the cache-buildx cache key, which isn't valid but was created as a bug AFAIK, if that hasn't been evicted after 7 days of whenever it was created, Github might renew the expiry if it's accessed within the past 7 days?

Anyway, that cache from whenever that was, is invalidated and we get a fresh build. At the end of the build job, since we did not restore a primary cache key for the given contents, it uploads to master branch cache the same key this PR is using. Now any branch/PR that derives this cache key should have access to that, and it should be the most recent fallback cache key used too presently.

I am not sure what exactly is causing this invalidation, but I can see that your previous version of this PR was able to build using the cache for this same cache key:

#9 [internal] load metadata for docker.io/clamav/clamav:latest
#9 DONE 0.2s

#8 [internal] load metadata for docker.io/library/debian:11-slim
#8 DONE 0.4s

#10 importing cache manifest from local:3943879211245529940
#10 DONE 0.0s

#11 FROM docker.io/clamav/clamav:latest@sha256:0353dfffc8e8d6e035208e380cdc0f7f64299b849acef4b20ca8d87d874ba409
#11 resolve docker.io/clamav/clamav:latest@sha256:0353dfffc8e8d6e035208e380cdc0f7f64299b849acef4b20ca8d87d874ba409 done
#11 DONE 0.0s

#12 [stage-base  1/47] FROM docker.io/library/debian:11-slim@sha256:5cf1d98cd080595[148](https://github.com/docker-mailserver/docker-mailserver/actions/runs/3170398519/jobs/5162954169#step:7:151)4f33b34c1ab25aac7007bb41c8b9901d97e4be3cf3ab04
#12 resolve docker.io/library/debian:11-slim@sha256:5cf1d98cd0805951484f33b34c1ab25aac7007bb41c8b9901d97e4be3cf3ab04 0.0s done
#12 DONE 0.0s

#13 [internal] load build context
#13 transferring context: 290.82kB 0.0s done
#13 DONE 0.1s

#14 [stage-base 21/47] RUN <<EOF (chmod 755 /etc/init.d/postgrey...)
#14 CACHED

Your last commit on that PR that built with cache was: properly provide ENV, afterwards it started invalidating cache. Nothing stands out as the cause though.

@polarathene
Copy link
Copy Markdown
Member

I'm not for introducing sleep as well, but unless someone else finds the cause (which may take time), I prefer it over restarting CI 3 or 4 times.

It's fine for now and I doubt we're going to benefit much time wise by reverting. It's a minor hit considering how long tests normally take, and we know that the sleep part aside, we're not implicitly hiding the timing problems that were surfaced by recent additions to tests.bats.

However, we can do better:

function wait_for_finished_setup_in_container() {
local STATUS=0
repeat_until_success_or_timeout --fatal-test "container_is_running ${1}" "${TEST_TIMEOUT_IN_SECONDS}" sh -c "docker logs ${1} | grep 'is up and running'" || STATUS=1

if [[ ! $(__is_changedetector_processing) ]]
then
# A new change event is expected, wait for it:
repeat_until_success_or_timeout 60 __is_changedetector_processing
fi
# Change event is in progress, wait until it finishes:
repeat_until_success_or_timeout 60 __is_changedetector_finished
# NOTE: Although the change event has completed, services like Postfix and Dovecot
# may still be in the process of restarting.
# You may still want to wait longer if depending on those to be ready.
}

function wait_for_empty_mail_queue_in_container() {
local CONTAINER_NAME="${1}"
local TIMEOUT=${TEST_TIMEOUT_IN_SECONDS}
# shellcheck disable=SC2016
repeat_in_container_until_success_or_timeout "${TIMEOUT}" "${CONTAINER_NAME}" bash -c '[[ $(mailq) == *"Mail queue is empty"* ]]'
}

function wait_for_tcp_port_in_container() {
repeat_until_success_or_timeout --fatal-test "container_is_running ${2}" "${TEST_TIMEOUT_IN_SECONDS}" docker exec "${2}" /bin/sh -c "nc -z 0.0.0.0 ${1}"
}
# @param ${1} name of the postfix container
function wait_for_smtp_port_in_container() {
wait_for_tcp_port_in_container 25 "${1}"

These are the methods you're adding sleep around, and apart from the helper I wrote with a fixed 60 seconds (and advice to wait on postfix/dovecot after separately if necessary), we can see a common way to avoid the sleeps so that they're only optional when needed:

# default timeout is 120 seconds
TEST_TIMEOUT_IN_SECONDS=${TEST_TIMEOUT_IN_SECONDS-120}

If 15 secs of sleep is really all that's required, surely we could just bump up the default a bit more. Otherwise I'd expect failures to still occur and increasing sleep each time isn't going to be any different? You could bump that up from 2 minutes to 3, more than enough padding.

If we still get failures, it's not a fix that extra time is likely to resolve, and more an indication that something else is going wrong (as has been suggested in the earlier version of this PR for what we could try).

Additionally as per my comments here, we can switch check-for-changes.sh to have Postfix and Dovecot use reload instead of restart, as that should minimize any race conditions from the downtime of restart. It's quite possible since we are handling change detection in the middle of the setup.

I also suggested considering wait_for_service "${TEST_NAME}" postfix directly after wait_until_change_detection_event_completes is called if we're to wait on ensuring Postfix is ready.

casperklein
casperklein previously approved these changes Oct 7, 2022
@polarathene

This comment was marked as outdated.

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Oct 8, 2022

Resolved

test/tests.bats failed early during setup again at a common place:

not ok 223 setup_file failed
# (from function `setup_file' in test file test/tests.bats, line 77)
#   `docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/non-existing-user.txt"' failed
# 5512b8d095f19644a77e931d6acc10041cdf7751b88e13bdc51f7664ccc0438a
# [   INF   ]  mail.my-domain.com is up and running

docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/non-existing-user.txt"

Worth noting is this is a failure because of the return code/exit status of the docker exec command being in the test setup, not due to any test assertions failing. Perhaps we could simply move these to a test case directly after setup?

There is nothing in the output that indicates the failure, and if previous nc commands were successful, I'm not sure why this would failure other than perhaps being run without pause. We also have helpers to run commands like this until they're successful.


Casper looked into a similar nc failure on a PR of his, and said that Postfix was not reachable at the time (even though we don't start these commands until port 25 is reachable). Since this is a common failure during the setup at different points, presumably Postfix is still being restarted after we waited on it the first time..

Comment thread test/tests.bats
@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Oct 9, 2022

Very nice investigative work @polarathene!

I think this PR has some nice additions too that should make it into v12 later :) Feel free to review :D

PS: Sorry for the force-push - forgot to sign the commit.

Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and a request to remove the TEST_TIMEOUT_IN_SECONDS addition at the start of tests.bats.

The rest of the feedback is just optional suggestions for clamav.bats, as the "checking" bit seemed redundant and I'd rather the space suffix be moved out of the ENV - although the suggestions reduce the usefulness of the ENV a bit 😅

Comment thread test/tests.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/clamav.bats Outdated
Comment thread test/tests.bats
Comment thread test/tests.bats Outdated
Comment thread Makefile Outdated
- removed unnecessary configuration
- added test for virus detection to `clamav.bats`
@georglauterbach georglauterbach changed the title ci: fix transient test failures ci: misc test enhancements Oct 10, 2022
@georglauterbach
Copy link
Copy Markdown
Member Author

I have changed the title as @casperklein proposed (it was indeed not fitting anymore).

polarathene
polarathene previously approved these changes Oct 10, 2022
Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Two small questions remain :)

Comment thread test/clamav.bats
Comment thread test/clamav.bats
Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Good work! 😎 🎉

@georglauterbach georglauterbach modified the milestones: v12.0.0, v11.3.0 Oct 11, 2022
@georglauterbach georglauterbach removed the meta/feature freeze On hold due to upcoming release process label Oct 13, 2022
@georglauterbach georglauterbach merged commit 488b256 into master Oct 14, 2022
@georglauterbach georglauterbach deleted the ci/fix-tests-bats-and-more branch October 14, 2022 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants