ci: misc test enhancements by georglauterbach · Pull Request #2815 · docker-mailserver/docker-mailserver

georglauterbach · 2022-10-05T14:56:25Z

Description

Fresh start of #2811

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (README.md or the documentation under docs/)
If necessary I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Lacking justification for being in PR.

georglauterbach · 2022-10-06T07:24:20Z

@polarathene do you have any idea, btw, why the test job does not use the cache from the build job?

georglauterbach · 2022-10-06T09:48:51Z

So, I know, not pretty - but I spent too much time yesterday without really getting down to it. Therefore, I figured, adding the sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

casperklein · 2022-10-06T13:25:01Z

sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

I haven't looked in detail in this current issue. But another pragmatic approach could be: find out, when the failing tests started and then revert the last changes made to the tests? That should fix it or am I wrong.

Anything was introduced that caused that issue. So instead of fixing the symptoms, I would roll back the change that introduced the issue in the first place.

georglauterbach · 2022-10-06T13:43:53Z

sleeps is not a permanent solution, but for the time being, better than an always-failing CI.

I haven't looked in detail in this current issue. But another pragmatic approach could be: find out, when the failing tests started and then revert the last changes made to the tests? That should fix it or am I wrong.

IIRC #2729 showed that CI is fragile. While I do not think #2729 introduced a bug, it unveiled the fragility. That's why it is hard to pinpoint: there is no obvious change that serves as the cause.

Anything was introduced that caused that issue. So instead of fixing the symptoms, I would roll back the change that introduced the issue in the first place.

As I said above, this works in theory, but it is more or less difficult in this case for the reasons mentioned above.

I'm not for introducing sleep as well, but unless someone else finds the cause (which may take time), I prefer it over restarting CI 3 or 4 times.

polarathene · 2022-10-06T21:03:20Z

@polarathene do you have any idea, btw, why the test job does not use the cache from the build job?

The build job did not create a new cache. It had a primary key cache hit. The file contents that derive the cache key has been seen by the CI before.

When this happens, if something has changed that invalidates that cache, unfortunately the cache action does not upload a new cache, this only happens when the expected cache key for the contents (the primary key that we used) didn't have a match and a fallback cache key was used.

So we restored a previous cached build, that for some reason uploaded the build cache that for the same key still got it invalidated. I only remember this happening when I worked on the feature and ARG was invalidating the build, or there was something causing incorrect cache key generation.

I am wondering if we implemented the re-usable workflows correctly, or if they're incompatible here. I'd need to check the cache key generated is for this PR contents or if it was generating that from master?

Cache key CI used: cache-buildx-a0ceadd91401151c4991b14c3f1e2d38e77032937df0f7d9dcf9a12582b50b89

I just noticed this is quite a problem with commits on master often failing to pass/complete the Build, Test & Deploy workflows.

polarathene · 2022-10-06T21:58:16Z

~~As suspected...~~ (EDIT: Tracked down where the key was from and why it hasn't changed, totally fine)

Cache restored from key: cache-buildx-a0ceadd91401151c4991b14c3f1e2d38e77032937df0f7d9dcf9a12582b50b89

That is the same cache key used by this PR CI build/test, as was used on Master for merging of this recent PR. That was also the cache key used in that PRs last commits (merges from master, that was the specific job that generated the cache key quoted above) which isn't too surprising. We can also see that non-merge commits in that PR was using different cache keys as they should (this ones build step restored a fallback key, then generated a new one, that was used to restore cache in future commits on the PR and generate the same cache key for the image build being used here).

Oh ok, just noticed that this PR doesn't contribute anything that should cause a new image cache key to be created, so it appears to be behaving correctly.

Looking back at the related PR merge commit on master and the CI build job there, it does not have the known build cache key from the PR due to the security scoping stuff, it needs to do a fresh build/cache upload of that. And we see that it selects and restores the cache-buildx cache key, which isn't valid but was created as a bug AFAIK, if that hasn't been evicted after 7 days of whenever it was created, Github might renew the expiry if it's accessed within the past 7 days?

Anyway, that cache from whenever that was, is invalidated and we get a fresh build. At the end of the build job, since we did not restore a primary cache key for the given contents, it uploads to master branch cache the same key this PR is using. Now any branch/PR that derives this cache key should have access to that, and it should be the most recent fallback cache key used too presently.

I am not sure what exactly is causing this invalidation, but I can see that your previous version of this PR was able to build using the cache for this same cache key:

#9 [internal] load metadata for docker.io/clamav/clamav:latest
#9 DONE 0.2s

#8 [internal] load metadata for docker.io/library/debian:11-slim
#8 DONE 0.4s

#10 importing cache manifest from local:3943879211245529940
#10 DONE 0.0s

#11 FROM docker.io/clamav/clamav:latest@sha256:0353dfffc8e8d6e035208e380cdc0f7f64299b849acef4b20ca8d87d874ba409
#11 resolve docker.io/clamav/clamav:latest@sha256:0353dfffc8e8d6e035208e380cdc0f7f64299b849acef4b20ca8d87d874ba409 done
#11 DONE 0.0s

#12 [stage-base  1/47] FROM docker.io/library/debian:11-slim@sha256:5cf1d98cd080595[148](https://github.com/docker-mailserver/docker-mailserver/actions/runs/3170398519/jobs/5162954169#step:7:151)4f33b34c1ab25aac7007bb41c8b9901d97e4be3cf3ab04
#12 resolve docker.io/library/debian:11-slim@sha256:5cf1d98cd0805951484f33b34c1ab25aac7007bb41c8b9901d97e4be3cf3ab04 0.0s done
#12 DONE 0.0s

#13 [internal] load build context
#13 transferring context: 290.82kB 0.0s done
#13 DONE 0.1s

#14 [stage-base 21/47] RUN <<EOF (chmod 755 /etc/init.d/postgrey...)
#14 CACHED

Your last commit on that PR that built with cache was: properly provide ENV, afterwards it started invalidating cache. Nothing stands out as the cause though.

polarathene · 2022-10-06T22:16:24Z

I'm not for introducing sleep as well, but unless someone else finds the cause (which may take time), I prefer it over restarting CI 3 or 4 times.

It's fine for now and I doubt we're going to benefit much time wise by reverting. It's a minor hit considering how long tests normally take, and we know that the sleep part aside, we're not implicitly hiding the timing problems that were surfaced by recent additions to tests.bats.

However, we can do better:

docker-mailserver/test/test_helper/common.bash

Lines 120 to 122 in ff96950

    
           function wait_for_finished_setup_in_container() { 
        
             local STATUS=0 
        
             repeat_until_success_or_timeout --fatal-test "container_is_running ${1}" "${TEST_TIMEOUT_IN_SECONDS}" sh -c "docker logs ${1} | grep 'is up and running'" || STATUS=1

docker-mailserver/test/test_helper/common.bash

Lines 204 to 216 in ff96950

    
             if [[ ! $(__is_changedetector_processing) ]] 
        
             then 
        
               # A new change event is expected, wait for it: 
        
               repeat_until_success_or_timeout 60 __is_changedetector_processing 
        
             fi 
        
             # Change event is in progress, wait until it finishes: 
        
             repeat_until_success_or_timeout 60 __is_changedetector_finished 
        
             # NOTE: Although the change event has completed, services like Postfix and Dovecot 
        
             # may still be in the process of restarting. 
        
             # You may still want to wait longer if depending on those to be ready. 
        
           }

docker-mailserver/test/test_helper/common.bash

Lines 243 to 249 in ff96950

    
           function wait_for_empty_mail_queue_in_container() { 
        
             local CONTAINER_NAME="${1}" 
        
             local TIMEOUT=${TEST_TIMEOUT_IN_SECONDS} 
        
             # shellcheck disable=SC2016 
        
             repeat_in_container_until_success_or_timeout "${TIMEOUT}" "${CONTAINER_NAME}" bash -c '[[ $(mailq) == *"Mail queue is empty"* ]]' 
        
           }

docker-mailserver/test/test_helper/common.bash

Lines 89 to 95 in ff96950

    
           function wait_for_tcp_port_in_container() { 
        
             repeat_until_success_or_timeout --fatal-test "container_is_running ${2}" "${TEST_TIMEOUT_IN_SECONDS}" docker exec "${2}" /bin/sh -c "nc -z 0.0.0.0 ${1}" 
        
           } 
        
           # @param ${1} name of the postfix container 
        
           function wait_for_smtp_port_in_container() { 
        
             wait_for_tcp_port_in_container 25 "${1}"

These are the methods you're adding sleep around, and apart from the helper I wrote with a fixed 60 seconds (and advice to wait on postfix/dovecot after separately if necessary), we can see a common way to avoid the sleeps so that they're only optional when needed:

docker-mailserver/test/test_helper/common.bash

Lines 8 to 9 in ff96950

    
           # default timeout is 120 seconds 
        
           TEST_TIMEOUT_IN_SECONDS=${TEST_TIMEOUT_IN_SECONDS-120}

If 15 secs of sleep is really all that's required, surely we could just bump up the default a bit more. Otherwise I'd expect failures to still occur and increasing sleep each time isn't going to be any different? You could bump that up from 2 minutes to 3, more than enough padding.

If we still get failures, it's not a fix that extra time is likely to resolve, and more an indication that something else is going wrong (as has been suggested in the earlier version of this PR for what we could try).

Additionally as per my comments here, we can switch check-for-changes.sh to have Postfix and Dovecot use reload instead of restart, as that should minimize any race conditions from the downtime of restart. It's quite possible since we are handling change detection in the middle of the setup.

I also suggested considering wait_for_service "${TEST_NAME}" postfix directly after wait_until_change_detection_event_completes is called if we're to wait on ensuring Postfix is ready.

polarathene · 2022-10-08T19:18:25Z

Resolved

test/tests.bats failed early during setup again at a common place:

not ok 223 setup_file failed
# (from function `setup_file' in test file test/tests.bats, line 77)
#   `docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/non-existing-user.txt"' failed
# 5512b8d095f19644a77e931d6acc10041cdf7751b88e13bdc51f7664ccc0438a
# [   INF   ]  mail.my-domain.com is up and running

docker-mailserver/test/tests.bats

Line 77 in dd6f967

    
           docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/non-existing-user.txt"

Worth noting is this is a failure because of the return code/exit status of the docker exec command being in the test setup, not due to any test assertions failing. Perhaps we could simply move these to a test case directly after setup?

There is nothing in the output that indicates the failure, and if previous nc commands were successful, I'm not sure why this would failure other than perhaps being run without pause. We also have helpers to run commands like this until they're successful.

Casper looked into a similar nc failure on a PR of his, and said that Postfix was not reachable at the time (even though we don't start these commands until port 25 is reachable). Since this is a common failure during the setup at different points, presumably Postfix is still being restarted after we waited on it the first time..

georglauterbach · 2022-10-09T17:49:19Z

Very nice investigative work @polarathene!

I think this PR has some nice additions too that should make it into v12 later :) Feel free to review :D

PS: Sorry for the force-push - forgot to sign the commit.

polarathene

Few questions and a request to remove the TEST_TIMEOUT_IN_SECONDS addition at the start of tests.bats.

The rest of the feedback is just optional suggestions for clamav.bats, as the "checking" bit seemed redundant and I'd rather the space suffix be moved out of the ENV - although the suggestions reduce the usefulness of the ENV a bit 😅

Co-authored-by: Brennan Kinney <[email protected]>

- removed unnecessary configuration - added test for virus detection to `clamav.bats`

georglauterbach · 2022-10-10T09:13:13Z

I have changed the title as @casperklein proposed (it was indeed not fitting anymore).

polarathene

LGTM 👍

Two small questions remain :)

polarathene

Good work! 😎 🎉

"cherry-pick" of #2811 - clean (re-)start

18da767

georglauterbach self-assigned this Oct 5, 2022

georglauterbach mentioned this pull request Oct 5, 2022

ci: attempt to fix recent transient test failures #2811

Closed

7 tasks

adjust tests to new env

cdf86af

polarathene reviewed Oct 5, 2022

View reviewed changes

Comment thread .gitignore Outdated

revert: Change to .gitignore

68ca3f4

Lacking justification for being in PR.

georglauterbach requested review from casperklein and polarathene October 6, 2022 10:22

georglauterbach added area/ci kind/bugfix priority/medium priority/high meta/feature freeze On hold due to upcoming release process labels Oct 6, 2022

georglauterbach added this to the v12.0.0 milestone Oct 6, 2022

apply PR feedback (removing sleeps)

dd6f967

casperklein previously approved these changes Oct 7, 2022

View reviewed changes

This comment was marked as outdated.

Sign in to view

polarathene mentioned this pull request Oct 9, 2022

Run user-patches.sh right before starting daemons #2817

Merged

11 tasks

polarathene reviewed Oct 9, 2022

View reviewed changes

Comment thread test/tests.bats

polarathene mentioned this pull request Oct 9, 2022

tests(fix): Avoid introducing CRLF into postfix-accounts.cf during setup #2820

Merged

4 tasks

Merge branch 'master' into ci/fix-tests-bats-and-more

f5fad61

georglauterbach force-pushed the ci/fix-tests-bats-and-more branch from c5b9741 to f5fad61 Compare October 9, 2022 17:50

georglauterbach requested a review from polarathene October 10, 2022 06:38

polarathene requested changes Oct 10, 2022

View reviewed changes

Apply suggestions from code review

aa2df44

Co-authored-by: Brennan Kinney <[email protected]>

georglauterbach dismissed casperklein’s stale review via aa2df44 October 10, 2022 07:29

more feedback

fbe0200

- removed unnecessary configuration - added test for virus detection to `clamav.bats`

georglauterbach changed the title ~~ci: fix transient test failures~~ ci: misc test enhancements Oct 10, 2022

add proper test prefix (fix)

755bd6b

georglauterbach requested review from casperklein and polarathene October 10, 2022 09:17

polarathene previously approved these changes Oct 10, 2022

View reviewed changes

Comment thread test/clamav.bats

Comment thread test/clamav.bats

apply more feedback

b639707

georglauterbach dismissed polarathene’s stale review via b639707 October 10, 2022 10:20

georglauterbach requested a review from polarathene October 10, 2022 11:31

polarathene approved these changes Oct 10, 2022

View reviewed changes

georglauterbach modified the milestones: v12.0.0, v11.3.0 Oct 11, 2022

georglauterbach removed the meta/feature freeze On hold due to upcoming release process label Oct 13, 2022

Merge branch 'master' into ci/fix-tests-bats-and-more

da9cf7d

casperklein approved these changes Oct 13, 2022

View reviewed changes

georglauterbach merged commit 488b256 into master Oct 14, 2022

georglauterbach deleted the ci/fix-tests-bats-and-more branch October 14, 2022 07:48

Uh oh!

Conversation

georglauterbach commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Uh oh!

georglauterbach commented Oct 6, 2022

Uh oh!

georglauterbach commented Oct 6, 2022

Uh oh!

casperklein commented Oct 6, 2022

Uh oh!

georglauterbach commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene commented Oct 6, 2022

Uh oh!

polarathene commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene commented Oct 6, 2022

Uh oh!

This comment was marked as outdated.

polarathene commented Oct 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

georglauterbach commented Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georglauterbach commented Oct 10, 2022

Uh oh!

polarathene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

polarathene left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

georglauterbach commented Oct 5, 2022 •

edited

Loading

georglauterbach commented Oct 6, 2022 •

edited

Loading

polarathene commented Oct 6, 2022 •

edited

Loading

polarathene commented Oct 8, 2022 •

edited

Loading

georglauterbach commented Oct 9, 2022 •

edited

Loading

polarathene left a comment •

edited

Loading