ci: Better build caching for CI#2742
Conversation
For the cache to work properly, we need to derive a cache key from the build context (files that affect the Dockerfile build) instead of the cache key changing by commit SHA. We also need to avoid failing tests preventing the caching of a build, thus splitting to separate jobs. This first pass used `upload-artifact` and `download-artifact` to transfer the built image, but it has quite a bit of overhead and prevented multi-platform build (without complicating the workflow further).
While `download-artifact` + `docker load` is a little faster than rebuilding the image from cached layers, `upload-artifact` takes about 2 minutes to upload the AMD64 (330MB) tar image export (likely due to compression during upload?). The `actions/cache` approach however does not incur that hit and is very quick (<10 secs) to complete it's post upload work. The dependent job still gets a cache-hit, and the build job is able to properly support multi-platform builds. Added additional notes about timing and size of including ARM builds.
When the ARG changes due to commit SHA, it invalidates all cache due to the LABEL layers at the start. Then any RUN layers implicitly invalidate, even when the ARG is not used. Introduced basic multi-stage build, and relocated the container config / metadata to the end of the build. This avoids invalidating expensive caching layers (size and build time) needlessly.
|
Very nice and welcome changes / additions! Would just adding another job (dependent on whether the AMD64 build was successful maybe) that builds the ARM64 image in parallel to the BATS test suite cut down on the build time (the 13 minutes)? Is there something I am missing, that would prevent this solution?
Detailsnot ok 217 setup_file failed # (from function `setup_file' in test file test/tests.bats, line 72) # `docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/existing-added.txt"' failed # ba44729a0ebac54aa83742d2eb89408862474caaa88b64f4ee0f085f46b3b4ed # [ INF ] mail.my-domain.com is up and running # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 053F2BDDDC # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 2F89ABDE21 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 56EC7BDE23 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 7E1E6BDE26 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 9FFBABDE28 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as CF2F3BDE21 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # mail # bats warning: Executed 217 instead of expected 318 tests make: *** [Makefile:43: tests] Error 1 |
You are not wrong, I am just really stretched for time now. I am enrolled for a course until december with 40-80 hour weeks demanded from students, and need to be prepared before it starts in a few days. You're welcome to adjust the PR or add a follow-up change 👍 I mostly didn't want to copy/paste more of the same step as we have similar jobs in other workflows. Github docs mention a thing called "re-usable workflows" that might be appropriate way of keeping that DRY. Alternatively, we could also explore a |
|
Oh and the test failure is probably related to the recent PR, tripping up more timing sensitive tests :( |
I will provide a follow-up PR :) I will have a look at the proposals you gave afterwards too! |
|
I didn't have time yet to look at this content wise. Just some stylistic questions so far:
|
Probably just different style preference between me and @georglauterbach , happy to be consistent with whatever preference 👍 I don't think it makes any difference in this case (at least for
Fair 👍 In both cases I chose to prefix the name with the scope/type of value for added clarity, as they're a fair bit apart. I'm not sure how familiar other maintainers are with multi-job workflows or a multi-stage
Happy to remove it if you like, |
Ack 👍
Fair. It's just that I've mostly see
If it's only me, keep it 😄 |
casperklein
left a comment
There was a problem hiding this comment.
Good work! I haven't tested the CI part, but I think you know what you do 😉
Yeah I'm fine with that too and don't mind doing so on my own Dockerfiles 👍 I've only been a bit more verbose since it's harmless and may help maintainers / contributors that aren't as familiar :) I'll merge as-is. If any maintainer wants to remove the |
|
Another question for clarity: The tests have been moved to a new job (job-run-tests). There the image is build again (from cache). In |
That's an good question! :) The TL;DR version is not that I know of. This is just a build cache, changes to Dockerfile or content that
When I wrote this improvement I did encounter one failure case. This was due to the The image would build successfully, and then upload the cache. Then any future build where the So it rebuilt from scratch each time, and at the end of the job refused to update the cache for that key because it was a successful cache hit (only applies to exact match for cache key, |
While I was careful to handle that here as I also had that same concern. I did notice I messed up with the doc previews workflow. It uploads an artifact to a common name, I can't recall if it's scoped to the workflow or gets overwritten by other PRs building doc previews in parallel. This provides the docs build result, and PR metadata, I don't think I've seen it yet get mixed up, but might be worth investigating 😅 Thankfully it's just for doc previews and shouldn't cause any risks. |
Description
Fixing flaws to make the tests workflow and Dockerfile more cache friendly.
Change Overview
In Sep 2019, a PR introduced the VCS
ARGdirectives with acknowledgement that it invalidates the build cache.Every time the commit SHA changes, due to this placement all layers afterwards are invalidated and must be rebuilt, even though only
LABELdirective values were being updated (even without those,RUNwill implicitly be invalidated due toARGchanging as it is cached as part of the build ENV).I've shifted all directives related to image config that aren't necessary until the end to a 2nd stage. The additional
FROMintroducing a 2nd stage is not necessary, but provides a basis for later refactoring the Dockerfile to split out ClamAV and Fail2Ban layers into their own stages to further improve build caching.The original intent of this PR was to address the caching with a more stable cache key (instead of the commit SHA), such that if the build context (files involved in the image build process) has not changed since the last build, that we can re-use the build cache for that.
Presently the existing cache key would:
The new cache key:
Dockerfile,VERSIONandtarget/**has not changed between commits.keyhit (restore-keyfallback to an older build cache, reducing build time and upon job success uploading for the new derived cache key).Complimentary changes:
type=localcache mode needs a workaround to avoid accidentally accumulating cache wastefully which adds to upload/download time and more evictions (330-810MB + 7-20 sec depending on supported platforms, vs 2-3GB I've seen presently cached with download times of approx 1 minute).type=ghacache mode has since been made available (still deemed experimental) earlier this year that uses theactions/cacheAPI (reverse engineered IIRC, not officially supported), I've chosen to avoid that for now and have not had time to try and compare it.Dockerfileissue withARGusage is an important one, as while the derived cache key can be effective at caching, when the commit SHA changed, the cache was completely ignored due to the builder invalidating cache... which caused a problem as that cache hit would not upload the new build cache (build args passed in aren't part of the derived cache key). Although the built image will still be slightly different due to theseARG, we are only sharing the build layer cache between jobs, so there is minimal work to recreate the image from cache and append theARGchanges (avoiding wasteful cache upload forARGchanges).Dockerfileand buildxbuilderinstance.Impact of platforms on cache storage + build time:
Additional details
In the first commit, I also experimented with
upload-artifactanddownload-artifactto transfer across jobs but this was much worse than just usingactions/cache.Future improvements could leverage buildx
COPY --linkfeature that is available since March 2022.type=ghafor the image cache might be interesting too if it caches on step success instead of job success, allowing for tests to be in the same job. Similar workflows could also potentially be more DRY with Github Actions support for re-usable workflows.For now though, this PR should help, especially if a follow-up moves out the ARM64 build.
Type of change
Checklist: