check-for-changes: performance improvements + wait for settle v2 by NorseGaud · Pull Request #2157 · docker-mailserver/docker-mailserver

NorseGaud · 2021-08-28T23:30:39Z

Original PR that was reverted : #2104

Description

See: #2098

This PR will:

Have check-for-changes.sh wait for changes to "settle" (checksum comparison has to not have changed over two different checks with 5 second gaps between) before modifying and restarting anything
Added backgrounding of processes and using wait throughout instead of sleeps for more efficiency
Removed the while loop from addmailuser due to check-for-changes.sh logic changes
Fixed a bug in _monitored_files_checksums
TODO: Add resource usage tests of default setup and ensure the minimum requirements for the container/host specs stay the same (or as close as possible)

This is, in my opinion, critically needed for an API to be performing changes -- and lots of them all at once! See docker-mailserver/docker-mailserver-admin#3 for more discussion about this.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (non-breaking change that does improve existing functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
[] I have made corresponding changes to the documentation (README.md or the documentation under docs/)
If necessary I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NorseGaud · 2021-08-28T23:31:32Z

Per #2154 we need to check why so much CPU is used before merging

…formance improvements throughout + removed addmailuser while loops

… various test fixes for new check-for-changes logic

NorseGaud · 2021-08-30T12:46:22Z

On boot of the docker container/starting services:

 12:39:32 up 23 min,  3 users,  load average: 1.04, 0.59, 0.30
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
ec2-user pts/0    inet-64-112-178- 12:17   21:38   1.45s  1.43s top -c
ec2-user pts/1    inet-64-112-178- 12:17    1:32   0.89s  0.65s docker-compose up
ec2-user pts/2    inet-64-112-178- 12:17    0.00s  0.07s  0.00s w

Once booted, it drops down but hovers around 0.50. Any commands I run cause the cpu load average to spike again, even if briefly.

I'm trying to determine what specifically is causing this.

NorseGaud · 2021-08-30T12:48:42Z

Ok, it looks to be fine after I remove

      for FILE in ${CHANGED}
      do
        case "${FILE}" in
          "/etc/letsencrypt/acme.json" )
              for CERTDOMAIN in ${SSL_DOMAIN} ${HOSTNAME} ${DOMAINNAME}
              do
                {
                  _extract_certs_from_acme "${CERTDOMAIN}" && break
                } &
                WAIT_FOR_PIDS+=($!)
              done
            ;;

          * )
            _notify 'warn' 'File not found for certificate in check_for_changes.sh'
            ;;

        esac
      done

Continuing to confirm...

NorseGaud · 2021-08-30T13:24:18Z

Supervisor logs were showing spam:

cmp: EOF on /tmp/docker-mailserver-config-chksum.new after byte 596, line 4

Turns out the chksum.new file was being written too slowly and the cmp command was trying to read it when it was partially written to.

it looks like we've wrapped _monitored_files_checksums's code in ( ) and that's likely causing it to execute and continue when we really want it to wait.

… changes for

NorseGaud · 2021-08-31T02:54:21Z

Odd... check for changes uses a helper that doesn't work...

set -x and a few modifications show:

+ _monitored_files_checksums
+ echo 'HOSTNAME: deliver'
+ cd /tmp/docker-mailserver
+ exec sha512sum -- postfix-accounts.cf postfix-virtual.cf postfix-aliases.cf dovecot-quotas.cf /etc/letsencrypt/acme.json /etc/letsencrypt/live/deliver/key.pem /etc/letsencrypt/live/deliver/privkey.pem /etc/letsencrypt/live/deliver/fullchain.pem

root@deliver:/# ls -alht /etc/letsencrypt/live/deliver/
ls: cannot access '/etc/letsencrypt/live/deliver/': No such file or directory
root@deliver:/# ls -alht /etc/letsencrypt/live/
total 20K
drwxr-xr-x 9 root root 6.0K Aug 31 00:00 ..
drwxr-xr-x 2 root root 6.0K Aug 16 00:01 xkr.email
drwxr-xr-x 2 root root 6.0K Aug 16 00:01 pierce.us
drwx------ 4 root root 6.0K Mar 30 02:26 .
-rw-r--r-- 1 root root  740 Aug 11  2020 README

Why would anyone generate a letsencrypt with just the hostname and not the domain?

…tting of DOMAINNAME in check-for-changes.sh

…mailserver

github-actions · 2021-10-09T17:09:23Z

Documentation preview for this PR is ready! 🎉

Built with commit: 5f48041

polarathene

I've completed my personal notes on this PR, it's prior PRs and other related history in the past regarding the change detection logic. This PR remains open as a reminder for me to tackle, but it could be closed in favor of a new issue instead.

The current state of this PR has fallen a fair bit out of sync with the changes/refactoring for the v10.3.0 release, with more changes to occur again in the near future as check-for-changes.sh is refactored some more. Some of the changes this PR covered have also since become part of the v10.3.0 release, I don't think it's worth reviving this PR.

Personally, I'm not fond of the approach proposed here with wait.

Changes can be detected via FS event notifications, with fallback to polling when that is not supported (eg: remote FS like NFS):

Issue #1958 was about reducing the rate of calculating checksums for the changedetector service, with PR #1966 attempting implementing inotifywait where it was noted something regarding volume mounts (insufficient details provided) were having trouble with acme.json but not postfix-accounts.cf (may not have been a bind mount?) and making a suggestion to swap the cmp command for sha512sum -c --status /path/to/existing/checksum.

That user also had a separate PR attempt with little details that branched PR-1966 to swap inotifywait with fswatch for some reason. Neither PR was merged, nor a comment regarding their rejection.

Additionally when looking into all of this I did a deep dive into file locking and believe the changes made to replace flock introduced regressions.

Switch back to `flock`

Both in regards to:

The intent of locking files in the first place - flock locked the configs being interacted with, while the replacement applied locks to individual scripts instead, thus locks only blocked a specific script from being concurrently called, but not concurrent access to the actual config files being modified which was the original intent with flock being introduced.
And performance/timing - flock had no lockfile to cleanup or timeout on, timeout could be configured via option, as could immediate exit, otherwise it would cause a process to block/wait until a lock was released and could be acquired to continue processing, all with no sleep or similar loop involved..

flock was removed only for the benefit of remote filesystems that it was deemed unavailable (NFS) which was inaccurate and referencing outdated information, the error experienced appears to be for different reasons. I additionally looked into lockf and fcntl as the other common locking alternatives:

lockf is a wrapper around fcntl and doesn't seem to offer much benefit.
fcntl has some pro/cons, but is what NFS will use to emulate flock with and will need some testing to verify correct behaviour with multiple clients involved.
And some utilities like lockfile (from procmail package) that provide a more convenient command and API to use for actual lockfiles if other options like flock aren't viable.

File based locks are generally not ideal however as they're not atomic (checking for the lockfile and creating one if it doesn't exist is not a single operation and thus prone to some race conditions), mkdir is often cited as an atomic variant for that, but using syscalls with flock and fcntl is preferable from what I came across.

Avoiding background processes with wait for parallelization due to concerns

Using locks properly should resolve that addmailuser loop removal just as well AFAIK. The background of processes and wait if I understood correctly was to try parallelize the logic, but this seems unwise as IIRC there is some synchronous flow between some of the logic and you'd be complicating that flow for maintainers for minimal time savings as most of that is related to the service restarts and any sleep delays (which can be better reduced / optimized via FS event notifications when supported, few users will use remote network file systems like NFS where polling should work just as well as current sleep delays).

Change Detection (consider `watchexec`)

Settling on changes is a good improvement for check-for-changes.sh, and I think we can leverage a file watcher daemon like watchexec for this. It supports both FS event notifications and polling modes, and can debounce (wait for no changes until X time has passed before running) although I'm not sure if it can ensure we force processing if debounce stalls processing due to sustained high frequency of changes, which should be unlikely and better served by such users abstracting changes via queue to batch apply rather than directly modifying or calling other utilities (eg: what the approach by the DMS API project is currently taking). At least I think watchexec would cover everything we'd do and more/better and allow us to offload maintenance for that functionality.

TL;DR

This PR (and it's history in prior PRs and issues involved) serves as a good reference but probably not worth resuming or the correct approach to merge.
File locking appears to have regressed and should bring back flock, NFS does actually support it and the failure just needs to be investigated/resolved.
Detecting changes can avoid the outer loop and it's frequency by using something like watchexec with FS event notifications for file changes (or polling mode for NFS), using the debounce feature to get settled changes. I'm unsure currently how beneficial the checksum functionality is to keep (existing in master, not additions from this PR) if adopting watchexec (it can inject ENV with values of affected files, but checksum is probably nicer to work with, we can switch from sha512 to blake3 for optimization).

Additionally (improvements, but can come later in separate PRs):

While I recently de-duplicated the check-for-changes.sh logic shared with start-mailserver.sh startup scripts, we probably can refactor that some more to diff accordingly as issue #2096 notes. (Note: They are an LDAP + NFS user, noting the chown -R perf issue on /var/mail which is likely compounding together, not that check-for-changes.sh presently runs that for LDAP users however..they also cite adding many users via API service and proposing for a queue in the container as well to batch apply)
I also recall a PR some years ago moved away from hash type config DB files with Postfix, these could still use text configs but were possible to update at runtime without restarting Postfix IIRC, I'd need to look up that discussion again for why the change was made and if it makes sense to revert.

Actionable Tasks

If I find time for these, it should be around Dec or Jan (excluding "extras" tasks), or someone else is welcome to attempt these before me. I've built up a rather verbose list of notes for when I tackle this, but I'm just as happy to review/collaborate (_just give me a heads up in advance, if you the reader are going to tackle this 😅 )

Use watchexec (with debounce) instead for change detection, possibly requiring an ENV for NFS/polling.
Look at replacing the existing file locking logic with lockfile (from procmail package) or better flock (will require reproducing @NorseGaud multi-instance of DMS containers with NFS volumes that produce the bad file descriptor error and resolving that). Probably more importantly address that files being locked are the actual relevant ones, not the scripts themselves..
If keeping the checksum logic around (eg: the monitor function), consider using blake3 instead of sha512 for checksums as minor optimization.

Extras:

Investigate viability / complexity of only responding to diff changes and/or changing from text: to hash: based config records for Postfix. Users primary issue may be more related to NFS in combination with chown -R (being investigated and resolved in PR #2256), but they appear to be an LDAP user with usage for spinning up throwaway images for testing, check-for-changes.sh doesn't apply to LDAP at present, it could be startup related but their API to add users at runtime shouldn't be involved with that only check-for-changes.sh 🤷‍♂️
Investigate possibility to support batch change requests, potentially via a queue service, though at present this is likely better deferred to the API project, it may be related to file locking concerns depending on implementation. File based queues may be similarly unreliable to in-memory queue system with different failure risks, but the core project is also meant to be DB averse 🤷‍♂️

georglauterbach · 2023-04-24T13:35:02Z

Same as #2443; see #2443 (comment).

polarathene · 2023-04-25T00:15:55Z

This shouldn't be too important anymore. Many changes were done since to improve how check-for-changes.sh works. It is much faster at processing now, with minimal down time.

However, it may not be sufficient for bulk adding of users. Especially for supporting encrypted mailbox creation.

For this I still think debouncing with throttle is better than the current polling approach. My advice in previous comment is still valid for approaching that with watchexec. I also still recommend switching the file locking approach back to flock.

EDIT: Relevant issue that has all closed PRs being linked didn't reference this one. Including this PR for traceability.

NorseGaud self-assigned this Aug 28, 2021

NorseGaud added kind/improvement Improve an existing feature, configuration file or the documentation meta/wip labels Aug 28, 2021

NorseGaud added 10 commits August 29, 2021 20:13

added some helpful stuff to check for changes

596791d

more tweaks

5515288

test

2582d02

check-for-changes.sh now makes sure that changes settle + various per…

e737017

…formance improvements throughout + removed addmailuser while loops

cleanup

2583b26

delmailuser Mailbox directory not existing shouldn't throw an error +…

cfb8ac7

… various test fixes for new check-for-changes logic

quick test fix

584002a

sleep should only happen when there is a difference

79cfc70

fix for path to use pwd instead fo CDIR

30ee9df

bats bump

92f87d3

georglauterbach removed the meta/wip label Aug 30, 2021

NorseGaud marked this pull request as draft August 30, 2021 12:14

Fix high CPU usage from backgrounding in check-for-changes

253ce4b

NorseGaud added 6 commits August 30, 2021 21:12

prevent EOF errors for cmp by dynamically loading in certs to monitor…

2a82921

… changes for

removed a few backgrounded processes

fba8655

fixed shellcheck

de84ad1

test

cd6eda7

test

4ad1aea

test

642839f

NorseGaud added 3 commits August 30, 2021 23:14

fixed improper usage of HOSTNAME in _monitored_files_checksums and se…

f1d44d9

…tting of DOMAINNAME in check-for-changes.sh

fixed improper usage of HOSTNAME in _monitored_files_checksums and se…

c6641bb

…tting of DOMAINNAME in check-for-changes.sh

merge upstream

a1cc2ec

georglauterbach removed the meta/feature freeze On hold due to upcoming release process label Oct 7, 2021

NorseGaud added 2 commits October 9, 2021 12:57

Merge branch 'master' of https://github.com/docker-mailserver/docker-…

9373b8d

…mailserver

merged in master

5f48041

quick fix to conflict

d8ec45b

NorseGaud mentioned this pull request Oct 9, 2021

[BUG] Let's encrypt certificate renew not recognised. mailserver uses old one and not updated one. #2239

Closed

4 tasks

github-actions Bot added the Stale label Oct 30, 2021

polarathene modified the milestones: v10.3.0, v11.0.0 Nov 2, 2021

github-actions Bot closed this Nov 12, 2021

polarathene removed the Stale label Nov 14, 2021

polarathene reopened this Nov 14, 2021

polarathene reviewed Nov 29, 2021

View reviewed changes

polarathene added the stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI label Nov 29, 2021

polarathene self-assigned this Nov 29, 2021

polarathene mentioned this pull request Nov 29, 2021

[FR] Streamline alias code #1721

Open

oblitum mentioned this pull request Feb 7, 2022

relieve some pain: restart check-for-changes.sh regularly #2398

Closed

7 tasks

NorseGaud mentioned this pull request Feb 9, 2022

check-for-changes.sh uses a lot of resources #2348

Closed

georglauterbach mentioned this pull request Feb 14, 2022

Add changedetector functionality for ${SSL_TYPE} == manual #2404

Merged

7 tasks

NorseGaud mentioned this pull request Feb 28, 2022

check-for-changes wait for settle #2443

Closed

11 tasks

polarathene mentioned this pull request Mar 18, 2022

scripts: update log #2486

Closed

8 tasks

polarathene removed this from the v11.0.0 milestone Apr 4, 2022

polarathene added this to the v12.0.0 milestone Jun 10, 2022

polarathene unassigned NorseGaud Jun 10, 2022

georglauterbach modified the milestones: v12.0.0, v13.0.0 Mar 3, 2023

georglauterbach closed this Apr 24, 2023

polarathene mentioned this pull request Apr 25, 2023

Unset mydestination in postfix/main.cf #3264

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

check-for-changes: performance improvements + wait for settle v2#2157

check-for-changes: performance improvements + wait for settle v2#2157
NorseGaud wants to merge 35 commits intodocker-mailserver:masterfrom
NorseGaud:check-for-changes-performance

NorseGaud commented Aug 28, 2021 •

edited

Loading

Uh oh!

NorseGaud commented Aug 28, 2021

Uh oh!

NorseGaud commented Aug 30, 2021 •

edited

Loading

Uh oh!

NorseGaud commented Aug 30, 2021

Uh oh!

NorseGaud commented Aug 30, 2021

Uh oh!

NorseGaud commented Aug 31, 2021

Uh oh!

github-actions Bot commented Oct 9, 2021

Uh oh!

polarathene left a comment

Uh oh!

georglauterbach commented Apr 24, 2023

Uh oh!

polarathene commented Apr 25, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

NorseGaud commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

NorseGaud commented Aug 28, 2021

Uh oh!

NorseGaud commented Aug 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NorseGaud commented Aug 30, 2021

Uh oh!

NorseGaud commented Aug 30, 2021

Uh oh!

NorseGaud commented Aug 31, 2021

Uh oh!

github-actions Bot commented Oct 9, 2021

Uh oh!

polarathene left a comment

Choose a reason for hiding this comment

TL;DR

Actionable Tasks

Uh oh!

georglauterbach commented Apr 24, 2023

Uh oh!

polarathene commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NorseGaud commented Aug 28, 2021 •

edited

Loading

NorseGaud commented Aug 30, 2021 •

edited

Loading

polarathene commented Apr 25, 2023 •

edited

Loading