Skip to content

check-for-changes: performance improvements + wait for settle v2#2157

Closed
NorseGaud wants to merge 35 commits intodocker-mailserver:masterfrom
NorseGaud:check-for-changes-performance
Closed

check-for-changes: performance improvements + wait for settle v2#2157
NorseGaud wants to merge 35 commits intodocker-mailserver:masterfrom
NorseGaud:check-for-changes-performance

Conversation

@NorseGaud
Copy link
Copy Markdown
Member

@NorseGaud NorseGaud commented Aug 28, 2021

Original PR that was reverted : #2104

Description

See: #2098

This PR will:

  • Have check-for-changes.sh wait for changes to "settle" (checksum comparison has to not have changed over two different checks with 5 second gaps between) before modifying and restarting anything
  • Added backgrounding of processes and using wait throughout instead of sleeps for more efficiency
  • Removed the while loop from addmailuser due to check-for-changes.sh logic changes
  • Fixed a bug in _monitored_files_checksums
  • TODO: Add resource usage tests of default setup and ensure the minimum requirements for the container/host specs stay the same (or as close as possible)

This is, in my opinion, critically needed for an API to be performing changes -- and lots of them all at once! See docker-mailserver/docker-mailserver-admin#3 for more discussion about this.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (non-breaking change that does improve existing functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • [] I have made corresponding changes to the documentation (README.md or the documentation under docs/)
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@NorseGaud NorseGaud self-assigned this Aug 28, 2021
@NorseGaud NorseGaud added kind/improvement Improve an existing feature, configuration file or the documentation meta/wip labels Aug 28, 2021
@NorseGaud
Copy link
Copy Markdown
Member Author

Per #2154 we need to check why so much CPU is used before merging

@NorseGaud NorseGaud marked this pull request as draft August 30, 2021 12:14
@NorseGaud
Copy link
Copy Markdown
Member Author

NorseGaud commented Aug 30, 2021

On boot of the docker container/starting services:

 12:39:32 up 23 min,  3 users,  load average: 1.04, 0.59, 0.30
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
ec2-user pts/0    inet-64-112-178- 12:17   21:38   1.45s  1.43s top -c
ec2-user pts/1    inet-64-112-178- 12:17    1:32   0.89s  0.65s docker-compose up
ec2-user pts/2    inet-64-112-178- 12:17    0.00s  0.07s  0.00s w

Once booted, it drops down but hovers around 0.50. Any commands I run cause the cpu load average to spike again, even if briefly.

I'm trying to determine what specifically is causing this.

@NorseGaud
Copy link
Copy Markdown
Member Author

Ok, it looks to be fine after I remove

      for FILE in ${CHANGED}
      do
        case "${FILE}" in
          "/etc/letsencrypt/acme.json" )
              for CERTDOMAIN in ${SSL_DOMAIN} ${HOSTNAME} ${DOMAINNAME}
              do
                {
                  _extract_certs_from_acme "${CERTDOMAIN}" && break
                } &
                WAIT_FOR_PIDS+=($!)
              done
            ;;

          * )
            _notify 'warn' 'File not found for certificate in check_for_changes.sh'
            ;;

        esac
      done

Continuing to confirm...

@NorseGaud
Copy link
Copy Markdown
Member Author

Supervisor logs were showing spam:

cmp: EOF on /tmp/docker-mailserver-config-chksum.new after byte 596, line 4

Turns out the chksum.new file was being written too slowly and the cmp command was trying to read it when it was partially written to.

it looks like we've wrapped _monitored_files_checksums's code in ( ) and that's likely causing it to execute and continue when we really want it to wait.

@NorseGaud
Copy link
Copy Markdown
Member Author

Odd... check for changes uses a helper that doesn't work...

set -x and a few modifications show:

+ _monitored_files_checksums
+ echo 'HOSTNAME: deliver'
+ cd /tmp/docker-mailserver
+ exec sha512sum -- postfix-accounts.cf postfix-virtual.cf postfix-aliases.cf dovecot-quotas.cf /etc/letsencrypt/acme.json /etc/letsencrypt/live/deliver/key.pem /etc/letsencrypt/live/deliver/privkey.pem /etc/letsencrypt/live/deliver/fullchain.pem
root@deliver:/# ls -alht /etc/letsencrypt/live/deliver/
ls: cannot access '/etc/letsencrypt/live/deliver/': No such file or directory
root@deliver:/# ls -alht /etc/letsencrypt/live/
total 20K
drwxr-xr-x 9 root root 6.0K Aug 31 00:00 ..
drwxr-xr-x 2 root root 6.0K Aug 16 00:01 xkr.email
drwxr-xr-x 2 root root 6.0K Aug 16 00:01 pierce.us
drwx------ 4 root root 6.0K Mar 30 02:26 .
-rw-r--r-- 1 root root  740 Aug 11  2020 README

Why would anyone generate a letsencrypt with just the hostname and not the domain?

@georglauterbach georglauterbach removed the meta/feature freeze On hold due to upcoming release process label Oct 7, 2021
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 9, 2021

Documentation preview for this PR is ready! 🎉

Built with commit: 5f48041

Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've completed my personal notes on this PR, it's prior PRs and other related history in the past regarding the change detection logic. This PR remains open as a reminder for me to tackle, but it could be closed in favor of a new issue instead.

The current state of this PR has fallen a fair bit out of sync with the changes/refactoring for the v10.3.0 release, with more changes to occur again in the near future as check-for-changes.sh is refactored some more. Some of the changes this PR covered have also since become part of the v10.3.0 release, I don't think it's worth reviving this PR.


Personally, I'm not fond of the approach proposed here with wait.

Changes can be detected via FS event notifications, with fallback to polling when that is not supported (eg: remote FS like NFS):

Issue #1958 was about reducing the rate of calculating checksums for the changedetector service, with PR #1966 attempting implementing inotifywait where it was noted something regarding volume mounts (insufficient details provided) were having trouble with acme.json but not postfix-accounts.cf (may not have been a bind mount?) and making a suggestion to swap the cmp command for sha512sum -c --status /path/to/existing/checksum.

That user also had a separate PR attempt with little details that branched PR-1966 to swap inotifywait with fswatch for some reason. Neither PR was merged, nor a comment regarding their rejection.


Additionally when looking into all of this I did a deep dive into file locking and believe the changes made to replace flock introduced regressions.

Switch back to `flock`

Both in regards to:

  • The intent of locking files in the first place - flock locked the configs being interacted with, while the replacement applied locks to individual scripts instead, thus locks only blocked a specific script from being concurrently called, but not concurrent access to the actual config files being modified which was the original intent with flock being introduced.
  • And performance/timing - flock had no lockfile to cleanup or timeout on, timeout could be configured via option, as could immediate exit, otherwise it would cause a process to block/wait until a lock was released and could be acquired to continue processing, all with no sleep or similar loop involved..

flock was removed only for the benefit of remote filesystems that it was deemed unavailable (NFS) which was inaccurate and referencing outdated information, the error experienced appears to be for different reasons. I additionally looked into lockf and fcntl as the other common locking alternatives:

  • lockf is a wrapper around fcntl and doesn't seem to offer much benefit.
  • fcntl has some pro/cons, but is what NFS will use to emulate flock with and will need some testing to verify correct behaviour with multiple clients involved.
  • And some utilities like lockfile (from procmail package) that provide a more convenient command and API to use for actual lockfiles if other options like flock aren't viable.

File based locks are generally not ideal however as they're not atomic (checking for the lockfile and creating one if it doesn't exist is not a single operation and thus prone to some race conditions), mkdir is often cited as an atomic variant for that, but using syscalls with flock and fcntl is preferable from what I came across.


Avoiding background processes with wait for parallelization due to concerns

Using locks properly should resolve that addmailuser loop removal just as well AFAIK. The background of processes and wait if I understood correctly was to try parallelize the logic, but this seems unwise as IIRC there is some synchronous flow between some of the logic and you'd be complicating that flow for maintainers for minimal time savings as most of that is related to the service restarts and any sleep delays (which can be better reduced / optimized via FS event notifications when supported, few users will use remote network file systems like NFS where polling should work just as well as current sleep delays).

Change Detection (consider `watchexec`)

Settling on changes is a good improvement for check-for-changes.sh, and I think we can leverage a file watcher daemon like watchexec for this. It supports both FS event notifications and polling modes, and can debounce (wait for no changes until X time has passed before running) although I'm not sure if it can ensure we force processing if debounce stalls processing due to sustained high frequency of changes, which should be unlikely and better served by such users abstracting changes via queue to batch apply rather than directly modifying or calling other utilities (eg: what the approach by the DMS API project is currently taking). At least I think watchexec would cover everything we'd do and more/better and allow us to offload maintenance for that functionality.

TL;DR

  • This PR (and it's history in prior PRs and issues involved) serves as a good reference but probably not worth resuming or the correct approach to merge.
  • File locking appears to have regressed and should bring back flock, NFS does actually support it and the failure just needs to be investigated/resolved.
  • Detecting changes can avoid the outer loop and it's frequency by using something like watchexec with FS event notifications for file changes (or polling mode for NFS), using the debounce feature to get settled changes. I'm unsure currently how beneficial the checksum functionality is to keep (existing in master, not additions from this PR) if adopting watchexec (it can inject ENV with values of affected files, but checksum is probably nicer to work with, we can switch from sha512 to blake3 for optimization).

Additionally (improvements, but can come later in separate PRs):

  • While I recently de-duplicated the check-for-changes.sh logic shared with start-mailserver.sh startup scripts, we probably can refactor that some more to diff accordingly as issue #2096 notes. (Note: They are an LDAP + NFS user, noting the chown -R perf issue on /var/mail which is likely compounding together, not that check-for-changes.sh presently runs that for LDAP users however..they also cite adding many users via API service and proposing for a queue in the container as well to batch apply)
  • I also recall a PR some years ago moved away from hash type config DB files with Postfix, these could still use text configs but were possible to update at runtime without restarting Postfix IIRC, I'd need to look up that discussion again for why the change was made and if it makes sense to revert.

Actionable Tasks

If I find time for these, it should be around Dec or Jan (excluding "extras" tasks), or someone else is welcome to attempt these before me. I've built up a rather verbose list of notes for when I tackle this, but I'm just as happy to review/collaborate (_just give me a heads up in advance, if you the reader are going to tackle this 😅 )

  • Use watchexec (with debounce) instead for change detection, possibly requiring an ENV for NFS/polling.
  • Look at replacing the existing file locking logic with lockfile (from procmail package) or better flock (will require reproducing @NorseGaud multi-instance of DMS containers with NFS volumes that produce the bad file descriptor error and resolving that). Probably more importantly address that files being locked are the actual relevant ones, not the scripts themselves..
  • If keeping the checksum logic around (eg: the monitor function), consider using blake3 instead of sha512 for checksums as minor optimization.

Extras:

  • Investigate viability / complexity of only responding to diff changes and/or changing from text: to hash: based config records for Postfix. Users primary issue may be more related to NFS in combination with chown -R (being investigated and resolved in PR #2256), but they appear to be an LDAP user with usage for spinning up throwaway images for testing, check-for-changes.sh doesn't apply to LDAP at present, it could be startup related but their API to add users at runtime shouldn't be involved with that only check-for-changes.sh 🤷‍♂️
  • Investigate possibility to support batch change requests, potentially via a queue service, though at present this is likely better deferred to the API project, it may be related to file locking concerns depending on implementation. File based queues may be similarly unreliable to in-memory queue system with different failure risks, but the core project is also meant to be DB averse 🤷‍♂️

@polarathene polarathene added the stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI label Nov 29, 2021
@polarathene polarathene self-assigned this Nov 29, 2021
@NorseGaud NorseGaud mentioned this pull request Feb 28, 2022
11 tasks
@polarathene polarathene mentioned this pull request Mar 18, 2022
8 tasks
@polarathene polarathene removed this from the v11.0.0 milestone Apr 4, 2022
@polarathene polarathene added this to the v12.0.0 milestone Jun 10, 2022
@georglauterbach georglauterbach modified the milestones: v12.0.0, v13.0.0 Mar 3, 2023
@georglauterbach
Copy link
Copy Markdown
Member

Same as #2443; see #2443 (comment).

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Apr 25, 2023

This shouldn't be too important anymore. Many changes were done since to improve how check-for-changes.sh works. It is much faster at processing now, with minimal down time.

However, it may not be sufficient for bulk adding of users. Especially for supporting encrypted mailbox creation.

For this I still think debouncing with throttle is better than the current polling approach. My advice in previous comment is still valid for approaching that with watchexec. I also still recommend switching the file locking approach back to flock.


EDIT: Relevant issue that has all closed PRs being linked didn't reference this one. Including this PR for traceability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/improvement Improve an existing feature, configuration file or the documentation stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants