check-for-changes: performance improvements + wait for settle v2#2157
check-for-changes: performance improvements + wait for settle v2#2157NorseGaud wants to merge 35 commits intodocker-mailserver:masterfrom NorseGaud:check-for-changes-performance
Conversation
|
Per #2154 we need to check why so much CPU is used before merging |
…formance improvements throughout + removed addmailuser while loops
… various test fixes for new check-for-changes logic
|
On boot of the docker container/starting services: Once booted, it drops down but hovers around I'm trying to determine what specifically is causing this. |
|
Ok, it looks to be fine after I remove Continuing to confirm... |
|
Supervisor logs were showing spam: Turns out the chksum.new file was being written too slowly and the cmp command was trying to read it when it was partially written to. it looks like we've wrapped _monitored_files_checksums's code in ( ) and that's likely causing it to execute and continue when we really want it to wait. |
|
Odd... check for changes uses a helper that doesn't work...
Why would anyone generate a letsencrypt with just the hostname and not the domain? |
…tting of DOMAINNAME in check-for-changes.sh
…tting of DOMAINNAME in check-for-changes.sh
|
Documentation preview for this PR is ready! 🎉 Built with commit: 5f48041 |
polarathene
left a comment
There was a problem hiding this comment.
I've completed my personal notes on this PR, it's prior PRs and other related history in the past regarding the change detection logic. This PR remains open as a reminder for me to tackle, but it could be closed in favor of a new issue instead.
The current state of this PR has fallen a fair bit out of sync with the changes/refactoring for the v10.3.0 release, with more changes to occur again in the near future as check-for-changes.sh is refactored some more. Some of the changes this PR covered have also since become part of the v10.3.0 release, I don't think it's worth reviving this PR.
Personally, I'm not fond of the approach proposed here with wait.
Changes can be detected via FS event notifications, with fallback to polling when that is not supported (eg: remote FS like NFS):
Issue #1958 was about reducing the rate of calculating checksums for the changedetector service, with PR #1966 attempting implementing inotifywait where it was noted something regarding volume mounts (insufficient details provided) were having trouble with acme.json but not postfix-accounts.cf (may not have been a bind mount?) and making a suggestion to swap the cmp command for sha512sum -c --status /path/to/existing/checksum.
That user also had a separate PR attempt with little details that branched PR-1966 to swap inotifywait with fswatch for some reason. Neither PR was merged, nor a comment regarding their rejection.
Additionally when looking into all of this I did a deep dive into file locking and believe the changes made to replace flock introduced regressions.
Switch back to `flock`
Both in regards to:
- The intent of locking files in the first place -
flocklocked the configs being interacted with, while the replacement applied locks to individual scripts instead, thus locks only blocked a specific script from being concurrently called, but not concurrent access to the actual config files being modified which was the original intent withflockbeing introduced. - And performance/timing -
flockhad no lockfile to cleanup or timeout on, timeout could be configured via option, as could immediate exit, otherwise it would cause a process to block/wait until a lock was released and could be acquired to continue processing, all with no sleep or similar loop involved..
flock was removed only for the benefit of remote filesystems that it was deemed unavailable (NFS) which was inaccurate and referencing outdated information, the error experienced appears to be for different reasons. I additionally looked into lockf and fcntl as the other common locking alternatives:
lockfis a wrapper aroundfcntland doesn't seem to offer much benefit.fcntlhas some pro/cons, but is what NFS will use to emulateflockwith and will need some testing to verify correct behaviour with multiple clients involved.- And some utilities like
lockfile(fromprocmailpackage) that provide a more convenient command and API to use for actual lockfiles if other options likeflockaren't viable.
File based locks are generally not ideal however as they're not atomic (checking for the lockfile and creating one if it doesn't exist is not a single operation and thus prone to some race conditions), mkdir is often cited as an atomic variant for that, but using syscalls with flock and fcntl is preferable from what I came across.
Avoiding background processes with wait for parallelization due to concerns
Using locks properly should resolve that addmailuser loop removal just as well AFAIK. The background of processes and wait if I understood correctly was to try parallelize the logic, but this seems unwise as IIRC there is some synchronous flow between some of the logic and you'd be complicating that flow for maintainers for minimal time savings as most of that is related to the service restarts and any sleep delays (which can be better reduced / optimized via FS event notifications when supported, few users will use remote network file systems like NFS where polling should work just as well as current sleep delays).
Change Detection (consider `watchexec`)
Settling on changes is a good improvement for check-for-changes.sh, and I think we can leverage a file watcher daemon like watchexec for this. It supports both FS event notifications and polling modes, and can debounce (wait for no changes until X time has passed before running) although I'm not sure if it can ensure we force processing if debounce stalls processing due to sustained high frequency of changes, which should be unlikely and better served by such users abstracting changes via queue to batch apply rather than directly modifying or calling other utilities (eg: what the approach by the DMS API project is currently taking). At least I think watchexec would cover everything we'd do and more/better and allow us to offload maintenance for that functionality.
TL;DR
- This PR (and it's history in prior PRs and issues involved) serves as a good reference but probably not worth resuming or the correct approach to merge.
- File locking appears to have regressed and should bring back
flock, NFS does actually support it and the failure just needs to be investigated/resolved. - Detecting changes can avoid the outer loop and it's frequency by using something like
watchexecwith FS event notifications for file changes (or polling mode for NFS), using the debounce feature to get settled changes. I'm unsure currently how beneficial the checksum functionality is to keep (existing in master, not additions from this PR) if adoptingwatchexec(it can inject ENV with values of affected files, but checksum is probably nicer to work with, we can switch from sha512 to blake3 for optimization).
Additionally (improvements, but can come later in separate PRs):
- While I recently de-duplicated the
check-for-changes.shlogic shared withstart-mailserver.shstartup scripts, we probably can refactor that some more to diff accordingly as issue #2096 notes. (Note: They are an LDAP + NFS user, noting thechown -Rperf issue on/var/mailwhich is likely compounding together, not thatcheck-for-changes.shpresently runs that for LDAP users however..they also cite adding many users via API service and proposing for a queue in the container as well to batch apply) - I also recall a PR some years ago moved away from
hashtype config DB files with Postfix, these could still use text configs but were possible to update at runtime without restarting Postfix IIRC, I'd need to look up that discussion again for why the change was made and if it makes sense to revert.
Actionable Tasks
If I find time for these, it should be around Dec or Jan (excluding "extras" tasks), or someone else is welcome to attempt these before me. I've built up a rather verbose list of notes for when I tackle this, but I'm just as happy to review/collaborate (_just give me a heads up in advance, if you the reader are going to tackle this 😅 )
- Use
watchexec(with debounce) instead for change detection, possibly requiring an ENV for NFS/polling. - Look at replacing the existing file locking logic with
lockfile(fromprocmailpackage) or betterflock(will require reproducing @NorseGaud multi-instance of DMS containers with NFS volumes that produce the bad file descriptor error and resolving that). Probably more importantly address that files being locked are the actual relevant ones, not the scripts themselves.. - If keeping the checksum logic around (eg: the monitor function), consider using blake3 instead of sha512 for checksums as minor optimization.
Extras:
- Investigate viability / complexity of only responding to diff changes and/or changing from
text:tohash:based config records for Postfix. Users primary issue may be more related to NFS in combination withchown -R(being investigated and resolved in PR #2256), but they appear to be an LDAP user with usage for spinning up throwaway images for testing,check-for-changes.shdoesn't apply to LDAP at present, it could be startup related but their API to add users at runtime shouldn't be involved with that onlycheck-for-changes.sh🤷♂️ - Investigate possibility to support batch change requests, potentially via a queue service, though at present this is likely better deferred to the API project, it may be related to file locking concerns depending on implementation. File based queues may be similarly unreliable to in-memory queue system with different failure risks, but the core project is also meant to be DB averse 🤷♂️
|
Same as #2443; see #2443 (comment). |
|
This shouldn't be too important anymore. Many changes were done since to improve how However, it may not be sufficient for bulk adding of users. Especially for supporting encrypted mailbox creation. For this I still think debouncing with throttle is better than the current polling approach. My advice in previous comment is still valid for approaching that with EDIT: Relevant issue that has all closed PRs being linked didn't reference this one. Including this PR for traceability. |
Original PR that was reverted : #2104
Description
See: #2098
This PR will:
check-for-changes.shwait for changes to "settle" (checksum comparison has to not have changed over two different checks with 5 second gaps between) before modifying and restarting anything_monitored_files_checksumsThis is, in my opinion, critically needed for an API to be performing changes -- and lots of them all at once! See docker-mailserver/docker-mailserver-admin#3 for more discussion about this.
Type of change
Checklist:
docs/)