check-for-changes wait for settle by NorseGaud · Pull Request #2443 · docker-mailserver/docker-mailserver

NorseGaud · 2022-02-28T14:38:05Z

Description

Original PR this was split from: #2157

Goal: Have check-for-changes.sh wait for changes to "settle" (checksum comparison has to not have changed over two different checks with 5-second gaps between) before modifying and restarting anything.

Why: The eventual API (docker-mailserver/docker-mailserver-admin#3) will allow users to make frequent changes to the config and we don't want them triggering service restarts until those requests/changes settle down.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (non-breaking change that does improve existing functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (README.md or the documentation under docs/)
If necessary I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…es to configs and prevention of unnecessary service restarts

polarathene · 2022-03-01T14:20:26Z

I haven't looked over this PR update, but chiming in to say I still feel strongly about what I said in the previous PR back in Nov 2021. I've mentioned it a few times but had no feedback.

I was hoping to have started working on that by now, but lacked resources. That should be resolved sometime this month thankfully, and I can start contributing again.

Did you want to discuss the alternate route I think we should pursue, or for the time being get your approach in first?

NorseGaud · 2022-03-01T14:38:51Z

I haven't looked over this PR update, but chiming in to say I still feel strongly about what I said in the previous PR back in Nov 2021. I've mentioned it a few times but had no feedback.

I was hoping to have started working on that by now, but lacked resources. That should be resolved sometime this month thankfully, and I can start contributing again.

Did you want to discuss the alternate route I think we should pursue, or for the time being get your approach in first?

Do you mean watchexec? It adds a dependency which we'd have to wait on fixes for should something break or stop working. I don't think it's a good idea. This isn't that hard to solve in just pure bash.

casperklein · 2022-03-01T15:49:20Z

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

NorseGaud · 2022-03-01T15:51:44Z

If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

What is wrong with NFS and how does it relate to waiting for changes to settle? I'd like to keep this PR to only the essentials if possible. These massive PRs (which I'm responsible for) are painful for us all.

casperklein · 2022-03-01T15:55:43Z

I've updated my post above, I hope this makes it clearer 😉 There is nothing wrong with this PR from my point of view.

polarathene · 2022-03-01T23:31:47Z

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

watchexec already handles that and switches to polling instead of more efficient file watcher.

It additionally has a debounce setting, so that it won't run any script if an event triggers again within a duration (must be idle for X time after last event to run script), and a throttle IIRC to ensure that it will not debounce forever, at least once per Y time.

Do you mean watchexec? It adds a dependency which we'd have to wait on fixes for should something break or stop working.

Yes, watchexec. What fixes do you expect to wait on or breakage to happen? It's a rust binary, I don't see that causing any major concerns.

The developer is quite friendly / responsive, chances are if a problem needed to be investigated/addressed, it'd be done sooner on their end than us with bash given how long some bug-fixes/features can take for us volunteer maintainers to complete.

Regardless, if we take your approach first for now, you'd have a fallback option to rollback to and maintain, but I don't think that'd be necessary in practice.

I don't think it's a good idea. This isn't that hard to solve in just pure bash.

It might not be that hard to add to the existing bash support, but overall we could reduce a fair amount of bash to maintain, which I consider a decent reason in this case since we've already had bash introduce breakage that wasn't easy to identify and resolve quickly..

That was due to a recent base image update caused memory leak because check-for-changes was running the sha512sum via subshell, and this was possibly related to an incompatibility with libc update with older kernels (from about a year ago or so, but still common for servers).

We've already had other maintainers (myself included) find this area a bit tricky/confusing in the past to understand (thankfully a fair amount of refactoring by you and me has greatly improved that at least). I don't know about you, but my memory on such inner workings / layers in a codebase can get foggy over time, and I know I won't be as active of a maintainer after this year, so someone else needs to retain/accumulate that knowledge and so forth.

As time goes by digging through git blame (especially with big changes and logic being split into separate files or renamed/relocated), trying to learn from past contributors/PRs can be frustrating, I did that extensively for this area especially for file locking logic.

IMO, simplifying this sort of thing as much as possible so that it's easy to grok and maintain is more important than it being handled in pure bash. Similar to the python acme_extract, just because it could be done in bash doesn't mean it's appropriate to. Bash makes sense elsewhere in the project, @georglauterbach and @casperklein do a stellar job keeping that in tip-top shape.

This PR is probably fine for now, but my opinion is it only further complicates the logic / control-flow. If watchexec reduces maintenance cost, provides better polling approach (compared to managing loops+sleeps) as a fallback all the while letting us leverage more effective file watchers for the majority of users... I'm quite down for that 😅

Not sure if you're aware but related tests updated by myself and @georglauterbach and perhaps your own experience have had failures due to the sleep approach required by current check-for-changes approach, and increasing the sleep or bytes checked by logs to accommodate has accidentally resulted in false positives where the author didn't consider the logs they were checking might be unrelated older logs from a previous change event... and probably introduced some race conditions.

TL;DR

watchexec has the promise of:

Less maintenance woes.
Probably faster resolution to any issues compared to existing history with the current bash script.
File watching with automatic polling fallback for users that need it (eg NFS).

Bash solution may be simple, but that doesn't always make it appropriate to prefer bash for the sake of it. Personally, watchexec seems it'd be simpler and more flexible to maintain on our end. I'm more concerned about check-for-changes logic as a whole, rather than what this PR alone is solving.

I'm not against reviewing and merging this PR in the meantime though, but I think there's a strong case for watchexec at a later date. We can always roll back to your bash approach if I'm wrong about watchexec.

polarathene

LGTM 👍 Just a few minor changes to consider.

NorseGaud · 2022-03-02T12:45:17Z

Seeing some flakiness in tests. I'm going to run them a few more times to see what and how often it fails.

casperklein · 2022-03-02T12:59:39Z

As always, good summary @polarathene 👍

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

watchexec already handles that and switches to polling instead of more efficient file watcher. It additionally has a debounce setting, so that it won't run any script if an event triggers again within a duration (must be idle for X time after last event to run script), and a throttle IIRC to ensure that it will not debounce forever, at least once per Y time.

This sounds like a jack of all trades 😄 If that works reliable and the code base can be considerably reduced, I think it's worth to evaluate this solution.

But until then, there are no reasons to not improve the current script with this PR.

NorseGaud · 2022-03-03T00:22:08Z

Flakey test:

not ok 183 checking relay hosts: default mapping is added from env vars for new virtual user entry in 11546ms
[191](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:191)
# (from function `run_until_success_or_timeout' in file test/test_helper/common.bash, line 58,
[192](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:192)
#  in test file test/mail_with_relays.bats, line 62)
[193](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:193)
#   `run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map' failed

georglauterbach · 2022-03-03T07:41:08Z

Flakey test:


not ok 183 checking relay hosts: default mapping is added from env vars for new virtual user entry in 11546ms

[191](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:191)

# (from function `run_until_success_or_timeout' in file test/test_helper/common.bash, line 58,

[192](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:192)

#  in test file test/mail_with_relays.bats, line 62)

[193](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:193)

#   `run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map' failed

Difficult to say but I have never experienced this test to be flakey with other PRs or the current code base. How certain we be that this test really is flakey and it's not the PR's fault?

polarathene · 2022-03-03T07:52:50Z

From the shared log, it's possible that the timeout of 10 seconds is too short due to the extra sleeps and slower CI hardware vs local testing (I recall this being a gotcha in the past for other maintainers). If the test is consistently passing locally it's likely the cause is CI environment.

There was many more failures in the failed commit earlier, likely all due to this reason.. It may be a good indication to take caution with eventual merging if it's going to ripple through to other PRs?

For tests that have too short timeouts, we could increase them (presumably safe), but others may be failing due to assumptions that delays between service availability wouldn't be as long without any explicit timing? 🤷‍♂️

casperklein · 2022-03-03T11:43:43Z

How certain we be that this test really is flakey and it's not the PR's fault?

We merged a lot PRs in the last days and this test was never flaky IIRC.

PS: This test also fails (reliable) when I run the tests locally.

NorseGaud · 2022-03-03T13:10:53Z

Yes, with this commit 94a72dd

Thanks, give me a bit to dig into this.

…3.x in /bin/bash, and brew installs 5.x into /usr/local/bin

…-changes wait for settle

NorseGaud · 2022-03-04T13:36:51Z

To everyone watching this PR: I figured that I would change the shebang in setup.sh as part of this PR since it was blocking me running tests on macOS (3964c98). It is the same issue described in #2448. I don't expect that it will be a breaking change... What do you think?

NorseGaud · 2022-03-04T13:46:38Z

  assert_output ''
  run ./setup.sh -c mail_with_relays alias add [email protected] [email protected]
-  run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map
+  run_until_success_or_timeout 11 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map


Turns out adding aliases takes 1 second longer than anything else :)

casperklein · 2022-03-05T11:17:42Z

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.).
Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

NorseGaud · 2022-03-05T15:03:37Z

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.). Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

I agree. We should pull all custom macOS changes from the code and update the Contributor guide to explain macOS users need to use a Linux VM (I can do in a different PR). What does @georglauterbach and @polarathene think?

georglauterbach · 2022-03-05T19:58:38Z

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.). Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

I agree. We should pull all custom macOS changes from the code and update the Contributor guide to explain macOS users need to use a Linux VM (I can do in a different PR). What does @georglauterbach and @polarathene think?

I very much agree that this should be the way to go :)

polarathene · 2022-03-06T03:48:35Z

What does @georglauterbach and @polarathene think?

I agree as well. I personally just spin up a VM to run tests.

github-actions · 2022-03-27T01:29:47Z

This pull request has become stale because it has been open for 20 days without activity.
This pull request will be closed in 10 days automatically unless:

a maintainer removes the meta/stale label or adds the stale-bot/ignore label
new activity occurs, such as a new comment

NorseGaud · 2022-03-27T14:41:46Z

Not stale! I just moved so haven't had time for this

georglauterbach · 2022-03-27T14:43:58Z

Not stale! I just moved so haven't had time for this

If you want to you can add the stale-bot/ignore label so GH actions won't mark this PR as stale.

PS: Also sorry for the upcoming merge conflicts due to #2498 🙃🙈

georglauterbach · 2022-06-06T10:50:08Z

This PR is now open for quite some time without recent contribution, and we have quite a lot of merge conflicts here.

I‘d advocate for closing this and, if someone want‘s to try again in the future, open a new PR.

Any objections from @docker-mailserver/maintainers?

polarathene · 2022-06-06T13:16:26Z

I intend to tackle this within a week once the v11.1 PRs are sorted and I can focus on v12. I'll also be addressing other long-standing draft PRs at that time.

georglauterbach · 2023-04-24T13:31:04Z

As part of a cleanup series for v13.0.0, I will close this as part of a TODO list (see #3289).

It may be a personal thing, but I like the PR list to be clean and up-to-date with non-stale PRs. I hope this does not offend anyone, but maintaining DMS for me is easier when I see all relevant PRs as open (or non-stale WIP). Maintaining the PR and issue list is a lot of work, especially for me, so I'd like to keep it in a way that is comfortable for me.

This PR is just too old for my taste; this does not mean you cannot incorporate these changes though! Please (re-)open when you're working on this again.

v1 of check-for-changes settle logic to handle lots of frequent chang…

4d0051a

…es to configs and prevention of unnecessary service restarts

NorseGaud requested review from casperklein and georglauterbach February 28, 2022 14:38

NorseGaud marked this pull request as draft February 28, 2022 14:39

NorseGaud added 2 commits February 28, 2022 09:52

various test changes

523b192

lint fix

7f39160

NorseGaud self-assigned this Mar 1, 2022

polarathene requested changes Mar 2, 2022

View reviewed changes

Comment thread test/mail_changedetector.bats

Comment thread target/scripts/check-for-changes.sh Outdated

Comment thread target/scripts/check-for-changes.sh Outdated

Comment thread target/scripts/check-for-changes.sh

NorseGaud added 3 commits March 1, 2022 19:58

Merge branch 'master' into improvement/check-for-changes-settle

77cfcc3

copy and fix problem for _log_date

5ce5b38

using unset

9ee0290

NorseGaud force-pushed the improvement/check-for-changes-settle branch from 8a71d5c to 9ee0290 Compare March 2, 2022 01:03

Merge branch 'master' into improvement/check-for-changes-settle

4bbfd32

polarathene previously approved these changes Mar 2, 2022

View reviewed changes

NorseGaud added area/scripts kind/improvement Improve an existing feature, configuration file or the documentation priority/low labels Mar 2, 2022

Merge branch 'master' into improvement/check-for-changes-settle

94a72dd

NorseGaud added 2 commits March 4, 2022 08:22

/usr/bin/env to setup.sh to support running on macOS; macOS has bash …

3964c98

…3.x in /bin/bash, and brew installs 5.x into /usr/local/bin

1 second bump to run_until_success_or_timeout to handle new check-for…

077483f

…-changes wait for settle

NorseGaud dismissed polarathene’s stale review via 077483f March 4, 2022 13:28

NorseGaud commented Mar 4, 2022

View reviewed changes

NorseGaud force-pushed the improvement/check-for-changes-settle branch from 38c855a to 077483f Compare March 5, 2022 14:41

NorseGaud added 2 commits March 6, 2022 11:28

Merge branch 'master' into improvement/check-for-changes-settle

2fdd5de

quick change

266a197

github-actions Bot added the meta/stale This issue / PR has become stale and will be closed if there is no further activity label Mar 27, 2022

NorseGaud removed the meta/stale This issue / PR has become stale and will be closed if there is no further activity label Mar 27, 2022

polarathene added the stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI label Mar 27, 2022

polarathene assigned polarathene and unassigned NorseGaud Jun 10, 2022

polarathene added this to the v12.0.0 milestone Jun 10, 2022

georglauterbach modified the milestones: v12.0.0, v13.0.0 Mar 3, 2023

georglauterbach closed this Apr 24, 2023

georglauterbach mentioned this pull request Apr 24, 2023

check-for-changes: performance improvements + wait for settle v2 #2157

Closed

10 tasks

georglauterbach deleted the improvement/check-for-changes-settle branch April 24, 2023 14:15

Uh oh!

Conversation

NorseGaud commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

polarathene commented Mar 1, 2022

Uh oh!

NorseGaud commented Mar 1, 2022

Uh oh!

casperklein commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NorseGaud commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casperklein commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene commented Mar 1, 2022

TL;DR

Uh oh!

polarathene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NorseGaud commented Mar 2, 2022

Uh oh!

casperklein commented Mar 2, 2022

Uh oh!

NorseGaud commented Mar 3, 2022

Uh oh!

georglauterbach commented Mar 3, 2022

Uh oh!

polarathene commented Mar 3, 2022

Uh oh!

casperklein commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NorseGaud commented Mar 3, 2022

Uh oh!

NorseGaud commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NorseGaud Mar 4, 2022

Choose a reason for hiding this comment

Uh oh!

casperklein commented Mar 5, 2022

Uh oh!

NorseGaud commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georglauterbach commented Mar 5, 2022

Uh oh!

polarathene commented Mar 6, 2022

Uh oh!

github-actions Bot commented Mar 27, 2022

Uh oh!

NorseGaud commented Mar 27, 2022

Uh oh!

georglauterbach commented Mar 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georglauterbach commented Jun 6, 2022

Uh oh!

polarathene commented Jun 6, 2022

Uh oh!

georglauterbach commented Apr 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

NorseGaud commented Feb 28, 2022 •

edited

Loading

casperklein commented Mar 1, 2022 •

edited

Loading

NorseGaud commented Mar 1, 2022 •

edited

Loading

casperklein commented Mar 1, 2022 •

edited

Loading

casperklein commented Mar 3, 2022 •

edited

Loading

NorseGaud commented Mar 4, 2022 •

edited

Loading

NorseGaud commented Mar 5, 2022 •

edited

Loading

georglauterbach commented Mar 27, 2022 •

edited

Loading