Skip to content

check-for-changes wait for settle#2443

Closed
NorseGaud wants to merge 12 commits intomasterfrom
improvement/check-for-changes-settle
Closed

check-for-changes wait for settle#2443
NorseGaud wants to merge 12 commits intomasterfrom
improvement/check-for-changes-settle

Conversation

@NorseGaud
Copy link
Copy Markdown
Member

@NorseGaud NorseGaud commented Feb 28, 2022

Description

Original PR this was split from: #2157

Goal: Have check-for-changes.sh wait for changes to "settle" (checksum comparison has to not have changed over two different checks with 5-second gaps between) before modifying and restarting anything.

Why: The eventual API (docker-mailserver/docker-mailserver-admin#3) will allow users to make frequent changes to the config and we don't want them triggering service restarts until those requests/changes settle down.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (non-breaking change that does improve existing functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (README.md or the documentation under docs/)
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…es to configs and prevention of unnecessary service restarts
@NorseGaud NorseGaud marked this pull request as draft February 28, 2022 14:39
@polarathene
Copy link
Copy Markdown
Member

I haven't looked over this PR update, but chiming in to say I still feel strongly about what I said in the previous PR back in Nov 2021. I've mentioned it a few times but had no feedback.

I was hoping to have started working on that by now, but lacked resources. That should be resolved sometime this month thankfully, and I can start contributing again.

Did you want to discuss the alternate route I think we should pursue, or for the time being get your approach in first?

@NorseGaud
Copy link
Copy Markdown
Member Author

I haven't looked over this PR update, but chiming in to say I still feel strongly about what I said in the previous PR back in Nov 2021. I've mentioned it a few times but had no feedback.

I was hoping to have started working on that by now, but lacked resources. That should be resolved sometime this month thankfully, and I can start contributing again.

Did you want to discuss the alternate route I think we should pursue, or for the time being get your approach in first?

Do you mean watchexec? It adds a dependency which we'd have to wait on fixes for should something break or stop working. I don't think it's a good idea. This isn't that hard to solve in just pure bash.

@NorseGaud NorseGaud self-assigned this Mar 1, 2022
@casperklein
Copy link
Copy Markdown
Member

casperklein commented Mar 1, 2022

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

@NorseGaud
Copy link
Copy Markdown
Member Author

NorseGaud commented Mar 1, 2022

If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

What is wrong with NFS and how does it relate to waiting for changes to settle? I'd like to keep this PR to only the essentials if possible. These massive PRs (which I'm responsible for) are painful for us all.

@casperklein
Copy link
Copy Markdown
Member

casperklein commented Mar 1, 2022

I've updated my post above, I hope this makes it clearer 😉 There is nothing wrong with this PR from my point of view.

@polarathene
Copy link
Copy Markdown
Member

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

watchexec already handles that and switches to polling instead of more efficient file watcher.

It additionally has a debounce setting, so that it won't run any script if an event triggers again within a duration (must be idle for X time after last event to run script), and a throttle IIRC to ensure that it will not debounce forever, at least once per Y time.


Do you mean watchexec? It adds a dependency which we'd have to wait on fixes for should something break or stop working.

Yes, watchexec. What fixes do you expect to wait on or breakage to happen? It's a rust binary, I don't see that causing any major concerns.

The developer is quite friendly / responsive, chances are if a problem needed to be investigated/addressed, it'd be done sooner on their end than us with bash given how long some bug-fixes/features can take for us volunteer maintainers to complete.

Regardless, if we take your approach first for now, you'd have a fallback option to rollback to and maintain, but I don't think that'd be necessary in practice.


I don't think it's a good idea. This isn't that hard to solve in just pure bash.

It might not be that hard to add to the existing bash support, but overall we could reduce a fair amount of bash to maintain, which I consider a decent reason in this case since we've already had bash introduce breakage that wasn't easy to identify and resolve quickly..

That was due to a recent base image update caused memory leak because check-for-changes was running the sha512sum via subshell, and this was possibly related to an incompatibility with libc update with older kernels (from about a year ago or so, but still common for servers).

We've already had other maintainers (myself included) find this area a bit tricky/confusing in the past to understand (thankfully a fair amount of refactoring by you and me has greatly improved that at least). I don't know about you, but my memory on such inner workings / layers in a codebase can get foggy over time, and I know I won't be as active of a maintainer after this year, so someone else needs to retain/accumulate that knowledge and so forth.

As time goes by digging through git blame (especially with big changes and logic being split into separate files or renamed/relocated), trying to learn from past contributors/PRs can be frustrating, I did that extensively for this area especially for file locking logic.

IMO, simplifying this sort of thing as much as possible so that it's easy to grok and maintain is more important than it being handled in pure bash. Similar to the python acme_extract, just because it could be done in bash doesn't mean it's appropriate to. Bash makes sense elsewhere in the project, @georglauterbach and @casperklein do a stellar job keeping that in tip-top shape.

This PR is probably fine for now, but my opinion is it only further complicates the logic / control-flow. If watchexec reduces maintenance cost, provides better polling approach (compared to managing loops+sleeps) as a fallback all the while letting us leverage more effective file watchers for the majority of users... I'm quite down for that 😅

Not sure if you're aware but related tests updated by myself and @georglauterbach and perhaps your own experience have had failures due to the sleep approach required by current check-for-changes approach, and increasing the sleep or bytes checked by logs to accommodate has accidentally resulted in false positives where the author didn't consider the logs they were checking might be unrelated older logs from a previous change event... and probably introduced some race conditions.


TL;DR

watchexec has the promise of:

  • Less maintenance woes.
  • Probably faster resolution to any issues compared to existing history with the current bash script.
  • File watching with automatic polling fallback for users that need it (eg NFS).

Bash solution may be simple, but that doesn't always make it appropriate to prefer bash for the sake of it. Personally, watchexec seems it'd be simpler and more flexible to maintain on our end. I'm more concerned about check-for-changes logic as a whole, rather than what this PR alone is solving.

I'm not against reviewing and merging this PR in the meantime though, but I think there's a strong case for watchexec at a later date. We can always roll back to your bash approach if I'm wrong about watchexec.

Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Just a few minor changes to consider.

Comment thread test/mail_changedetector.bats
Comment thread target/scripts/check-for-changes.sh Outdated
Comment thread target/scripts/check-for-changes.sh Outdated
Comment thread target/scripts/check-for-changes.sh
@NorseGaud NorseGaud force-pushed the improvement/check-for-changes-settle branch from 8a71d5c to 9ee0290 Compare March 2, 2022 01:03
@NorseGaud
Copy link
Copy Markdown
Member Author

Seeing some flakiness in tests. I'm going to run them a few more times to see what and how often it fails.

@casperklein
Copy link
Copy Markdown
Member

As always, good summary @polarathene 👍

Regarding watchexec: If I understand correctly, we would also need a work-around for at least NFS. I think we should aim a one-for-all solution (if possible).

watchexec already handles that and switches to polling instead of more efficient file watcher. It additionally has a debounce setting, so that it won't run any script if an event triggers again within a duration (must be idle for X time after last event to run script), and a throttle IIRC to ensure that it will not debounce forever, at least once per Y time.

This sounds like a jack of all trades 😄 If that works reliable and the code base can be considerably reduced, I think it's worth to evaluate this solution.

But until then, there are no reasons to not improve the current script with this PR.

polarathene
polarathene previously approved these changes Mar 2, 2022
@NorseGaud NorseGaud added area/scripts kind/improvement Improve an existing feature, configuration file or the documentation priority/low labels Mar 2, 2022
@NorseGaud
Copy link
Copy Markdown
Member Author

Flakey test:

not ok 183 checking relay hosts: default mapping is added from env vars for new virtual user entry in 11546ms
[191](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:191)
# (from function `run_until_success_or_timeout' in file test/test_helper/common.bash, line 58,
[192](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:192)
#  in test file test/mail_with_relays.bats, line 62)
[193](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:193)
#   `run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map' failed

@georglauterbach
Copy link
Copy Markdown
Member

Flakey test:


not ok 183 checking relay hosts: default mapping is added from env vars for new virtual user entry in 11546ms

[191](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:191)

# (from function `run_until_success_or_timeout' in file test/test_helper/common.bash, line 58,

[192](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:192)

#  in test file test/mail_with_relays.bats, line 62)

[193](https://github.com/docker-mailserver/docker-mailserver/runs/5399413386?check_suite_focus=true#step:8:193)

#   `run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map' failed

Difficult to say but I have never experienced this test to be flakey with other PRs or the current code base. How certain we be that this test really is flakey and it's not the PR's fault?

@polarathene
Copy link
Copy Markdown
Member

From the shared log, it's possible that the timeout of 10 seconds is too short due to the extra sleeps and slower CI hardware vs local testing (I recall this being a gotcha in the past for other maintainers). If the test is consistently passing locally it's likely the cause is CI environment.

There was many more failures in the failed commit earlier, likely all due to this reason.. It may be a good indication to take caution with eventual merging if it's going to ripple through to other PRs?

For tests that have too short timeouts, we could increase them (presumably safe), but others may be failing due to assumptions that delays between service availability wouldn't be as long without any explicit timing? 🤷‍♂️

@casperklein
Copy link
Copy Markdown
Member

casperklein commented Mar 3, 2022

How certain we be that this test really is flakey and it's not the PR's fault?

We merged a lot PRs in the last days and this test was never flaky IIRC.

PS: This test also fails (reliable) when I run the tests locally.

@NorseGaud
Copy link
Copy Markdown
Member Author

Yes, with this commit 94a72dd

Thanks, give me a bit to dig into this.

@NorseGaud
Copy link
Copy Markdown
Member Author

NorseGaud commented Mar 4, 2022

To everyone watching this PR: I figured that I would change the shebang in setup.sh as part of this PR since it was blocking me running tests on macOS (3964c98). It is the same issue described in #2448. I don't expect that it will be a breaking change... What do you think?

assert_output ''
run ./setup.sh -c mail_with_relays alias add [email protected] [email protected]
run_until_success_or_timeout 10 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map
run_until_success_or_timeout 11 docker exec mail_with_relays grep -e domain2.tld /etc/postfix/relayhost_map
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out adding aliases takes 1 second longer than anything else :)

@casperklein
Copy link
Copy Markdown
Member

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.).
Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

@NorseGaud NorseGaud force-pushed the improvement/check-for-changes-settle branch from 38c855a to 077483f Compare March 5, 2022 14:41
@NorseGaud
Copy link
Copy Markdown
Member Author

NorseGaud commented Mar 5, 2022

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.). Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

I agree. We should pull all custom macOS changes from the code and update the Contributor guide to explain macOS users need to use a Linux VM (I can do in a different PR). What does @georglauterbach and @polarathene think?

@georglauterbach
Copy link
Copy Markdown
Member

I see, you add more and more Mac specific workarounds. This will be a cat & mouse game forever. Every PR is only tested against linux. So it's just a matter of time, until something breaks on MacOS again (unnoticed, until you stumble upon it.). Wouldn't it be much easier, to just spin up a dev (linux) virtual machine and use that? That's what I do, instead of adding Windows specific fixes 😆

I agree. We should pull all custom macOS changes from the code and update the Contributor guide to explain macOS users need to use a Linux VM (I can do in a different PR). What does @georglauterbach and @polarathene think?

I very much agree that this should be the way to go :)

@polarathene
Copy link
Copy Markdown
Member

What does @georglauterbach and @polarathene think?

I agree as well. I personally just spin up a VM to run tests.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has become stale because it has been open for 20 days without activity.
This pull request will be closed in 10 days automatically unless:

  • a maintainer removes the meta/stale label or adds the stale-bot/ignore label
  • new activity occurs, such as a new comment

@github-actions github-actions Bot added the meta/stale This issue / PR has become stale and will be closed if there is no further activity label Mar 27, 2022
@NorseGaud
Copy link
Copy Markdown
Member Author

Not stale! I just moved so haven't had time for this

@NorseGaud NorseGaud removed the meta/stale This issue / PR has become stale and will be closed if there is no further activity label Mar 27, 2022
@georglauterbach
Copy link
Copy Markdown
Member

georglauterbach commented Mar 27, 2022

Not stale! I just moved so haven't had time for this

If you want to you can add the stale-bot/ignore label so GH actions won't mark this PR as stale.

PS: Also sorry for the upcoming merge conflicts due to #2498 🙃🙈

@polarathene polarathene added the stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI label Mar 27, 2022
@georglauterbach
Copy link
Copy Markdown
Member

This PR is now open for quite some time without recent contribution, and we have quite a lot of merge conflicts here.

I‘d advocate for closing this and, if someone want‘s to try again in the future, open a new PR.

Any objections from @docker-mailserver/maintainers?

@polarathene
Copy link
Copy Markdown
Member

I intend to tackle this within a week once the v11.1 PRs are sorted and I can focus on v12. I'll also be addressing other long-standing draft PRs at that time.

@polarathene polarathene assigned polarathene and unassigned NorseGaud Jun 10, 2022
@polarathene polarathene added this to the v12.0.0 milestone Jun 10, 2022
@georglauterbach georglauterbach modified the milestones: v12.0.0, v13.0.0 Mar 3, 2023
@georglauterbach
Copy link
Copy Markdown
Member

As part of a cleanup series for v13.0.0, I will close this as part of a TODO list (see #3289).

It may be a personal thing, but I like the PR list to be clean and up-to-date with non-stale PRs. I hope this does not offend anyone, but maintaining DMS for me is easier when I see all relevant PRs as open (or non-stale WIP). Maintaining the PR and issue list is a lot of work, especially for me, so I'd like to keep it in a way that is comfortable for me.

This PR is just too old for my taste; this does not mean you cannot incorporate these changes though! Please (re-)open when you're working on this again.

@georglauterbach georglauterbach deleted the improvement/check-for-changes-settle branch April 24, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scripts kind/improvement Improve an existing feature, configuration file or the documentation priority/low stale-bot/ignore Indicates that this issue / PR shall not be closed by our stale-checking CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants