Skip to content

relieve some pain: restart check-for-changes.sh regularly#2398

Closed
georglauterbach wants to merge 4 commits intomasterfrom
pain-reduction/check-for-changes
Closed

relieve some pain: restart check-for-changes.sh regularly#2398
georglauterbach wants to merge 4 commits intomasterfrom
pain-reduction/check-for-changes

Conversation

@georglauterbach
Copy link
Copy Markdown
Member

@georglauterbach georglauterbach commented Feb 7, 2022

Description

The script does not need to run every single second. This is not to say
that there is huge resource usage due to running every second, but every
ten two seconds is more than enough. This should act like a plaster when
it comes to #2348 -
I KNOW THIS IS NOT A FIX. But this way, people
can first of all stay on Debian 11 and DMS 10.4.0, secondly, we may get
some more feedback on this mysterious issue due to people staying in DMS
10.4.0.
Restarting the script will provide a new PID and should in theory stop the resource leakage. Not sure whether we want to go through with it though, as this will also disguise the error at hand to some degree.

Independently of #2348, increasing the timeout is a very valid concern I had, and on my deployment, I already raised it.

Fixes nothing, but should increase the time before DMS starts using all the RAM on a system.

Type of change

  • Improvement (non-breaking change that does improve existing functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (README.md or the documentation under docs/)
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

The script does not need to run every single second. This is not to say
that there is huge resource usage due to running every second, but every
ten seconds is more than enough. This _should_ act like a plaster when
it comes to #2348 - I KNOW THIS IS **NOT** A FIX. But this way, people
can first of all stay on Debian 11 and DMS 10.4.0, secondly, we may get
some more feedback on this mysterious issue due to people staying in DMS
10.4.0.
@georglauterbach georglauterbach added area/scripts kind/improvement Improve an existing feature, configuration file or the documentation labels Feb 7, 2022
@georglauterbach georglauterbach requested a review from a team February 7, 2022 10:28
@georglauterbach georglauterbach self-assigned this Feb 7, 2022
@georglauterbach
Copy link
Copy Markdown
Member Author

@polarathene would providing the one failing test with more time help here, or do you think the issue is somewhere else entirely?

@oblitum
Copy link
Copy Markdown

oblitum commented Feb 7, 2022

This is probably related to PR #2157.

@georglauterbach
Copy link
Copy Markdown
Member Author

This is probably related to PR #2157.

It is :D But #2157 has stalled.


I'm not even sure whether this PR is the right way, but I'm running my deployment with 10s and all is well for me. Independently of the problem I think that increasing sleep time is a good idea.

@casperklein
Copy link
Copy Markdown
Member

Fixes nothing, but should increase the time before DMS starts using all the RAM on a system.

I think a better approach would be, to restart the changedetector service once a day or so.
Your solution does only slow down a potential memory leak, but will not prevent it. Restarting the service should however "fix" it.

@georglauterbach
Copy link
Copy Markdown
Member Author

Good idea. Will supervisorctl changedetector restart suffice here?

@casperklein
Copy link
Copy Markdown
Member

Yes, I think so. But haven't tested it. If the PID of the script changes afterwards, it works.

@casperklein
Copy link
Copy Markdown
Member

BTW: While reviewing the check-for-changes script, I stumbled upon

(
cd /tmp/docker-mailserver || exit 1
exec sha512sum 2>/dev/null -- \
postfix-accounts.cf \
postfix-virtual.cf \
postfix-aliases.cf \
dovecot-quotas.cf \
/etc/letsencrypt/acme.json \
${CERT_FILES[@]}
)

@georglauterbach @polarathene Do you have an idea, why there is a subshell used + exec statement? IMO it should work without those too. I cannot imagine a reason for that.

@georglauterbach
Copy link
Copy Markdown
Member Author

BTW: While reviewing the check-for-changes script, I stumbled upon

(
cd /tmp/docker-mailserver || exit 1
exec sha512sum 2>/dev/null -- \
postfix-accounts.cf \
postfix-virtual.cf \
postfix-aliases.cf \
dovecot-quotas.cf \
/etc/letsencrypt/acme.json \
${CERT_FILES[@]}
)

@georglauterbach @polarathene Do you have an idea, why there is a subshell used + exec statement? IMO it should work without those too. I cannot imagine a reason for that.

I see no reason why the subshell is there either... maybe the writer wanted to be able to not run the SHA-sum when he/she could not change directories, but this is an awful way of doing it. I think this can be done in a better way :D

Comment thread target/scripts/startup/fixes-stack.sh Outdated
@georglauterbach georglauterbach changed the title relieve some pain: increased sleep duration for check-for-changes.sh relieve some pain: restartcheck-for-changes.sh regularly Feb 7, 2022
@georglauterbach georglauterbach changed the title relieve some pain: restartcheck-for-changes.sh regularly relieve some pain: restart check-for-changes.sh regularly Feb 7, 2022
@georglauterbach georglauterbach requested a review from a team February 7, 2022 21:31
@oblitum
Copy link
Copy Markdown

oblitum commented Feb 7, 2022

I think this can be done in a better way :D

TBH, this spot in the code is the strange part that makes me feel it's the culprit. I wish that, if you folks could refactor this to something less astonishing, we could then test whether that actually removes the issue, before having a PR applied that could daily kill the problem.

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Feb 7, 2022

I think this can be done in a better way :D

TBH, this spot in the code is the strange part that makes me feel it's the culprit. I wish that, if you folks could refactor this to something less astonishing, we could then test whether that actually removes the issue, before having a PR applied that could daily kill the problem.

I will see where I can provide something here unless @casperklein beats me to it :D @casperklein maybe you can have a look at the mailbox in #2361 if you have some time - not sure how to approach this properly, and you have more experience in this regard.

@casperklein
Copy link
Copy Markdown
Member

casperklein commented Feb 7, 2022

function _fix_restart_changedetector_daily

One last thing: I think it's a good idea, to add a small comment referencing the issue, so it's clear, why this function exists.
Otherwise it LGTM 👍

Edit:

before having a PR applied that could daily kill the problem.

Valid point. Once this "fix" is applied and works, people will not notice that there is a problem, which makes it harder to find the root cause and apply a final fix.

@oblitum
Copy link
Copy Markdown

oblitum commented Feb 8, 2022

Just to be clear, I suspect of sub shell part because it does a full clone of shell script state (without due reason afaik), and this state cloning might be increasing each time it happens, accumulating over previous state (although I couldn't yet realize what could be linearly increasing), while leaving leftovers behind in the parent shell process. This can explain why both RAM and CPU usage increases linearly.

@NorseGaud
Copy link
Copy Markdown
Member

Just to be clear, I suspect of sub shell part because it does a full clone of shell script state (without due reason afaik), and this state cloning might be increasing each time it happens, accumulating over previous state (although I couldn't yet realize what could be linearly increasing), while leaving leftovers behind in the parent shell process. This can explain why both RAM and CPU usage increases linearly.

I think that's a really good point.

@NorseGaud
Copy link
Copy Markdown
Member

I've seen exec used instead of eval but both for the same purpose: allowing interpolation of dynamic variables to happen before execution. Likely why because of the array expansion/interpolation happening. Though, it should work just fine without exec afaik

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Feb 8, 2022

Currently testing the new version locally. If this works, I'll create another PR and we can close this one.

While the removal of exec is certainly worthwhile, if this fixes the resource leak, it seems as if the problem is with the version of Bash on Debian 11 (some bug in Bash?), or the exec does something very weird we do not understand. I highly doubt that this is a problem with the underlying exec syscall, but maybe Docker does a weird translation as well. Who knows...

@georglauterbach
Copy link
Copy Markdown
Member Author

I will close this in favor of #2401.

@georglauterbach georglauterbach deleted the pain-reduction/check-for-changes branch February 8, 2022 13:15
@polarathene
Copy link
Copy Markdown
Member

Just to add to the discussion. It was maybe related to syscall change with the newer Debian image, perhaps not due to Bash itself but another core package (EDIT: glibc AFAIK).

I don't recall the specifics but Alpine has a similar problem I became aware of a week or so ago (EDIT: this answer explains it, for Debian 10 vs 11 it seems to be related to the glibc update, notably with kernels prior to 5.8), where a change since their 3.14 release resulted in a syscall that older kernels didn't support properly.. I was surprised as it broke the hash command from bash from working correctly (the Sentry.io CLI installer script was failing because of that).

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Feb 16, 2022

Very interesting, so it was as I suspected earlier in the issue discussion (a kernel thing...). Thank you for sharing the information here, very much appreciated!

EDIT: glibc is broken anyway (if one belives L. Torvalds :D)

@oblitum
Copy link
Copy Markdown

oblitum commented Feb 16, 2022

@polarathene tbh I'm not sure how that relates to a leak due to subshell'ing, I couldn't connect the dots. Pointed problem refers to failing testing file writable, seems not related at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scripts kind/improvement Improve an existing feature, configuration file or the documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants