fix: Reload the rspamd service instead of restarting it#4632
fix: Reload the rspamd service instead of restarting it#4632casperklein merged 12 commits intodocker-mailserver:masterfrom
rspamd service instead of restarting it#4632Conversation
|
You should provide more context with the PR, appears to be as identified here: #4579 (comment) End of that comment I am unsure of the current PR approach resolving the problem properly due to change detection still being racey (although the issue should be less frequently encountered by dropping the redundant restart). I proposed for change detection to be refactored with adopting |
|
In rspamd is restarted twice, once in _setup_default_signing_conf and the second time in _final_steps as it is mentioned in the referenced issue. |
|
TL;DR:
EDIT: I see your test run was from the merge commit of that PR, so the same spurious failure detailed in #4579 and referenced in #4627 (comment) Change detection is presently at a default 2s poll rate, Rspamd can take a few seconds to complete a restart IIRC. So depending on how close the two restarts are from change detection and the EDIT: Sending a
Yes, but it will still be possible that it is triggered and causes this failure even with your PR here. Here is our change detection logic for rspamd support, it will detect the DKIM files created, then at the end of the snippet restart the rspamd service: docker-mailserver/target/scripts/check-for-changes.sh Lines 187 to 212 in 0f5763c The DKIM keys generated into the DMS Config Volume docker-mailserver/target/bin/rspamd-dkim Lines 163 to 176 in 0f5763c Just depends on the timing 🤔 The logic for both were contributed for DMS v13 back in 2023, where I raised some concerns:
So technically both restarts in this Rspamd DKIM script could be avoided I think? (LDAP is now enabled for change detection support, so I think that'll go smoothly). Doing so would make the following commentary a bit redundant?: docker-mailserver/target/bin/rspamd-dkim Lines 253 to 257 in 0f5763c We may still want to go ahead with the immediate copy though, but instead of a restart a |
There was a problem hiding this comment.
Pull request overview
This PR removes a duplicated rspamd restart call and changes the remaining restart operations to reload operations using SIGHUP signals. The change aims to address issue #4579 by eliminating redundant service restarts and using a more lightweight reload mechanism instead of full restarts.
- Replaced
supervisorctl restart rspamdwithpkill --signal SIGHUP $(pgrep --oldest rspamd)in change detection script - Removed one redundant
supervisorctl restart rspamdcall from rspamd-dkim script - Changed the remaining restart in rspamd-dkim to use SIGHUP signal reload
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| target/scripts/check-for-changes.sh | Changes rspamd restart to reload via SIGHUP when configuration changes are detected |
| target/bin/rspamd-dkim | Removes one redundant restart call and changes the final restart to a reload via SIGHUP |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _log 'debug' 'Rspamd configuration has changed - restarting service' | ||
| supervisorctl restart rspamd | ||
| _log 'debug' 'Reloading Rspamd configuration' | ||
| pkill --signal SIGHUP $(pgrep --oldest rspamd) |
There was a problem hiding this comment.
The command substitution in the pkill command should be quoted to prevent word splitting and globbing issues. Without quotes, if pgrep returns multiple PIDs or unexpected output, the command could fail or behave unexpectedly.
| pkill --signal SIGHUP $(pgrep --oldest rspamd) | |
| pkill --signal SIGHUP "$(pgrep --oldest rspamd)" |
There was a problem hiding this comment.
All the explanation makes sense as the service restarting should be synncronized in some way.
I did the requested changes in the branch but now there is another issue:
https://github.com/scristian71/docker-mailserver/actions/runs/20763050368/job/59622717076
Will try to analyze in the next days.
There was a problem hiding this comment.
The new issue looks to be:
- RSpamd has a wrong config and pkill "reloads" the config but fails to check the status (rspamd is dead)
- User fixes the config
- now a pkill cannot send a signal to a not existing process, and we need to probably start rspamd with
supervisorctl start rspamd
(We are also constantly failing the "[Rspamd] (DKIM) argument 'keysize' is applied correctly for RSA keys" test on our CI)
There was a problem hiding this comment.
And here i changed it to:
local rspamd_pid
rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid=""
if [[ -z ${rspamd_pid} ]]; then
_log 'debug' 'Starting Rspamd after configuration change'
supervisorctl start rspamd
else
_log 'debug' 'Reloading Rspamd configuration'
if ! kill -s SIGHUP "${rspamd_pid}"; then
_log 'error' "Failed to send SIGHUP to Rspamd (PID: ${rspamd_pid})"
fi
fi
| _log 'debug' 'Rspamd configuration has changed - restarting service' | ||
| supervisorctl restart rspamd | ||
| _log 'debug' 'Reloading Rspamd configuration' | ||
| pkill --signal SIGHUP $(pgrep --oldest rspamd) |
There was a problem hiding this comment.
The pkill command lacks error handling. If rspamd is not running, pgrep will fail and the pkill command will execute with no PID argument, which will also fail. This change detection script does not have strict error handling (no set -e), so the failure will be silent. Consider adding error handling to log a warning if the reload fails, or verify that rspamd is running before attempting to signal it.
| pkill --signal SIGHUP $(pgrep --oldest rspamd) | |
| local rspamd_pid | |
| rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid="" | |
| if [[ -z ${rspamd_pid} ]]; then | |
| _log 'warn' 'Rspamd reload requested, but no rspamd process found' | |
| else | |
| if ! pkill --signal SIGHUP "${rspamd_pid}"; then | |
| _log 'warn' "Failed to send SIGHUP to Rspamd (PID: ${rspamd_pid})" | |
| fi | |
| fi |
ap-wtioit
left a comment
There was a problem hiding this comment.
I added the code changes that worked for me locally based on your current PR.
- checking if there is an rspamd pid
- reloading (with sighup) if rspamd pid can be found
- checking that kill was executed successfully
- logging errors not warnings (i missed that pkill did nothing when using the AI suggestion at first and only logging a warning)
- checking the status of rspamd after changes in rspamd-dkim to emit an error if config is broken (the previous
supervisorctl restart rspamddid return an error if the config was broken) but ignoring the error in check-for-changes (as this seems to loop and should start rspamd again after the user fixes the config error)
| @@ -275,10 +272,8 @@ function _transform_public_key_file_to_dns_record_contents() { | |||
| } | |||
|
|
|||
| function _final_steps() { | |||
There was a problem hiding this comment.
This is what seems to work for us:
function _final_steps() {
local rspamd_pid
rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid=""
if [[ -z ${rspamd_pid} ]]; then
supervisorctl start rspamd
elif ! kill -s SIGHUP "${rspamd_pid}"; then
_log 'error' "Failed to send SIGHUP to Rspamd (PID: ${rspamd_pid})"
fi
# check status of rspamd after a possible reload (that could have killed rspamd in case of a breaking config)
supervisorctl status rspamd
_log 'trace' 'Finished DKIM key creation'
}
| _log 'debug' 'Rspamd configuration has changed - restarting service' | ||
| supervisorctl restart rspamd | ||
| _log 'debug' 'Reloading Rspamd configuration' | ||
| pkill --signal SIGHUP $(pgrep --oldest rspamd) |
There was a problem hiding this comment.
And here i changed it to:
local rspamd_pid
rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid=""
if [[ -z ${rspamd_pid} ]]; then
_log 'debug' 'Starting Rspamd after configuration change'
supervisorctl start rspamd
else
_log 'debug' 'Reloading Rspamd configuration'
if ! kill -s SIGHUP "${rspamd_pid}"; then
_log 'error' "Failed to send SIGHUP to Rspamd (PID: ${rspamd_pid})"
fi
fi
|
I have added the changes suggested by @ap-wtioit and the tests are passing now: |
| rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid="" | ||
| if [[ -z ${rspamd_pid} ]]; then |
There was a problem hiding this comment.
| rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid="" | |
| if [[ -z ${rspamd_pid} ]]; then | |
| rspamd_pid=$(supervisorctl pid rspamd) | |
| if (( rspamd_pid == 0 )); then |
There was a problem hiding this comment.
I think this is a bit cleaner. If rspamd is running, the PID is returned, otherwise 0.
There was a problem hiding this comment.
EDIT: Not sure why this is listed under my posted review, it's a reply to earlier feedback.
I'm not sure if it's better since a condition is still required? supervisorctl pid <service name> is neat, but the main difference is instead of an empty value on failure it returns a 0 🤔
Then again, it's clearer for intent than the pgrep command, and might actually be better for our older test case on process restarts where I noted a few services that were less than obvious when piecing that together that extra context was required for maintenance.
Why the (( rspamd_pid == 0 )) condition vs [[ rspamd_pid -eq 0 ]]? (which is what we typically use elsewhere in our scripts AFAIK?)
There was a problem hiding this comment.
The reason was not to avoid the condition. "grepping" for something that we (supervisord) already know seemed odd to me.
Why the
(( rspamd_pid == 0 ))condition vs[[ rspamd_pid -eq 0 ]]? (which is what we typically use elsewhere in our scripts AFAIK?)
That was just a mistake. Personally I use (( )) for pure arithmetic stuff. You are right, in DMS, we should stick to the style we typically use elsewhere 👍
| # check status of rspamd after a possible reload (that could have killed rspamd in case of a breaking config) | ||
| supervisorctl status rspamd |
There was a problem hiding this comment.
This doesn't work as you think. If rspamd fails to start due an invalid configuration, supervisor will immediately start rspam again (in a loop).
The output will almost always be "running".
There was a problem hiding this comment.
Ok than my local tests were not doing what i thought was going on. I was on the wrong track there. I mistook the pkill not working with pids (that i discovered later) for one of the negative tests leaving the config in a dirty state.
|
This is in response to @ap-wtioit review input, but on the main thread to keep the added information below visible.
Whoops that's my bad sorry 😓 I was referencing some test logic I wrote a while back but didn't pay attention to how the PID was used. This is additional context for my benefit on the topic. No need to read below this unless it interests you 👍 NotesUsing the $ pgrep --oldest rspamd
1360
$ pgrep --list-full rspamd
1360 rspamd: main process
1639 rspamd: rspamd_proxy process (127.0.0.1:11332)
1640 rspamd: rspamd_proxy process (127.0.0.1:11332)
1641 rspamd: controller process (0.0.0.0:11334)
1642 rspamd: hs_helper process
# Reload (NOTE: `--echo` produces the output lines instead of being silent):
$ pkill --echo --signal SIGHUP rspamd
rspamd killed (pid 1360)
rspamd killed (pid 1639)
rspamd killed (pid 1640)
rspamd killed (pid 1641)
rspamd killed (pid 1642)
# Only sends signal to the main rspamd process:
# (NOTE: This still respawns the child processes)
$ pkill --oldest --echo --signal SIGHUP rspamd
rspamd killed (pid 1360)
# Original PID 1360 wasn't actually "killed":
# (Child processes were respawned)
$ pgrep --full --list-full rspamd
1360 rspamd: main process
2115 rspamd: rspamd_proxy process (127.0.0.1:11332)
2116 rspamd: rspamd_proxy process (127.0.0.1:11332)
2117 rspamd: controller process (0.0.0.0:11334)
2118 rspamd: hs_helper process
# Process tree (`pstree` via `psmisc` package):
$ pstree -p -s "$(pgrep --oldest rspamd)"
dumb-init(1)---supervisord(7)---rspamd(1360)-+-rspamd(2115)
|-rspamd(2116)
|-rspamd(2117)
`-rspamd(2118)NOTE: The increasingly larger PID of the child processes appears to be from Related logs (different PIDs from another run), completes within 1 second: Rspamd SIGHUP reload event logsRspamd service management and failure handlingIf Rspamd service itself is not running, I don't think we need to start the service. Supervisord handles restarting the main process when the service was started during container init, and it will handle restarting if the rspamd process is killed abruptly too (which is what the linked test case covers, albeit it is missing rspamd coverage it should still be handled the same).
Rspamd shouldn't be dead? The actions run failure linked by @scristian71 has test cases fail with: Not too different from the error at the start of my associated report (minus the earlier log lines related to a However the current failure of ( $ pkill --signal SIGHUP 123
$ echo $?
1I also tried to trigger a bad config failure, but this didn't affect the I'm lacking further context on where bad config is causing rspamd to die in our test suite 😅 Side-Note - Accidental
|
polarathene
left a comment
There was a problem hiding this comment.
I'm still a bit inclined towards pkill as shown below, but open to feedback from @casperklein or @georglauterbach if using supervisorctl pid <service name> is the preferred approach (both variants are covered below as suggestions).
I was a bit slow to give feedback, but thanks for looking into this too @ap-wtioit ❤️
We want to avoid duplicating this functionality across the two locations, historically that diverges over time with contributors / reviewers forgetting to carry changes over to the other location of duplicate code.
Instead we now have the utils.sh helper that we can extract that to, both of these files have that already in scope, so you just need to apply the suggestions I've provided and add the following to our utils.sh:
function _reload_rspamd() {
local PID_RSPAMD
PID_RSPAMD="$(pgrep --oldest rspamd)"
if [[ -n ${PID_RSPAMD} ]]; then
_log 'debug' 'Reloading Rspamd configuration'
if ! kill -s SIGHUP "${PID_RSPAMD}"; then
_log 'error' "Failed to send `SIGHUP` to Rspamd (PID: ${PID_RSPAMD})"
fi
fi
}@casperklein made a nice suggestion for getting the rspamd PID via supervisorctl pid rspamd. This returns 0 on failure (eg: service not running).
The behaviour of kill with a PID of 0 kills processes in the same process group as the current process:
https://man7.org/linux/man-pages/man2/kill.2.html
If pid equals 0, then sig is sent to every process in the process group of the calling process.
Which in this case would terminate check-for-changes.sh process or your shell running the setup CLI command. Neither ideal, so we still need the variable + condition:
function _reload_rspamd() {
local PID_RSPAMD
PID_RSPAMD=$(supervisorctl pid rspamd)
if [[ ${PID_RSPAMD} -ne 0 ]]; then
_log 'debug' 'Reloading Rspamd configuration'
if ! kill -s SIGHUP "${PID_RSPAMD}"; then
_log 'error' "Failed to send `SIGHUP` to Rspamd (PID: ${PID_RSPAMD})"
fi
fi
}-neoperator for number comparison (note the lack of quote wrappingPID_RSPAMDnow.- Probably still want the conditional as there is output you cannot silence if encountering an error and we may still want to output our own error message?:
That wouldn't happen with the logic AFAIK though..
$ kill -s SIGHUP 12345 bash: kill: (12345) - No such process $ echo $? 1
Variant - Without error support
The error log could be omitted if we're sure PID_RSPAMD being a value other than 0 is valid? (no string output from supervisorctl):
function _reload_rspamd() {
local PID_RSPAMD
PID_RSPAMD=$(supervisorctl pid rspamd)
if [[ ${PID_RSPAMD} -ne 0 ]]; then
_log 'debug' 'Reloading Rspamd configuration'
kill -s SIGHUP "${PID_RSPAMD}"
fi
}$ kill -s SIGHUP $(supervisorctl pid what)
bash: kill: No: arguments must be process or job IDs
bash: kill: such: arguments must be process or job IDs
bash: kill: process: arguments must be process or job IDs
bash: kill: what: arguments must be process or job IDsActual failure in the above case would occur at the -ne conditional:
[[: No such process what: syntax error in expression (error token is "such process what")
# If switching to a string value and using `!= '0'` comparision instead of `-ne 0`:
$ kill -s SIGHUP "$(supervisorctl pid what)"
bash: kill: No such process what: arguments must be process or job IDsAlternatively... if we don't bother with the PID we can just use pkill (as I demonstrated here):
function _reload_rspamd() {
_log 'debug' 'Reloading Rspamd configuration'
if ! pkill --signal SIGHUP --oldest 'rspamd'; then
_log 'error' "Failed to send `SIGHUP` to the main rspamd process"
fi
}| rspamd_pid=$(pgrep --oldest rspamd) || rspamd_pid="" | ||
| if [[ -z ${rspamd_pid} ]]; then |
There was a problem hiding this comment.
EDIT: Not sure why this is listed under my posted review, it's a reply to earlier feedback.
I'm not sure if it's better since a condition is still required? supervisorctl pid <service name> is neat, but the main difference is instead of an empty value on failure it returns a 0 🤔
Then again, it's clearer for intent than the pgrep command, and might actually be better for our older test case on process restarts where I noted a few services that were less than obvious when piecing that together that extra context was required for maintenance.
Why the (( rspamd_pid == 0 )) condition vs [[ rspamd_pid -eq 0 ]]? (which is what we typically use elsewhere in our scripts AFAIK?)
|
I was about to write a comment for the _reload_rspamd utils method. When i read Supervisor/supervisor#53 once again and came to Supervisor/supervisor#53 (comment). It seems there is a Edit: confirmed that for us the tests are working with |
Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
|
I have applied the suggested changes. Locally the tests are passing. |
rspamd service instead of restarting it
georglauterbach
left a comment
There was a problem hiding this comment.
Didn't read the whole conversation, but the changes look sensible.
There was a problem hiding this comment.
I'll apply these (nevermind, not given permission to do so)
In addition to these final changes, please add a Changelog entry just above this section:
docker-mailserver/CHANGELOG.md
Lines 11 to 13 in 0f5763c
Copy/paste this:
- **Rspamd:**
- Configuration changes now trigger a service reload instead of a restart ([#4632](https://github.com/docker-mailserver/docker-mailserver/pull/4632))Co-authored-by: Brennan Kinney <[email protected]>
Co-authored-by: Brennan Kinney <[email protected]>
polarathene
left a comment
There was a problem hiding this comment.
LGTM 👍 Thanks for taking the time to contribute! ❤️
polarathene
left a comment
There was a problem hiding this comment.
My mistake, once this change is applied we can merge :)
Co-authored-by: Brennan Kinney <[email protected]>

Description
Context: #4579 (comment)
Fixes the failure by requesting Rspamd reload configuration via sending a
HUPsignal to the mainrspamdprocess.supervisorctl restart rspamdwas prone to a race condition causing tests to fail and sometimes putting the rspamd service into a bad state.As per review below, the decision was to switch send the signal via
supervisorctl signal <signal> <service name>.Type of change
Checklist
docs/)CHANGELOG.md