Skip to content

tests(refactor): mail_fetchmail.bats + co-locate test cases for processes#3010

Merged
polarathene merged 16 commits intodocker-mailserver:masterfrom
polarathene:tests/refactor-check-restart-processes
Jan 18, 2023
Merged

tests(refactor): mail_fetchmail.bats + co-locate test cases for processes#3010
polarathene merged 16 commits intodocker-mailserver:masterfrom
polarathene:tests/refactor-check-restart-processes

Conversation

@polarathene
Copy link
Copy Markdown
Member

@polarathene polarathene commented Jan 17, 2023

Description

All common test cases that checked if a process was running or that it restarts have been migrated into a new test file that co-locates them. Seems more appropriate to centralize that into one place. Brings serial test time down by about 2 minutes (540 sec run-time was my best result).

Commits have been staged out with additional context in their commit messages to explain certain changes/choices made.

Additionally:

  • The fetchmail test files were merged into one and updated to the new format. A postgrey test that only checked the service was disabled due to ENV also removed as redundant.
  • postsrsd in supervisord config had it's autorestart policy as unexpected instead of true like other services. I looked up the relevant PR and there is a commit that changes to unexpected but it provides no context as to why, nor any discussion in the PR thread. Tests aren't showing any issues in switching to true but would fail to restart postsrsd with unexpected.
  • SASLAuthd config for supervisord has most commands use -O with an ENV, which if unset fails to run the process as the command itself is invalid. Quote wrapping the ENV in the config however treats the arg with an empty string value and runs without configuring the ENV explicitly. I decided to quote wrapped each ENV referenced in supervisord config.

Details

The original test cases were not great and the revised versions were a bit too eager to be reliable. A common test function is called by iterating an array of processes to test against.

  • It'll wait until the process has been running for a little bit of time.
  • Then kill the process and wait until that process is no longer running (by polling the process age until no result), as some processes take a couple seconds to be killed.
  • Finally, the new process should be running, but we poll again as this isn't always the case. Postfix and Fail2Ban are managed via wrapper scripts which have sleep 5 that delay the restart. I've documented local observations of the timings in the test itself.

Earlier commits did have separate test cases calling the same function instead of running through a loop. The loop seems fine and correctly provides errors mentioning the involved process in the event of a failure. A similar loop is used for the disabled ENV test case.

Optional Optimization (Remove up to 30 seconds test time)

I was able to bring the enabled test case time down to approx 10 seconds (it can be 20-30 currently, with another 10 if keeping the separate ClamAV test case). It would be a bit more verbose and add extra complexity for maintenance though, so I opted against it.

Realistically it seems to run in 10 seconds vs 20 seconds of the current approach (with clamd not in a separate test case, and wrapper scripts removed). If useful here is that earlier attempt:

Alternative process checking test case
@test "enabled - should restart processes when killed" {
  local MIN_PROCESS_AGE=4

  # Average time: 4 seconds (7 seconds with chroot + wrapper)
  run repeat_until_success_or_timeout 10 _should_have_all_processes_running "${MIN_PROCESS_AGE}"
  assert_success

  # Average time: 2 seconds
  _should_kill_all_processes_successfully

  # Average time: 3 seconds
  local PROCESS_LIST=("${ENABLED_PROCESS_LIST[@]}")
  run repeat_until_success_or_timeout 10 _should_have_restarted_all_processes "${MIN_PROCESS_AGE}"
  assert_success

  # Average time: 1 second (4-6 seconds with chroot + wrapper)
  run repeat_until_success_or_timeout 10 _should_have_all_processes_running
  assert_success
}

function _is_process_running() {
  local PROCESS=${1}
  local MIN_SECS_RUNNING
  [[ -n ${2} ]] && MIN_SECS_RUNNING="--older ${2}"

  docker exec "${CONTAINER_NAME}" pgrep --list-full ${MIN_SECS_RUNNING} "${PROCESS}"
}

function _should_have_all_processes_running() {
  for PROCESS in "${ENABLED_PROCESS_LIST[@]}"
  do
    if [[ ! $(_is_process_running "${PROCESS}" "${1}") =~ "${PROCESS}" ]]
    then
      echo "(fail) '${PROCESS}' is not running"
      return 1
    fi
  done
}

# Kill each process to trigger a restart:
function _should_kill_all_processes_successfully() {
  for PROCESS in "${ENABLED_PROCESS_LIST[@]}"
  do
    _run_in_container pkill --echo "${PROCESS}"
    assert_output --partial "${PROCESS}"
  done
}

# NOTE: Processes that have been restarted are removed from additional iterations,
# so that they don't fail the check while waiting on other processes to restart.
# NOTE: `--older` being too low (eg: 2) will match restarted process age, causing failure.
function _should_have_restarted_all_processes() {
  local STILL_RUNNING=()

  for i in "${!PROCESS_LIST[@]}"
  do
    local PROCESS="${PROCESS_LIST[$i]}"

    if [[ "$(_is_process_running "${PROCESS}" "${1}")" =~ "${PROCESS}" ]]
    then
      # Output helpful for debugging failures:
      echo "(fail) '${PROCESS}' is still running"
      STILL_RUNNING+="${PROCESS}"
    else
      # Output helpful for debugging failures:
      echo "(pass) Excluding '${PROCESS}' from next iteration"
      unset PROCESS_LIST[$i]
    fi
  done

  # Fail if any older processes were still found running:
  [[ ! ${#STILL_RUNNING[@]} -gt 0 ]]
}

Maintenance notes

Some frustrations in helpers to be aware of if someone maintains the test in future:

  • run_until_success_or_timeout ... - Cannot assert output if encountering a timeout failure. Only the timedout command message is emitted.
  • run repeat_until_success_or_timeout ... - Captures the output of each polled function call. Fine for success status, but if you try to refute output against the success, the previous failures get considered as well.
  • repeat_until_success_or_timeout run ... - The first call is always successful even if the function called failed, since run doesn't bubble that error up. Causes false positives. Cannot pair with _run_in_container() as a result, caution needs to be taken there (or anything else with an implicit run call inside the helper).

Type of change

  • Improvement (non-breaking change that does improve existing functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Extract the test cases for checking a process is running and properly restarts from various test files into a single one:

Core (always running):
opendkim, opendmarc, master (postfix)

ENV dependent:
amavi (amavisd-new), clamd, dovecot, fail2ban-server (fail2ban), fetchmail, postgrey, postsrsd, saslauthd

These now run off a single container with the required ENV and call a common function (the revised version in parallel test cases).
`saslauth.conf` calls `-O` option for most commands defined with an ENV that may be empty/null. This would cause the process to silently fail / die.

This doesn't happen if quote wrapping the ENV, which calls `-O` with an empty string.

Not necessary, but since one of `postgrey` ENV were quote wrapped in `supervisor-app.conf`, I've also done the same there.
The PR that introduced the config switched from `true` to `unexpected` without any context. That prevents restart working when the process is killed. Setting to `true` instead will correctly restart the service.
`mail_with_postgrey_disabled_by_default.bats` only checked the migrated test cases, removed as no longer serving a purpose.
The previous version did not ensure that the last checks process was actually restarted, only that it was running.

It turns out that `pkill` is only sending the signal, there can be some delay before the original process is actually killed and restarted.

This can be identified with `pgrep --older <seconds>`. First ensure the process is at a specified age, then after killing check that the process is not running that is at least that old, finally check that there is a younger process actually running.. (_could fail if a process doesn't restart, or there is a delay such as imposed by `sleep` in wrapper scripts for postfix and fail2ban_)

The helper method is not used anywhere else now, move it into this test instead. It has been refactored to accomodate the needs for `--older`, and `--list-full` provides some output that can be matched (similar for `pkill --echo`).
Moves the list of processes into array vars to iterate through instead.

If a failure occurs, the process name is visible along with line number in `_should_restart_when_killed()` to identify what went wrong.
Additional coverage to match what other test files were doing before, ensuring that these ENV can prevent their respective service from running.
Not sure about this.

It reduces the time of CPU activity (sustained full load on a thread) and increase in memory usage (1GB+ loading signatures database), but as a separate test case it also adds 10 seconds without reducing the time of the test case it was extracted from.
Additionally merges in the parallel test file.
Keep out of the default base config for tests.
Changed the first configs remote and local user values to more clearly document what their values should represent (_and that they don't need to be a full mail address, that's just what our Dovecot is configured with for login_).

Shifted the `here` to the end of the `is` line. It's optional syntax, only intended to contrast with the remote `there` for readability.

Additionally configured imap protocol. Not tested or verified if that's correct configuration for usage with imap protocol instead. The fetchmail feature tests are currently lacking.

Added an inline doc into the fetchmail test to reference a PR about the importance of the trailing `.` in the config. Updated the partial matching to ensure it matches for that in the value as well.
Few minor adjustments. The other ENV for clamd doesn't seem to provide any benefit, trim out the noise. Added a note about why it's been split out.

Fetchmail parallel configs are matching the config file path in the process command that is returned. The `.rc` suffix is just to add further clarity to that.
@polarathene polarathene added priority/medium area/tests kind/improvement Improve an existing feature, configuration file or the documentation area/configuration (file) service/fetchmail labels Jan 17, 2023
@polarathene polarathene added this to the v12.0.0 milestone Jan 17, 2023
@polarathene polarathene self-assigned this Jan 17, 2023
assert_output --partial "${PROCESS}"
}

# NOTE: CONTAINER_NAME is implicit; it should have be set prior to calling.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: CONTAINER_NAME is implicit; it should have be set prior to calling.
# NOTE: CONTAINER_NAME is implicit; it should have been set prior to calling.

Copy link
Copy Markdown
Member Author

@polarathene polarathene Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georglauterbach did you want this change applied?

EDIT: Deferred in favour of merging.

Copy link
Copy Markdown
Member

@georglauterbach georglauterbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼


I will add documentation to the functions for process checking with my next PR. I will put out my PR (for this branch) when this PR is merged (resolving the conflicts). Please wait before merging other PRs after this one until my new PR is merged in order to avoid more conflicts :)


Moreover, I'd opt for moving the already adjusted test under test/tests/serial/ into the parallel sets with another PR so we can see which tests are really still left to be revised. This should be done with another PR. @polarathene tell me whether you'd like to do this or whether I should take care of this. We should try to evenly distribute tests across the sets to save as much time as possible.

@polarathene
Copy link
Copy Markdown
Member Author

I will add documentation to the functions for process checking with my next PR.

What do you mean? In our actual docs?

Or did you mean regarding the gotchas I found with polling / timeout helper methods?

Please wait before merging other PRs after this one until my new PR is merged in order to avoid more conflicts :)

No worries I'll hold off from tests for the time being 👍


I'd opt for moving the already adjusted test under test/tests/serial/ into the parallel sets with another PR so we can see which tests are really still left to be revised.

After this PR is merged, the following can be moved:

  • mail_changedetector.bats
  • mail_fetchmail.bats
  • mail_lmtp_ip.bats
  • mail_tls_dhparams.bats
  • setup-cli.bats (Converted but I will still refactor it a bit, it can be moved though)

tests.bats has been partially converted. My plan is to continue splitting out smaller test files from that. A bunch of other tests can quickly be converted rather without trouble, they could be bundled into another PR if you'd like to wait on that before moving into parallel sets, or deferred to another PR to move them if you'd rather get the existing list above relocated.

tell me whether you'd like to do this or whether I should take care of this.
We should try to evenly distribute tests across the sets to save as much time as possible.

You're welcome to move the above list (either after this merge, or a separate PR that defers mail_fetchmail.bats to a 2nd round with others).

I don't mind doing it myself, I've just been delaying until I've converted the bulk of tests (often with some refactoring first from looking into the test cases history or test times).

@casperklein
Copy link
Copy Markdown
Member

I haven't look in detail for what exactly that is used for, just a general thought/improvement for the future:

test/tests/parallel/set3/process-check-restart.bats:

# (this allows us to more confidently check the process was restarted)`
# Wait until process has been running for at least MIN_PROCESS_AGE:

Instead of checking the proccess age, the PID before/after restart could be compared.

@polarathene
Copy link
Copy Markdown
Member Author

polarathene commented Jan 18, 2023

the PID before/after restart could be compared.

That gets a bit complicated though, as the commented section of notes for each process; there is several that have multiple processes returned.

I think they're all children to a single parent process (one of the returned results), so if that can be better filtered that could work too perhaps?

I did like matching by process with --list-all for the returned output to contain information about the actual process instead of just a PID(s) though. At least while writing the tests it was helpful for debugging (master at one point was matching the Postfix master process, and an Amavis one that had "(master)" matched).

@polarathene polarathene merged commit fb82082 into docker-mailserver:master Jan 18, 2023
@georglauterbach
Copy link
Copy Markdown
Member

I will add documentation to the functions for process checking with my next PR.

What do you mean? In our actual docs?

Just what the function does and its parameters.

Or did you mean regarding the gotchas I found with polling / timeout helper methods?

Probably a good idea to add them as well.

[...] A bunch of other tests can quickly be converted rather without trouble, they could be bundled into another PR if you'd like to wait on that before moving into parallel sets, or deferred to another PR to move them if you'd rather get the existing list above relocated.

I'll have a look at that later today. Will probably do an intermediate PR to move already revised tests.

You're welcome to move the above list (either after this merge, or a separate PR that defers mail_fetchmail.bats to a 2nd round with others).

Alright, I will tend to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/configuration (file) area/tests kind/improvement Improve an existing feature, configuration file or the documentation priority/medium service/fetchmail

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants