tests(refactor): mail_fetchmail.bats + co-locate test cases for processes#3010
Conversation
Extract the test cases for checking a process is running and properly restarts from various test files into a single one: Core (always running): opendkim, opendmarc, master (postfix) ENV dependent: amavi (amavisd-new), clamd, dovecot, fail2ban-server (fail2ban), fetchmail, postgrey, postsrsd, saslauthd These now run off a single container with the required ENV and call a common function (the revised version in parallel test cases).
`saslauth.conf` calls `-O` option for most commands defined with an ENV that may be empty/null. This would cause the process to silently fail / die. This doesn't happen if quote wrapping the ENV, which calls `-O` with an empty string. Not necessary, but since one of `postgrey` ENV were quote wrapped in `supervisor-app.conf`, I've also done the same there.
The PR that introduced the config switched from `true` to `unexpected` without any context. That prevents restart working when the process is killed. Setting to `true` instead will correctly restart the service.
`mail_with_postgrey_disabled_by_default.bats` only checked the migrated test cases, removed as no longer serving a purpose.
The previous version did not ensure that the last checks process was actually restarted, only that it was running. It turns out that `pkill` is only sending the signal, there can be some delay before the original process is actually killed and restarted. This can be identified with `pgrep --older <seconds>`. First ensure the process is at a specified age, then after killing check that the process is not running that is at least that old, finally check that there is a younger process actually running.. (_could fail if a process doesn't restart, or there is a delay such as imposed by `sleep` in wrapper scripts for postfix and fail2ban_) The helper method is not used anywhere else now, move it into this test instead. It has been refactored to accomodate the needs for `--older`, and `--list-full` provides some output that can be matched (similar for `pkill --echo`).
Moves the list of processes into array vars to iterate through instead. If a failure occurs, the process name is visible along with line number in `_should_restart_when_killed()` to identify what went wrong.
Additional coverage to match what other test files were doing before, ensuring that these ENV can prevent their respective service from running.
Not sure about this. It reduces the time of CPU activity (sustained full load on a thread) and increase in memory usage (1GB+ loading signatures database), but as a separate test case it also adds 10 seconds without reducing the time of the test case it was extracted from.
Additionally merges in the parallel test file.
Keep out of the default base config for tests.
Changed the first configs remote and local user values to more clearly document what their values should represent (_and that they don't need to be a full mail address, that's just what our Dovecot is configured with for login_). Shifted the `here` to the end of the `is` line. It's optional syntax, only intended to contrast with the remote `there` for readability. Additionally configured imap protocol. Not tested or verified if that's correct configuration for usage with imap protocol instead. The fetchmail feature tests are currently lacking. Added an inline doc into the fetchmail test to reference a PR about the importance of the trailing `.` in the config. Updated the partial matching to ensure it matches for that in the value as well.
Few minor adjustments. The other ENV for clamd doesn't seem to provide any benefit, trim out the noise. Added a note about why it's been split out. Fetchmail parallel configs are matching the config file path in the process command that is returned. The `.rc` suffix is just to add further clarity to that.
| assert_output --partial "${PROCESS}" | ||
| } | ||
|
|
||
| # NOTE: CONTAINER_NAME is implicit; it should have be set prior to calling. |
There was a problem hiding this comment.
| # NOTE: CONTAINER_NAME is implicit; it should have be set prior to calling. | |
| # NOTE: CONTAINER_NAME is implicit; it should have been set prior to calling. |
There was a problem hiding this comment.
@georglauterbach did you want this change applied?
EDIT: Deferred in favour of merging.
There was a problem hiding this comment.
LGTM 👍🏼
I will add documentation to the functions for process checking with my next PR. I will put out my PR (for this branch) when this PR is merged (resolving the conflicts). Please wait before merging other PRs after this one until my new PR is merged in order to avoid more conflicts :)
Moreover, I'd opt for moving the already adjusted test under test/tests/serial/ into the parallel sets with another PR so we can see which tests are really still left to be revised. This should be done with another PR. @polarathene tell me whether you'd like to do this or whether I should take care of this. We should try to evenly distribute tests across the sets to save as much time as possible.
What do you mean? In our actual docs? Or did you mean regarding the gotchas I found with polling / timeout helper methods?
No worries I'll hold off from tests for the time being 👍
After this PR is merged, the following can be moved:
You're welcome to move the above list (either after this merge, or a separate PR that defers I don't mind doing it myself, I've just been delaying until I've converted the bulk of tests (often with some refactoring first from looking into the test cases history or test times). |
|
I haven't look in detail for what exactly that is used for, just a general thought/improvement for the future: Instead of checking the proccess age, the PID before/after restart could be compared. |
That gets a bit complicated though, as the commented section of notes for each process; there is several that have multiple processes returned. I think they're all children to a single parent process (one of the returned results), so if that can be better filtered that could work too perhaps? I did like matching by process with |
Just what the function does and its parameters.
Probably a good idea to add them as well.
I'll have a look at that later today. Will probably do an intermediate PR to move already revised tests.
Alright, I will tend to it. |
Description
All common test cases that checked if a process was running or that it restarts have been migrated into a new test file that co-locates them. Seems more appropriate to centralize that into one place. Brings serial test time down by about 2 minutes (540 sec run-time was my best result).
Commits have been staged out with additional context in their commit messages to explain certain changes/choices made.
Additionally:
postsrsdin supervisord config had it'sautorestartpolicy asunexpectedinstead oftruelike other services. I looked up the relevant PR and there is a commit that changes tounexpectedbut it provides no context as to why, nor any discussion in the PR thread. Tests aren't showing any issues in switching totruebut would fail to restartpostsrsdwithunexpected.-Owith an ENV, which if unset fails to run the process as the command itself is invalid. Quote wrapping the ENV in the config however treats the arg with an empty string value and runs without configuring the ENV explicitly. I decided to quote wrapped each ENV referenced in supervisord config.Details
The original test cases were not great and the revised versions were a bit too eager to be reliable. A common test function is called by iterating an array of processes to test against.
sleep 5that delay the restart. I've documented local observations of the timings in the test itself.Earlier commits did have separate test cases calling the same function instead of running through a loop. The loop seems fine and correctly provides errors mentioning the involved process in the event of a failure. A similar loop is used for the disabled ENV test case.
Optional Optimization (Remove up to 30 seconds test time)
I was able to bring the enabled test case time down to approx 10 seconds (it can be 20-30 currently, with another 10 if keeping the separate ClamAV test case). It would be a bit more verbose and add extra complexity for maintenance though, so I opted against it.
Realistically it seems to run in 10 seconds vs 20 seconds of the current approach (with clamd not in a separate test case, and wrapper scripts removed). If useful here is that earlier attempt:
Alternative process checking test case
Maintenance notes
Some frustrations in helpers to be aware of if someone maintains the test in future:
run_until_success_or_timeout ...- Cannot assert output if encountering a timeout failure. Only thetimedoutcommand message is emitted.run repeat_until_success_or_timeout ...- Captures the output of each polled function call. Fine for success status, but if you try to refute output against the success, the previous failures get considered as well.repeat_until_success_or_timeout run ...- The first call is always successful even if the function called failed, sincerundoesn't bubble that error up. Causes false positives. Cannot pair with_run_in_container()as a result, caution needs to be taken there (or anything else with an implicitruncall inside the helper).Type of change
Checklist: