Skip to content

[BUG] ENABLE_SRS=1 causing high CPU usage with postsrsd #2722

@polarathene

Description

@polarathene

Miscellaneous first checks

  • I checked that all ports are open and not blocked by my ISP / hosting provider.
  • I know that SSL errors are likely the result of a wrong setup on the user side and not caused by DMS itself. I'm confident my setup is correct.

Affected Component(s)

postsrsd

What happened and when does this occur?

Another user reported to upstream postsrsd experiencing this same issue earlier in the year (specifically with DMS).

I noticed this while working on a PR and debugging local test failures. Why the tests do not fail in our CI is unclear (might be due to older environment/packages in CI?):

 ✗ checking configuration: hostname/domainname override: check headers of received mail [235]
   (from function `assert_success' in file test/test_helper/bats-assert/src/assert.bash, line 114,
    in test file test/mail_hostname.bats, line 104)
     `assert_success' failed
   
   -- command failed --
   status : 1
   output : ls: cannot access '/var/mail/localhost.localdomain/user1/new': No such file or directory
   --
 ✗ checking configuration: non-subdomain: check headers of received mail [308]
   (from function `assert_success' in file test/test_helper/bats-assert/src/assert.bash, line 114,
    in test file test/mail_hostname.bats, line 163)
     `assert_success' failed
   
   -- command failed --
   status : 1
   output : ls: cannot access '/var/mail/localhost.localdomain/user1/new': No such file or directory
   --

I had a process monitor active in the background and noticed the four containers spun up in that test each used 25% CPU for the containers postsrsd process.

When starting a single container in my VM (2vCPU) postsrsd was consistently showing 50% CPU usage, despite no mail to process.

The ports 10001 and 10002 are not bound to localhost:

$ ss -tulpn | grep LISTEN

tcp   LISTEN 0      4096       127.0.0.1:10024      0.0.0.0:*                                     
tcp   LISTEN 0      100        127.0.0.1:10025      0.0.0.0:*                                     
tcp   LISTEN 0      100          0.0.0.0:587        0.0.0.0:*                                     
tcp   LISTEN 0      100          0.0.0.0:143        0.0.0.0:*    users:(("dovecot",pid=642,fd=32))
tcp   LISTEN 0      100          0.0.0.0:465        0.0.0.0:*                                     
tcp   LISTEN 0      100          0.0.0.0:65265      0.0.0.0:*    users:(("dovecot",pid=642,fd=54))
tcp   LISTEN 0      100          0.0.0.0:25         0.0.0.0:*                                     
tcp   LISTEN 0      4096       127.0.0.1:8891       0.0.0.0:*                                     
tcp   LISTEN 0      4096       127.0.0.1:8893       0.0.0.0:*                                     
tcp   LISTEN 0      100             [::]:587           [::]:*                                     
tcp   LISTEN 0      100             [::]:143           [::]:*    users:(("dovecot",pid=642,fd=33))
tcp   LISTEN 0      100             [::]:465           [::]:*                                     
tcp   LISTEN 0      100             [::]:65265         [::]:*    users:(("dovecot",pid=642,fd=55))
tcp   LISTEN 0      100             [::]:25            [::]:*

postsrsd process is running:

$ ps aux | grep postsrsd

root         591  0.0  0.0   4864  3872 pts/0    S    02:37   0:00 /bin/bash /usr/local/bin/postsrsd-wrapper.sh
root         623  0.0  0.0   2420  1592 pts/0    S    02:37   0:00 /bin/sh /etc/init.d/postsrsd start
root         774 98.2  0.0   2332   708 pts/0    R    02:37   3:01 /usr/sbin/postsrsd -f 10001 -r 10002 -d example.test -s /etc/postsrsd.secret -a = -n 4 -N 4 -u postsrsd -p /var/run/postsrsd.pid -c /var/lib/postsrsd -l 127.0.0.1 -D -X

However after roughly 10 minutes running like this, postsrsd heavy CPU activity ceases and the ports do show as bound. It's unclear why it takes this long.

What did you expect to happen?

postsrsd should not be consuming so much CPU activity, nor taking about 10 minutes to finish starting up. Possibly stuck in this upstream while loop?

How do we replicate the issue?

  1. Next step uses the config dir of the failed test container, but it should reproduce with a minimal config containing postfix-accounts.cf from the make generate-accounts command.

  2. Run:

    docker run --rm -it \
      --volume "${PWD}/test/duplicate_configs/mail_override_hostname:/tmp/docker-mailserver" \
      --volume "${PWD}/test/test-files:/tmp/docker-mailserver-test:ro" \
      -e ENABLE_SRS=1 \
      -h 'example.com' \
      --name 'postsrsd_cpu' \
      "mailserver/docker-mailserver:edge"
  3. Monitor the postsrsd process, it should show high CPU activity (can be observed on the host, does not require running anything within the container).

  4. Run docker exec -it postsrsd_cpu bash in a separate terminal, and then try send a mail sendmail -f '[email protected]' '[email protected]' < /tmp/docker-mailserver-test/email-templates/existing-user1.txt

  5. DMS will respond to it but fail as it can't reach the PostSRSD service on localhost:10001, as the service is running (ps aux | grep postsrsd) but ports are not yet bound (ss -tulpn | grep LISTEN), and our logs appear to be useless (cat /var/log/supervisor/postsrsd.log only has a line about the service being started, nothing added once it quiets down 10 mins later).

  6. Monitor the postsrsd process, it should cease the heavy workload after a while, and ports for the service finally bound. Mail can then be sent to the local user account as before and actually be received (ls /var/mail/localhost.localdomain/user1).

DMS version

v11.1

What operating system is DMS running on?

Linux

What instruction set architecture is DMS running on?

x86_64 / AMD64

What container orchestration tool are you using?

Docker

Docker CLI command

docker run --rm -it \
  --volume "${PWD}/test/duplicate_configs/mail_override_hostname:/tmp/docker-mailserver" \
  --volume "${PWD}/test/test-files:/tmp/docker-mailserver-test:ro" \
  -e ENABLE_SRS=1 \
  -h 'example.com' \
  --name 'postsrsd_cpu' \
  'mailserver/docker-mailserver:edge'

Relevant log output

Nothing relevant was logged, other than the failure to handle the mail due to the service not yet being up.

Other relevant information

I suspect the environment may have something to do with it.

Environment is:

  • VMWare Workstation Player 16.3 on a Linux 5.17 x86_64 host.
  • VM guest is Fedora 36 (5.18.9 kernel), Docker version is 20.10.17.
  • CPU (2vCPU) under 50% load reported in the guest OS, 4GB RAM + 4GB swap (ZRAM), 2GB of each free.

Network if it's something going wrong with the binding step:

$ ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever

1511: eth0@if1512: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Code of conduct

Improvements to this form?

It's eating up markdown syntax when some inputs are wrapped in codefences upon submission, while others are assuming formats such as yml for docker-compose.yml config sharing, despite saying you can also provide a docker CLI command instead.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions