Switching from SpamAssassin+Bogofilter+Exim to Rspamd+Exim+Dovecot

For more than ten years, I used SpamAssassin and Bogofilter along with Exim to filter spams, along with SFP and Greylisting directly within Exim.

Why changing?

I must say that almost no spam reached me unflagged for years. Why changing anything then?

First, I have more users and the system was not really multiuser-aware. For instance, the bayesian filter training cronjob had configured SPAMDIR, etc.

Second, my whole setup was based on using specific transports and routers in exim to send mails first to bogofilter, then to spamassassin. It means that filtering is done after SMTP-time, when the mail has been already accepted. You filter but do not discourage or block spam sources.

Rspamd?

General
Written inC/LuaPerlC
Process modelevent drivenpre-forked poolLDA and pre-forked
MTA integrationmilter, LDA, custommilter, custom (Amavis)LDA
Web interfaceembedded3rd party
Languages supportfull, UTF-8 conversion/normalisation, lemmatizationnaïve (ASCII lowercase)naïve
Scripting supportLua APIPerl plugins
LicenceApache 2Apache 2GPL
Development statusvery activeactiveabandoned

Rspamd seems activitely developed and easy to integrate not only with Exim, the SMTP, but also with Dovecot, which is use as IMAPS server.

Instead of having:

Exim SMTP accept with greylist -> bogofilter -> spamassassin -> procmail -> dovecot 

The idea is to have:

Exim SMTP accept with greylist and rspamd -> dovecot with sieve filtering 

It blocks rejects/discard spam earlier and makes filtering easier in a multiuser environment (sieve is not dangerous, unlike procmail, and can be managed by clients, if desirable)

My new setup is contained in my rien-mx package: the initial greylist system is still there.

Exim

What matters most is acl_check_rcpt definition (already used in previous version) and new acl_check_data definition.:

### acl/41_rien-check_data_spam
#################################
# based on https://rspamd.com/doc/integration.html
# -  using CHECK_DATA_LOCAL_ACL_FILE included in the acl_check_data instead a creating a new acl
# - and scan all the messages no matter the source:
#    because some might be forwarded by smarthost client, requiring scanning with no defer/reject

## process earlier scan

# find out if a (positive) spam level is already set
warn
  condition = ${if match{$h_X-Spam-Level:}{\N\*|\+\N}}
  set acl_m_spamlevel = $h_X-Spam-Level:
warn
  condition = ${if match{$h_X-Spam-Bar:}{\N\*|\+\N}}
  set acl_m_spamlevel = $h_X-Spam-Bar:
warn
  condition = ${if match{$h_X-Spam_Bar:}{\N\*|\+\N}}
  set acl_m_spamlevel = $h_X-Spam_Bar:

# discard high probability spam identified by earlier scanner
# (probably forwarded by a friendly server, since it is unlikely that a spam source would shoot
# itself in the foot, no point to generate bounces)
discard
  condition = ${if >={${strlen:$acl_m_spamlevel}}{15}}
  log_message = discard as high-probability spam announced

# at least make sure X-Spam-Status is set if relevant
warn
  condition = ${if and{{ !def:h_X-Spam-Status:}{ >={${strlen:$acl_m_spamlevel}}{6} }}}
  add_header = X-Spam-Status: Yes, earlier scan ($acl_m_spamlevel)

# accept content from relayed hosts with no spam check
# unless registered in final_from_hosts (they are outside the local network)
accept
  hosts = +relay_from_hosts
  !hosts = ${if exists{CONFDIR/final_from_hosts}\
		      {CONFDIR/final_from_hosts}\
		      {}}

# rename earlier reports and score
warn
  condition = ${if def:h_X-Spam-Report:}
  add_header = X-Spam-Report-Earlier: $h_X-Spam-Report:
warn
  condition = ${if def:h_X-Spam_Report:}
  add_header = X-Spam-Report-Earlier: $h_X-Spam_Report:
warn
  condition = ${if def:h_X-Spam-Score:}
  add_header = X-Spam-Score-Earlier: $h_X-Spam-Score:
warn
  condition = ${if def:h_X-Spam_Score:}
  add_header = X-Spam-Score-Earlier: $h_X-Spam_Score:


# scan the message with rspamd
warn spam = nobody:true
# This will set variables as follows:
# $spam_action is the action recommended by rspamd
# $spam_score is the message score (we unlikely need it)
# $spam_score_int is spam score multiplied by 10
# $spam_report lists symbols matched & protocol messages
# $spam_bar is a visual indicator of spam/ham level

# remove foreign headers except spam-status, because it better to have twice than none 
warn
  remove_header = x-spam-bar : x-spam_bar : x-spam-score : x-spam_score : x-spam-report : x-spam_report : x-spam_score_int : x-spam_action : x-spam-level
  
# add spam-score and spam-report header
# (possible to add condition to add header rspamd recommend:
#   condition  = ${if eq{$spam_action}{add header})
warn
  add_header = X-Spam-Score: $spam_score
  add_header = X-Spam-Report: $spam_report

# add x-spam-status header if message is not ham
# do not match when $spam_action is empty (e.g. when rspamd is not running)
warn
  ! condition  = ${if match{$spam_action}{^no action\$|^greylist\$|^\$}}
  add_header = X-Spam-Status: Yes

# add x-spam-bar header if score is positive
warn
  condition = ${if >{$spam_score_int}{0}}
  add_header = X-Spam-Bar: $spam_bar

## delay/discard/deny depending on the scan
  
# use greylisting with rspamd
# (unless coming from authenticated or relayed host)
defer message    = Please try again later
   condition  = ${if eq{$spam_action}{soft reject}}
   !hosts = ${if exists{CONFDIR/final_from_hosts}\
		       {CONFDIR/final_from_hosts}\
		       {}}
   !authenticated = *
   log_message  = greylist $sender_host_address according to soft reject spam filtering

# high probability spam get silently discarded if 
# coming from authenticated or relayed host
discard
   condition  = ${if eq{$spam_action}{reject}}
   hosts = ${if exists{CONFDIR/final_from_hosts}\
		       {CONFDIR/final_from_hosts}\
		       {}}
   log_message  = discard as high-probability spam from final from host

discard
   condition  = ${if eq{$spam_action}{reject}}
   authenticated = *
   log_message  = discard as high-probability spam from authentificated
   
# refuse high probability spam from other sources
deny  message    = Message discarded as high-probability spam
   condition  = ${if eq{$spam_action}{reject}}
   log_message	= reject mail from $sender_host_address as high-probability spam

These two will take to send through rspamd and accept/reject/discard mails.

A dovecot_lmtp transport is also necessary:

dovecot_lmtp:   
  debug_print = "T: dovecot_lmtp for $local_part@$domain"   
  driver = lmtp   
  socket = /var/run/dovecot/lmtp   
  #maximum number of deliveries per batch, default 1   
  batch_max = 200   
  # remove suffixes/prefixes   
  rcpt_include_affixes = false 

There are also other internal files, especially in conf.d/main. For instance. If you want to follow my setup, you are encouraged to download the whole mx/etc/exim folder at least. Most files have comments, easy to find out if they are relevant or not. Or you can just copy/paste relevant settings into etc/conf.d/main/10_localsettings, like for instance:

# path of rspamd 
spamd_address = 127.0.0.1 11333 variant=rspamd 

# data acl definition 
CHECK_DATA_LOCAL_ACL_FILE =  /etc/exim4/conf.d/acl/41_rien-check_data_spam

# memcache traditional greylioting
GREY_MINUTES  = 0.4
GREY_TTL_DAYS = 25
# we greylist servers, so we keep it to the minimum required to cross-check with SPF
#   sender IP, sender domain
GREYLIST_ARGS = {${quote:$sender_host_address}}{${quote:$sender_address_domain}}{GREY_MINUTES}{GREY_TTL_DAYS}

Other files exim4/conf.d/ are useful for other local features a bit outside the scope of this article (business per target email aliases, specific handling of friendly relays, SMTP forward to specific authenticated SMPT for specific domains when sending mails).

Dovecot

This assumes that dovecot already works (with all components installed). Nonetheless, you need to edit LTMP delivery by editing /etc/dovecot/conf.d/20-lmtp.conf as follow:

# to be added
lmtp_proxy = no
lmtp_save_to_detail_mailbox = no
lmtp_rcpt_check_quota = no
lmtp_add_received_header = no 

protocol lmtp {
  # Space separated list of plugins to load (default is global mail_plugins).
  mail_plugins = $mail_plugins
  # remove domain from user name
  auth_username_format = %n
}

You also need to edit /etc/dovecot/conf.d/90-sieve.conf:

# to be added

 # editheader is restricted to admin global sieve
 sieve_global_extensions = +editheader

 # run global sieve (sievec must ran manually every time they are updated)
 sieve_before = /etc/dovecot/sieve.d/

You also need to edit /etc/dovecot/conf.d/20-imap.conf:

protocol imap {   
  mail_plugins = $mail_plugins imap_sieve
}

You also need to edit /etc/dovecot/conf.d/90-plugin.conf:

plugin {
  sieve_plugins = sieve_imapsieve sieve_extprograms
  sieve_extensions = +vnd.dovecot.pipe +vnd.dovecot.environment

  imapsieve_mailbox4_name = Spam
  imapsieve_mailbox4_causes = COPY APPEND
  imapsieve_mailbox4_before = file:/usr/local/lib/dovecot/report-spam.sieve

  imapsieve_mailbox5_name = *
  imapsieve_mailbox5_from = Spam
  imapsieve_mailbox5_causes = COPY
  imapsieve_mailbox5_before = file:/usr/local/lib/dovecot/report-ham.sieve

  imapsieve_mailbox3_name = Inbox
  imapsieve_mailbox3_causes = APPEND
  imapsieve_mailbox3_before = file:/usr/local/lib/dovecot/report-ham.sieve

  sieve_pipe_bin_dir = /usr/local/lib/dovecot/
}

You need custom scripts to train dovecot: both shell and sieve filters. /usr/local/lib/dovecot/report-ham.sieve:

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.mailbox" "*" {
  set "mailbox" "${1}";
}

if string "${mailbox}" "Trash" {
  stop;
}

if environment :matches "imap.user" "*" {
  set "username" "${1}";
}

pipe :copy "sa-learn-ham.sh" [ "${username}" ];

/usr/local/lib/dovecot/report-spam.sieve:

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.user" "*" {
  set "username" "${1}";
}

pipe :copy "sa-learn-spam.sh" [ "${username}" ];

/usr/local/lib/dovecot/sa-learn-ham.sh

#!/bin/sh
exec /usr/bin/rspamc learn_ham

/usr/local/lib/dovecot/sa-learn-spam.sh

#!/bin/sh 
exec /usr/bin/rspamc learn_spam

Then you need a /etc/dovecot/sieve.d similar as mine to put all site-wide sieve scripts. Mine are shown as example of what can be done easily with sieve. Regarding spam, they will only flag spam. End user sieve filter will matter:

#; -*-sieve-*-
require ["editheader", "regex", "imap4flags", "vnd.dovecot.pipe", "copy"];

# simple flagging for easy per-user sorting
# chained, so only a single X-Sieve-Mark is possible

## flag Spam
if anyof (
	  header :regex "X-Spam-Status" "^Yes",
	  header :regex "X-Spam-Flag" "^YES",
	  header :regex "X-Bogosity" "^Spam",
	  header :regex "X-Spam_action" "^reject")
{
  # flag for the mail client
  addflag "Junk";
  # header for further parsing
  addheader "X-Sieve-Mark" "Spam";
  # autolearn
  pipe :copy "sa-learn-spam.sh";
}
## sysadmin
elsif address :localpart ["from", "sender"] ["root", "netdata", "mailer-daemon"]
{
  addheader "X-Sieve-Mark" "Sysadmin";
} 
## social network
elsif address :domain :regex ["to", "from", "cc"] ["^twitter\.",
						   "^facebook\.",
						   "^youtube\.",
						   "^mastodon\.",
						   "instagram\."]
{
  addheader "X-Sieve-Mark" "SocialNetwork";
}
## computer related
elsif address :domain :regex ["to", "from", "cc"] ["debian\.",
						   "devuan\.",
						   "gnu\.",
						   "gitlab\.",
						   "github\."]
{
  addheader "X-Sieve-Mark" "Cpu";
}

Each time scripts are modified in this folder, sievec must be run by root (because otherwise sieve script are compiled by current user, which cannot write in /etc for obvious reasons):

sievec -D /etc/dovecot/sieve.d
sievec -D /usr/local/lib/dovecot

Finally, as example of final user sieve script (to put in ~/.dovecot.sieve:

#; -*-sieve-*-
require ["fileinto", "regex", "vnd.dovecot.pipe", "copy"];

if header :is "X-Sieve-Mark" "Spam"
{
   # no care for undisclosed recipients potential false positive
  if address :contains ["to", "cc", "bcc"] ["undisclosed recipients", "undisclosed-recipients"]
		{
		  discard;
		  stop;
		}

  # otherwise just put in dedicated folder		
  fileinto "Spam";
  stop;
}

Rspamd

Rspamd was installed by devuan/debian package (not clear to me why Rspamd people discourage using these packages on their website, lacking context). It work out of the box.

I also installed clamav, razor and redis. Rspamd require lot of small tuning, check the folder /etc/rspamd

To get razor running, it pass request through /etc/xinetd.d/razor:

service razor
{
#	disable		= yes
	type		= UNLISTED
	socket_type     = stream
	protocol	= tcp
	wait		= no
	user		= _rspamd
	bind            = 127.0.0.1
	only_from	= 127.0.0.1
	port		= 11342
	server		= /usr/local/bin/razord
}

going along with wrapper script /usr/local/bin/razord:

#!/bin/sh
/usr/bin/razor-check && echo -n "spam" || echo -n "ham"

It is configured to work in the same way with pyzor but so far it does not work (not clear to me why – seems also an IPv6 issue, see below).

I noticed issues with IPv6: so far my mail servers are still IPv4 only and Rspamd nonetheless tries sometimes to connect on IPv6. I solved issue by commenting ::1 localhost in /etc/hosts.

Results

So far it works as expected (except the issues IPv4 vs IPv6 and pyzor). Rspamd required a bit more work than expected, but once it is going, it seems good.

Obviously, in the process, I lost the benefit of the well trained Bogofilter, but I hope soon enough Rspamd own bayesian filters will kick in.

In my setup there are extra files related to replicating over multiple servers that I might cover in another article (replication of email, sieve users filter through nextcloud and redis shared database via stunnel). The switch to Rspamd+Exim+Dovecot made this replication of multiples servers much better.

UPDATE: pipe :copy vs execute :pipe

Using pipe :copy in sieve script is actually causing issues. Sieve pipe is a disposition-type action, it is intended to deliver the message, similarly to a fileinto or redirect command. As such, if the command return failure, sieve filter stop. That is not desirable, if we use rspamd with learn_condition (defined in statistic.conf) to avoid multiple learning of the same file, etc. It would lead to such error in logs and sieve scripts prematurely ended:

[dovecot log]
Apr  9 21:08:10 mx dovecot: lmtp(userx)<31807><11nYLZrZUWI/fAAA4k3FvQ>: program exec:/usr/local/lib/dovecot/sa-learn-spam.sh (31810): Terminated with non-zero exit code 1
Apr  9 21:08:10 mx dovecot: lmtp(userx)<31807><11nYLZrZUWI/fAAA4k3FvQ>: Error: sieve: failed to execute to program `sa-learn-spam.sh': refer to server log for more information.
Apr  9 21:08:10 mx dovecot: lmtp(userx)<31807><11nYLZrZUWI/fAAA4k3FvQ>: sieve: msgid=<20220409190707.0A81E808EB@xxxxxxxxxxxxxxxxxxx>: stored mail into mailbox 'INBOX'
Apr  9 21:08:10 mx dovecot: lmtp(userx)<31807><11nYLZrZUWI/fAAA4k3FvQ>: Error: sieve: Execution of script /etc/dovecot/sieve.d/20_rien-mark.sieve failed, but implicit keep was successful

[rspamd log]
2022-04-09 21:08:10 #4264(controller) <eb2984>; csession; rspamd_stat_classifier_is_skipped: learn condition for classifier bayes returned: already in class spam; probability 93.38%; skip classifier
2022-04-09 21:08:10 #4264(controller) <eb2984>; csession; rspamd_task_process: learn error: all learn conditions denied learning spam in default classifier

We got an implicit keep, with an already known and identified spam forcefully sent to INBOX due to learning failure since it was already known.

Using execute :pipe instead solves the issue and match what we really want: the spam/ham learning process is extra step, it is neither involved in the filtering or delivery of the message. Its failure or success is irrelevant to the delivery process.

Using execute, the non-zero error return code from the executed script will be logged too, but without any other effect, especially not stopping sieve further processing:

[dovecot log]
Apr 10 15:12:09 mx dovecot: lmtp(userx)<3450><iqoWCKnXUmJ6DQAA4k3FvQ>: program exec:/usr/local/lib/dovecot/sa-learn-spam.sh (3451): Terminated with non-zero exit code 1
Apr 10 15:12:09 mx dovecot: lmtp(userx)<3450><iqoWCKnXUmJ6DQAA4k3FvQ>: sieve: msgid=<6252b1ee.1c69fb81.8a4e2.f847@xxxxxxxxxxxx>: fileinto action: stored mail into mailbox 'Spam'

[rspamd log]
2022-04-10 15:12:09 #9025(controller) <820d3e>; csession; rspamd_task_process: learn error: all learn conditions denied learning spam in default classifier

Check dovecot related files for up-to-date example/version:/etc/dovecot /usr/local/lib/dovecot

Avoiding Spams with SPF and greylisting within Exim

A year ago, I posted an article describing a way to slay Spams with both Bogofilter and SpamAssassin embedded in exim. This method was proven effective for my mailboxes:  since then, during a timespan of one year, Bogofilter caught ~ 85 % of actual spams, SpamAssassin (called only if mail not already flagged defavorably by Bogofilter) caught ~ 15 %. Do the math, I had almost none to flag by hand.

Why would I change such setup? For fun, obviously 🙂

Actually, I made no change, I just implemented SPF (Sender Policy Framework) and greylisting.

I noticed that plenty of spams were sent to my server @thisdomain claiming to be sent by whoever@thisdomain. These dirty spams were easily caught by the duo Bogofilter / SpamAssassin, but still, it annoyed me that @thisdomain was misused. SPF allows, using DNS records, to list which servers/computers are allowed to send mails from addresses @thisdomain.   SPF checks are predefined in Exim out of the box, so I’ll skip its configuration. The relevant DNS record (with bind9), allowing only two boxes (primary and secondary mail servers) designated by their IP to send mails @thisdomain, looks like:

thisdomain. IN  TXT  "v=spf1 ip4:78.249.xxx.xxx ip4:86.65.xxx.xxx -all"

Result: Since I implemented SPF on my domains, there was no change in the number of spam caughts. However, during this period, my primary server list of temporary bans dropped from 200/100 IPs to 40/20 IPs. I cannot pinpoint with certainty the cause of this evolution because the temporary bans list depends on plenty of things. But, surely, pretending to send mails @thesedomainsgrilledbySPF surely lost some interest for spambots. Implementing SPF is actually not about helping ourselves directly but indirectly: reducing effectiveness of spambots helps everybody.

I use greylisting on my secondary mail server since a while and I noticed over years that this one almost never had to ban IPs. Not that he never received spam, but that he almost never received mails from very obvious spam sources identified at STMP time.  Seems that most very obvious spam sources never insist enough to pass through greylisting. I guess that most spambots are coded to skip any mail server that does not immediatly accept a proper SMTP transaction, because it has no time to waste, considering how little is the percentage of spams sent actually reaching someone real.

This greylisting use the following files (an assumes memcached and libcache-memcached-perl are properly installed):

So I gave a try using greylist my primary mail server, but with a very short waiting time, because 5 minutes, for example, to receive mail from a not-yet-known  source is not acceptable. So I edited the relevant conf.d/main/ file to GREY_MINUTES = 0.5 and GREY_TTL_DAYS = 25.

Result: no changes regarding the number of caught spams. However, like on the secondary mail server, the number of banned IPs is near to none. Looks like most obvious spam sources don’t wait even only 30 seconds – actually, it’s a very acute choice as they would be anyway banned if they did.

Slaying Spams with both Bogofilter and SpamAssassin embedded in exim

Ads are spam. Good thing with the internet’s ads is that you can set up countermeasures.

(Disclaimer: yes, there is nothing new here, just an example of setup)

I have plenty of email addresses from different providers, some are definitely history. I could go through the websites of all of these and set up forwarding for the one I no longer use but still want to be able to get mail from, just in case. Well, I would do that if I was using my mail client to fetch mails – because otherwise fetching mails would actually take ages.

But, as I have a local home underclocked 🙂 server, I find way easier and potent to, instead, use ESR’s fetchmail to download them all to a single account that is accessed by my mail client through IMAPS. I have a /etc/fetchmailrc like:

poll pop.free.fr with proto POP3
user 'XXX' there with password 'XXX' is 'localuser' here
poll imap.gmail.com with proto IMAP
user '[email protected]' there with password 'XXX' is 'localuser' here with ssl
user '[email protected]' there with password 'XXZ' is 'localuser' here with ssl

Fetchmail download mails than then relies on the installed SMTP, which is Exim, to deliver it to end user account mailbox accessible through IMAPS.

What’s so nifty nifty about? Well, mails will also be filtered for spam. As it happens on the local home server, it will be unnoticeable for the end user that is me. We’ll use several anti-spam tools, not caring about redundancy and time-consumption: DNSBLs, Bogofilter, SpamAssassin, razor2.

So, here we go. Note that Exim (exim4) in Debian use the user Debian-exim. localuser is the recipient end-user, it belongs to the group localuser name after himself.
We will add Debian-user to the group localuser and create a system group dedicated to spamchecking to easily share bayesian databases.:

# addgroup --system spamslayer
# adduser Debian-exim spamslayer
# adduser Debian-exim localuser
# adduser localuser spamslayer

* Bogofilter is a bayesian spam filter . It is said to be faster and lesser time consuming than the SpamAssassin’s own bayesian filter so will run mails through it first. It is installed with the debian package.

Edit /etc/bogofilter.cf as follows:

bogofilter_dir=/var/lib/bogofilter
db_transaction=yes

The bayes directory must be created by hand:

# mkdir /var/lib/bogofilter
# chgrp spamslayer /var/lib/bogofilter
# chmod 2777 /var/lib/bogofilter

* SpamAssassin is a powerful, at the cost of time-consumption, spam-killer. It is installed with the debian package.

In the following site-wide config /etc/spamassassin/local.cf, I use bayesian filters, razor2, several DNSBLs and I adjust some tests according to my needs:

# Save spam messages as a message/rfc822 MIME attachment instead of
# modifying the original message (0: off, 2: use text/plain instead)
#
# Keep as it is because bogofilter would not learn properly otherwise,
# as it cannot distinguish report from the spam.
report_safe 0
# Set which networks or hosts are considered 'trusted' by your mail
# server (i.e. not spammers)
#
trusted_networks 192.168.1.
# Locales
#
# (I only receive mails in English or French)
ok_locales en fr
# Set the threshold at which a message is considered spam (default: 5.0)
#
required_score 3.3
# Use Bayesian classifier (default: 1)
#
# (I created the relevant directory)
use_bayes 1
bayes_file_mode 0777
bayes_path /var/lib/spamassassin-bayes/bayes
score BAYES_20 0.3
score BAYES_40 0.5
score BAYES_50 0.8
score BAYES_60 1
score BAYES_80 2
score BAYES_95 2.5
score BAYES_99 6
# Bayesian classifier auto-learning (default: 1)
#
# (I may change that, not sure about it)
bayes_auto_learn 1
# Set headers which may provide inappropriate cues to the Bayesian
# classifier
#
bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
# use razor
# (/etc/razor is the standard debian path)
use_razor2 1
razor_config /etc/razor/razor-agent.conf
score RAZOR2_CF_RANGE_51_100 3.2
# some rbl checks are already made by exim, at RCPT time, not all.
skip_rbl_checks 0
rbl_timeout 30
score RCVD_IN_SBL 15
score RCVD_IN_XBL 15
score RCVD_IN_SORBS_HTTP 15
score RCVD_IN_SORBS_SOCKS 15
score RCVD_IN_SORBS_MISC 15
score RCVD_IN_SORBS_SMTP 15
score RCVD_IN_SORBS_ZOMBIE 15
# adjust some tests scores: lower DUL test
score FROM_ENDS_IN_NUMS 0.2
score FROM_HAS_MIXED_NUMS 0.2
score FROM_HAS_MIXED_NUMS3 0.2
score RCVD_IN_NJABL_DUL 0.1
score RCVD_IN_SORBS_DUL 0.1
# lower stupid test
score DNS_FROM_SECURITYSAGE 0.0
# adjust some tests scores
score FAKE_HELO_HOTMAIL 3
score FORGED_HOTMAIL_RCVD 3
score HTML_FONT_BIG 2.4
score NO_REAL_NAME 2
score RCVD_IN_BL_SPAMCOP_NET 3
score SUBJ_ILLEGAL_CHARS 4.8
score EXTRA_MPART_TYPE 2.8
score SUBJ_ALL_CAPS 2.6
# increase all scores related to drugs: what do I care, duh
score DRUGS_ANXIETY 5
score DRUGS_ANXIETY_EREC 5
score DRUGS_ANXIETY_OBFU 5
score DRUGS_DIET 5
score DRUGS_DIET_OBFU 5
score DRUGS_ERECTILE 5
score DRUGS_ERECTILE_OBFU 5
score DRUGS_MANYKINDS 10
score DRUGS_MUSCLE 5
score DRUGS_PAIN 5
score DRUGS_PAIN_OBFU 5
score DRUGS_SLEEP 5
score DRUGS_SLEEP_EREC 5
score DRUGS_SMEAR1 5
# same goes for porn
score AMATEUR_PORN 5
score BEST_PORN 5
score DISGUISE_PORN 5
score DISGUISE_PORN_MUNDANE 5
score FREE_PORN 5
score HARDCORE_PORN 5
score LIVE_PORN 5
score PORN_15 5
score PORN_16 5
score PORN_URL_MISC 5
score PORN_URL_SEX 5
score PORN_URL_SLUT 5

The bayes directory must be created:

# mkdir /var/lib/spamassassin-bayes
# chown Debian-exim /var/lib/spamassassin-bayes
# chmod 0777 /var/lib/spamassassin-bayes

Obviously, it implies that razor2 must be properly installed. We install the debian package then set it up. Remember it must run with user Debian-exim, so we do:

# chown -R Debian-exim:spamslayer /etc/razor
# su Debian-exim
$ razor-admin -home=/etc/razor -register
$ razor-admin -home=/etc/razor -create
$ razor-admin -home=/etc/razor -discover

To save ressources, we start SpamAssassin as a daemon (spamd), that will be called using its specific client (spamc). Before using the initd script, edit as follows /etc/defaut/spamassassin:

# Change to one to enable spamd
ENABLED=1
# SpamAssassin uses a preforking model, so be careful! You need to
# make sure --max-children is not set to anything higher than 5,
# unless you know what you're doing.
OPTIONS="--create-prefs --max-children 5 --helper-home-dir -u Debian-exim -g spamslayer"
# Cronjob
# Set to anything but 0 to enable the cron job to automatically update
# spamassassin's rules on a nightly basis
CRON=1

All that being do, you’ll want to (re)start the daemon with the relevant initd script (/etc/init.d/spamassassin restart here).

* Now we’ll tune Exim to call all by himself first Bogofilter and then SpamAssassin, if necessary only. We use splitted configuration in /etc/exim4/conf.d/. That is debian-specific I think but it does make any difference anyway.

First we define useful transports in /etc/exim4/conf.d/transport/35_spamblock (the name 35_spamblock is arbitrary and the number does not matter here):

spamslay_bogofilter:
driver = pipe
command = /usr/sbin/exim4 -oMr spamslayed-bogofilter -bS
use_bsmtp = true
transport_filter = /usr/bin/bogofilter -l -p -e
home_directory = "/tmp"
current_directory = "/tmp"
# must use a privileged user to set $received_protocol
# on the way back in!
user = Debian-exim
group = spamslayer
log_output = true
return_fail_output = true
return_path_add = false
message_prefix =
message_suffix =
#
spamslay_spamd:
driver = pipe
command = /usr/sbin/exim4 -oMr spamslayed-spamd -bS
use_bsmtp = true
transport_filter = /usr/bin/spamc
home_directory = "/tmp"
current_directory = "/tmp"
# must use a privileged user to set $received_protocol
# on the way back in!
user = Debian-exim
group = spamslayer
log_output = true
return_fail_output = true
return_path_add = false
message_prefix =
message_suffix =

Second we define routers, here in /etc/exim4/conf.d/router/450_spamblock – the order matters, here it is just after 400_exim4-config_system_aliases and before 500_exim4-config_hubuser:

# spam checking
# first bogofilter
spamslay_router_bogofilter:
debug_print = "R: bogofilter for $local_part@$domain received with protocol $received_protocol with X-Spam-Flag=$h_X-Spam-Flag and X-Bogosity=$h_X-Bogosity"
# When to scan a message :
# - it isn't already flagged as spam
# - it has not yet been spamslayed at all
# - it isn't local ($received_protocol eq "" or local)
condition = "${if and{ {!eqi{$h_X-Spam-Flag:}{yes}} {!eq{$received_protocol}{spamslayed-bogofilter}} {!eq{$received_protocol}{spamslayed-spamd}} {!eq{$received_protocol}{local}} {!eq{$received_protocol}{}} }}"
driver = accept
transport = spamslay_bogofilter
#
# second spamd
spamslay_router_spamd:
debug_print = "R: spamd for $local_part@$domain received with protocol $received_protocol with X-Spam-Flag=$h_X-Spam-Flag and X-Bogosity=$h_X-Bogosity"
# When to scan a message :
# - it isn't already flagged as spam
# - it has not yet been spamslayed with SA
# - it isn't local ($received_protocol eq "" or local)
condition = "${if and { {!eqi{$h_X-Spam-Flag:}{yes}} {!match{$h_X-Bogosity:}{^Spam}} {!eq {$received_protocol}{spamslayed-spamd}} {!eq{$received_protocol}{local}} {!eq{$received_protocol}{}} }}"
driver = accept
transport = spamslay_spamd
#
# This route will send any mail that got here to the devnull alias, that
# should be configured in /etc/aliases to be a real link to /dev/null.
# This route should get only mails that have spam score higher than 14.
# This will affect users mails!
spamslay_killit:
condition = "${if ge{$h_X-Spam-Level:}{\*\*\*\*\*\*\*\*\*\*\*\*\*\*} {1}{0} }"
driver = redirect
data = spam
file_transport = address_file
pipe_transport = address_pipe

* Next step, now that spams are flagged, it makes sense to put them apart in the Maildir that will be accessed through IMAPS. I do this with procmail. We set umask for procmail (the IMAP server is configured as such too) to make sure Debian-exim can access stored mails (we want mode 0640, group read access, so the umask is 666-640=026). Here’s the relevant bit of /home/localuser/.procmailrc:

UMASK=026
IMAPDIR=$HOME/.Maildir/
SPAM=$IMAPDIR".Poubelle.Spam/"
#
:0
* ^X-Spam-Status: Yes
$SPAM
:0
* ^X-Spam-Flag: YES
$SPAM
#
:0
* ^X-Bogosity: Spam
$SPAM

At the same time, we make sure Debian-exim can access mails already there (so not affected by umask):

# cd /home/localuser
# chmod 750 -Rv .Maildir
# chmod 0640 -v `find .Maildir -type f`

(PS: you may want to enforce a more restrictive policy, depending on how your server is accessed – but, anyway, Debian-exim is by essence able to tamper with mails you receive, so it won’t make a big difference)

* Training bayesian filters.

Now that spam ended up in a specific Maildir, both SpamAssassin and Bogofilter bayesians filters must be trained to be effective.

We add the following in /etc/cron.d/bayes:

# trains bayesian filters
SPAMDIR="/home/localuser/.Maildir/.Poubelle.Spam/cur/ /home/localuser/.Maildir/.Poubelle.Spam/new/"
#
# spamd: can handle by itself bogofiltered headers
25 * * * * Debian-exim /usr/bin/sa-learn --spam $SPAMDIR
#
# bogofilter: not able to clean inappropriate cues from spamd, will do it
# by removing:
# - informational SpamAssassin headers
# - SpamAssassin score and decision (irrelevant)
# (-u was not set as it is discouraged perf-wise in bogofilter's manual)
28 * * * * Debian-exim for file in `find $SPAMDIR -type f`; do cat $file; done | grep -v -E "^X-Spam-(Checker|Flag|Level|Report)" | sed s/"^X-Spam-Status.*score.*required.*tests="//g | /usr/bin/bogofilter --register-spam

Obviously, if you want it to learn from plenty of different users, you’ll have to think of something more elaborated 🙂
Anyway, regarding plenty of users, it would actually probably wise to think twice about the whole concept of sharing bayesian filters that may not at all be accurate for very differents users.

One alternative would have been to avoid meddling with Exim and to run both bogofilter and spamd via procmail. Sure, it would not have been site-wide setup but for a few users, ~/.procmailrc can be replicated easily. But actually I enjoy messing with Exim, that’s kind of a hobby. I skipped here the part where we call DNSBLs in Exim (working out-of-the-box anyway). And on a production server, with the SMTP wide opened to the web, it is possible to follow this approach just to shut off spammers at SMTP-time -which induces a huge resources gain- and even ban them.