SpamAssassin training and spam cleanup script

Spam is a constant battle as it is ever changing and always creeping into your Inbox. Spam wrangling is only effective with proper training, SpamAssassin does a decent job out-of-the-box but needs users input to truly be effective. This script will run SpamAssassin’s built in sa-learn tool against users known spam and known ham.

With my setup — A spam message is received (postfix) and identified as spam (spamassassin), it will be moved to the users Junk directory per the Sieve (dovecot) rule at the bottom of this post. If a spam message is received and not matched as such, it will be delivered to the Inbox. The user will identify it and manually move it to the Junk directory. I like to configure Thunderbird to “Move new junk messages to: “Junk” folder on: … in Account Settings.” Now a user marks a message as Junk, it automatically gets moved.

When this script runs it will mark messages in the Inbox as ham and messages in the Junk directory as spam. The command “sa-learn –sync” adds these results to the Bayes database that SpamAssassin consults when determining a spam score for each received message. This database is optionally backed up in the event a mistake was made and you need to revert back to previous versions. The script logs to /var/log/train-mail.log, information about how much spam and ham is being added, how many total messages have been processed and stats on the auto clean feature built into sa-learn can be gleaned. Additionally, I setup a spam cleanup using find and rm.

If sa-learn scans a mail as ham when it is spam, or vice versa, simply move the messages to the correct directory (Inbox=ham/Junk=spam), and the mistake will be corrected on the next run. SpamAssassin will automatically ‘forget’ the previous indications.

Its important to have an equal part ham to spam. As a result I run this script daily in an effort to capture users ham before they delete it or sort it into sub folders. Another thing to be aware of is that typically you should aim to train with at least 1000 messages of spam, and 1000 ham messages, if possible. More is better, but anything over about 5000 messages does not improve accuracy significantly in SpamAssassins tests.

Obviously a lot of these options are site/user specific. This is a good foundation to use as-is or build from.

#!/bin/bash

#specify one or more users, space padded [user=(user1 user2 user3)] or empty [user=()] to include all users. All users is considered uid ≥ 1000.
user=()

#After how many days should Spam be deleted?
cleanafter=30

#backup path, comment out to disable backups
bk=/home/backup/sa-learn_bayes_`date +%F`.backup

log=/var/log/train-mail.log
#log=/dev/stdout

echo -e "\n`date +%c`"  >> $log 2>&1

if [ -z ${user[@]} ]; then
echo user is empty, using all users from system
user=(`awk -F':' '$3 >= 1000 && $3 < 65534' /etc/passwd |awk -F':' '{print $1}'`)
fi

for u in ${user[@]}; do
if [ ! -d /home/$u/Maildir ]; then
echo "No such Maildir for $u" >> $log 2>&1
else
echo "Proceeding with ham and spam training on user \"$u\""
#add all messages in "junk" directory to spamassassin
echo spam >> $log 2>&1
#change this path to match your spam directory, in this case its "Junk"
#add current and new messages in Junk directory as spam
sa-learn --no-sync --spam /home/$u/Maildir/.Junk/{cur,new} >> $log 2>&1
echo ham >> $log 2>&1
#only add current mail to ham, not new. This gives user a chance to move it to spam dir.
sa-learn --no-sync --ham /home/$u/Maildir/{cur} >> $log 2>&1
fi
done

#sync the journal created above with the database
echo sync >> $log 2>&1
sa-learn --sync >> $log 2>&1
if [ $? -eq 0 ]; then
for u in ${user[@]}; do
echo "deleting spam for $u older than 30 days" >> $log 2>&1
find /home/$u/Maildir/.Junk/cur/ -type f -mtime +$cleanafter -exec rm {} \;
done
else
echo "sa-learn wasn't able to sync. Something is broken. Skipping spam cleanup"
fi

echo "Statistics:" >> $log 2>&1
sa-learn --dump magic >> $log 2>&1
echo ============================== >> $log 2>&1

if [ -n $bk ]; then
echo "backup writing to $bk" >> $log 2>&1
sa-learn --backup > $bk
fi

Here is my sieve rule for moving messages that are marked as spam to my Junk directory. I setup roundcube for people to manage their sieve filters.

$ cat /home/jason/sieve/managesieve.sieve
require ["fileinto"];
# rule:[SPAM]
if header :contains "X-Spam-Status" "Yes"
{
	fileinto "Junk";
	stop;
}

Posted

in

, ,

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *