Spamassassin

1/8/2024

Training with emails that old won’t be of much help, but they are still active and update their archives daily. It appears that since 2020 most of their archives don’t have new messages.Īn archive of spam received since 1998. Every day the service archives newly received spam for you to use to train with. This is my preferred data source for training as it has an initial database you can restore. It’s important to keep training SpamAssassin with incoming emails you receive. Using public spam data is helpful to get started but may not be specific to your use case. There are also available SpamAssassin backups that you can restore to get started. There are a few sources of spam and ham emails online available to download. $ sa-learn -spam /var/vmail/*/Maildir/Spam/ You may also use curly braces to identify one of many possible folder names in the patch as well. This is helpful to quickly update the Bayes database for many users. SpamAssassin’s learning utility can handle wildcards inside of the path. Read about additional options on the man page. It’s also possible to teach SpamAssassin from a single email or using mbox or mbx formats. This is the common method for use with the Maildir format. The below commands will learn spam and ham respectively from a folder containing emails. The utility sa-learn will ignore emails that have already been processed to prevent adding extra weight to certain tokens. You can either manually run sa-learn or preferably add it to a cron job to routinely update the database. It’s important to run sa-learn as the same user who starts spamc inside of your mail content filter. A token is a sequence of words or short characters that are commonly found in spam or ham. In default usage, it will take a directory of spam or ham emails and add their tokens to the database. The tool to train SpamAssassin is sa-learn. Once the initial training is done, it’s recommended to routinely train SpamAssassin as spam is always changing and the more training you do with your own set of data the more accurate the filter will become. There are a few online databases available to initially feed into SpamAssassin’s Bayesian database. Training SpamAssassin against your own data is preferred, but it’s only effective if you have a significant amount of spam and ham available. Conditional probability is the likelihood of an outcome occurring, based on a previous outcome occurring. The Bayesian filter will compare past content from known spam and ham emails to determine the likelihood of spam.īayes' theorem, named after 18th-century British mathematician Thomas Bayes, is a mathematical formula for determining conditional probability.

While it does come with a few plugins enabled for DKIM, SPF, RBL, and content checks, SpamAssassin is limited unless you train its Bayesian filter.

SpamAssassin won’t do much if it hasn’t been trained.

0 Comments

Spamassassin

Leave a Reply.

Author

Archives

Categories