Support /
Knowledge Base

Bayesian Filters


Syneto’s bayesian antispam module is based on Spamassassin’s library for spam detection and tagging. To enable spam filtering on either POP3 or SMTP protocols please navigate to Email->Antispam and press the related buttons in the ‘Activate’ area (see Figure 1). From the same menu you can choose the fate of the spam:

  • you can add a text to the subject
  • you can add a custom header to the message
  • you can send a copy to an address
  • you can silently drop the spam
  • you can place it in quarantine

These options can be selected in any combination though some of them don’t make much sense (adding a text to the subject or a header and then dropping the email – the email will be dropped with both your text and header).

Warning: if you choose a combination of actions, these will be applied to all emails found as spam, by both engines, bayesian and Commtouch. 

Right now the spam filter runs with a default configuration which, in most cases, should catch obvious spamming attempts. However, to reduce the incidence of spam that passes through it and, also, to reduce the rate of false positives (false positives are valid emails which are wrongly identified as spam), the spam filter needs to be trained and finely tuned. Usually, this happens over time so do not expect to have a perfect spam filter in 5 minutes.

To be able to train and configure properly the spam filter it is necessary a little background in the way the spam filter operates and what are its main parameters. Also we will introduce a little list of terms which may help in understanding several notions. After the basic notions have been explained it will follow a thorough tour of the interface used to configure, train and update the spam filter, showing the preferred modes of operation.
 

 

Figure 1. Enabling spam filtering for SMTP and POP protocols

Background Information on Bayesian Antispam Engine

The spam filter uses multiple sources of information in categorizing an email as spam or ham (ham are innocent emails):

  • a bayesian engine, which uses a heuristic method to determine if an email is spam or not, based on a corpus or spam and a corpus of ham: it does in an algorithmic way what a human would do if presented with two stacks of spam and ham emails – will try to find the characteristics for spam then for ham and eventually will try to find which characteristics are predominant in an email that has to be categorized
  • a list of rules that try to match different features of spam emails (key words, specific headers, etc)
  • a list of DNS blacklists – servers which list IPs of compromised networks and machines, open relays, open proxies and other problematic sources of spam, viruses and attacks
  • a whitelisting/blacklisting database: whitelisting entries are always bound to be marked as ham while blacklisted entries are always bound to be marked as spam 

The email receives from each of these sources a score and if the score is above a certain user configurable threshold, the email is marked as spam.

Configuring Parameters 

After the spam filtering service has been started, you should pay a visit to the ‘Parameters’ tab (in the same menu, Proxies > Email Proxies > Spam filtering). ‘Basic Parameters’ area holds four items of interest to be configured:

  • ‘Enable autolearn’ – when checked, it will allow the bayesian engine to reinforce its own decisions by adding to the proper database of spam or ham any email successfully classified (if the email is spam, it will be added to the corpus of spam and if it is ham it will be added to the ham corpus)
  • ‘Required score’ – usually, a default value of 5 would be good enough if the bayesian engine is properly trained and DNS blacklists are turned on; be careful, a value too low may raise the number of false positives. It is often better to have more spam in the inbox than missing an important email that was categorized as spam by mistake.
  • ‘Allowed charsets’ – the character sets allowed in the emails; if a character set is not allowed but it is present in the email it will raise the score of that email, making it more probable to be spam. This may prevent receiving spam in, say, chinese or russian.
  • ‘Trusted networks’ – a list of the networks which are considered trusted: the chances of receiving spam from a host from those networks is very slim. Warning: this doesn’t mean that emails received from those networks are automatically whitelisted; use the whitelisting service to do that. It simply reduces the overall spam score by a large margin.
  • ‘Scan files of maximum’ – if the size of the email is larger than this value do not attempt to spam filter it; it is advisable to keep this number low because you will both conserve system resources – applying a bayesian filtering on a large email consumes a lot of resources – and usually the size of a spam message is quite small
  • ‘Update automatically’ – select a value (daily, weekly, monthly) to enable the automatical update service for bayesian engine

Figure 2. Spam filter parameters configuration screen

The ‘Email reporting level’ are includes two options that may help debugging the activity of your spam filter. ‘Include report details for spam filtered emails’ will add a nice report for any email tagged as spam showing which rule was triggered and with what score. ‘Include spam headers for clean emails’ will do almost the same thing for clean emails. These can be invaluable tools when trying to find out why the spam filtering is acting strange, when debugging its behavior.

One of the greatest features of the spam filtering module is the seamless integration of several DNS blacklists. ‘DNS blacklist’ entry under the ‘Whitelists and blacklists’ box lets you choose which of the seven possible blacklists you want to activate. A good choice is to have all of them activated but you may want to turn some or all of them off if, for some reason, they do too much traffic on your network.

 

Figure 3. The bayesian engine allows you to enable among several public DNS blacklists to enhance the spam detection process

The blacklisting/whitelisting system is a simple way to specify which email addresses should always be tagged as spam and which of them should be always tagged as ham. Defining a blacklisting/whitelisting policy can be made by pressing the ‘Whitelisting and blacklisting’ button (see Figure 4)
 

Figure 4. Bayesian engine white and blacklist in action

Training the Bayesian Engine

To operate properly, the bayesian filter has to be trained with quasiequal amounts of spam and ham. To train the spam filter by hand, you need to place the spam and ham emails in unix mailboxes before starting the training operation. As of now, it doesn’t support Outlook or Mail.app mailboxes. In the ‘Manual training’ section use the  ’Upload UNIX mbox file’ area upload the spam and ham mailboxes – clicking the corresponding button each time. If you want to remove some emails from the spam or ham corpus, upload the mailbox containing those emails and click the ‘Forget’ problem. The spam and ham emails are decomposed in tokens which are used to form a token database.

The second area, ‘Manage spam database’, can be used to download or upload an already trained token database. Uploading a token database is a very fast operation compared to the training process and it’s always a good idea to backup your token database just in case something goes wrong or if you want to replicate the same database on more than one appliance.

Figure 5. Bayesian engine training interface

Instead of collecting emails from your users you can let them train the filter by setting up three mailboxes (three accounts) on your local SMTP server, for example spam@example.com, ham@example.com and forget@example.com. Any email sent to these mailboxes that pass through the spam filter, via SMTP proxy, will train the spam filter.
 

Figure 6. Training bayesian engine via emails

To configure this service use the  ’Training via email’. You have to define the necessary email addresses and the network from which emails will be accepted to train the spam filter. When the user wants to train an email as spam he must send it to spam@example.com. The same thing works for ham or if someone wants to remove an email message from the database.

The spam database can be updated automatically on a daily, weekly or monthly base. When the automatic updated mode is selected, the spam updated will fetch only the patches that weren’t already applied to your spam database. These patches are, in fact, unix mailboxes containing spam which are trained like you would train any other unix mailbox.

The spam update server however, collects these patches and, when they reach a certain number, it produces a spam token database, which installs way faster compared to the individual patches. The user has the option of installing such a spam database plus all the latest patches that weren’t yet integrated into the database by using the last section of the ‘Bayesian Filter’ tab: ’Update’. This section also displays information about the last time an update occured for your bayesian engine.

Warning! Installing a spam database will wipe out your current database! If you want to keep the content trained by you but also have the content from the spam update please press the ‘Latest patches’ button in the ‘Update now’ area. The update process will be lengthier but you will keep the already trained data. Please be patient if you choose this route.