Another¬†method for spam filtering involves close investigation of an email's ‚??header‚?Ě for warnings flags. For instance, an apparent forged ‚??From‚?Ě address is usually an indication of a spam message. This appears when a spammer sends a message from one ISP but claims a completely different addresses in the header. Unfortunately, other aspects of email headers can be forged including their originating IP address and client type. All of this is usually an effort to elude network administrators trying to identify the actual source of the message. Fortunately, there are aspects of a message header that cannot be forged and this is how the forgery can be revealed. However, a misconfigured email server or email client can give the appearance of forgery even though the discrepancies are innocent. The result is the message¬†getting blocked as probable spam. So while this technique is fairly effective, it is prone to false positives.
A more sophisticated technique for spam filtering is called ‚??Bayesian Analysis‚?Ě, which uses complex statistical comparison from a tokenized database of known spam messages and known legitimate messages. Incoming mail is compared with the database and assigned a probability based on the matching of tokens in the database. The central weakness of Bayesian Analysis is the classic¬†conundrum of ‚??garbage in, garbage out‚?Ě. A large database of existing email with accurate classification is damn effective against spam, but a database which is too small or contains misclassified email can be quite ineffective.
Recently, I launched our Bayesian database with over 2,000 spam messages and 1,000 legitimate messages. There should have been few or no misclassification because I went through the messages by hand before ‚??feeding‚?Ě them to database for tokenization. My personal experience shows our database to correctly identify spam over 90% of the time with very few false positives. Bayes maintains it's effectiveness against spam as long as an email administrator keeps feeding it properly classified messages. Because of the tokenization process, the database can identify new spam never seen before...but it's still a good idea to keep it current otherwise it will slowly degrade as spammers try new techniques.
Here is a small example from our Bayes Database, the figures on the far left are the probabilities, with .140 being 14% likelihood of being spam. The actual tokens being matched are on the far right.
0.140 1 4 1093368741 134
0.049 0 1 1093305963 Pity
0.985 3 0 1093370171 phoenicia
0.995 10 0 1093371791 capstan
0.184 7 19 1092264308 smaller
0.991 5 0 1093371613 v i a g r a
0.958 1 0 1092770619 Biz!
0.958 1 0 1093354427 sk:byronic
0.995 9 0 1093374838 coloratura
0.992 6 0 1093371309 voltmeter
0.958 1 0 1093370178 wavenumber
Previous Page Next Page