Reviews   ::        

Articles   ::        

Home   ::        

Links   ::        

Archives   ::        

Search   ::        

About Us   ::        
HDTV Costs     

HDTV Guide     

Diskeeper 9     

Stor n Go PRO     

Blade SSD Server     


  Evaluating Spam Costs and Filtering Techniques

Another¬†method for spam filtering involves close investigation of an email's ‚??header‚?Ě for warnings flags. For instance, an apparent forged ‚??From‚?Ě address is usually an indication of a spam message. This appears when a spammer sends a message from one ISP but claims a completely different addresses in the header. Unfortunately, other aspects of email headers can be forged including their originating IP address and client type. All of this is usually an effort to elude network administrators trying to identify the actual source of the message. Fortunately, there are aspects of a message header that cannot be forged and this is how the forgery can be revealed. However, a misconfigured email server or email client can give the appearance of forgery even though the discrepancies are innocent. The result is the message¬†getting blocked as probable spam. So while this technique is fairly effective, it is prone to false positives.

A more sophisticated technique for spam filtering is called ‚??Bayesian Analysis‚?Ě, which uses complex statistical comparison from a tokenized database of known spam messages and known legitimate messages. Incoming mail is compared with the database and assigned a probability based on the matching of tokens in the database. The central weakness of Bayesian Analysis is the classic¬†conundrum of ‚??garbage in, garbage out‚?Ě. A large database of existing email with accurate classification is damn effective against spam, but a database which is too small or contains misclassified email can be quite ineffective.

Recently, I launched our Bayesian database with over 2,000 spam messages and 1,000 legitimate messages. There should have been few or no misclassification because I went through the messages by hand before ‚??feeding‚?Ě them to database for tokenization. My personal experience shows our database to correctly identify spam over 90% of the time with very few false positives. Bayes maintains it's effectiveness against spam as long as an email administrator keeps feeding it properly classified messages. Because of the tokenization process, the database can identify new spam never seen before...but it's still a good idea to keep it current otherwise it will slowly degrade as spammers try new techniques.

Here is a small example from our Bayes Database, the figures on the far left are the probabilities, with .140 being 14% likelihood of being spam. The actual tokens being matched are on the far right.

0.140 1 4 1093368741 134

0.049 0 1 1093305963 Pity

0.985 3 0 1093370171 phoenicia

0.995 10 0 1093371791 capstan

0.184 7 19 1092264308 smaller

0.991 5 0 1093371613 v i a g r a

0.958 1 0 1092770619 Biz!

0.958 1 0 1093354427 sk:byronic

0.995 9 0 1093374838 coloratura

0.992 6 0 1093371309 voltmeter

0.958 1 0 1093370178 wavenumber

Previous Page    Next Page
Table of Contents
Page 1: The Cost of Spam
Page 2: Simple Techniques
Page 3: Complex Techniques
Page 4: Integrated Techniques
Page 5: The Future
Page 6: Final Thoughts

      Posted by: , August 25, 2004, 6:00 pm  

    Cool banner #1
       ::  USB News

       ::  Bjorn 3D

       ::  [H]ardOCP

       ::  BurnOutPC

       ::  I am Not a Geek

Top Products














Sound Cards

Creative Labs



Graphic Cards




Hard Drives







2001 - 2004 Digital Silence
Digital Silence is not responsible for the information or the accuracy of the information above.
All trademarks and copyrights owned by their respective companies.

Graphical Design by Mohsin Ali
Website Layout by Universal Interactive

PHP Programming by Network Innovations
Additional HTML Programming by Moddin.Net