Michael Morris
07-10-2008, 03:09 AM
I'm patching up an old but very effective Bayesian Filtering spam package for 3.7.2 release. I'm looking for question ideas for the program.
Bayesian filtering works because no matter what the spammers try at the end of the day they still have to deliver their message. The filters ask yet or no questions about the post and score the answers. The higher the score the more likely the post is to be spam. Here are the questions I've currently encoded and their reason for being part of the test. I'm looking for commentary on the point value of each question and also other questions that can be added. This is meta code - I'm not going to cite the exact function call(s) needed to perform these tests. I expect to have the product out by September.
Note that Bayesian filtering is best limited to users with less than 5 posts.
Here goes.
1. Is the post a new thread. (5)
Spammers usually create new threads for increased visibility.
2. Does the post contain URL's
Spammers usually use url code.
2a. Does the URL take up 50% or more of the total length of the post.
Spammers tend to do this as well.
2b. Is the post nothing but a link?
Almost a certain sign of spam.
2c. Are there multiple URL's
Also another almost certain sign.
3. Does the post contain multiple $ characters
4. Are more than 50% of the characters in the post outside regex [A-Z, a-z, 0-9]
I've observed spam posts that where utterly illegible, probably misconfigured bots. To be honest, they are among the most amusing.
5. Has any URL in the post been the origin point of spam before
The previous version of vSpamScan 2 used an independently maintained blacklist. The new version will maintain it's own and each time a post is made it will harvest urls from the spam post and add them to the blacklist. Admins will also be able to whitelist domains to prevent their being added to the blacklist.
6. Is the estimated typing speed of the post greater than 100 wpm.
Bots move fast. Time the request of the editor page with the response of the post.
7. Image code?
7a. More than one?
8. Does the IP of the bot match the first 3 prefixes of a prior spam attempt
For instance, say you're spammed by a bot at 211.67.31.12 From then on vSpamScan will score this question on any post from 211.67.31.* Not an outright IP ban, but its a clue.
9. Post contain the words porn, viagra, and multiple other common spam terms
If so count them up.
10. More than 50% of words in post mispelled.
Punishes attempts to dodge 9 above.
11. Post shorter than 50 character not including characters in bbcode tags
12. Post contains fewer than 2 sentences (here looking for periods not used in urls)
13. Post Title contains multiple $ or multiple !!!
14. Post contains email addresses.
15. No md5 hash of post
Clientside javascript will be used to hash new users posts. 2 cases where this will fail - users who disable javascript and bots that don't parse javascript.
The idea is that after a certain score, say 20, the post is pushed to the moderation queue. At another threshold - say 40 - the post just gets deleted. Admins of a board will be able to change the scoring of each of these questions or even disable them.
vSpamScan 2 will also incorporate some features seen in some of the scattered spam packages on the board, like timing length of registration, Verbal captcha coding, and also user psuedo-moderation (any new poster's post will have a report post as spam button. Each time any registered user clicks the report post as spam button (but only once per user) 10 points are added to the posts spam score. So say a spammer squeaks past the filter but still scores 15 on the filter. If a user hits the report the score goes up to 25 and BOOM - outta here. The system also will track which users consistently file false reports and ignores their use of the button after a set number of false alarms.
Bayesian filtering works because no matter what the spammers try at the end of the day they still have to deliver their message. The filters ask yet or no questions about the post and score the answers. The higher the score the more likely the post is to be spam. Here are the questions I've currently encoded and their reason for being part of the test. I'm looking for commentary on the point value of each question and also other questions that can be added. This is meta code - I'm not going to cite the exact function call(s) needed to perform these tests. I expect to have the product out by September.
Note that Bayesian filtering is best limited to users with less than 5 posts.
Here goes.
1. Is the post a new thread. (5)
Spammers usually create new threads for increased visibility.
2. Does the post contain URL's
Spammers usually use url code.
2a. Does the URL take up 50% or more of the total length of the post.
Spammers tend to do this as well.
2b. Is the post nothing but a link?
Almost a certain sign of spam.
2c. Are there multiple URL's
Also another almost certain sign.
3. Does the post contain multiple $ characters
4. Are more than 50% of the characters in the post outside regex [A-Z, a-z, 0-9]
I've observed spam posts that where utterly illegible, probably misconfigured bots. To be honest, they are among the most amusing.
5. Has any URL in the post been the origin point of spam before
The previous version of vSpamScan 2 used an independently maintained blacklist. The new version will maintain it's own and each time a post is made it will harvest urls from the spam post and add them to the blacklist. Admins will also be able to whitelist domains to prevent their being added to the blacklist.
6. Is the estimated typing speed of the post greater than 100 wpm.
Bots move fast. Time the request of the editor page with the response of the post.
7. Image code?
7a. More than one?
8. Does the IP of the bot match the first 3 prefixes of a prior spam attempt
For instance, say you're spammed by a bot at 211.67.31.12 From then on vSpamScan will score this question on any post from 211.67.31.* Not an outright IP ban, but its a clue.
9. Post contain the words porn, viagra, and multiple other common spam terms
If so count them up.
10. More than 50% of words in post mispelled.
Punishes attempts to dodge 9 above.
11. Post shorter than 50 character not including characters in bbcode tags
12. Post contains fewer than 2 sentences (here looking for periods not used in urls)
13. Post Title contains multiple $ or multiple !!!
14. Post contains email addresses.
15. No md5 hash of post
Clientside javascript will be used to hash new users posts. 2 cases where this will fail - users who disable javascript and bots that don't parse javascript.
The idea is that after a certain score, say 20, the post is pushed to the moderation queue. At another threshold - say 40 - the post just gets deleted. Admins of a board will be able to change the scoring of each of these questions or even disable them.
vSpamScan 2 will also incorporate some features seen in some of the scattered spam packages on the board, like timing length of registration, Verbal captcha coding, and also user psuedo-moderation (any new poster's post will have a report post as spam button. Each time any registered user clicks the report post as spam button (but only once per user) 10 points are added to the posts spam score. So say a spammer squeaks past the filter but still scores 15 on the filter. If a user hits the report the score goes up to 25 and BOOM - outta here. The system also will track which users consistently file false reports and ignores their use of the button after a set number of false alarms.