Following some further thought and after having used
spamassassin a fair bit in the past I think that I'm going to run with a similar rule based system.
A config file has a list of rules within it. Each rule has a score which is added to an overall score for the post should that rule match the post.
At the end of testing a post for "spaminess" the overall score of the post is compared to a configurable "threshold" score. If the score for a post is over the threshold then it's deemed to be spam.
This setup, coupled with perl regular expression and similar_text support should make for a fairly simple but highly configurable system.
After there's a stable release of this code I'd hope that a number of us could worth together to produce a decent standard list of patterns that people can use out of the box. I'm going to need some help with that stage as the forums I admin have fairly low amounts of spam compared to some other sites.
As an aside SpamBuster nipped 8 spam posts in the butt this morning on
visordown.com