Simon Willison’s Weblog

Subscribe

Solving comment spam

28th January 2004

There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed as email and we’re all going to hell in a hand basket.

The story so far

I fall squarely in to the techno-optimist category. Back in September I started blacklisting domains linked to from spam comments, defending against return visits from spammers and allowing others to syndicate my block list to run on their own site. Then in October I tweaked my comment system to eliminate PageRank from links in comments, making spamming for search engine optimisation a futile exercise. Of course, this measure only works if spammers realise it’s there (I know at least one has) which is why I’m personally very happy to see that the latest release of Moveable Type has adopted the technique—to mixed reviews from the MT community.

There have been a whole bunch of other technological innovations over the past few months. Sam Ruby has implemented throttling to ban people who post three consecutive comments, and has some great ideas about guarding against strangers. Jay Allen’s MT-Blacklist makes the blacklisting concept available to a wide audience. Meanwhile, James Seng’s MT-Bayesian introduces trainable spam filters adapted from the fight against email spam.

The challenges ahead

So those are the solutions so far; the critical question is whether they work. The amount of spam I’ve been getting has definitely decreased, but as I run a completely custom blogging system I’m safe from the automated scripts that target more widespread systems—other sites make easier targets. Now that the less ethical search engine optimisers have started to catch on to the potential of comment spam to improve their PageRank the amount of spam can only increase. Some bloggers have already started to disable comments entirely (thankfully Dan turned them back on again shortly afterwards), setting a worrying precedent for the elimination two way interactions comments allow between bloggers and non-bloggers.

I’ll put it in writing now: I will never disable comments on this blog. In the past few months the comments here have proved far more interesting and valuable than my actual posts, and I really appreciate the quality of the discussions that have arisen here. I will take whatever steps are necessary to keep this a useful environment for discussion.

Many people have hailed user registration as the ultimate solution to spam. It isn’t, because the value of PageRank is just too high—and writing a script to automatically create accounts (even with email confirmation required) is child’s play to anyone who is competent in an internet-aware scripting language. Even accessibility-impeding captchas are no defence against spammers who can afford to employ cheap labour to defeat them—and with search engine rankings as critical as they are there’s no shortage of spam dollars.

With those ruled out, let’s look at the remaining solutions:

The killer

Without links, comment spam has no purpose. To eliminate spam, eliminate links. Redirecting them through a PageRank killer already achieves this, but proves too subtle for spammers intent on spreading their links as widely as they can. Too truly eliminate spam, strip out links and anything that even looks like a URL and force the spammer to preview their carefully crafted advertisement before hitting submit. Seeing as hyperlinks are the single most important feature of the web this may seem draconian—and indeed it is. But on a site that serves more as a discussion forum than a farm and where the alternative to killing links is killing comments entirely this could be the saving factor.

For most blogs however links are an essential part of the discourse—I certainly wouldn’t want to disable them here. Now only do they add huge value to the discussions, but more importantly they act as a “signature” for many commenters—knowing a comment is by “Dan” is far less useful than knowing that it’s by Dan from www.simplebits.com.

Finding a compromise

Draconian measures such as the above wouldn’t be necessary if spammers would wise up to the fact that their carefully crafted missives were having no effect on their precious PageRank. The real challenge then is to make anti-PageRank measures obvious to even the most brain-addled viagra peddlers. I’ve taken the first step towards this by turning on compulsory previewing for comments, which should have the added benefit of reminding legitimate commenters to use paragraph tags. I’ll be working on ways of making the anti PageRank measures more obvious over the next few days, as and when work permits.

I’ve seen people argue that depriving legitimate commenters of PageRank is a poor compromise. I disagree: if the only cost of eliminating the incentive to spam is the loss of some Google ego then I see it as a price well worth paying. Of course, I say that as someone who’s already built up their Google ego but at the end of the day it’s my blog, my rules. One solution I’ve considered is creating a whitelist of sites that frequent commenters use in their signatures, causing them to be displayed without a redirect.

Comment spam is a solvable problem. Furthermore, blogging about comment spamming is almost as dull as blogging about blogging. Let’s hurry up and solve it so we can go back to blogging about cats.

This is Solving comment spam by Simon Willison, posted on 28th January 2004.

Next: Iterating over a sequence in reverse

Previous: Simple tricks for more usable forms

Previously hosted at http://simon.incutio.com/archive/2004/01/28/solvingCommentSpam