In the beginning…
Ever since Sergey and Larry invented the PageRank algorithm, published it and started Google, people have been looking for ways to build lots of external links to their sites. Up to a little while ago, this was done through sending out email spam looking for link exchanges, and having ‘link’ pages. While there is still benefit in trading links with relevant people, I suspect most people delete link exchange requests immediately without even reading them, just like I do.
Along comes ‘web 2.0’
While I’m not a fan of the term, one of the features of ‘Web 2.0’ was the level of interaction people could have with websites. While forums are nothing new, being able to create and upload content on wiki style sites, as well as leave comments on blogs without creating accounts was the next big thing in SEO spammers. The early blog software was pretty basic, and it didn’t take much effort to write an automated bot to go and automatically post your links all over the web. The response to this was the ‘Captcha’ field : pretty much a feature of most blogs these days. I hate the things : a programmer solution to a user problem if I ever saw one. You’ll note I don’t use one with this blog. I just manually delete the spam comments as they come in.
The rel=nofollow era
Next along was a suggestion from Google that people start using a new tag, called ‘rel=nofollow’ on links. This would instruct search engines not to pass any rank onto a link tagged with this bit of html. While this still works, and is effective in preventing losing pagerank for commenters, it’s a blanket approach. I don’t mind passing on pagerank from a blog page to a useful link posted in a comments field. In the linking world, what goes around comes around. If a page is that important for ranking, you wouldn’t have any user-contributed content on it anyway. I certainly would hope that if I contributed a useful comment to another blog, a bit of link love might come back my way.
The Army of a Thousand Monkeys
With captchas and nofollow common in the blog comments world, the Spammers were slowed down a bit. You could easily stop automated bots from posting to comment fields, and if you didn’t use a captcha, you didn’t lose anything because of the nofollow links anyway.
But the prize is still there : all it takes is a couple of hundred links ‘out there’ and a page can start ranking for a given keyword/keyphrase. So SEO Spammers have turned to a new trick : the ‘mechanical turk’ approach. No disrespect to Amazon but the term existed before they started using it.
What I’ve noticed lately (and discussions with other site owners confirms this) is that there appears to be an army of workers who will dilligently create an account and start posting links all over your site. You can’t stop them with a Captcha because it is a real human working the controls. You can’t stop them posting links because you need to let your legitimate users post links. You can only realise that it’s a spam post after the fact, when your powerful head-mounted computer instantly recognises the fact.
I’ve found that these spammers like forums the best. Just a nice, empty page, usually freedom to enter all sorts of html and the ability to make lots of different posts once an account has been created.
I’ve no idea how people get an ROI on this activity, but I suppose a determined poster can rack up a couple hundred links in an hour if they set out to do it.
How to stop the Tide
I’ve been mulling over this question a lot, lately. Specifically because I have seen the problem affect DotNetNuke sites badly : there isn’t a lot of Spam protection (apart from a [somewhat troublesome] captcha installation). Things like the forum software doesn’t have captcha built in, and besides, we’ve already established that Captcha doesn’t work with human SEO bots.
Some time back I switched to a bayesian spam filter for my email, after getting frustrated with other [lame] solutions which rely on black lists and keyword lists and other ultimately doomed technology. I say ultimately doomed because it is defence by whack-a-mole. It’s always reactive and takes a huge effort to be successful. It also creates too many false positives : that is, normal email that gets trapped in the spam folder, never to be seen. That’s because, like adults, occasionally your friends use the word ‘sex’ in emails as well.
The reason I got onto this was through Paul Graham’s excellent book Hackers and Painters, in which a chapter is devoted to his Bayesian Spam Filter software. By the time I got around to reading it, there was already an implementation of the bayesian algorithm for Outlook called SpamBayes, which I promptly downloaded and installed. It’s even open source. And I’ve have been happy with it, ever since.
Experimenting with DotNetNuke
I decided that the way forwards was to try and implement the same algorithm in a one-size-fits-all implementation for DotNetNuke. The idea would be that you could point it to any module where visitors are allowed to create content. If it suspected an entry in a page was Spam, it would either send back a 404, redirect the user to somewhere else, or ‘silently fail’ : meaning the content would be created but deleted later, or perhaps modified in such a way that it was a pointless post for the user (for example, replace the offending text with ‘This message was detected as Spam and deleted’.
I’ve been experimenting with this approach and have had some success, although it needs a lot more trials to figure it out better. The beauty with the statistical based approach is that it is tailored to each and every site, though you do need to ‘teach’ your particular version what you consider to be spam. I’ve come to a fork in the road, though, and before I spend much more time, I though I better get some early feedback to see if I’m on the right track.
This post is really to put it ‘out there’ and see what people think. Are human spam bots creating content on your site, bloating your blog comments and filling up your forums? Have you already found a successful solution? Is a generic solution for any type of DNN module the right way, or do people want specific solutions for specific content modules (ie, blog, forum, etc). Let me know via the comments - no spam please :) That means you, charlie2342435@yahoo.com!