A blacklist can be a handy tool for preventing a certain kind of spam; the kind that seeks to gain page rank by having lots of links back to a certain page. It works because the URL is something the spammer must include if they want their spam to have any effect. If you build a blacklist in a plain text file, with one keyword (could be a domain, a word, whatever) for each line then it is really easy to plug into that list from a variety of applications.
I was getting some post spam in my PunBB forums, so I hooked up the forum posting to the same blacklist I use to prevent comment spam here in my blog. Here is the PHP code if you’d like to try it yourself:
$spam_urls = file("spam_urls.txt");
foreach ($spam_urls as $spam_url) {
$spam_url = str_replace("\n", "", strtolower($spam_url));
if (strstr(strtolower(stripslashes($_REQUEST["field_name"])), $spam_url)) {
die('Oops, blocked.');
}
}
To use:
- build your spam_urls.txt file
- change the variable (field_name) to the name of the field that can contain spam URLs (you can add additional fields as needed, the easiest way is to copy the “if” block)
- add this check that into the code just before something gets posted
If you use WordPress and like managing your blacklist in WordPress, you can set up a CRON job to create the text file from the blacklist in the database once an hour or so. If you only have the list in the database, it’s harder to leverage the list from other applications.
Great idea. One other thing you could snag from the cron job would be a published black list to augment your own. That would be nice because then you wouldn’t be starting from scratch.
Two quick tips: use chop or rtrim instead of that str_replace, and in general, instead of using strstr, it’s better to do a comparison by using strpos and checking it (with the !== comparison) against false. strstr is wasteful when you’re just doing a comparison because it returns the rest of the string after the match.