blog spam - a solution
Today, this blog got its first ever spam, via the trackback interface. How annoying. Here's how I've stopped it (yes, the regexes could be better, and the parse_url() call eliminated, but its late and this is a quick hack):
<?php function ne_rbl_check($ip) { static $lists = array('.sbl-xbl.spamhaus.org'); $ip = gethostbyname($ip); foreach ($lists as $bl) { $octets = explode('.', $ip); $octets = array_reverse($octets); $h = implode('.', $octets) . $bl; $x = gethostbyname($h); if ($h != $x) { return false; } } return true; } function ne_surbl_checks() { $things = func_get_args(); foreach ($things as $thing) { if (preg_match('/^\\d+\\.\\d+\\.\\d+\\.\\d+$/', $thing)) { if (!ne_rbl_check($thing)) return false; } if (preg_match_all('~(http|https|ftp|news|gopher)://([^ ]+)~si', $thing, $m = array(), PREG_SET_ORDER)) { foreach ($m as $match) { $url = parse_url($match[0]); if (!ne_rbl_check($url['host'])) return false; } } } return true; } ?>
These two functions implement RBL and SURBL checks. RBLs, as you probably already know, are real-time block lists; you can look up an IP address in a block list using DNS, and if you get a record back, that address is in the block list. The first of the two functions implements this, in a bit of a lame hackish way.
The second function implements content-based checks, commonly known as SURBL; the text is scanned for things that look like IP addresses or URLs; those IP addresses or host names are extracted from the content and then looked up in the RBL using the first function.
Why is this good? A comment spammer will typically want to inject a link to their site onto your blog, and you can be fairly sure that their site is listed in a good RBL. The RBL used in my sample above is an aggregation of the SBL and XBL lists which contain known spammers and known zombie/exploited machines, so it should do the job perfectly.
Now to hook it up to the blog; this snippet is taken from my trackback interface:
<?php if (!ne_surbl_checks(get_ip(), $_REQUEST['excerpt'], $_REQUEST['url'], $_REQUEST['blog_name'])) { respond('you appear to be on SBL/XBL, or referring to content that is', 1); } ?>
get_ip() is a function to determine the IP address of the person submitting the page; I haven't included it here for the sake of brevity; it's fairly simple to code one, but keep in mind that it needs to be aware of http proxies. respond() returns an appropriate error message to the person making the trackback and exits the script.
And that's all there is to it; you can do similar things with your comments submission and pingback interfaces.
Enjoy.