I’ve been bombarded (about 50 a day) by a new kind of spam comment lately. It’s been slipping through my MT-Blacklist filters, because it creates intelligible sentences by varying verbs (like “check” and “visit”) and nouns (like “site” and “pages”). Sometimes, when I’m browsing through other sites I see the same spam comments, so I figured I would post the regular expression I wrote to block it in case anyone happens to be searching for one, like the one I wrote a few months ago.
(check|visit)[\w\-_.]*(pages|sites|information|info)[\w\-_. ]*
This has been the most difficult spam variation I’ve had to deal with. The one weakness of most comment spam is that it’s bound to a static website address. Since spam is usually generated through robots, there are patterns that can be matched in order to block it. The key is figuring out what the pattern is, whether it may be a reoccurring IP address (very unlikely and unreliable), or a reoccurring website address (most likely). This one is different though, because the advertised websites keep changing. Not only that, but the sentences used to present the site are also inconsistent. The pattern, as a result, is more complex.

