I’ve been bom­barded (about 50 a day) by a new kind of spam com­ment lately. It’s been slip­ping through my MT-Blacklist fil­ters, because it cre­ates intel­li­gi­ble sen­tences by vary­ing verbs (like “check” and “visit”) and nouns (like “site” and “pages”). Sometimes, when I’m brows­ing through other sites I see the same spam com­ments, so I fig­ured I would post the reg­u­lar expres­sion I wrote to block it in case any­one hap­pens to be search­ing for one, like the one I wrote a few months ago.

(check|visit)[\w\-_.]*(pages|sites|information|info)[\w\-_. ]*

This has been the most dif­fi­cult spam vari­a­tion I’ve had to deal with. The one weak­ness of most com­ment spam is that it’s bound to a sta­tic web­site address. Since spam is usu­ally gen­er­ated through robots, there are pat­terns that can be matched in order to block it. The key is fig­ur­ing out what the pat­tern is, whether it may be a reoc­cur­ring IP address (very unlikely and unre­li­able), or a reoc­cur­ring web­site address (most likely). This one is dif­fer­ent though, because the adver­tised web­sites keep chang­ing. Not only that, but the sen­tences used to present the site are also incon­sis­tent. The pat­tern, as a result, is more complex.