Phil Factor's Phrenetic Phoughts

Simple-Talk columnist
The wilder shores of Transact SQL    Phil on Twitter   Phil on SQL Server Central"

TSQL Spam-killer

Published Friday, April 11, 2008 7:53 AM

It's very refreshing to take a break from a tedious bit of routine code to try out something unusual or tricky. Sometimes one can be quite startled by finding out how easy it is to do in TSQL.

Jonathan Snook's blog is my favourite "developer's blog" of all. It is packed with ideas. I was reading it the other day and came across How I built an effective blog comment spam blocker. This, I thought, was great fun, because it told you how to build it but didn't give you the code. The algorithm scores a string according to how likely it is to be spam. For everything in a comment that the filter likes like, the string gets a point. For everything it don't like, the string loses points. If the string totals 1 or higher, it lands on the site as a valid comment. If it scores a 0, it's set for moderation, If it's below 0, it's marked as spam. Some of the things that score negative are rather amusing ( .pl or .cn sites for example) but done from experience.

With Jonathan's permission, here is a TSQL implementation. I've simplified it slightly, to fit it in a blog, by using temporary tables for the word-banks, dodgy URL suffixes and so on. Normally, you'd put these in permanent tables and fine-tune the system as the language of spam changes, without having to alter the code.

CREATE FUNCTION [dbo].[fnSpamScore] (@comment VARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE
@links INT
DECLARE
@LenComment INT
DECLARE
@points INT
DECLARE
@hit INT
DECLARE
@ii INT
DECLARE
@Start INT
DECLARE
@Length INT
DECLARE
@KeywordLength INT
DECLARE
@Starta INT
DECLARE
@urls TABLE(url VARCHAR(255))

DECLARE @SpamWords TABLE(spamword VARCHAR(30))
--stock the spam-words table
INSERT INTO @spamwords(spamword)
SELECT 'Levitra' UNION SELECT 'viagra' UNION SELECT 'casino' UNION
SELECT
'cialis' UNION SELECT 'nude' UNION SELECT 'tramadol' UNION
SELECT
'phentermine' UNION SELECT 'xanax' UNION SELECT 'alprazolam' UNION
SELECT
'amoxicillin' UNION SELECT 'xxx' UNION SELECT 'porn'

DECLARE @commentstart TABLE(word VARCHAR(30))
---stock the comment-start table
INSERT INTO @commentstart(word)
SELECT 'interesting' UNION SELECT 'cool' UNION SELECT 'sorry' UNION
SELECT
'nice'

DECLARE @keyWords TABLE(keyword VARCHAR(30))
--stock the keyword table
INSERT INTO @keywords(keyword)
SELECT '.html' UNION SELECT 'free' UNION SELECT '?' UNION
SELECT
'&' UNION SELECT '.info' UNION SELECT '.pl' UNION
SELECT
'.de' UNION SELECT '.cn'

DECLARE @URLStarts TABLE(start VARCHAR(30),offset INT)
--stock the URL Start table
INSERT INTO @URLStarts(start,offset)
      
SELECT 'HREF=',5--unquoted
UNION SELECT 'HREF="',6--quoted
UNION SELECT 'HREF=" ',7--I've seen this trick in spam
UNION SELECT 'HTTP://',0--not in an anchor
UNION SELECT 'HTTPS://',0--not in an anchor
UNION SELECT 'mailto://',0--not in an anchor

--Get the length of the comments and initialise things
SELECT @LenComment=LEN(REPLACE(@comment,' ','|')),@points=0,
  
@ii=@LenComment,@links=0
WHILE @ii>0--find every URL in the comments and put them in a table
  
BEGIN--check for the next HREF, possibly 'quoted'
  
SELECT @start=0
  
SELECT TOP 1 @start=hit,@keywordLength=offset FROM (
      
SELECT [hit]=PATINDEX ('%'+start+'%',RIGHT(@comment,@ii)),offset
              
FROM @urlStarts)f
    
WHERE hit >0 ORDER BY hit ASC, offset DESC

   IF
COALESCE(@start,0)=0 BREAK--no more?
   --so we isolate the actual URL in the anchor
  
SELECT @Length= PATINDEX ('%["> ]%',
              
RIGHT(@comment,@ii-@start-@keywordLength))
  
SELECT @links=@links+1,--increment the URL tally
          
@Length=CASE @length WHEN 0 THEN @ii ELSE @length END
                                                      
--no termination?
  
INSERT INTO @urls(url) --add to our URL table
          
SELECT LEFT(SUBSTRING(RIGHT(@comment,@ii),
              
@start+@keywordLength,@Length),255)
  
--and reduce the length of the string we look at past the URL
  
SELECT @ii=@ii-@start-@keywordLength-@Length
  
END
--How many links are in the body?  if more than 2 then -1 point per link
-- if Less than 2 then +2 points
SELECT @points=@points+ CASE WHEN @links < 2 THEN 2 ELSE -@links END
--How long is the comment? More than 20 characters and there's no links
                                                       --then + 2 points
-- Less than 20 characters then -1 point
SELECT @points=@points+CASE WHEN @links=0
      
AND @lencomment>20 THEN +2 ELSE -1 END

--Number of previous comments from same ID Approved comments +1 point per
-- Marked as spam  -1 point per

--Keyword search in body of comments (viagra, casino, etc.) -1 point per
SELECT @points=@points-COUNT(*)
          
FROM @spamwords WHERE CHARINDEX(spamword,@comment)>0

--URLs that have certain words or characters in them
--                                .html, .info, ?, & or free -1 point per
--or URLs that have certain TLDs .de, .pl, or .cn (sorry guys) -1 point
SELECT @points=@points-COUNT(*)
          
FROM @keywords INNER JOIN @urls ON CHARINDEX(keyword,url)>0

--URL length More than 30 characters -1 point
SELECT @points=@points-COUNT(*)
          
FROM @urls WHERE LEN(url)>30
--Body starts with... Interesting, Sorry, Nice or Cool. -10 points
SELECT @points=@points-(10*COUNT(*))
          
FROM @commentStart WHERE CHARINDEX(word,@comment)>0

--Random character match 5 consonannts -1 point per
SELECT @ii=@LenComment
WHILE @ii>0
  
BEGIN
   SELECT
@hit= PATINDEX ('%'+REPLICATE('[bcdfghjklmnpqrstvwxyz]',5)
                       +
'%',RIGHT(@comment,@ii))
  
IF @hit=0 BREAK
   SELECT
@points=@points-1, @ii=@ii-(@hit+5)
  
END
RETURN
@points
END

SELECT
dbo.fnSpamScore('this is a perfectly legitimate comment that
points out that phil''s code is horribly broken due to him being called
out for a beer half way through writing it.'
)
--  +4
SELECT dbo.fnSpamScore('Cool. Buy herbal viagra at http:\\DodgySite.cn
and impress your neighbours.'
)
--  -7

The system seems pretty effective. If a spam slips through, or a legitimate comment falls foul of the system, you can tweak it very simply . It is in place in the forthcoming Simple-Talk/SSC Wiki which I'm currently helping to build.

As a strange by-product of developing this, I've taken a sudden liking to spam, purely to test the efficiency of the algorithm. Unfortunately, Simple-Talk is a martyr to it, but I can't find a way of plugging this routine into community server. Still, it means we have a vast test-bank of the stuff to check the routine with!

It occurs to me that there must be a number of ways of writing the SQL code to Jonathan Snooks algorithm. What is the fastest and most effective way? Perhaps we should have a 'Lionel-style' competition.

For anyone without Firefox who can't copy n' paste, here is the source file

Comments

 

Phil Factor said:

For a good set of tips for reducing comment spam to negligible levels, have a read of this
http://bjorkoy.com/past/2008/1/6/bulletproof_protection_against_comment_spam/
April 19, 2008 10:59 AM
 

spooky1966 said:

Great article, can't wait to try this.  I tried to copy and paste the code in SQL Server Management Studio and it is all goofed up.  Can you possibly attach a link to a txt file?
April 25, 2008 8:34 AM
 

Phil Factor said:

I've added a link at the end of the Blog. Cut n' Paste seems to work fine with Firefox, but IE seems to make a mess of it.
April 25, 2008 9:02 AM
 

mgp22 said:

Great script.. Can you possibly show how it would be done with permanent tables?  also, would this proc be effective in a production web site environment with tons of hits?
tx,
Matt
April 30, 2008 12:27 PM
 

Phil Factor said:

If you were getting enough comments submitted for performance to become an issue then you'd be running a gigantic site. On my system it runs so fast that I have problems measuring how long it takes.
April 30, 2008 12:55 PM
 

mgp22 said:

Ok - am a relative newbie.. i understand.  Does it matter if it's done with "permanent" tables or not (for storing spam keywords - e.g. xanax)?
April 30, 2008 1:08 PM
 

Phil Factor said:

I'd stick with what's there for the time being- get a feel for how it works; play with it.  If performance later becomes an issue, or if you want someone else to be able to change the word bank, then alter it to permanent tables. You can always rewrite the algorithm in a compiled procedural language later on.
April 30, 2008 1:14 PM
 

mgp22 said:

Ok Thanks a lot!!
Matt
April 30, 2008 2:15 PM
You need to sign in to comment on this blog

















<April 2008>
SuMoTuWeThFrSa
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910
Virtualizing Exchange: points for discussion
 With the increasing acceptance of the use of Virtualization as a means of providing server... Read more...

Encouraging .NET Reflector Add-ins
 Jason Haley is well-known for the resources he's provided to developers who wish to extend Reflector's... Read more...

Using .NET Reflector Add-ins
 .NET Reflector by itself is great, but it really comes into its own with the help of some add-ins. Here... Read more...

Unique Experiences!
 You'd have thought that a unique constraint was an easy concept - Not a bit of it; it can cause a lot... Read more...

Dynamic Data Templates in ASP.NET 3.5
 Gayani gives an introduction to Dynamic Data Templates in ASP.NET 3.5 and explains how one can save a... Read more...