It's very refreshing to take a break from a tedious bit of routine code to try out something unusual or tricky. Sometimes one can be quite startled by finding out how easy it is to do in TSQL.
Jonathan Snook's blog is my favourite "developer's blog" of all. It is packed with ideas. I was reading it the other day and came across How I built an effective blog comment spam blocker. This, I thought, was great fun, because it told you how to build it but didn't give you the code. The algorithm scores a string according to how likely it is to be spam. For everything in a comment that the filter likes like, the string gets a point. For everything it don't like, the string loses points. If the string totals 1 or higher, it lands on the site as a valid comment. If it scores a 0, it's set for moderation, If it's below 0, it's marked as spam. Some of the things that score negative are rather amusing ( .pl or .cn sites for example) but done from experience.
With Jonathan's permission, here is a TSQL implementation. I've simplified it slightly, to fit it in a blog, by using temporary tables for the word-banks, dodgy URL suffixes and so on. Normally, you'd put these in permanent tables and fine-tune the system as the language of spam changes, without having to alter the code.
CREATE FUNCTION [dbo].[fnSpamScore] (@comment VARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE @links INT
DECLARE @LenComment INT
DECLARE @points INT
DECLARE @hit INT
DECLARE @ii INT
DECLARE @Start INT
DECLARE @Length INT
DECLARE @KeywordLength INT
DECLARE @Starta INT
DECLARE @urls TABLE(url VARCHAR(255))
DECLARE @SpamWords TABLE(spamword VARCHAR(30))
--stock the spam-words table
INSERT INTO @spamwords(spamword)
SELECT 'Levitra' UNION SELECT 'viagra' UNION SELECT 'casino' UNION
SELECT 'cialis' UNION SELECT 'nude' UNION SELECT 'tramadol' UNION
SELECT 'phentermine' UNION SELECT 'xanax' UNION SELECT 'alprazolam' UNION
SELECT 'amoxicillin' UNION SELECT 'xxx' UNION SELECT 'porn'
DECLARE @commentstart TABLE(word VARCHAR(30))
---stock the comment-start table
INSERT INTO @commentstart(word)
SELECT 'interesting' UNION SELECT 'cool' UNION SELECT 'sorry' UNION
SELECT 'nice'
DECLARE @keyWords TABLE(keyword VARCHAR(30))
--stock the keyword table
INSERT INTO @keywords(keyword)
SELECT '.html' UNION SELECT 'free' UNION SELECT '?' UNION
SELECT '&' UNION SELECT '.info' UNION SELECT '.pl' UNION
SELECT '.de' UNION SELECT '.cn'
DECLARE @URLStarts TABLE(start VARCHAR(30),offset INT)
--stock the URL Start table
INSERT INTO @URLStarts(start,offset)
SELECT 'HREF=',5--unquoted
UNION SELECT 'HREF="',6--quoted
UNION SELECT 'HREF=" ',7--I've seen this trick in spam
UNION SELECT 'HTTP://',0--not in an anchor
UNION SELECT 'HTTPS://',0--not in an anchor
UNION SELECT 'mailto://',0--not in an anchor
--Get the length of the comments and initialise things
SELECT @LenComment=LEN(REPLACE(@comment,' ','|')),@points=0,
@ii=@LenComment,@links=0
WHILE @ii>0--find every URL in the comments and put them in a table
BEGIN--check for the next HREF, possibly 'quoted'
SELECT @start=0
SELECT TOP 1 @start=hit,@keywordLength=offset FROM (
SELECT [hit]=PATINDEX ('%'+start+'%',RIGHT(@comment,@ii)),offset
FROM @urlStarts)f
WHERE hit >0 ORDER BY hit ASC, offset DESC
IF COALESCE(@start,0)=0 BREAK--no more?
--so we isolate the actual URL in the anchor
SELECT @Length= PATINDEX ('%["> ]%',
RIGHT(@comment,@ii-@start-@keywordLength))
SELECT @links=@links+1,--increment the URL tally
@Length=CASE @length WHEN 0 THEN @ii ELSE @length END
--no termination?
INSERT INTO @urls(url) --add to our URL table
SELECT LEFT(SUBSTRING(RIGHT(@comment,@ii),
@start+@keywordLength,@Length),255)
--and reduce the length of the string we look at past the URL
SELECT @ii=@ii-@start-@keywordLength-@Length
END
--How many links are in the body? if more than 2 then -1 point per link
-- if Less than 2 then +2 points
SELECT @points=@points+ CASE WHEN @links < 2 THEN 2 ELSE -@links END
--How long is the comment? More than 20 characters and there's no links
--then + 2 points
-- Less than 20 characters then -1 point
SELECT @points=@points+CASE WHEN @links=0
AND @lencomment>20 THEN +2 ELSE -1 END
--Number of previous comments from same ID Approved comments +1 point per
-- Marked as spam -1 point per
--Keyword search in body of comments (viagra, casino, etc.) -1 point per
SELECT @points=@points-COUNT(*)
FROM @spamwords WHERE CHARINDEX(spamword,@comment)>0
--URLs that have certain words or characters in them
-- .html, .info, ?, & or free -1 point per
--or URLs that have certain TLDs .de, .pl, or .cn (sorry guys) -1 point
SELECT @points=@points-COUNT(*)
FROM @keywords INNER JOIN @urls ON CHARINDEX(keyword,url)>0
--URL length More than 30 characters -1 point
SELECT @points=@points-COUNT(*)
FROM @urls WHERE LEN(url)>30
--Body starts with... Interesting, Sorry, Nice or Cool. -10 points
SELECT @points=@points-(10*COUNT(*))
FROM @commentStart WHERE CHARINDEX(word,@comment)>0
--Random character match 5 consonannts -1 point per
SELECT @ii=@LenComment
WHILE @ii>0
BEGIN
SELECT @hit= PATINDEX ('%'+REPLICATE('[bcdfghjklmnpqrstvwxyz]',5)
+'%',RIGHT(@comment,@ii))
IF @hit=0 BREAK
SELECT @points=@points-1, @ii=@ii-(@hit+5)
END
RETURN @points
END
SELECT dbo.fnSpamScore('this is a perfectly legitimate comment that
points out that phil''s code is horribly broken due to him being called
out for a beer half way through writing it.')
-- +4
SELECT dbo.fnSpamScore('Cool. Buy herbal viagra at http:\\DodgySite.cn
and impress your neighbours.')
-- -7
The system seems pretty effective. If a spam slips through, or a legitimate comment falls foul of the system, you can tweak it very simply . It is in place in the forthcoming Simple-Talk/SSC Wiki which I'm currently helping to build.
As a strange by-product of developing this, I've taken a sudden liking to spam, purely to test the efficiency of the algorithm. Unfortunately, Simple-Talk is a martyr to it, but I can't find a way of plugging this routine into community server. Still, it means we have a vast test-bank of the stuff to check the routine with!
It occurs to me that there must be a number of ways of writing the SQL code to Jonathan Snooks algorithm. What is the fastest and most effective way? Perhaps we should have a 'Lionel-style' competition.
For anyone without Firefox who can't copy n' paste, here is the source file