Click here to monitor SSC
Robyn Page and Phil Factor

RSS Newsfeed Workbench

06 July 2007

/* Have you ever fancied being able to put Simple-Talk's content into a database? We haven't either, but we'll do it just for practice. The objective of this workbench is to show just how easy it is to implement an RSS-based datafeed in SQL Server 2005. The purpose is to try out downloading an XML file from the internet, checking it for validity, and if not, seeing if it is a redirect. We'll show a cool SQL Server 2005 way of reading a text file into a VARCHAR(MAX) variable. We'll do some shredding of an XML file too. By the time we've finished we hope we will have illustrated some useful techniques and given you an application that you can turn into a production-quality newsreader.

It is very easy to read RSS feeds and OPML feeds into SQL Server 2005. This is because the files are XML and are easily eaten by the XML functions. RSS, which doesn't actually stand for anything other than RDF Site Summary, is a simple way of publishing, or broadcasting, frequently updated Web content such as articles and blog entries. OPML, originally conceived for transferring outlines, is a format that has been used widely for distributing collections of RSS feeds. However, it has a large number of uses in transferring simple structured list-based information. As it is used for exchanging lists of RSS feeds, we'll show how to read in an OPML list of feeds

Please remember before we start that this code is for illustration and is not 'production quality'. You will need to sort out the security issues (see the process delegation workbench) and add logging and more error-checking.

We won't attempt to store all the information that RSS2 defines, simply because we can't find any sites that use all this information. In fact, most feeds use only a small part of the information and few aggregators seem to display the extra information anyway. We'll boil it down to two relational tables */

/* Firstly, create a new database called RSSFeedWorkbench. Make sure you have the rights to execute xp_cmdshell too!*/

.USE RSSFeedWorkbench 
GO
IF EXISTS ( SELECT  *
            
FROM    sysobjects --delete it if it exists
            
WHERE   id OBJECT_ID('RSSitem') ) 
  
DROP TABLE [RSSitem]
/*
We create two tables. RSSItem contains every news item that we collect. 
It is linked to an RSSFeed table which contains every feed that we want to 
read from. We have not included every parameter of a feed item because
they are rarely used. */
CREATE TABLE dbo.RSSitem
  
(
    
RSSitem_ID INT IDENTITY(11)
                   NOT NULL,
    
RSSFeed_ID INT NULL,
    
title VARCHAR(80) NOT NULL,
    
link VARCHAR(200) NULL,
    
[description] VARCHAR(MAX) NULL,
    
PubDate VARCHAR(40) NULL,
    
[GUID] VARCHAR(80) NULL,
    
insertiondate DATETIME NOT NULL
                           
CONSTRAINT dfRssItemDate DEFAULT GETDATE() ),
    
CONSTRAINT PK_RSSitem PRIMARY KEY CLUSTERED RSSitem_ID ASC )
  )
ON
  
[PRIMARY] 
/****** Object:  Table  dbo . RSSFeed     Script Date: 07/02/2007 10:32:19 ******/
IF EXISTS ( SELECT  *
            
FROM    sysobjects --delete it if it exists
            
WHERE   id OBJECT_ID('RSSFeed') ) 
  
DROP TABLE [RSSFeed]
/* the RSSFeed table contains every RSS feed that we want to read from
.
It is used to poll around the feeds. This routine checks to see whether 
the TYPE field is set to RSS.(or RSS2 etc) If it isn't then it isn't
read. */ 
CREATE TABLE dbo.RSSFeed
  
(
    
RSSFeed_ID INT IDENTITY(11)
                   NOT NULL,
    
title VARCHAR(80) NOT NULL,
    
link VARCHAR(200) NOT NULL,
    
[description] VARCHAR(2000) NOT NULL
                                
DEFAULT '',
    
[language] VARCHAR(10) NOT NULL
                           
DEFAULT 'en-US',
    
Category VARCHAR(100) NOT NULL
                          
DEFAULT 'Generic',
    
docs VARCHAR(80) NULL,
    
Generator VARCHAR(80) NULL,
    
ManagingEditor VARCHAR(80) NULL,
    
Webmaster VARCHAR(80) NULL,
    
actualURL VARCHAR(200) NULL,
    
[type] VARCHAR(80) NOT NULL
                       
DEFAULT 'RSS',--is it RSS or ATOM?
    
insertiondate DATETIME NOT NULL
                           
DEFAULT GETDATE() ),
    
CONSTRAINT PK_RSSFeed PRIMARY KEY CLUSTERED RSSFeed_ID ASC )
  )
ON
  
[PRIMARY] 

GO

ALTER TABLE  dbo.RSSitem WITH CHECK
ADD
  CONSTRAINT 
FK_RSSitem_RSSFeed FOREIGN KEY RSSFeed_ID 
      
REFERENCES dbo.RSSFeed RSSFeed_ID )
GO
ALTER TABLE  dbo.RSSitem CHECK CONSTRAINT  FK_RSSitem_RSSFeed 
GO

CREATE NONCLUSTERED INDEX [idxRSSFeedTitle] ON [dbo].[RSSFeed] [title] ASC )

CREATE NONCLUSTERED INDEX [idxTitle] ON [dbo].[RSSitem] [title] ASC )
GO

/* Although there are more portable ways of getting the contents of the URL, we'll use CURL.EXE, which is free, and very easy to install, and useful for a wide range of purposes. You'll need to download and install this to get the Workbench to run. You'll need to install the OpenSSL package too from here .

Some RSS feeds will try to stop v non-browser software from reading an RSS feed. Curl has useful facilities that allow it to mimic a browser to the extent of allowing cookies and letting you specify the user agent. */

/* this is the stored procedure that grabs the contents of the feed, checks to see if is one we know about and, if not, saves its attributes. then it saves any items it has not already saved. As it saves the url of the feed, you can use it to add the feed to your repository, and then it will automatically refresh it (if you install the spUpdateAllFeeds stored procedure in the right place. This stored procedure is more complex than it might be because one can try to access a feed only to find it is a redirection to the actual feed. If this is the case then one has to read the file as an HTML file, get the anchor, and use the contents of the HREF as the source of the newsfeed, before trying again! */

IF EXISTS ( SELECT  *
            
FROM    sysobjects
            
WHERE   id OBJECT_ID('spUpdateRSSArchiveFrom') ) 
  
DROP PROCEDURE [spUpdateRSSArchiveFrom]
GO
CREATE PROCEDURE [dbo].[spUpdateRSSArchiveFrom]
  
@url VARCHAR(200),--URL of the RSS datafeed to register and read from
  
@tempfile VARCHAR(80'c:\rssfeed.xml'--optional file to save as
AS /*
execute spUpdateRSSArchiveFrom 
 'http://www.simple-talk.com/community/forums/rss.aspx?ForumID=142&Mode=0',
 'C:\forums.rss'

*/
  
SET nocount ON
  DECLARE 
@ErrorMessage VARCHAR(100)
  
DECLARE @command NVARCHAR(255--the command string used for spExecuteSQL
  
DECLARE @ExitCode INT--the code returned by the xp_cmdShell procedure
  
DECLARE @ExitCodeASCII VARCHAR(10)--the ascii version of above code
  
DECLARE @Badfile VARCHAR(MAX)--we use this to pull in any HTML files to
--examine if there was an error in reading in the XML file
  
DECLARE @Retries INT--the number of retries we allow
  
DECLARE @Ref VARCHAR(1000)--the reference supplied by the feed redirector
  
SELECT  @retries 0--set up the retry count
  
CREATE TABLE #lines line VARCHAR(2000) )--table used to remember what is
--passed back from the command line when executing xp_cmdShell

  
retry:--yes we use GOTOs. We sometimes do...
  
RAISERROR 'processing %s'01@url )--a way of getting an immediate 
--progress report
/* now we make up the command line for CURL.EXE. You may want to change the
parameters here for various reasons such as a proxy. If an error happens
then whatever CURL prints out is returned to the procedure and printed out
by the error processing */
  
SELECT  @command 'curl "' @url '" -o"' @tempfile
         
'" -A"Mozilla/4.0 (compatible; MSIE 6.0;'
         
+' Windows NT 5.0; .NET CLR 1.1.4322"'
/*Firstly we read the RSSfeed into a file*/
  
INSERT  INTO #lines line )
          
EXECUTE @exitcodexp_cmdshell @command--and execute it
  
SELECT  @ErrorMessage COALESCE(@errorMessage'') + COALESCE(line'')
  
FROM    #lines--put the entire result into one string
  
IF EXISTS ( SELECT  1
              
FROM    #lines
              
WHERE   line LIKE '%is not recognized as an%' 
    
BEGIN--the silly moo hasn't installed CURL!
      
RAISERROR 'Sorry. You must have CURL installed first!'16)
      
RETURN 1
    
END
  IF 
EXISTS ( SELECT  1--an error was returned from CURL
              
FROM    #lines
              
WHERE   line LIKE '%Could not resolve host%' )
    OR EXISTS ( 
SELECT  1
                
FROM    #lines
                
WHERE   line LIKE 'curl: (%' 
    
BEGIN
      RAISERROR 
'Sorry. Could not get RSS feed because %s'161,
        
@ErrorMessage )
      
RETURN 1
    
END
  IF 
@Exitcode <> --ah CURL command line returned an error code
    
BEGIN
      SELECT  
@ExitcodeASCII CONVERT(VARCHAR(5), @Exitcode)
      
SELECT  @ErrorMessage COALESCE(@errorMessage'') + COALESCE(line'')
      
FROM    #lines
      
RAISERROR 'Sorry. Errorcode %s. Curl reports %s'161,
        
@ExitcodeASCII@ErrorMessage )
      
RETURN 1
    
END


/* ...then we read it in...*/
  
DECLARE @RSSfeed XML
  
DECLARE @Feed_ID INT

  BEGIN 
TRY--we will catch errors at this point as they are usually XML
    
SELECT  @Command 'SELECT  @RSSfeed = BulkColumn
FROM    OPENROWSET(BULK ''' 
@TempFile ''', SINGLE_BLOB) AS x ' 
    
EXEC sp_executesql @commandN'@RSSfeed xml output'@RSSfeed OUTPUT 
  
END TRY
  
BEGIN CATCH
 
IF @@Error IN (9413,9422)--if it was an XML parsing error
   
BEGIN
   RAISERROR 
'%s was not valid XML. Was it a redirect?'01@url )
   
SELECT  @Command 'SELECT  @Badfile = BulkColumn
FROM    OPENROWSET(BULK ''' 
@TempFile ''', SINGLE_BLOB) AS x ' 
   
EXEC sp_executesql @command--have a look at it
   --read the file as a TEXT file rather than an XML
     
N'@BadFile varchar(MAX) output'@BadFile OUTPUT 
   
--we examine it to see if it is a redirect file
   
SELECT  @ref SUBSTRING(@BadFille,
            
CHARINDEX('<a href="'@BadFile '<a href="'1)
            + 
91000)
   
SELECT  @ref LEFT(@RefCHARINDEX('"'@ref '"') - 1)
   
IF LEN(@Ref) < 10--check for obvious signs of problems
     
OR @ref NOT LIKE 'http://%' 
     
BEGIN
     RAISERROR 
'The url %s was not a valid redirect!'161@URL )
     
RETURN 1
     
END
        UPDATE 
RSSFeed SET type='BAD' WHERE ActualURL=@URL
   
SELECT  @URL @ref,
       
@Retries @Retries 1
   
IF @retries 
     
GOTO retry
   
END
    RAISERROR 
'The url %s was not a valid RSS feed!'161@URL )
    
RETURN 1
  
END CATCH
/*Now we see if this is a new feed. If it is we add it to the feed 
database. 

*/
  
DECLARE @RSSFeedAttributes TABLE
    
(
      
title VARCHAR(80),
      
link VARCHAR(200),
      
[description] VARCHAR(2000),
      
[language] VARCHAR(10DEFAULT 'en-US',
      
Category VARCHAR(100DEFAULT 'Generic',
      
docs VARCHAR(80),
      
Generator VARCHAR(80),
      
ManagingEditor VARCHAR(80),
      
Webmaster VARCHAR(80)
    ) 
/*
<title>My Title</title>
<link>My Link</link>
<description>My Description</description>
<language>My Language e.g. en-us</language>
<category>Newspapers</category><docs>My Docs URL</docs>
<generator>My RSS Generator</generator>
<managingEditor>My.Editorial@Email.Address</managingEditor>
<webMaster>My.Webmaster@Email.Address</webMaster>
*/ 

  
INSERT  INTO @RSSFeedAttributes
          
(
            
title,
            
link,
            
[description],
            
[language],
            
Category,
            
docs,
            
Generator,
            
ManagingEditor,
            
Webmaster
            
          
)
   
SELECT
     
x.feed.value('title[1]''varchar(80)'AS title,
     
x.feed.value('link[1]''varchar(200)'AS link,
     
x.feed.value('description[1]''varchar(2000)'AS [description],
     
COALESCE(x.feed.value('language[1]''Varchar(10)'), 'en-US'
                                                             
AS [language],
     
COALESCE(x.feed.value('Category[1]''Varchar(100)'),
          
'Generic'AS [Category],
     
x.feed.value('docs[1]''Varchar(80)'AS [docs],
     
x.feed.value('generator[1]''Varchar(80)'AS [Generator],
     
x.feed.value('managingeditor[1]''Varchar(80)'AS [ManagingEditor],
     
x.feed.value('webmaster[1]''Varchar(80)'AS [Webmaster]
   
FROM    @RSSfeed.nodes('//rss/channel'AS feed )

--insert the feed if it doesn't exist
  
INSERT  INTO RSSFeed
          
(
            
title,
            
link,
            
[description],
            
[language],
            
Category,
            
docs,
            
Generator,
            
ManagingEditor,
            
Webmaster,
            
ActualURL
            
          
)
          
SELECT  f.title,
                  
f.link,
                  
f.[description],
                  
f.[language],
                  
f.Category,
                  
f.docs,
                  
f.Generator,
                  
f.ManagingEditor,
                  
f.Webmaster,
                  
@URL
          
FROM    @RSSFeedAttributes f
                  
LEFT OUTER JOIN RSSFeed 
                          
ON RSSfeed.title f.title
                          
AND rssFeed.link rssfeed.link
          
WHERE   rssFeed.rssFeed_ID IS NULL

--..and get the ID of the feed
  
SELECT TOP 1
          
@Feed_ID rssFeed.rssFeed_ID
  
FROM    @RSSFeedAttributes f
          
INNER JOIN RSSFeed ON RSSfeed.title f.title
                                
AND rssFeed.link rssfeed.link
--and add any RSSfeed item that doesn't exist
  
INSERT  INTO RSSitem
      
(
        
RSSFeed_ID,
        
title,
        
link,
        
[description],
        
PubDate,
        
[GUID]
      
)
      
SELECT  @feed_ID,
              
f.title,
              
f.link,
              
f.description,
              
COALESCE(f.pubdate'-'),
              
COALESCE(f.[GUID]'-')
      
FROM    SELECT
                    
x.feed.value('title[1]''varchar(80)'AS title,
                    
x.feed.value('link[1]''varchar(200)'AS link,
                    
x.feed.value('description[1]''varchar(max)'
                                                     
AS [description],
                    
x.feed.value('pubdate[1]''Varchar(40)'AS [pubdate],
                    
CONVERT(VARCHAR(80), x.feed.value('guid[1]',
                                               
'Varchar(2000)')) AS [GUID]
                
FROM    @RSSfeed.nodes('//rss/channel/item'AS feed )
              ) 
f
      
LEFT OUTER JOIN RSSitem ON COALESCE(f.title'-'RSSitem.title
                         
AND COALESCE(f.GUID'-'RSSitem.GUID
                         
AND COALESCE(f.PubDate'-'rssitem.pubdate
      
WHERE   rssitem.RSSitem_ID IS NULL

GO

IF EXISTS ( SELECT  *
            
FROM    sysobjects
            
WHERE   id OBJECT_ID('spReadOPMLFile') ) 
  
DROP PROCEDURE [spReadOPMLFile]
GO
/*
Now we provide a routine for reading in any ompl FILE. This is just a list
so it is relatively easy to read, but note that the list may be 
hierarchical. We are not too concerned about the OPML hierarchy so we
won't bother to record the hierarchy.
*/
CREATE PROCEDURE spReadOPMLFile @Filename VARCHAR(100)
/*
spReadOPMLFile 'S:MyFavouriteRSSFeeds.opml'
*/
AS 
  DECLARE 
@opmlfeed XML
  
DECLARE @command NVARCHAR(255)

  
SELECT  @Command 'SELECT  @opmlfeed = BulkColumn
FROM    OPENROWSET(BULK ''' 
@filename ''', SINGLE_BLOB) AS x ' 
  
EXEC sp_executesql @commandN'@opmlfeed xml output'@opmlfeed OUTPUT 

  
INSERT  INTO RSSfeed
      
(
        
title,
        
link,
        
[description],
        
[language],
        
Category,
        
actualURL,
        
[type]
        
      
)
      
SELECT  f.*
      
FROM    
     
SELECT  
       
x.opml.value('@title''nvarchar(80)'AS title,
       
x.opml.value('@xmlUrl''nvarchar(200)'AS link,
       
COALESCE(x.opml.value('@description',
           
'nvarchar(2000)'), ''AS [description],
       
COALESCE(x.opml.value('@language',
           
'nvarchar(2000)'), 'Een-US'AS [language],
       
'Generic' AS [category],
       
x.opml.value('@xmlUrl''nvarchar(200)'AS actualURL,
       
COALESCE(x.opml.value('@version''nvarchar(80)'),
            
'RSS'AS [type]
     
FROM    @opmlfeed.nodes('/descendant::outline[@xmlUrl]')
         
AS opml )
     ) 
f
              
LEFT OUTER JOIN rssfeed ON f.link rssfeed.link
      
WHERE   rssfeed.rssFeed_ID IS NULL

GO
IF EXISTS ( SELECT  *
            
FROM    sysobjects
            
WHERE   id OBJECT_ID('spUpdateAllFeeds') ) 
  
DROP PROCEDURE [spUpdateAllFeeds]
GO

/* this is the stored procedure that polls all your existing RSS feeds
 and reads in any new news entries*/

CREATE PROCEDURE spUpdateAllFeeds
AS 
  DECLARE 
@command VARCHAR(MAX)

  
SELECT  @Command COALESCE(@command'') + '
execute spUpdateRSSArchiveFrom ''' 
actualURL '''
'
  
FROM    rssFeed
  
WHERE   type LIKE 'RSS%'
  
EXECUTE @command )--execute them as a string
GO

/* In the list of files to download will be some sample OMPL files to try out. Now when you've loaded in a few RSS feeds, all you have to do is to put this stored procedure on the scheduler and this will keep your News up-to-date. Instant Blogrolls! The first person to write an OMPL exporter for this will get a Simple-Talk goodie bag! */

Robyn Page and Phil Factor

Author profile:


Robyn Page has worked as a consultant with Enformatica and USP Networks with a special interest in the provision of broadcast services over IP intranets. She was also a well known actress, being most famous for her role as Katie Williams, barmaid and man-eater in the Television Series Family Affairs. She is currently having a career break to raise a young family.

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 20 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

Search for other articles by Robyn Page and Phil Factor

Rate this article:   Avg rating: from a total of 17 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: ISNULL OR COALESCE?
Posted by: Darren (view profile)
Posted on: Monday, July 09, 2007 at 10:40 AM
Message: hi, was just wondering why u are using coalesce instead of isnull in ur code:

SELECT @ErrorMessage = COALESCE(@errorMessage, '') + COALESCE(line, '')

Thanks.

Subject: Security risks
Posted by: Andy (view profile)
Posted on: Tuesday, July 10, 2007 at 4:14 AM
Message: Are there potential security risks here by allowing the database to access websites? Could a malicious user us it to download and run code on the server?

Subject: Re IsNull and Security Risks.
Posted by: Robyn Page (view profile)
Posted on: Tuesday, July 10, 2007 at 2:14 PM
Message: Darren, is the IsNull function better in some way than the COALESCE function? We use COALESCE because its use seems clearer in meaning, but we're open to persuasion about its use.

Andy,
We never give ordinary users access to xp_cmdshell. We implemented this system to run on the SQL Server Agent under a special user created for the job with just the sp and xp permissions required. We say in paragraph 3 'You will need to sort out the security issues' (this article is to illustrate techniques) but even as it stands it wouldn't have to be unsafe unless your system is wide-open already. If it is, then you have problems enough already!

Subject: IsNull
Posted by: Anonymous (not signed in)
Posted on: Thursday, July 12, 2007 at 10:08 AM
Message: Hi Robyn,
i was asking the question just because i was unaware of any advantages between them both -so i did a little research and it seems they are practically the same perf wise -with IsNull very,very slightly coming out on top. Aside from all that i think IsNull is easier to read.
While reading about it i found this perf tip that i think is very relevant indeed -"IsNull() is unquestionably faster and easier to spell than Coalesce()" LOL

Darren.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.