Click here to monitor SSC

Simple-Talk columnist

A PowerShell RSS Reader using an OPML file

Published 18 March 2013 7:12 pm

To celebrate the announcement of the planned  demise of Google Reader, I’ve done a PowerShell script that gives you the  items from the OPML collection of feeds that you import or export between your feed readers. Basically,  you create your own primitive feed reader. I’m afraid it isn’t as good as Google Reader.

So what is involved?  RSS/Atom is a rather loose definition, in that the only attribute a feed item actually needs is link and the content. The spec has been liberally interpreted too, so that there isn’t much you can really guarantee being able to read every RSS file..

To get a well-constructed  RSS feed is trivial. In PowerShell v3, it is a one-liner.  The problem is in getting resilience.  To get every feed to work is a struggle, and so I apologise for giving up at a point.

Because I can throw lists of links at this routine instead of an OPML, or use it in a function with several OPML files, I use this type of PowerShell routine for specific tasks such as checking to see if particular groups of sites have had postings. It is very easy to set up an alert if a particular site gets a posting.

I’ve added things to the script to take out all the HTML tags from the description and just view the first five-hundred characters. I’ve limited it to the first hundred feeds just to test it, and I’ve limited it to report just the current days articles. You’ll want to change all that, I expect.

You’ll need to fill in the path to the location of your  OPML file (basically an XML list of links), and the number of days back you want to read items from,  and either change or delete the ‘Select -first 100 | ’ bit, which just gets the first articles. You’ll want to change the (truncate ($_.xxx -replace "<.*?>") 500) (take out all the HTML tags and truncate to 500 characters or less) to suit your tastes.

 At the end of the pipeline you can, of course, save the results to a database or file, or maybe send it as an email, or format it into an HTML file: but there is no sense in adding all that stuff because you know it already!

Note: one source of weakness is the date-parsing routine - {try {get-date ($_.PubDate -replace "UT")} catch {Get-Date ’01 January 2006 00:00:00′}}} – which will cause some feeds to be relegated into the past. As you can see, PowerShell happily eats dates with GMT at the end but not UT for some reason.

$MyOPMLFile= ‘.\AllMyFeeds.opml’ #change this to the name of your OPML file
$RestError=[xml]‘<broken></broken>’
$DaysBack=[int]-1 #the number of days back you want articles from

function truncate([string]$value, [int]$MaxLength)
{#can you believe there is no powershell built-in way of doing this?
    if ($value.Length -gt $MaxLength) { $value.Substring(0, $MaxLength) }
    else { $value }
}

[xml]$opml= Get-Content $MyOPMLFile # grab the OPML file of feeds
$opml.opml.body.outline.outline.xmlurl| Select -first 100 | # only the first few for testing
 foreach {try{Invoke-RestMethod $_} catch{ $RestError }} | # flag if an error happened
   where {{try {$_.SelectSingleNode(‘link’)} catch{$null} -ne $null}} | #filter out 404s, malformed items  and bad links
                    
<# Each <item> within a feed represents an article. The <item> must include at least the following elements:
    <link>: The canonical URL for the article.
    <content:encoded>: The full HTML content of the article.
But you are also likely to find ..
    <title>: The article’s headline. If it isn’t there, you’d need to find it in the content
    <pubDate>: The date of the article’s publication, in RFC822 format.
    <description>: A short, summary or abstract of the article.
    <dc:creator>: Name of the person who wrote the article.
    <media:content> and <media:group>: URLs and metadata for image, video, and audio assets.
#>
     Select @{name="Title"; Expression = {try {$_.title} catch {‘Unknown title’}}},
       @{name="Description"; # this isn’t mandatory, but you can get the content
         Expression ={try { if ($_.SelectSingleNode(‘description’) -eq $null)
                                 { truncate ($_.encoded.‘#cdata-section’ -replace "<.*?>") 500}
                            elseif ( $_.description.ToString() -eq ‘description’)
                                 {truncate ($_.description.‘#cdata-section’  -replace "<.*?>") 500 }
                            else {truncate ($_.description  -replace "<.*?>") 500 }}
                      catch {‘error’}}},
       @{name="PubDate"; Expression = {try {get-date ($_.PubDate -replace "UT")} # force it into a PS date 
                                       catch {Get-Date ’01 January 2006 00:00:00′}}},
       @{name="author"; Expression = {try {if ( $_.author.length -eq 0) {$_.creator}
                                           else {$_.author}}
                                      catch{‘Unknown Author’}}},
        link | #we already checked for a link!
                      where-object {$_.Pubdate -gt  (Get-Date).AddDays($DaysBack)}
                           # we only get the fresh news from the last couple of days.

If  you don’t already have an OPML file to practice on, here is one you can use that I’ve put together to give you exciting articles and blogs from Simple-Talk. Just save it to a file, extend it with your favourite blogs and sites, and you’ll soon be wondering why you ever felt that Google Reader was essential! Of course, you can still use the routine above with a simple list of RSS feeds, but then you wouldn’t have something that could be stitched into your news feed reader OPML file.

<?xml version="1.0" encoding="ISO-8859-1"?>
<opml version="1.1">
   
<head>
       
<title>SimpleTalk Subscriptions</title>
       
<dateModified>Wed, 20 Mar  2013 07:21:56 GMT</dateModified>
   
</head>
   
<body>
       
<outline text="simple-talk">
           
<outline text="Home Page" title="Simple Talk Home Page" type="rss" xmlUrl="https://www.simple-talk.com/feed/" htmlUrl="https://www.simple-talk.com/"/>
           
<outline text="SQL Articles" title="SQL Home" type="rss" xmlUrl="https://www.simple-talk.com/sql/rss.aspx" htmlUrl="https://www.simple-talk.com/sql/"/>
           
<outline text=".NET Articles" title=".NET Articles" type="rss" xmlUrl="https://www.simple-talk.com/dotnet/rss.aspx" htmlUrl="https://www.simple-talk.com/dotnet/"/>
           
<outline text="SysAdmin Articles" title="SysAdmin Articles" type="rss" xmlUrl="https://www.simple-talk.com/sysadmin/rss.aspx" htmlUrl="https://www.simple-talk.com/sysadmin/"/>
           
<outline text="Opinion and Geeks" title="Opinion and Geeks" type="rss" xmlUrl="https://www.simple-talk.com/opinion/rss.aspx" htmlUrl="https://www.simple-talk.com/opinion/"/>
           
<outline text="Books and Book Reviews" title="Books and Book Reviews" type="rss" xmlUrl="https://www.simple-talk.com/books/rss.aspx" htmlUrl="https://www.simple-talk.com/books/"/>
           
<outline text="Cloud" title=".NET Articles" type="rss" xmlUrl="https://www.simple-talk.com/cloud/rss.aspx" htmlUrl="https://www.simple-talk.com/cloud/"/>
           
<outline text="Blogs" title=".NET Articles" type="rss" xmlUrl="https://www.simple-talk.com/blogs/feed/" htmlUrl="https://www.simple-talk.com/blogs/"/>
       
</outline>
       
<outline text="SQL Server Central">
           
<outline title="Main Articles" text="www.sqlservercentral.com/Xml/Rss/articles" type="rss" xmlUrl="http://www.sqlservercentral.com/Xml/Rss/articles"/>
           
<outline title="SQL Server Central Blogs" text="www.sqlservercentral.com/blogs/feed/" type="rss" xmlUrl="http://www.sqlservercentral.com/blogs/feed/"/>
           
<outline title="Ask Sqlservercentral Questions" text="ask.sqlservercentral.com/feed/questions.rss" type="rss" xmlUrl="http://ask.sqlservercentral.com/feed/questions.rss"/>
       
</outline>
   
</body>
</opml>

References

3 Responses to “A PowerShell RSS Reader using an OPML file”

  1. hemanth.damecharla says:

    Thanks for posting the sample opml content. I have been trying to get the code working on some of the opml documents at wikipedia, without luck.
    Maybe, something wrong from my end.

  2. Phil Factor says:

    The only thing wrong is that you’d need to change the path if you change the structure of the OPML file. ..
    $opml.opml.body.outline.outline.xmlurl
    ..that gets each item. I use the format generally accepted by the news readers

    I’ve added some useful references and fixed a date-parsing bug that prevented some feeds being seen. (I’ve added a warning too, in the text)

  3. mbourgon says:

    Phil, very cool. Nice to see another solution to the whole Google Reader issue.

Leave a Reply