Click here to monitor SSC

John Magnabosco

SQL Server Development and Data Security

Scraped and Bruised

Published Sunday, April 26, 2009 5:09 PM

The content of blogs certainly do not fall into the category of sensitive data. They are a collection of thoughts, ideas and bits of shared knowledge that are placed in a very public location. Despite not being considered sensitive the information provided in a blog remains subject to being stolen and its integrity compromised.

Recently I posted a blog that was immediately scraped by a splogger, injected with less than positive words and published on a blog at WordPress. While many of the adjectives were modified they maintained the association of my name as well as the other names that I noted within that entry. This act was very upsetting. Not only did they glean my original content and infringed on my, as well as Simple-Talk's, copyright; they also compromised my reputation by modifying positive statements about the people and organizations that I noted in my blog to negative ones. Granted most of the modifications were nonsensical; but for those who are not familiar with me may encounter this version and derive an erroneous conclusion from the blog entry.

A few months ago I encountered another scenario where there were many blog entries from Simple-Talk copied verbatim on another site. While this occurrence did not modify the content, it did scramble the credit of the blog entries. I was credited for an entry on C# and XML which I did not author.

Both of these occurrences were identified through a Google Alert that I setup. This alert sends me an e-mail whenever my name is mentioned on the Internet. It also provides me with the URL in which the mentioning occurred. It is indeed a very handy tool!

The effort to protect this data and respond to these occurrences of theft can be quite perplexing. If you find yourself subject to being scraped, here are some suggestions:

In my specific case, I blog at Simple-Talk.com along with other bloggers. It is in their interest that their content is protected. The first step that I took was to notify Simple-Talk.com that this was occurring so that they can take some actions from their side to protect their content as a whole.

Another step is to seek the contact information for the offending sites. A whois search can often reveal the contact information as well as host information. There is a possibility that the scraping is from being ignorant to the nature of offending site's actions rather than being malicious. If that is the case it does present an opportunity for the site to correct their ways.

If the offending splog is hosted on a shared blogging site, such as WordPress, a complaint could be submitted to the hosting company noting that the site is violating their terms of use policies. This may get the site shut down... at least until they move it to somewhere else.

Many search engines offer a means to submit a Digital Millennium Copyright Act infringement notification. An example of what is involved with this approach would be the following link from Google: http://www.google.com/dmca.html. This requires some preparation of evidence as well as potentially engaging an attorney; but it provides a way of pursuing a violation through civil action.

I recently ran across an article that discussed one option in the fight against scraping called cloaking. In a nutshell the concept is to insert some PHP into the content that will present alternate content if it is displayed from a specific IP Address. While this approach may not work on all sites, it does provide an interesting approach.

I am interested in hearing from other bloggers who have experienced their blog being scraped and actions taken to address the issue.

by Johnm

Comments

No Comments
You need to sign in to comment on this blog

About Johnm

John Magnabosco manages the Data Services Group at one of the fastest growing companies in the United States. He is also a Co-Founder of the Indianapolis Professional Association for SQL Server (IndyPASS), Co-Founder of IndyTechFest, the author of the book titled "Protecting SQL Server Data" and contributing author of "SQL Server MVP Deep Dives Volume 2".
<April 2009>
SuMoTuWeThFrSa
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789
How to Kill a Company in One Step or Save it in Three
 The majority of companies that suffer a major data loss subsequently go out of business. Wesley David... Read more...

Migrating from OCS 2007 R2 to Lync: Part 4
 Having migrated the rest of our users and legacy resources across and started getting ready to... Read more...

Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL... Read more...

Seth Godin: Big in the IT Business
 Seth Godin has transformed our understanding of marketing in IT. He invented the concept of 'permission... Read more...

Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...