Click here to monitor SSC
  • Av rating:
  • Total votes: 26
  • Total comments: 8
Phil Factor

How to Import Data from HTML pages

30 March 2011

It turns out that there are plenty of ways to get data into SQL Server from  websites, whether the data is in tables, lists or DIVs. Phil finds to his surprise that it is easier to use Powershell and the HTML Agility Pack, than some of the more traditional approaches. Web-scraping suddenly  becomes more resilient.

Quite a lot of developers would like to read the data reliably from websites, usually in order to subsequently load the data into a database. There are several ways of doing so, and I’ve used most of them. If it is a one-off process, such as getting the names of countries, colours, or words for snow, then it isn’t much of a problem. If you need to do it more regularly when data gets updated, then it can become more tedious. Any system that you use is likely to require constant maintenance because of the shifting nature of most websites. There are a number of snags which aren’t always apparent when you’re starting out with this sort of ‘web-scraping’ technology. From a distance, it seems easy.

An HTML table is the most obvious place to find data. An HTML table isn’t in any way equivalent to a database table. For a start, there seems to be wide range of opinions about how an HTML data table should be structured. The data, too, must always be kept at arms-length within the database until it is thoroughly checked. Some people dream of being able to blast data straight from the web page into a data table. So easily, the dream can become a nightmare. Imagine that you have an automated routine that is set up to get last week’s price movements for a commodity from a website: Sugar Beet, let us say. Fine. Because the table you want doesn’t have any id or class to uniquely identify it within the website, you choose instead to exploit the fact that it is the second table on the page. It all works well, and you are unaware that the first table, used for formatting the headings and logo prettily, is replaced by some remote designer in another country with a CSS solution using DIVs. The second table then becomes the following table, containing the prices for an entirely different commodity: Oilseed Rape. Because the prices are similar and you do not check often, you don’t notice, and the business takes decisions on buying and selling sugar-beet based on the fluctuations in the price of Oilseed Rape. Other things can go wrong. Designers can change tables by combining cells, either vertically or horizontally. The order of columns can change (some designers apparently don’t think that column headings are 'cool'). Other Websites use different table structures or don’t use TH tags for headings. Some designers put in extra columns in a table of data purely as ‘spacers’, for sub headings, pictures, or fancy borders.

There are plenty of ways of fighting back against sloppy web-pages to do web-scraping. You can get to your data more reliably if it is identified by its ID or the class assigned to it, or if it is contained within a consistent structure. To do this, you need a rather efficient way of getting to the various parts of the webpage.

If you are doing an aggregation and ‘warehousing’ of data from a number of sources you need to do sanity checks. Is the data valid and reasonable? Has it fluctuated beyond the standard deviation? Is it within range? You need to do occasional cross-checks of a sample of the data with other sources of the same data. In one company I worked for, the quantitive data for product comparisons with competitors were stored in Metric units for some products and Imperial units for others, without anyone noticing for months. It only came to light when we bought one of the competitors and their engineers then questioned the data. Any anomalies, however slight they seem have to be logged, notified and then checked by an Administrator. There must be constant vigilance. Only when you are confident can you allow data updates. I see DBAs as being more like data-zoo-keepers than data-police. I like to check the data in a transit area within the database in order to do detailed checking before then using the data to update the database.

The alternatives

Of the various methods of extracting data from HTML tables that I’ve come across…

Regex dissection.
This can be made to work but it is bad. Regexes were not designed for slicing up hierarchical data. An HTML page is deeply hierarchical. You can get them to work but the results are overly fragile. I’ve used programs in a language such as Perl or PHP in the past to do this, though any .NET language will do as well.
Recursive dissection.
This will only work for well-formed HTML. Unfortunately, browsers are tolerant of badly-formed syntax. If you can be guaranteed to eat only valid XHTML, then it will work, but why bother when the existing tools exist to do it properly.
Iterative dissection:.
This is a handy approach in a language like SQL, when you have plenty of time to do the dissection to get at the table elements. You have more control that in the recursive solution, but the same fundamental problems have to be overcome: The HTML that is out there just doesn’t always have close-tags such as ‘</p>’, and there are many interpretations of the table structure out there. Robyn and I showed how this worked in TSQL here.
ODBC.
Microsoft released a Jet driver that was able to access a table from an HTML page, if you told it the URL and used it if you were feeling lucky. It is great when this just works, but it often disappoints. It is no longer actively maintained by Microsoft. I wrote about it here.
Driving the browser
Sometimes, you'll need to drive the IE browser as a COM object. This is especially the case if the site's contents is dynamic or is refreshed via AJAX. . You then have to read the DOM via the document property to get at the actual data. Let's not go there in this article! In the old days, we used to occasionally  use LINX text browser for this, and then parse the data out of the subsequent text file.
OCR.
You may come across a Flash site, or one where the data is done as 'text as images'.  Just screendump it and OCR the results. (for one-offs I use Abbyy screenshot reader!).
XPath queries
You can use either the .NET classes for XML or the built-in XQuery in SQL Server to do this. It will only work for valid XHTML.
XSLT
This is always a good conversation-stopper. ‘easier to use XSLT’. I’ve never tried it for this purpose, but there will always be someone who will say that it is soooo easy. The problem is that you are not dealing with XML or, for that matter, XHTML.
HTML Agility Pack
This is so easy that it makes XPath seem quite fun. It works like standard XPath, but on ordinary HTML, warts and all. With it you can slice and dice HTML to your heart’s content.

Using the HTML Agility Pack with PowerShell.

The HTML Agility Pack (HAP) was originally written by  Simon Mourier, who spent fourteen years at Microsoft before becoming the CTO and co-founder of SoftFluent.  The HAP is an assembly that works as a parser,  building a read/write DOM and supporting plain XPath or XSLT. It exploits dotNet's implementation of XPath to allow you to parse HTML files straight from the web.

The parser has no intrinsic understanding of the significance of the HTML tags. It treats HTML as if it were slightly zany XML. The HAP It works in a similar way to System.XML, but is very tolerant with malformed HTML documents, fragments or streams that are so typically found around the Web. It can be used for a variety of purposes such as fixing or generating HTML pages, fixing links, or adding to the pages. It is ideal for automating the drudgery of creating web pages, such as creating tables of contents, footnotes, or references for HTML documents. It is perfect for Web-scrapers too. It can even turn a web page into an RSS feed with just an XSLT file as the binding. It can be downloaded from here. It has been recently converted so as to now support LINQ  to Objects via a 'LINQ to Xml'-like interface

It isn’t possible to use the HTML Agility Pack easily from within SQL Server, though you could probably write a CLR library pretty quickly. The downside is that you’d need to access the internet from your production server, which would be silly. However, driving the HAP from PowerShell, or any other language that you might prefer, is pretty easy.

First step

 Here is some PowerShell that lists out some classical insults as text, from a test website. Quite simply, it downloads the file, gulps it into the HAP as a parsed file and then pulls the text out of the second table. ('(//table)[2]' means the second table, wherever it is in the document, whereas '(//table)[2]' means the second table of  the parent element, which isn't quite the same thing)

add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll'

$HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument

$wc = New-Object System.Net.WebClient

$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/quotations.html"))

$HTMLDocument.DocumentNode.SelectSingleNode("(//table)[2]").InnerText

This will give a list of insults. You would need to install the Agility Pack and give the path to it or put it in your Global Assembly Cache (GAC). In this, and subsequent PowerShell scripts, I’m giving quick ‘n dirty code that gets results, along with some test HTML pages, just so you can experiment and get a feel for it.

You’ll notice that this first script extracts the meat out of the sandwich by the means of XPath queries. Normally, ordinary humans have no business looking at these, but in certain circumstances they are very useful.

Of course, if you’re a database programmer, you’ll probably be familiar with XPath. Here is an example of shredding XHTML document fragments via XPath using SQL. I give just a small part of the document. Make sure that, in your mind's eye, the precious text is surrounded by pernicious adverts and other undesirable content.

DECLARE @xml xml

SELECT @XML='<body>

    <ol id="insults">

      <li>You have two ears and one mouth so you might listen the more and talk the less.</li>

      <li>You like your friends to be just clever enough to comprehend your cleverness and just stupid enough to admire it.</li>

      <li>You own and operate a ferocious temper. </li>

      <li>There is nothing wrong with you, that a miracle couldn''t cure.</li>

      <li>Why be disagreeable, when with a little effort you could be impossible? </li>

      <li>You look as if you had been poured into your clothes and had forgotten to say when.</li>

    </ol>

  </body>

'

 

SELECT @xml.query('//li[2]/text()')--text of second element

SELECT Loc.query('.') FROM @xml.nodes('/body/ol/li/text()') as T2(Loc)

However, that works with XML documents, or XHTML fragments, but not HTML.

Getting the contents of a list

Let’s try PowerShell’s .NET approach  with a list (OL, in this case), which is the other main data-container   you’re likely to find in an HTML page

add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll'

$HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument

$wc = New-Object System.Net.WebClient

$HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/listExample.html"))

$list=$HTMLDocument.DocumentNode.SelectNodes("(//ol)[1]/li") #only one list

foreach ($listElement in $list) {$listElement.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'') }

This is fun. Not only that but it is resilient too. You’ll notice that I’ve corrected, in this example, the interpretation of all the white-space within the text of the tag as being significant, just by using a couple of regex substitutions. If you execute this from xp_cmdshell in SQL Server, you’re laughing, since you can read it straight into a single-column table and from there into the database.

Getting data that isn't in a table into an array

Let’s try something more serious. The front page of   Simple-Talk.

add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll'

$HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument

$wc = New-Object System.Net.WebClient

$HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/default.aspx"))

$Table=@()

$articleDetails=$HTMLDocument.DocumentNode.SelectNodes("//div[@class='middle']/div[@class='articlesummary']")

foreach ($articleDiv in $articleDetails)

                {

                $Tuple = "" | Select Title, Author, Synopsis

                $Tuple.Title = $articleDiv.SelectSingleNode("a").innertext  

                $Tuple.Author =   $articleDiv.SelectSingleNode("div[1]").innertext

                $Tuple.Synopsis = $articleDiv.SelectSingleNode("div[2]").innertext  

                $Table +=$Tuple

                }

$Table

Here, you’ll immediately notice that we are specifying nodes by their class. We could use their IDs too. We’ve specified a couple of DIVs by their order within each of  the articleDetails nodes. Where am I getting all this information on how to address individual parts of an HTML document via XPath? Well, we have a new wallchart which we have just published that has the information in it, and maps the functionality between XPath, CSS and javascript DOM.  It should help a lot with the trickier bits of web-scraping.

 We have also created a multidimensional array in which to put the data.  This makes it easier to manipulate once we've finished.

Getting your data into SQL Server

Now that we have this data, what can we do to it? The easiest approach is to write it to file and then read it in to SQL Server via BCP.   There is no need to explain how to read a file into SQL Server as this is already well-covered in Simple-Talk, here , here, and here.  For the PowerShell file, we simply change the last line with

$Table | export-CSV 'SimpleTalkArticles.csv'

I’ve left out any optional switches for simplicity. Or you can use Using the Export-Clixml Cmdlet if you wish to do so, to export it as an XML file.

There are a number of switches you can use to fine-tune the export of this XML data to file . You can even do this ...

$Table |ConvertTo-HTML

...in order to convert that data into a table in an HTML file (or a table fragment if you prefer)! If you then write it to file, it can then be read into Excel easily,

At this stage, I hope you feel empowered to create some quick-‘n-dirty  PowerShell routines to grab HTML data from the internet and produce files that can then be imported into SQL Server or Excel.

The next stage would be to  import into a SQL Server table from a PowerShell script. You can call a PowerShell script directly from SQL Server and import the data it returns into a table.  Chad Miller has already covered an ingenious way of doing this without requiring a DataTable. If you are likely to want to import a lot of data, then a DataTable might be a good alternative.

Using a datatable

In this next ‘quick and dirty’ script we’ll create a DataTable that can then be exported to SQL Server. The most effective way of doing this is to use Chad Miller’s   ‘Write-DataTable’ function. You can, if the urge takes you, write to an XML file as if it were a multidimensional array, and thence gulp it into SQL Server.

$DataTable.WriteXml(".\TableData.xml")

$DataTable.WriteXmlSchema(".\TableData.xsd")

To take out some tedious logic, we’ll do a generic table-scraper that doesn’t attempt to give the proper column names but gives, for each   cell, the cell value, the row of the cell and the column of the cell. The headers are added as row 0 if they can be found. With this information, it is pretty easy to produce a SQL Server result, as we'll show later on.

add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll'

$HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument

$wc = New-Object System.Net.WebClient

$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/TestForHTMLAgilityPack.html"))

#create the table and the columns

$DataTable = New-Object system.Data.DataTable 'TableData'

$datacol1 = New-Object system.Data.DataColumn Row,([int])

$datacol2 = New-Object system.Data.DataColumn Col,([int])

$datacol3 = New-Object system.Data.DataColumn Value,([string])

 

#add the columns to the DataTable

$DataTable.columns.add($datacol1)

$DataTable.columns.add($datacol2)

$DataTable.columns.add($datacol3)

 

#iterate through the cells in the third table

foreach ($Cell in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//tr/td")) {

      $ThisRow=$DataTable.NewRow()

      $ThisRow.Item('Value')=$Cell.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'')

      $ThisRow.Item('Col')=$Cell.XPath -creplace '(?sim)[^\s]*td\[([\d]{1,6})\]\s*\z', '$1' #column

      $ThisRow.Item('Row')=$Cell.XPath -creplace '(?sim)[^\s]*tr\[([\d]{1,6})\]/td\[[\d]{1,6}\]\s*\z', '$1' #row

      $DataTable.Rows.Add($ThisRow)

      }          

#and iterate through the headers (these may be in a thead block)

foreach ($heading in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//th")) {

      $ThisRow=$DataTable.NewRow()

      $ThisRow.Item('Value')=$heading.InnerText.Trim()

      $ThisRow.Item('Col')=$heading.XPath -creplace '(?sim)[^\s]*th\[([\d]{1,6})\]\s*\z', '$1' #column

      $ThisRow.Item('Row')=0

      $DataTable.Rows.Add($ThisRow)

      }

$DataTable

This may look a bit wacky. The reason is that we’re taking advantage of a short-cut. The HTML Agility Pack returns the XPath of every HTML element that it returns, so you have all the information you need about the column number and row number without iterative counters. It assumes that the table is regular, without colspans or rowspans.  Here is an example from one of the cells that  we just   parsed.

/html[1]/body[1]/table[1]/tr[47]/td[1]

See that? It is the first  table (table[1])  row 47, column 1 isn’t it. You want the rightmost figures too, in case you have  nesting. You just have to parse it out of the XPath string. I used regex to get the ordinal values for the final elements in the path. The sample file has three tables, you can try it out with the other two by changing the table[1] to table[2] or table[3]

If we export the DataTable to XML using the …

$DataTable.WriteXml(".\TableData.xml")

..in the end of the last PowerShell script, then we can soon get the result into a SQL table that is easy to manipulate.

DECLARE @Data XML

SELECT   @Data = BulkColumn

FROM     OPENROWSET(BULK 'C:\workbench\TableData.XML', SINGLE_BLOB) AS x   

 In this example, I’ve left out most of the rows   so that you can try this, out of sequence, without needing the table.

DECLARE @data XML

SET @Data= '<?xml version="1.0" standalone="yes" ?>

<DocumentElement>

<TableData>

  <Row>1</Row>

  <Col>1</Col>

  <Value>Advertising agencies are eighty-five per cent confusion and fifteen per cent commission.</Value>

  </TableData>

  <TableData>

  <Row>1</Row>

  <Col>2</Col>

  <Value>Fred Allen b 1894</Value>

  </TableData>

  <TableData>

  <Row>2</Row>

  <Col>1</Col>

  <Value>An associate producer is the only guy in Hollywood who will associate with a producer.</Value>

  </TableData>

  <TableData>

  <Row>2</Row>

  <Col>2</Col>

  <Value>Fred Allen b 1894</Value>

  </TableData>

  <TableData>

  <Row>3</Row>

  <Col>1</Col>

  <Value>California is a fine place to live, if you happen to be an orange.</Value>

  </TableData>

  <TableData>

  <Row>3</Row>

  <Col>2</Col>

  <Value>Fred Allen b 1894</Value>

  </TableData>

  <TableData>

  <Row>0</Row>

  <Col>1</Col>

  <Value>Quotation</Value>

  </TableData>

  <TableData>

  <Row>0</Row>

  <Col>2</Col>

  <Value>Author</Value>

  </TableData>

  </DocumentElement>'

 

 Select max(case when col=1 then value else '' end) as Quote,

        max(case when col=2 then value else '' end) as Author

  from      

  (SELECT

         x.y.value('Col[1]', 'int') AS [Col],

         x.y.value('Row[1]', 'int') AS [Row],

         x.y.value('Value[1]', 'VARCHAR(200)') AS [Value]

         FROM @data .nodes('//DocumentElement/TableData') AS x ( y )

 ) rawTableData

 group by row

 having row >0

 order by row

Yes, you’re right, we’ve used XPath once again to produce a SQL Result.

By having the data with the same three columns whatever the data, you can simplify the transition to proper relational data.You can define a Table-valued Parameter to do lot of  the work for you or even pass the data from PowerShell to SQL Server using a TVP.  It makes the whole process very simple.

Using a web-scraper within an application

So, how would you go about scraping a site? The data is there but generally you have no idea of the structure of the site. I usually start by checking that the site can be read by this technique (dynamic content requires a very different technique) and that robots have not been specifically forbidden.  If it requires authentication, a POST, or any other complication such as the user-agent, then it is better to use a more specialised tool to do the download. I like both CURL and WGET.

With most pages, it is easy to work out where the data that you want is held, just by inspecting the source. FireBug in Firefox will help you locate an element that is difficult to find. For the more complicated sites, it is simple to scan a site for a particular word or phrase  in the InnerText using HAP, and  get back the absolute XPath for the elements that have the phrase in their Innertext. You can also sometimes locate your data by the HREF of an anchor . This snippet displays the link for every anchor on the page and its corresponding XPath.

foreach( $link in $HTMLDocument.DocumentNode.SelectNodes("//a[@href]"))

      {

      $link.Attributes["href"].value +' is at address ' + $link.XPath

      }

..or you can search for text, either within the element (contains) , starts-with and get the XPath of the elements

foreach( $element in $HTMLDocument.DocumentNode.SelectNodes("//td[contains(text(),'Fred') ]"))

      {

      $element.XPath + ' --- '''+$element.innerText+ ''''

      }

With a fair wind and an XPath reference wallchart, a great deal is possible. If, for example, the data always has the same title, even if its location in the page varies, you can write a script that gets gets its location purely by looking for the heading. Some data is in ordinary paragraph tags, but you can still get at them via XPath if they follow a particular heading. XPath has a great deal of magic for awkward data-gathering.

For data-gathering, I generally use dedicated PCs within the domain. These need need very little power, so you can use old nags. I never let SQL Server itself anywhere near the internet. On these PCs, I have a scheduled task that runs a script that downloads the next task (ID and parameters) from the SQL Server, and runs it if it is due, returning the data to SQL Server, using windows authentication. Each task corresponds to a single data collection on one site.  All errors and warnings are logged, together with the taskID, the User, and time of day,  within the SQL Server database. When the data is received, it is scrubbed, checked, compared with the existing data and then, if all is well,  the delta is entered into the database.

Conclusions

Using the HTMLAgilityPack is great for the run-of-the-mill reading of data, and you may never hit its limitations, unless of course you are scraping the contents of an AJAX site, or the data is in flash.  isn't perfect, since the HTML file is treated without understanding the semantics of HTML. This is fine up to a level of detail, but HTML tables really are awful because they allow you to mix presentation and meaning. The Colspan and Rowspan have no equivalent meaning in any sensible data table, and make extracting the data more tiresome than it need be.

Although I've tended to use RegEx queries in the past, I'm now convinced that the HTML Agility Pack is a more sensible approach for general use in extracting data from HTML in .NET

Phil Factor

Author profile:

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Google + To translate this article...

Search for other articles by Phil Factor

Rate this article:   Avg rating: from a total of 26 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Great Tip on HTML Agility Pack!
Posted by: Chad Miller (not signed in)
Posted on: Saturday, April 02, 2011 at 11:14 AM
Message: I've pointed folks to the pack when questions of web scrapping have come up, but haven't tried it myself.

There is a minor update to the write-datatable function that fixes an issue with error handling http://poshcode.org/2276. PoshCode repository doesn't replace scripts so the original link is still accessible http://poshcode.org/2118


Subject: Javascript issue
Posted by: DataBigBang (not signed in)
Posted on: Tuesday, April 05, 2011 at 2:42 PM
Message: The issue is with sites that are using javascript/AJAX. You need to use a browser control or something like HtmlUnit for Java.
I'd recommend to use it with http://www.ikvm.net and there is an article for further details: http://blog.stevensanderson.com/2010/03/30/using-htmlunit-on-net-for-headless-browser-automation/

Also, I'd like to contribute with an article I wrote some months ago on this subject: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/


Subject: Re: Javascript issue
Posted by: Phil Factor (view profile)
Posted on: Wednesday, April 06, 2011 at 2:19 AM
Message: This blog post at http://blog.databigbang.com/ is well worth reading. Thanks for the reference!

Subject: Getting HTML from sites with dynamic content
Posted by: Phil Factor (view profile)
Posted on: Friday, April 08, 2011 at 6:20 AM
Message: I didn't really mention in the article that this presents no real difficulties. To get the whole of the HTML content after it has been loaded dynamically, you can do something like this (In this case getting the google search results for the string "Phil Factor"!

$exploder = New-Object -Com InternetExplorer.Application
$searchString='Phil+Factor'
$exploder.Navigate("http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=$searchString")
$exploder.Visible=$true
While ($exploder.Busy) {}
$exploder.Document.DocumentElement.OuterHtml

Subject: Another T-SQL HTML parser: SQLDOM
Posted by: drueter (view profile)
Posted on: Sunday, February 19, 2012 at 5:09 PM
Message: Phil, you do great work--both with your SQL code and your writing. Thank you.

I recently had need for parsing HTML in T-SQL, and ended up creating my own parser and DOM manipulation routines. Feel free to take a look and use: http://sourceforge.net/projects/sqldom/

I wrote mine first (before I took a look at yours)--I ended up taking a different approach that seems to have worked out pretty well.

I appreciate the information on HTMLAgilityPack--I haven't checked that out. But like you, my first choice tends to be T-SQL whereever possible.

Subject: Re: Another T-SQL HTML parser: SQLDOM
Posted by: Phil Factor (view profile)
Posted on: Monday, February 20, 2012 at 2:30 AM
Message: David Rueter's TSQL solution to DOM parsing has some very nifty SQL in it and I reckon it will be very useful for many tasks. Definitely worth getting. Although the parsing of structured data in documents isn't that easy, it is definitely an option. Robyn and I once did a primitive XHTML parser which is here http://www.simple-talk.com/content/article.aspx?article=489

Subject: ocr lib
Posted by: seejoc (view profile)
Posted on: Sunday, July 15, 2012 at 6:13 AM
Message: Great article thanks for giving us this tips, but do you know any open source OCR library? (maybe not in some exotic language but in something like c++,python,php etc).
Thanks again for your tips.

Subject: SSIS to pull the LIVE data from the website and dump the data to db
Posted by: Sreehari Katageri (view profile)
Posted on: Friday, April 04, 2014 at 1:30 AM
Message: HI, I want to use SSIS to pull the LIVE data from the website and dump the data into my sql server database. Could any one suggest me the steps to achieve this. Can this be accomplished by Web services task. For example I want to capture LIVE gold price or cricket score and dump into my DB table.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.