Click here to monitor SSC
  • Av rating:
  • Total votes: 32
  • Total comments: 5
Seth Delconte

Ad-Hoc XML File Querying

05 March 2013

When you need to shred just part of the data within a large XML file into a SQL Server table, the most efficient way is to just select what you need via XQuery or by using XPath, before shredding it into a table. But precisely how would you do that?

XML data is often stored in files, outside of a database. One obvious use for XML file storage is to hold configuration data that is read by web servers and other systems. If ad-hoc querying of XML files becomes necessary, it’s important to find accurate and efficient ways to extract just the data you require from the XML, and return the results as a table. In this post, we’ll look into finding the best ways to perform ad-hoc queries against XML files using SQL Server’s XQuery implementation and end up with the results in a table.

Let’s use Microsoft’s sample Books.xml file, which contains 12 book entity records that include element and attribute values for book id, author, title, genre, publication date, and description. A sample book node looks like this:

<book id="bk109">
	<author>Kress, Peter</author>
	<title>Paradox Lost</title>
	<genre>Science Fiction</genre>
	<price>6.95</price>
	<publish_date>2000-11-02</publish_date>
	<description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description>
</book>

For our purposes, I’ve created a local copy of Books.xml on C:. Let’s assume that we want to query this local file from SQL Server, returning the results of our query in a relational format - this can be referred to as ‘shredding’ XML node values.

Importing XML data using a CTE

Our first step will be to find a way to access the XML file from SQL Server. One way to do this is to use the versatile OPENROWSET function. It can be used to import many different formats of bulk data. We can use OPENROWSET in a CTE (Common Table Expression) to directly access the XML data from the BLOB (Binary Large OBject) that OPENROWSET converts it into, for each time we run a query against it:

--query the XML Blob using a CTE (pulling from the XML file each time)
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT *
FROM XmlFile
GO

The ‘SINGLE_BLOB’ argument of OPENROWSET designates the XML data into a Blob of data type VARBINARY(MAX) – this type also is limited to a size of 2GB. The CTE creates a table expression named ‘XMLFile’ and its singlular column to hold the data - giving us the ability to reference the XML data properly while running XQuery statements against it.

Importing XML data using a function

Another way we can access the XML data from the file on demand is by creating a function that returns an XML variable. A variable of the XML data type can hold up to 2 GB of XML data – for context, our entire Books.xml file is only 5KB. We can easily create a SQL Server scalar function to do this:

--create function to return XML data from file as XML variable
USE AdventureWorks2012
GO
CREATE FUNCTION fn_BooksXML()
RETURNS XML
AS
BEGIN

	DECLARE @books XML;
	SELECT @books = CONVERT (XML, BulkColumn)
	FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData;

	RETURN @books;
END
GO

We’ve created this function in the AdventureWorks2012 database. It can be called like this:

--call function to return all XML file data
SELECT AdventureWorks2012.dbo.fn_BooksXML() AS [Books XML Data];

Obviously, creating a function to pull XML data is more of a permanent way to access an XML file, as opposed to an ad-hoc script.

Filtering XML data using FLWOR

Now that we have found ways to access our XML file data from SQL Server, we can start looking for the best way to run queries against it. To start with, let’s say we are looking for a query that will return author names for books in the ‘Computer’ genre category. An obvious first possible solution might be to use the iterative power of the XQuery FLWOR statement. FLWOR stands for For, Let, Where, Order by, and Return. FLWOR is an integral part of XQuery. A possible FLWOR solution to our problem might look like this:

--use a FLWOR statement to find all authors in Computer genre
SELECT AdventureWorks2012.dbo.fn_BooksXML().query('for $Book in /catalog/book
				let $Genre := $Book/genre
				let $Author := $Book/author
			    where $Genre = "Computer"
			    return $Author
			   ') AS [Computer Book Authors]
GO

We’ve used a FLWOR statement inside of an XQuery query() method, accessing the XML data from the file using our function This does return the values that we want:

The returned values

<author>Gambardella, Matthew</author>
<author>O'Brien, Tim</author>
<author>O'Brien, Tim</author>
<author>Galos, Mike</author>

But it definitely does not present the results in the relational format we are looking for – the data is returned in its native format, an XML fragment, instead. We could try stripping the FLWOR results of the XML tags using the XQuery data() function:

--use a FLWOR statement to find all authors in Computer genre (remove XML tags)
SELECT AdventureWorks2012.dbo.fn_BooksXML().query('for $Book in /catalog/book
				let $Genre := $Book/genre
				let $Author := $Book/author
			    where $Genre = "Computer"
			    return data($Author)
			   ') AS [Computer Book Authors]
GO

However, this method still returns the results contained in one record - only now without the XML tags:

The single reult, missing its tags

Incidentally, using the data() method makes it easy to return a delimited list. A few changes to our script, which includes the introduction of the concat() XQuery function, produces a delimited result set:

--use a FLWOR statement to find all authors in Computer genre (remove XML tags and delimit)
SELECT AdventureWorks2012.dbo.fn_BooksXML().query('for $Book in /catalog/book
				let $Genre := $Book/genre
				let $Author := $Book/author
			    where $Genre = "Computer"
			    return concat("[",data($Author)[1],"]")
			   ') AS [Computer Book Authors]
GO

The delimited results set

Although FLWOR is a powerful way to produce iterative results, it returns values in an XML format by default.

Filtering XML data using XQuery value() method

Another option we can try is to do our search using a path inside of an XQuery value() method. The value() method simply pulls the value data from an XML node element or attribute. Let’s use value() for our situation, this time using the CTE method to pull the XML data from the file:

--use XQuery value() method within a CTE
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT Contents.value('(/catalog/book/genre[. = "Computer"]/../author)[1]', 'varchar(50)') AS [Computer Book Authors]
FROM   XmlFile
GO

Within the value() method’s XQuery path, we are indicating here that only genre nodes with a value of ‘Computer’ should be examined. Then, we use the parent node axis abbreviation (‘..’) path step to back up to the book node - where its genre node meets that requirement - and then look at its author node. This works because the genre and author nodes are on the same level inside of the book node. Our results for the above query are:

Results for the above query

We can immediately see a major problem with using this solution for our purposes – it returns matches for just the first book node; The Value() method will only return one node at a time. This is due to the singleton requirement of the value() method. The singleton requirement forces us to indicate which book node instance we are referring to – even if there is only one instance. However, the XML data contains multiple book nodes, and we need to look at each one in order to get an accurate answer to our request. We’ve indicated that we want the results from the first book node instance returned - by using the ‘[1]’ as a singleton. Therefore, if we change that to ‘[2],’ we get:

Results using the [2] value

This type of value() query will work fine when using XML fragments having only one node instance, or where only one instance’s values were required. In our case, however, this is not what we want.

Filtering XML data using XQuery nodes() method

We have another possible way to get the results we want: using the XQuery nodes() method. The nodes() method is designed to shred XML node values into a relational format – nodes() actually returns a rowset based on the shredded XML. This sounds like exactly what we need. Let’s try an XQuery nodes() query:

--use XQuery nodes() method within a CTE
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
FROM   XmlFile CROSS APPLY Contents.nodes ('(//book/genre[. = "Computer"]/..)') AS t(c);

This query uses CROSS APPLY to create a kind of self join back to the XmlFile table expression. The CROSS APPLY requires table (t) and column (c) aliases. You may notice that we’ve used a basic XQuery statement in the value() method, but most of the real filtering is done in the nodes() method. Again, we have indicated that the values should be looked for in the genre nodes, and we’ve used the parent::node() axis abbreviation again to point back to the genre node’s parent book node. You’ll also notice that we are again able to employ the value() method, despite its singleton requirement. This is possible because the nodes() method shreds the XML nodes into a relational structure – the value() method is returning values for each record in a rowset generated by nodes(). So, the rowset generated by nodes()contains all book nodes where the genre is ‘Computer.’ The value() method indicates that the author node from the given book node is where the value should be taken from.

The query returns the following:

Results displayed in the relational format

We’ve finally returned the results in the relational format we are looking for. We can eliminate redundant authors by adding the following change:

--use XQuery nodes() method within a CTE to get distinct authors
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT DISTINCT x.[Computer Book Authors] FROM(
	SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
	FROM   XmlFile CROSS APPLY Contents.nodes ('(//book/genre[. = "Computer"]/..)') AS t(c)
)x;

The relational resultswith the redundant authors removed

Working around Case-sensitivity in XQuery

Our query is case-sensitive. Genre node matches for the value ‘Computer’ are not the same as matches for the value ‘computer.’ To work around case-sensitivity in XQuery statements, you can use either the upper-case() or lower-case() functions:

 --use XQuery nodes() method within a CTE to get distinct authors, resolving case-sensitivity
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT DISTINCT x.[Computer Book Authors] 
FROM
(
	SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
	FROM   XmlFile CROSS APPLY Contents.nodes ('(//book/genre[lower-case(.) = "computer"]/..)') AS t(c)
)x;

The lower-case() XQuery function forces the genre node value to be lower-case, and then compares it with the lower-case value ‘computer.’

Using the XQuery contains() function to find substrings

Our query is also currently limited to finding exact matches to the lower-case value ‘computer’ in genre nodes. This means that node values with any accidental spaces, for example, will be excluded from our results. Also, if we wanted to search for text in the description nodes, we would have to type the entire lengthy text in our XQuery statement in order to find matches. To allow for substring searches, XQuery provides the contains() function. We’ll use this in conjunction with our current query by making the following changes:

--use XQuery nodes() method within a CTE to get distinct authors, resolve case-sensitivity - returning substring matches
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT DISTINCT x.[Computer Book Authors] 
FROM
(
	SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
	FROM   XmlFile CROSS APPLY Contents.nodes ('(//book/genre[contains(lower-case(.), "comp")]/..)') AS t(c)
)x;

We have modified the way the lower-case() function is used so that we could nest it inside of the contains() function. Even though we are only searching for a substring of the genre node value (‘comp’), we still get the correct results:

The query results

Importing and filtering XML data using a view

If we want to establish a more complete, permanent database-oriented solution, we can create a view that encapsulates some of the work we did earlier, allowing for very easy SQL querying against a result set returned by XQuery:

--create view to return all XML data from file
USE AdventureWorks2012
GO
CREATE VIEW v_BooksXML
AS
	WITH XmlFile (Contents) AS (
	SELECT CONVERT (XML, BulkColumn) 
	FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
	)
	SELECT	
			c.value('(@id)', 'varchar(10)') AS [ID],
			c.value('(author)[1]', 'varchar(100)') AS [Author],
			c.value('(title)[1]', 'varchar(100)') AS [Title],
			c.value('(genre)[1]', 'varchar(50)') AS [Genre],
			c.value('(price)[1]', 'decimal(8,2)') AS [Price],
			c.value('(publish_date)[1]', 'date') AS [Date],
			c.value('(description)[1]', 'varchar(200)') AS [Description]
	FROM   XmlFile CROSS APPLY Contents.nodes ('(//book)') AS t(c)

GO

Now we can simple query the view, without using XQuery at all:

--query view
SELECT DISTINCT Author FROM AdventureWorks2012..v_BooksXML
WHERE Genre LIKE 'comp%'
GO

Viewing the results without XQuery

Efficiency comparisons

How do all of these different methods of accessing XML file data compare in terms of query speed? To check this, we’ll run the different methods together as a batch (making changes to WHERE criteria in the FLWOR statements to provide a better comparison to the other queries):

--XQuery nodes() method within a CTE
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT DISTINCT x.[Computer Book Authors] 
FROM
(
	SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
	FROM   XmlFile CROSS APPLY Contents.nodes ('(//book/genre[contains(lower-case(.), "comp")]/..)') AS t(c)
)x

--XQuery nodes() method using function
SELECT DISTINCT x.[Computer Book Authors] 
FROM
(
	SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
	FROM   (SELECT AdventureWorks2012.dbo.fn_BooksXML()) AS y(z) 
	CROSS APPLY z.nodes ('(//book/genre[contains(lower-case(.), "comp")]/..)') AS t(c)
)x

--use a FLWOR statement within CTE
;WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn) 
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT Contents.query('for $Book in /catalog/book
				let $Genre := $Book/genre
				let $Author := $Book/author
			    where $Genre [contains(lower-case(.), "comp")]
			    return $Author
			   ') AS [Computer Book Authors]
FROM XmlFile
GO;

--FLWOR statement using function
SELECT AdventureWorks2012.dbo.fn_BooksXML().query('for $Book in /catalog/book
				let $Genre := $Book/genre
				let $Author := $Book/author
			    where $Genre [contains(lower-case(.), "comp")]
			    return $Author
			   ') AS [Computer Book Authors]
GO

--view
SELECT DISTINCT Author FROM AdventureWorks2012..v_BooksXML
WHERE Genre LIKE 'comp%'
GO

Comparison of the efficiency of the various methods

The view is the most expensive method by far, and at 71% of the total batch cost, its query cost is roughly 10x either of the others. The differences between using a CTE versus a function seem negligible, as both the nodes() methods and the FLWOR statements cost around 7% of the batch. A minor exception is the FLWOR statement that uses a function – costing slightly more than its CTE counterpart at 8%.

Summary

We started out with an XML file that we wanted to filter for specific results, returning the values in a relational format. We first found a way to pull the XML data into a form that SQL Server could easily access and run queries against, using the OPENROWSET function. A more permanent extension of that idea was to create a function that encapsulated the same logic. We made several attempts to find the proper query solution, at first using an iterative FLWOR statement, but that did not return results relationally. We then tried using a plain XQuery value() method query, but discovered that the value() method’s singleton requirement restricted us to returning only one record at a time. We ended up finding a good resolution by using the XQuery nodes() method – the advantage that nodes() has is that it returns a rowset of the XML data, shredded and organized relationally. We also made a few modification to our nodes() query, eliminating the case-sensitivity and exact-value-match restrictions. We then incorporated some of that logic into a view that can be filtered using a SQL Server WHERE clause, rather than using XQuery criteria. The view solution is arguably the most straightforward solution, providing the creation of a permanent object is allowed. However, a batch query cost comparison showed us that the view method is by far the most costly.

Seth Delconte

Author profile:

Seth Delconte is a SQL Server Developer. In addition to SQL Server database development and administration, he enjoys C#.NET and Linux development.

Search for other articles by Seth Delconte

Rate this article:   Avg rating: from a total of 32 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good article
Posted by: vchandm23 (view profile)
Posted on: Friday, March 15, 2013 at 6:59 PM
Message: This is of the standard of university lab worksheet. Awesome. xquery is an amazing technology in database systems. I have been doing this in Oracle but good to see it in MSSQL finally.

Cheers,
vchandm23

Subject: Views
Posted by: NBSteve (view profile)
Posted on: Thursday, March 28, 2013 at 1:14 PM
Message: Thanks for the article, lots of good XML examples here. My only critique is that your efficiency comparison makes the use of views appear inefficient, which isn't directly the case. The use of a view itself doesn't add any significant overhead to the query. For example, if you instead created your view directly from the CTE example as follows:

[code]
CREATE VIEW v_BooksXML
AS
WITH XmlFile (Contents) AS (
SELECT CONVERT (XML, BulkColumn)
FROM OPENROWSET (BULK 'C:\Books.xml', SINGLE_BLOB) AS XmlData
)
SELECT DISTINCT x.[Computer Book Authors] AS Author
FROM
(
SELECT c.value('(author)[1]', 'varchar(50)') AS [Computer Book Authors]
FROM XmlFile CROSS APPLY Contents.nodes ('(//book/genre[contains(lower-case(.), "comp")]/..)') AS t(c)
)x

GO
[/code]

You'll find the same query using this view performs evenly with the other non-view examples. However, the view you're using in the article has 7 calls to the value() xml function, while all the other queries only use 1 value() call.

The View has all the extra value() calls so that it can be more useful and flexible, allowing it to be reused in more cases. However, that means it's also doing a lot of extra work even when a bunch of the needed columns aren't being used, hurting the performance. This is a danger of views... they can be useful for encapsulating routines in reusable code, but they may be encapsulating far more processing than is actually necessary for an individual query.

Subject: RE:Views
Posted by: delcons (view profile)
Posted on: Thursday, March 28, 2013 at 1:51 PM
Message: Thanks - good explanation of why the view is more expensive. However, I still think the way I present the view is a valid comparison to the other XML data accessibly options in terms of functionality - note that I didn't use the DISTINCT keyword or any WHERE criteria in the SQL function creation. Those options are used in the script that utilizes the object, not in the object creation. So, while you are correct in explaining how multiple XQuery value() method calls drag down the view's speed, I'm sticking to my original intent of comparing it to other XML accessibility options ;).

Subject: execution plans of XML querying
Posted by: DGPerson (view profile)
Posted on: Tuesday, April 02, 2013 at 11:56 AM
Message: The most common problem with shredding XML into relational tables lies with coders doing it 1 row at a time. It is best to use OPENXML and load the results into a temp table. Then work the data according to the needs of the app. Are there occasions when this won't work? Sure but if you receive well formed XML this will work and reduce the remote calls to the XML parser down to one instead of tens or hundreds of callouts depending upon the size of the XML body.

Subject: RE:Good article
Posted by: delcons (view profile)
Posted on: Friday, April 19, 2013 at 11:18 AM
Message: Thanks for the comments :)

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.