Ad-Hoc XML File Querying

When you need to shred just part of the data within a large XML file into a SQL Server table, the most efficient way is to just select what you need via XQuery or by using XPath, before shredding it into a table. But precisely how would you do that?

XML data is often stored in files, outside of a database. One obvious use for XML file storage is to hold configuration data that is read by web servers and other systems. If ad-hoc querying of XML files becomes necessary, it’s important to find accurate and efficient ways to extract just the data you require from the XML, and return the results as a table. In this post, we’ll look into finding the best ways to perform ad-hoc queries against XML files using SQL Server’s XQuery implementation and end up with the results in a table.

Let’s use Microsoft’s sample Books.xml file, which contains 12 book entity records that include element and attribute values for book id, author, title, genre, publication date, and description. A sample book node looks like this:

For our purposes, I’ve created a local copy of Books.xml on C:. Let’s assume that we want to query this local file from SQL Server, returning the results of our query in a relational format – this can be referred to as ‘shredding‘ XML node values.

Importing XML data using a CTE

Our first step will be to find a way to access the XML file from SQL Server. One way to do this is to use the versatile OPENROWSET function. It can be used to import many different formats of bulk data. We can use OPENROWSET in a CTE (Common Table Expression) to directly access the XML data from the BLOB (Binary Large OBject) that OPENROWSET converts it into, for each time we run a query against it:

The ‘SINGLE_BLOB’ argument of OPENROWSET designates the XML data into a Blob of data type VARBINARY(MAX) – this type also is limited to a size of 2GB. The CTE creates a table expression named ‘XMLFile’ and its singlular column to hold the data – giving us the ability to reference the XML data properly while running XQuery statements against it.

Importing XML data using a function

Another way we can access the XML data from the file on demand is by creating a function that returns an XML variable. A variable of the XML data type can hold up to 2 GB of XML data – for context, our entire Books.xml file is only 5KB. We can easily create a SQL Server scalar function to do this:

We’ve created this function in the AdventureWorks2012 database. It can be called like this:

Obviously, creating a function to pull XML data is more of a permanent way to access an XML file, as opposed to an ad-hoc script.

Filtering XML data using FLWOR

Now that we have found ways to access our XML file data from SQL Server, we can start looking for the best way to run queries against it. To start with, let’s say we are looking for a query that will return author names for books in the ‘Computer’ genre category. An obvious first possible solution might be to use the iterative power of the XQuery FLWOR statement. FLWOR stands for For, Let, Where, Order by, and Return. FLWOR is an integral part of XQuery. A possible FLWOR solution to our problem might look like this:

We’ve used a FLWOR statement inside of an XQuery query() method, accessing the XML data from the file using our function This does return the values that we want:

1756-bd72eb0e-705f-42a0-95f1-f75dde42ad3

But it definitely does not present the results in the relational format we are looking for – the data is returned in its native format, an XML fragment, instead. We could try stripping the FLWOR results of the XML tags using the XQuery data() function:

However, this method still returns the results contained in one record – only now without the XML tags:

1756-5efccd48-ed98-4930-ac44-b0e08ccc4c0

Incidentally, using the data() method makes it easy to return a delimited list. A few changes to our script, which includes the introduction of the concat() XQuery function, produces a delimited result set:

1756-601d0e91-e1c4-4f54-aee0-7bbd099f990

Although FLWOR is a powerful way to produce iterative results, it returns values in an XML format by default.

Filtering XML data using XQuery value() method

Another option we can try is to do our search using a path inside of an XQuery value() method. The value() method simply pulls the value data from an XML node element or attribute. Let’s use value() for our situation, this time using the CTE method to pull the XML data from the file:

Within the value() method’s XQuery path, we are indicating here that only genre nodes with a value of ‘Computer’ should be examined. Then, we use the parent node axis abbreviation (‘..’) path step to back up to the book node – where its genre node meets that requirement – and then look at its author node. This works because the genre and author nodes are on the same level inside of the book node. Our results for the above query are:

1756-8a4e7d42-f8a6-4eee-9f76-66adb6d589a

We can immediately see a major problem with using this solution for our purposes – it returns matches for just the first book node; The Value() method will only return one node at a time. This is due to the singleton requirement of the value() method. The singleton requirement forces us to indicate which book node instance we are referring to – even if there is only one instance. However, the XML data contains multiple book nodes, and we need to look at each one in order to get an accurate answer to our request. We’ve indicated that we want the results from the first book node instance returned – by using the ‘[1]’ as a singleton. Therefore, if we change that to ‘[2],’ we get:

1756-0dae9f7f-051c-4382-9622-882f195b892

This type of value() query will work fine when using XML fragments having only one node instance, or where only one instance’s values were required. In our case, however, this is not what we want.

Filtering XML data using XQuery nodes() method

We have another possible way to get the results we want: using the XQuery nodes() method. The nodes() method is designed to shred XML node values into a relational format – nodes() actually returns a rowset based on the shredded XML. This sounds like exactly what we need. Let’s try an XQuery nodes() query:

This query uses CROSS APPLY to create a kind of self join back to the XmlFile table expression. The CROSS APPLY requires table (t) and column (c) aliases. You may notice that we’ve used a basic XQuery statement in the value() method, but most of the real filtering is done in the nodes() method. Again, we have indicated that the values should be looked for in the genre nodes, and we’ve used the parent::node() axis abbreviation again to point back to the genre node’s parent book node. You’ll also notice that we are again able to employ the value() method, despite its singleton requirement. This is possible because the nodes() method shreds the XML nodes into a relational structure – the value() method is returning values for each record in a rowset generated by nodes(). So, the rowset generated by nodes()contains all book nodes where the genre is ‘Computer.’ The value() method indicates that the author node from the given book node is where the value should be taken from.

The query returns the following:

1756-85416852-1819-40b8-bf11-52c8de2af18

We’ve finally returned the results in the relational format we are looking for. We can eliminate redundant authors by adding the following change:

1756-2fbcd3f0-20c6-4fc3-8cf0-a58a454a460

Working around Case-sensitivity in XQuery

Our query is case-sensitive. Genre node matches for the value ‘Computer’ are not the same as matches for the value ‘computer.’ To work around case-sensitivity in XQuery statements, you can use either the upper-case() or lower-case() functions:

The lower-case() XQuery function forces the genre node value to be lower-case, and then compares it with the lower-case value ‘computer.’

Using the XQuery contains() function to find substrings

Our query is also currently limited to finding exact matches to the lower-case value ‘computer’ in genre nodes. This means that node values with any accidental spaces, for example, will be excluded from our results. Also, if we wanted to search for text in the description nodes, we would have to type the entire lengthy text in our XQuery statement in order to find matches. To allow for substring searches, XQuery provides the contains() function. We’ll use this in conjunction with our current query by making the following changes:

We have modified the way the lower-case() function is used so that we could nest it inside of the contains() function. Even though we are only searching for a substring of the genre node value (‘comp’), we still get the correct results:

1756-67dfdda4-b2a0-40ab-bc4c-52df68c8e17

Importing and filtering XML data using a view

If we want to establish a more complete, permanent database-oriented solution, we can create a view that encapsulates some of the work we did earlier, allowing for very easy SQL querying against a result set returned by XQuery:

Now we can simple query the view, without using XQuery at all:

1756-ae2ac41e-a88b-4956-a68c-70fafbf873c

Efficiency comparisons

How do all of these different methods of accessing XML file data compare in terms of query speed? To check this, we’ll run the different methods together as a batch (making changes to WHERE criteria in the FLWOR statements to provide a better comparison to the other queries):

1756-a40d9e5f-ed99-46ee-ba2d-be4885e7082

The view is the most expensive method by far, and at 71% of the total batch cost, its query cost is roughly 10x either of the others. The differences between using a CTE versus a function seem negligible, as both the nodes() methods and the FLWOR statements cost around 7% of the batch. A minor exception is the FLWOR statement that uses a function – costing slightly more than its CTE counterpart at 8%.

Summary

We started out with an XML file that we wanted to filter for specific results, returning the values in a relational format. We first found a way to pull the XML data into a form that SQL Server could easily access and run queries against, using the OPENROWSET function. A more permanent extension of that idea was to create a function that encapsulated the same logic. We made several attempts to find the proper query solution, at first using an iterative FLWOR statement, but that did not return results relationally. We then tried using a plain XQuery value() method query, but discovered that the value() method’s singleton requirement restricted us to returning only one record at a time. We ended up finding a good resolution by using the XQuery nodes() method – the advantage that nodes() has is that it returns a rowset of the XML data, shredded and organized relationally. We also made a few modification to our nodes() query, eliminating the case-sensitivity and exact-value-match restrictions. We then incorporated some of that logic into a view that can be filtered using a SQL Server WHERE clause, rather than using XQuery criteria. The view solution is arguably the most straightforward solution, providing the creation of a permanent object is allowed. However, a batch query cost comparison showed us that the view method is by far the most costly.

For more articles like this, sign up to the fortnightly Simple-Talk newsletter.

Tags: , , , , ,

  • 34204 views

  • Rate
    [Total: 44    Average: 4.6/5]
  • vchandm23

    Good article
    This is of the standard of university lab worksheet. Awesome. xquery is an amazing technology in database systems. I have been doing this in Oracle but good to see it in MSSQL finally.

    Cheers,
    vchandm23

  • NBSteve

    Views
    Thanks for the article, lots of good XML examples here. My only critique is that your efficiency comparison makes the use of views appear inefficient, which isn’t directly the case. The use of a view itself doesn’t add any significant overhead to the query. For example, if you instead created your view directly from the CTE example as follows:

    [code]
    CREATE VIEW v_BooksXML
    AS
    WITH XmlFile (Contents) AS (
    SELECT CONVERT (XML, BulkColumn)
    FROM OPENROWSET (BULK ‘C:Books.xml’, SINGLE_BLOB) AS XmlData
    )
    SELECT DISTINCT x.[Computer Book Authors] AS Author
    FROM
    (
    SELECT c.value(‘(author)[1]’, ‘varchar(50)’) AS [Computer Book Authors]
    FROM XmlFile CROSS APPLY Contents.nodes (‘(//book/genre[contains(lower-case(.), "comp")]/..)’) AS t(c)
    )x

    GO
    [/code]

    You’ll find the same query using this view performs evenly with the other non-view examples. However, the view you’re using in the article has 7 calls to the value() xml function, while all the other queries only use 1 value() call.

    The View has all the extra value() calls so that it can be more useful and flexible, allowing it to be reused in more cases. However, that means it’s also doing a lot of extra work even when a bunch of the needed columns aren’t being used, hurting the performance. This is a danger of views… they can be useful for encapsulating routines in reusable code, but they may be encapsulating far more processing than is actually necessary for an individual query.

  • delcons

    RE:Views
    Thanks – good explanation of why the view is more expensive. However, I still think the way I present the view is a valid comparison to the other XML data accessibly options in terms of functionality – note that I didn’t use the DISTINCT keyword or any WHERE criteria in the SQL function creation. Those options are used in the script that utilizes the object, not in the object creation. So, while you are correct in explaining how multiple XQuery value() method calls drag down the view’s speed, I’m sticking to my original intent of comparing it to other XML accessibility options ;).

  • DGPerson

    execution plans of XML querying
    The most common problem with shredding XML into relational tables lies with coders doing it 1 row at a time. It is best to use OPENXML and load the results into a temp table. Then work the data according to the needs of the app. Are there occasions when this won’t work? Sure but if you receive well formed XML this will work and reduce the remote calls to the XML parser down to one instead of tens or hundreds of callouts depending upon the size of the XML body.

  • delcons

    RE:Good article
    Thanks for the comments 🙂