25 September 2012

Getting Started With XML Indexes

XML Indexes make a huge difference to the speed of XML queries, as Seth Delconte explains; and demonstrates by running queries against half a million XML employee records. The execution time of a query is reduced from two seconds to being too quick to measure, purely by creating the right type of secondary index for the query.

Introduction

Today, many organizations store and transmit data in an XML data format. Because of the ubiquity and readability of the XML format, it has become convenient and popular to both consume and provide data using XML. Because XML data is used so widely, it is increasingly important to be able to query XML values efficiently. XML indexes were designed to complement the XML data type and have been available since SQL Server 2005. When these indexes are used properly, they can dramatically reduce execution time in queries made against XML columns.

How Do XML Indexes Work?

When no indexes are present, an XML field must be ‘shredded’ when the query is executed. This means that the XML data is peeled apart from the XML tags, and organized into a relational format. An XML index does this work ahead of time, representing the XML data in an already-shredded version, thereby allowing easy filtering.

Some XML Index Ground Rules

XML indexes can be applied only to columns of the XML data type. In order to use XML indexes on a specific table, the table must have a clustered index on the primary key column (a primary key constraint includes a clustered index upon creation by default).

There are 2 types of XML indexes – primary and secondary.

A primary XML index essentially contains one row of information for each node in the XML column. This information is made up of XML node values, types, and paths. A primary index is a ‘pre-shredded’ representation of the XML blob – an easy-to-filter copy of the XML data.

Secondary XML indexes are dependent on primary XML indexes – you cannot create any secondary indexes without first having created a primary index.

Secondary XML indexes come in 3 types: PATH, VALUE, and PROPERTY indexes. Secondary indexes are designed to provide additional optimization for certain types of XML queries.

  • PATH secondary indexes often help to optimize queries that make use of a path. An exist() method in the query usually indicates that a PATH index may improve the query execution.
  • PROPERTY secondary indexes are designed to be helpful with queries where the primary key field is used, and where multiple values are retrieved from a single XML instance.
  • VALUE secondary indexes are good for queries that use wildcards in a path to find known values – if the full path is not known, but the value being filtered IS known, a VALUE index will probably optimize the query execution.

The basic syntax to create an XML index is:

Using a Primary XML Index

Let’s jump right in and start experimenting with XML indexes.

We have an existing Employee table with 500000 employee records. A sampling of the table is shown below:

1567-1-97bd54b1-f2aa-4c49-becb-88898ad6f

We will need to generate XML data from the Employee table for the purposes of testing the benefits of XML indexes. To do this, we’ll create a script that builds out an XML representation of the Employee data fields – but first, we’ll need to create a table to hold the XML data:

We’ve created the table with a primary key, which by default creates a clustered index on the primary key column – a necessity if we plan on creating XML indexes on this new table. We can now run a script that populates the new EMP_XML table. Our script will do this via the FOR XML clause, using the PATH mode to create the ‘Employee’ root node:

Notice that we’ve designated some of the source table’s fields as nodes or nested nodes, and others as node attributes in the XML table. We now have a table containing all 500000 records of employee data, in XML format:

1567-1-726509a0-b434-4cd4-9164-3fc15d2d8

The first record’s EMP_DETAILS field, expanded:

1567-1-3f688bfe-f38c-4ff0-93f5-96f3c6d2c

We can see that all fields from the source table are now represented as XML nodes and attributes. We’ll use this table to test the performance benefits of the XML index.

Comparing Query Execution Times

Let’s now look at some basic filtering operations that we may want to perform on our new employee XML data.

Before running any queries, let’s turn on STATISTICS TIME to measure the query execution time:

We’ll now use a simple example to demonstrate the benefits of using a primary XML index. To locate the street address for an employee having a Social Security Number of 574582264, we can use the following XQuery statement:

1567-1-31a4d35d-d925-4219-ae36-9fba9cb56

1567-1-c5574b08-fde0-44e0-abbe-33429cc36

The elapsed time, under SQL Server Execution Timesin the Messages tab, represents the actual query execution time in milliseconds. We can see that the query time was 2463 ms. Now let’s add a primary XML index to the table:

We’ve created a primary index on the EMP_DETAILS column of the table – the only XML field. Creating the primary XML index on a table with this many records may take a few minutes. XML data is being collected and stored during the primary XML index creation, including node values, node types, and the paths from each node to the root.

Now that a primary XML index is on the EMP_DETAILS field, we’ll run the same query again:

1567-1-7d85f9ad-973f-4deb-9129-740c6b4fe

We see that the elapsed time and CPU time were both reduced by about 60% as a result of creating the primary XML index! We can tell for sure that the XML index was used by examining the graphical execution plan. To see this in SSMS, go to the Query menu before running the query and then select Include Actual Execution Plan. After running the query, look at the results in the Execution Plan tab:

1567-1-83d0415d-2173-4512-bdf9-795dd6e27

A node that is displayed for a PrimaryXML index seek or scan in the graphical execution plan indicates that the XML index was indeed used for the query.

Disabling XML Indexes

Let’s assume we have received a request to find all address information for employees who:

  1. Live in Michigan
  2. Have a first name of ‘Desiree’

Before we create a query for this request, let’s disable the primary XML index, so we can compare the before and after time statistics. By disabling an index, we can retain the index definition whilst clearing out the stored data about the table. A disabled index is not used by the query execution plan. The current status of XML indexes can be checked by using:

1567-1-9ce1b681-d33d-4690-ac78-2c0d22364

To disable our primary index, we’ll run the following ALTER INDEX statement, with the DISABLE argument:

The primary XML index is now rendered ineffective.

To return the results we need for the ‘addresses’ request, we can include a FLWOR expression in our query, within an XQuery query() method:

1567-1-5a3d504b-e46a-4d6a-9ad4-92c5e35bb

1567-1-b90ecb03-dc0f-4f8c-b4ac-666875ef6

This query gives us the results we are looking for; every element and attribute value under the Address node, and it also includes FirstName and LastName values for clarity. The FLWOR expression iterated the Address node and returned the State, Zip, Street, and City values. After running this query a few times, the elapsed execution time consistently shows a time of around 9539ms. Now that we’ve tested this query without using any XML indexes, let’s re-enable our primary XML index.

Rebuilding XML Indexes

An XML index, like any index, can be re-enabled by simply using the REBUILD option in an ALTER INDEX statement. To rebuild our primary XML index, we’ll run the following:

The primary XML index is now re-enabled. Be aware that rebuilding an index can have serious performance implications, limiting access to the table from other processes. Other types of indexes accept the REBUILD argument with the WITH (ONLINE = ON) hint to reduce this performance hit, but not XML indexes. If we were to attempt a REBUILD WITH (ONLINE = ON) on our primary XML index, we’d get the following error:

1567-1-a03a2fe5-d2bc-4845-aa76-ba8164820

XML indexes can only use the WITH (ONLINE = OFF) option.

Now that we have re-enabled our primary XML index, it will be considered in the query execution plan. Let’s re-run our previous query and examine the statistics time again:

1567-1-6e3cf8a4-ff4d-4f3a-a6f1-676b1ea35

Our previous execution time was 9539 milliseconds. We’ve reduced the time by 90% by using just a primary XML index! Although this is a great optimization gain, let’s see if we can increase efficiency for this query even more, using a secondary XML index.

Using Secondary XML Indexes

How can we decide which secondary index to use for our query?

PATH secondary indexes are designed to help improve queries that contain a good amount of path expressions. Queries that use the exist() method are usually good candidates for a PATH index. Our last query uses 2 instances of the exist() method in its WHERE clause, and also includes path expressions in the select list. The PATH index seems promising for our query.

PROPERTY secondary indexes are intended for queries where the primary key is a known value, and multiple return values are required from the same XML instance. We are returning multiple return values from the same XML instance. A PROPERTY index also seems like it would be helpful with our query.

VALUE secondary indexes are useful for searching for known element or attribute values, without necessarily knowing the element or attribute names, or the full path to the value. Queries that use wildcards for portions of a path would probably benefit from a VALUE index. In our query, we do know the paths to the elements and attributes we are filtering, as well as the values. A VALUE index could help, but probably not as much as a PATH index would.

Using PATH Secondary XML Indexes

The PATH secondary XML index is known to help with queries that have an exist() expression in the WHERE clause. Since our query uses 2 instances of the exist() method, let’s try aPATH secondary index, and see if it improves performance. We’ll create thePATH index as follows:

Note that we need to reference the primary XML index (IX_EMP_DETAILSx) in the index create statement with the USING clause, to specify that it will be a secondary XML index. A primary XML index is required in order to implement any secondary index. The ‘FOR PATH’ option is what designates the index as a PATH index. Now that we have a PATH secondary index, let’s re-run our query and view the execution time results:

1567-1-6c35585e-8778-4e9a-8235-3599f16bd

We see another huge performance gain by using the PATH secondary XML index – another reduction in execution time by more than 60%.

Using PROPERTY Secondary XML Indexes

With our ‘addresses’ query, a PROPERTY secondary XML index could be slightly more advantageous than a PATH index, because in addition to using paths, our query also returns multiple values. Let’s experiment and see if adding a PROPERTY index will increase optimization.

1567-1-7d55d440-fb39-468d-bede-aaeb72bce

After running the ‘addresses’ query again, this time with a PROPERTY index in place, we see yet another reduction in execution time – from 332 ms. to 190 ms. It seems pretty obvious that the PROPERTY index was the reason for the performance gain here, but how can we tell for sure? We now have 2 different secondary indexes on our table, PATH and PROPERTY, and their nodes both look similar in the graphical execution plan:

1567-1-d6a8654d-d69c-439c-9bc7-783ac8972

To find out, all we have to do is hover over the SecondaryXML node and look in the ‘Object’ section of the popup – OR right-click the node, select Properties, and examine the ‘Object’ section of the pane:

1567-1-6cf8db62-5162-4662-8043-ad62d47c5

We can clearly see the name of the index (IX_EMP_DETAILS_propertyx) that was used in the query. The PROPERTY index was utilized for our query. This is a good example of how experimenting with secondary XML indexes can result in highly optimized queries.

Using VALUE Secondary XML Indexes

We have received another request: write a query that returns all details for all employees living in zip code 98679. Let’s imagine that we aren’t familiar with the XML structure, but still need to query it without bothering to look up schema details. We are not sure of the node attribute name – is it ‘zip‘, ‘zipcode‘, or ‘zip code‘? We do know that it is listed as a node attribute, not a node element. We’ll choose a query that reflects what we know about the XML structure:

We’ve used the ‘//‘ path wildcard to indicate that we don’t know the full path to the attribute, and the ‘*‘ attribute wildcard to allow for the value to be found in any element attribute. The ‘@‘ shows that we know that the zip code value is an element attribute value.

1567-1-d6ae2520-11bd-469d-a5cc-c5a505726

1567-1-d68aa08a-8c32-4d73-b997-721d03371

The results are processed in 1809 ms. If we look at the graphical execution plan, we see that the PATH secondary index was used almost exclusively:

1567-1-617d04db-afad-44e8-bde8-e11d85097

Since our ‘zipcode’ query uses an incomplete path AND a wildcard, it’s a good candidate for a VALUE secondary index. Let’s create a VALUE index, and test the execution time from a second query run:

1567-1-a252421d-eb68-44d1-aa29-7eeb0ad8d

The VALUEindex reduced the query execution time to less than a millisecond!

We can see the SecondaryXML index node in the graphical execution plan:

1567-1-5a6db675-2a27-40b8-b36a-0ca2199f3

Let’s check the ‘Object’ section of the node’s Properties pane to confirm that the VALUE index was used:

1567-1-6c0d2aea-4c03-4fc3-ac34-ad44d6b08

We see that the VALUE index (IX_EMP_DETAILS_valuex) was used for the query.

If we had run this query without any XML indexes (by disabling the primary index and all secondary indexes), the query execution time results would have been as follows:

1567-1-4937bc4a-0848-401f-a0a8-2e8bf674f

So, a query that originally cost more than 7 seconds has now been optimized so that it takes less than 1 millisecond.

Disadvantages

Some disadvantages to using XML indexes include:

  • As with any index, data modifications are more expensive. Updates and inserts to an XML field that is indexed will take longer, as the index(es) must be maintained and updated.
  • XML indexes can claim massive amounts of disk space. To get an idea of how our XML indexes have affected disk space, consider how the sp_spaceused statistics have changed with each index addition:
    • With no XML indexes:
    • 1567-1-3125d89f-8b09-4b6a-bb80-e111a9970
    • After adding the primary XML index:
    • 1567-1-505cc84a-7470-4873-a449-3bf602650
    • After adding the PATH secondary index:
    • 1567-1-49f31374-8e68-4528-9ada-00698f74b
    • After adding the VALUE secondary index:
    • 1567-1-3a085bbc-c8d5-41a5-b759-9df079985
    • Ater adding the PROPERTY secondary index:
    • 1567-1-a8c6bd9e-3841-4e16-bca4-00f84c75e
    • The cumulative index space used for the table was increased by more than 2GB after adding the primary XML index and the 3 secondary indexes. That is a substantial disk space cost overhead for the XML indexes , considering that the data size of the 500000-row table is only about 300MB.

Conclusion

We’ve looked at some of the reasons that using XML indexes can increase a query’s performance, and have gone over some of the basic rules for using them.

We’ve delved into some specific test cases in order to prove the performance benefits of XML indexes on the XML data type. We found by comparing time statistics that using a primary XML index can reduce query execution times dramatically. We looked at the different types of secondary XML indexes, and saw that by implementing the correct secondary index, a query’s performance can be optimized even further.

We also reviewed some disadvantages to using XML indexes, such as increased table modification times due to index maintenance, and increased disk space consumption. However, when XML indexes are used correctly to support specific query demands, any increased disk and index maintenance costs may be seen as a small price for such large query performance gains.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 34364 times – thanks for reading.

Tags: , , ,

  • Rate
    [Total: 45    Average: 4.6/5]
  • Share

Seth Delconte

View all articles by Seth Delconte

  • Anonymous

    XML Data Size
    Tried this around two years ago.
    I found that the primary index does not make any difference when the XML data size is important , in my case, 100kb.
    Also the index affects the insert performance as i insert an average of 500k XML per day.

    This seem fine for low data volumes and small xml size.

  • George

    Xml indeces slowed down my query!
    When I applied all 4 indeces to the query below, it took almost 10 minutes for the query to complete. Then by trial an error I removed all indeces and the query is taking less that 1 minute.

    SELECT x.y.value(‘@CkSelect’,’bit’) AS CkSelect,
    x.y.value(‘@CkPos’,’varchar(18)’) AS ckPos,
    PayFileID,
    x.y.value(‘@CkSeqNo’,’char(18)’) AS CkSeqNo,
    x.y.value(‘@CkNo’,’nvarchar(30)’) AS ckNo,
    (SUBSTRING(dbo.jul2GregDt(x.y.value(‘@CkDate’,’varchar(10)’)),1,6) + SUBSTRING(dbo.jul2GregDt(x.y.value(‘@CkDate’,’varchar(10)’)),9,2)) AS ckDate, " _
    (SUBSTRING(dbo.jul2GregDt(x.y.value’@CkRxDate’,’varchar(10)’)),1,6) + SUBSTRING(dbo.jul2GregDt(x.y.value(‘@CkRxDate’,’varchar(10)’)),9,2)) AS ckRxDate,
    CONVERT(varchar(15),x.y.value(‘@CkAmount’,’money’),1) AS ckAmnt,
    CONVERT(varchar(15),x.y.value(‘@CkPLBAdjust’,’money’) + x.y.value(‘@CkPLBTax’,’money’),1) AS ckAdjust,
    CONVERT(varchar(15),x.y.value’@CkAmount’,’money’) + x.y.value(‘@CkPLBAdjust’,’money’) + x.y.value(‘@CkPLBTax’,’money’) – x.y.value(‘@CkCredited’,’money’),1) AS ckBal,
    x.y.value(‘@PayerNm’,’nvarchar(60)’) AS ckPayerNm,
    x.y.value(‘@PayerID’,’nvarchar(60)’) AS ckPayerID,
    x.y.value(‘@PayeeNm’,’varchar(60)’) AS ckPayeeNm,
    x.y.value(‘@PayeeNPI’,’nvarchar(10)’) AS ckPayeeNPI,
    x.y.value(‘@CkProcessedFlag’,’char(1)’) AS ckProcessedFlag,
    x.y.value(‘@PLB’,’varchar(100)’) AS ckPLB,
    x.y.value(‘@CkRegID’,’varchar(18)’) AS ckRegID FROM PayFiles CROSS APPLY PayFileXml.nodes(‘PayFile/Check’) x(y) INNER JOIN #payFilID ON id = PayFileID

  • delcons

    RE:Xml indeces slowed down my query!
    George, try a primary XML index only, without any secondary indexes. Make sure to run the query at least twice to allow the query optimizer to build the best execution plan – sometimes the very first execution IS slower… I also remember reading that there was a bug that caused primary XML indexes to be slower… I am not sure which version of SQL Server that was on, so make sure you have the latest SQL Server updates from MS.

  • Jeff Moden

    Nicely done, Seth
    Really nice article, Seth. Well done.

    As a side bar, though, if anyone tried to store such XML in one of my databases except for storing it in a temporary staging table to convert it to a "normal" table, we’d have to have a real serious meeting about it out in the wood shed. 😉

  • delcons

    RE:Nicely done, Seth
    Thanks Jeff!

  • tfountain

    The “wood shed”
    Nice line there Jeff! I’m fighting this same battle now. But since there is one DBA and many developers, logic loses over the whining.

  • mkalim_79@yahoo.com

    Need help with XML blob in one single row
    I have a situation where i have a 110MB file that needs to be dumped into a single row on an XML column type. Then I want to query this large XML test using XPath in SQL queries. Is there any efficient way to perform this query or if any indexes can be created on XML data contained in this XML column.

  • Peso

    XMLNAMESPACE
    It seems that using a namespace renders all xml-indexes useless, except for the primary xml.