01 February 2012

Converting String Data to XML and XML to String Data

We all appreciate that, in general, XML documents or fragments are held in strings as text markup. In SQL Server, XML variables and columns are instead tokenised to allow rapid access to the data within. This is fine, but can cause some odd problems, auch as 'entitization'. What, also, do you do if you need to preserve the formatting? As usual Rob Sheldon comes to our aid.

When you’re working with XML data, you might find that you want to convert a value from the XML data type to another type, or from another type to the XML type. SQL Server lets you do both, to a limited degree. You can convert XML data to any of the string or binary types, and you can convert any of the string or binary types to XML. The process you follow if fairly straightforward in either case. Mostly, you need to know when SQL Server will do implicit conversions or when you must explicitly cast data into another type. And you should be aware of a couple other subtleties when converting to or from XML data.

NOTE: This article is the fourth in a series about working with XML data in SQL Server. The first three articles cover the XML data type (“Working with the XML Data Type in SQL Server“), its methods (“The XML Methods in SQL Server“), and incorporating XML into database objects (“Incorporating XML into Your Database Objects“).

Converting String Data to XML

In SQL Server, you can convert data configured with any of the character or binary data types-such as CHAR, VARCHAR, and VARBINARY-to the XML data type. You can use the CAST() or CONVERT() function to explicitly cast the data to a different type, or you can let SQL Server implicitly convert the data. For instance, SQL Server will automatically convert a string value stored with the NVARCHAR data type into an XML value, as shown in the following example:

In this set of statements, I first declare the @string variable with the NVARCHAR(MAX) data type and then the @xml variable with the XML data type. I assign a string value-an XML fragment-to the @string data type, then set the value of @xml to equal @string. Because SQL Server can implicitly convert an NVARCHAR value to an XML value, the assignment is very straightforward, which I confirm by querying the @xml variable. As expected, the SELECT statement returns the following results:

<bookstore><book>Candide</book></bookstore>

The result set contains the XML fragment as it was originally assigned to the @string variable. The same holds true for other character data types, in terms of performing implicit conversions. For instance, in the following example, I replace the NVARCHAR(MAX) data type with VARCHAR(100):

This example is identical to the preceding one, except for switching the character data types. As a result, the SELECT statement returns the same XML fragment as before:

<bookstore><book>Candide</book></bookstore>

Again, SQL Server has implicitly converted the string data to XML. However, you can explicitly convert that data if you want. In the following example, I use the CAST() function to convert the @string value:

When you specify the CAST() function, you must pass in an expression that references the data to be converted, in this case, the @string variable, followed by the AS keyword and the name of the target data type, XML. All this is enclosed in parentheses, as I’ve done here. I then assign the value returned by the CAST() function to the @xml variable. As to be expected, the SELECT statement returns the same results as the previous examples:

<bookstore><book>Candide</book></bookstore>

You can just as easily use the CONVERT() function to achieve the same results. The main difference is how you structure the arguments that you pass into the function. For CONVERT(), you must first specify the target data type (XML), followed by the expression that references the source data (@string), as shown in the following example:

Notice that I follow the XML data type argument with a comma, then I specify the @string variable. Once again, the SELECT statement returns the XML fragment as it is assigned to the @string variable:

<bookstore><book>Candide</book></bookstore>

You might be wondering why you would explicitly cast string or binary data to the XML type if SQL Server handles such conversions implicitly. Let’s look at the CAST() function first. One reason you might want to use CAST()is if you plan to run your SQL script against a database management system in addition to SQL Server and that system doesn’t support implicit conversions. Because the CAST() function conforms to ANSI specifications, you can use it with any database systems that conform to those standards.

That’s not the case with the CONVERT() function, which is specific to Transact-SQL in SQL Server. If you want to run your script against another database system, you have to modify your code to conform to that system’s specifications. However, if you’re running your script only against SQL Server, you can take advantage of the CONVERT() function’s support for additional options, which let you better refine the conversion process.

For instance, in the following example I use tabs between the parent and child elements in the XML fragment:

The tabs serve only to add white space between the XML elements and do not affect the elements themselves. Then, as I do in the previous example, I use the CONVERT() function to cast the string as XML data. However, when I run the SELECT statement, the data no longer includes the white space, as shown in the following results:

<bookstore><book>Candide</book></bookstore>

But I can preserve the white space in the original string by adding a third argument to the CONVERT() function, as shown in the following example:

Notice that the function’s third argument is 1, which tells SQL Server to preserve the white space in the source data during the conversion process. Now the SELECT statement returns the following results:

<bookstore><book>Candide</book></bookstore>

This time the tabs are intact. I could have used spaces instead and received similar results. Now let’s look at what happens when we add linefeeds to the XML fragment. In the following example, I modify the string to include an additional book along with linefeeds and tabs:

Because I pass the third argument (1) into the CONVERT() function, as I did in the previous example, the conversion process preserves the white space and linefeeds, as shown in the results returned by the SELECT statement:

<bookstore>
   <book>Candide</book>
   <book>Pride and Prejudice</book>
</bookstore>

Another thing worth noting about converting data to the XML data type is that once you’ve converted the data, you can use the data type’s methods to retrieve data. For example, I modified the SELECT statement in the previous example to include the query() method:

As you can see, I pass in the second instance of the child element (/bookstore/book[2]) as an argument to the query() method. The SELECT statement now returns the following results.

<book>Pride and Prejudice</book>

You can, of course, use the query() method to perform more complex queries, and you can use other methods. The key is to save the string data to the XML type and then call the method.

Converting XML to String Data

Although SQL Server implicitly converts string and binary data to the XML data type, it doesn’t work the other way. All conversions from the XML type to string and binary types must be done explicitly using the CAST() or CONVERT() function. If you try to perform an implicit conversion, your statement will fail. For instance, the following example tries to convert the XML fragment to VARCHAR(100):

After I declare the @xml and @string variables, I assign the XML fragment to the @xml variable. I then try to set the @string variable to equal the @xml variable, but when I try to run these statements, SQL Server returns the following error:

Implicit conversion from data type xml to varchar is not allowed. Use the CONVERT function to run this query.

The problem is easy enough to fix. In the following example, I include the CAST() function to explicitly convert the XML value to VARCHAR(100):

Now the XML value is converted into a string with no problem. When I retrieve the value of @string, the SELECT statement returns the XML fragment, as shown in the following results:

<bookstore><book>Candide</book></bookstore>

I could have just as easily used the CONVERT() function to cast the XML value to VARCHAR:

As to be expected, the conversion occurs without a hitch and the XML fragment is assigned to the @string variable, which the SELECT statement confirms. Now suppose you insert linefeeds and tabs into your XML fragment, as I do in the following example:

Because the XML fragment contains these new elements, you might expect that you can simply add the third argument to the CONVERT() function to preserve the tabs and linefeeds. However, although I take this approach, the value saved to the @string variable does not preserve these elements, as shown in the results returned by the SELECT statement:

<bookstore><book>Candide</book><book>Pride and Prejudice</book></bookstore>

The problem is not with how I convert the @xml value, but with the way I assign the XML fragment to that variable. Notice that I assign the value simply by setting @xml to equal the XML fragment, enclosed in single quotes. What is essentially happening here is that SQL Server is implicitly converting a string value to the XML type, which means that the tabs and linefeeds are not being preserved during that assignment. The way to get around this is to use the CONVERT() function to explicitly cast the string to the XML type, as I do in the following example:

As you can see, I include a third argument in the CONVERT() function that specifies these elements be preserved. The SELECT statement returns the following results:

<bookstore>
   <book>Candide</book>
   <book>Pride and Prejudice</book>
</bookstore>

Now the string includes the tabs and linefeeds as they exist in the original XML fragment. Let’s take a closer look at this process so you can better understand what’s happening behind the scenes when you convert XML data.

XML Entitization

As the last example demonstrates, you can use the CONVERT() function to preserve characters that the XML parser normally ignores, such as spaces and tabs between elements. What this points to is that SQL Server handles certain components of an XML fragment-whether converting to or from the XML type-differently from the basic elements, attributes, and their values.

To better understand what’s happening, take a look at another example. The following set of Transact-SQL statements are similar to the last example, except that I’ve included a second SELECT statement to retrieve the @xml value and I’ve modified the second instance of the CONVERT() function so it doesn’t include the third argument (1):

As you might expect, the new SELECT statement returns the XML fragment with the linefeeds and tabs in place:

<bookstore>
   <book>Candide</book>
   <book>Pride and Prejudice</book>
</bookstore>

However, the second SELECT statement returns several unexpected values:

<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book> </bookstore>

NOTE: If I were to try to run this example without converting the original XML fragment, or converting the fragment without specifying the third argument in the CONVERT() function, the tabs and linefeeds would not be preserved, and the two SELECT statements would return the string without those tabs and linefeeds.

When the XML parser stores characters such as tabs and linefeeds within an XML document or fragment, it must do so in a way that preserves them throughout the parser’s normalization process, allowing these elements to be maintained whether being stored or retrieved or whether being converted to or from XML. Tabs and linefeeds are just two of the special characters that need to be preserved in this way, as the above example demonstrates, with tabs being saved as and linefeeds saved as . When you retrieve the XML value directly, these special characters are automatically displayed in a readable format.

The process of preserving certain characters with these special symbols is known as entitization. Unfortunately, the XML parser is not always consistent in entitizing characters, as demonstrated by the fact that only one linefeed is stored as &#x0A;. Another issue is the way the parser handles special characters that appear in an element or attribute value. In such cases, SQL Server simply returns an error, rather than trying to entitize them. For instance, in the following example I replace the word “and” in the second book title with an ampersand (&):

When I run this statement, I receive the following error:

XML parsing: line 3, character 15, illegal name character

The XML parser doesn’t like characters such as ampersands, left brackets (<), and right brackets (>) within element and attribute values. To store XML that contains these characters, you must manually entitize them by replacing the character with their symbols. In this case, I replace the ampersand with &amp;:

The parser can now parse the XML fragment. In doing so, it preserves the entitized ampersand, as shown in the results returned by the first SELECT statement:

<bookstore>
   <book>Candide</book>
   <book>Pride &amp; Prejudice</book>
</bookstore>

As you can see, the second book title contains the entitized ampersand. Now SQL Server has no problem handling the element’s value when that value is stored, retrieved, or converted. The entitized ampersand is also preserved when you convert the XML value to a string, which the second SELECT statement confirms:

<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book> </bookstore>

Of course, chances are pretty good you won’t want your string value to show the entitized characters. As before, you can try to get rid of them by adding the third argument to the second CONVERT() function, as shown in the following Transact-SQL:

The second SELECT statement now returns the following results:

<bookstore>
   <book>Candide</book>
   <book>Pride &amp; Prejudice</book>
</bookstore>

Almost there. The entitized tab and linefeed characters are returned as regular characters, but the ampersand within the element value is still entitized. To address this issue, you have to specifically remove it from your string. For example, in the following Transact-SQL I use the REPLACE() function to replace the &amp; value with an ampersand:

For the first argument of the REPLACE() function, I pass in the converted @xml value, then specify the value to be replaced (&amp;) and the new value (&). Now the second SELECT statement returns the value we want:

<bookstore>
   <book>Candide</book>
   <book>Pride & Prejudice</book>
</bookstore>

As the results show, the string value no longer contains any entitized characters. As you work with XML data, you’ll get a better feel for when to manually entitize and de-entitize data. For more information about the entitization process in SQL Server, see the topic “Serialization of XML Data” in SQL Server Books Online.

XML Data Conversion

For the most part, converting data from the XML data type to another type or converting it from another type to the XML type is a fairly painless process. Not surprisingly, there are a few gotchyas along the way, entitization being one of them. For more details about converting XML data, see the topics “Generating XML Instances” and “CAST and CONVERT (Transact-SQL)” in SQL Server Books Online. Also worth checking out is how to incorporate the FOR XML clause in a SELECT statement to return results as XML. You can find details about that in my Simple-Talk article “Using the FOR XML Clause to Return Query Results as XML.” As you can see, SQL Server provides a number of ways to work with XML data. This article, along with the first three in the series, should have provided you with a good overview of the various considerations to take into account when working with XML.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 228487 times – thanks for reading.

Tags: , , , ,

  • Rate
    [Total: 81    Average: 4/5]
  • Share

Robert Sheldon

View all articles by Robert Sheldon

  • Anonymous

    XML and string!!
    Many thanks for this summary and All the Best for your endeavours …..

  • kalpanaceo

    XML Conversion services
    I like this post. this post very important. we can get lot of information thought this post and this site. thanks for giving these information, good luck…!!!!

    Informatics Outsourcing – XML Conversion services

  • bkt

    Converting XML via FOR XML to String with preserved entity formatting question
    Thanks Robert for a great article!

    I’m trying to use the above in conjunction with the FOR XML (path, elements) to then create a string which retains the line/tab and line feeds but can’t seem to get it to work.

    My example is:

    DECLARE @RetXML XML
    DECLARE @ConvertXML XML
    DECLARE @RetStr nvarchar(max)

    Select @RetXML = (

    SELECT Student.Id,
    Student.First_Name,
    Student.Last_Name,

    (
    SELECT
    SA.Address1 As Address1,
    SA.Address2 As Address2,
    SA.Address3 As Address3,
    SA.City As TownCity
    FROM Students_Addresses AS SA
    WHERE SA.Student_Id = Student.Id

    FOR XML PATH(‘Student_Addresses_Local’), TYPE)

    FROM Students AS Student
    WHERE Student.Id = 54000

    FOR XML PATH (‘Students’), ELEMENTS )

    SET @ConvertXML = CONVERT(XML,@RetXML,1)

    SET @RetStr = CONVERT(nvarchar(max),@ConvertXML,1)

    PRINT @RetStr

    but the @RetStr is unformatted:

    <Students><Id>54000</Id><First_Name>MyFirstName</First_Name><Last_Name>MyLastName</Last_Name><Student_Addresses_Local><Address1>123 My Street</Address1></Student_Addresses_Local></Students>

    Is there any way to retain the text formatting in this case?

    Thanks in advance for any help 🙂

    Brett

  • binki

    Entitization of ampersand
    You say “the ampersand within the element value is still entitized. To address this issue, you have to specifically remove it from your string. For example, in the following Transact-SQL I use the REPLACE() function to replace the &amp; value with an ampersand.”. Why would anyone ever want to do this? The correct way to extract a string from an XML document is *not* to first make the document invalid. Simply, if your data looks like “<book>Pride & Prejudice</book>”, it is not well-formed and, thus, is NOT even XML. You should just leave it looking like “<book>Pride &amp; Prejudice</book>” since that actually means something in XML.

    If you want to extract a string from an XML, use an XML parser in your application layer or, in TSQL, use something like the following which lets TSQL’s XML parser unserialize the node’s value for you: “SELECT CAST(‘<bookstore><book>Candide</book><book>Pride &amp; Prejudice</book></bookstore>’ AS XML).value(‘(/bookstore/book)[2]’, ‘VARCHAR(MAX)’)”. Never try to programmatically parse, regex, or rewrite XML yourself *unless* if you are actually the author of the XML parser. If TSQL’s XML parser is broken, everyone is in trouble. If you go out of your way to handle serialized XML directly in your own code, you’re only going to end up getting yourself in trouble.