30 January 2012

What’s the Point of Using VARCHAR(n) Anymore?

The arrival of the (MAX) data types in SQL Server 2005 were one of the most popular feature for the database developer. At the time, there was a lot of discussion as to whether this freedom from having to specify string length came at a cost. Rob attempts to give a final answer as to any down-side.

When you are storing large strings, or any other data types that you’d want to store as VARCHAR or VARBINARY, there are factors that affect how and where that data is stored inside SQL Server. This in turn will affect the performance of inserts and updates.

Basics: Allocation Unit Types, Row and Page Size Limits

SQL Server’s unit of physical storage is a page. Each page has a fixed size of 8K bytes (ref). Where possible, data is stored in rows within these ‘pages’. A row cannot overlap a page boundary. If the space allocated to a datatype is fixed, or if the variable sized data is small enough to allow the row to fit, the base allocation unit type, “IN_ROW_DATA”, is used. However, there are two ways of storing data that would otherwise overflow a page boundary.

  1. When a single row grows too large to be stored in-row, data can be offloaded to “ROW_OVERFLOW_DATA” pages.
  2. When a single column stores more than 8,000 bytes, or if the developer chooses to force this behavior, data is stored in “LOB_DATA” pages.

This becomes much clearer through examples.

Exercising Row and Page Size Limits

Single String Column with VARCHAR(n) and VARCHAR (MAX)

8,000 Characters

First, let’s build two tables and exercise some long strings to see how SQL Server handles both normal and large strings. Here, we will be exercising row size limits.

After each insert, we will use versions of the following query to see into which type of page these records were loaded (see MSDN for details on the DMV):

The query results show that only “IN_ROW_DATA” pages were written for both tables, and the pages are 98.97% full.

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMax IN_ROW_DATA 1 98.97 1 8011 8011 0
OneColVarcharN IN_ROW_DATA 1 98.97 1 8011 8011 0

8,001 Characters

Clearly we can’t insert 8,001 characters into an 8,000-character column, but we can in the VARCHAR(MAX) column.

The query results here show that pages were written to both “IN_ROW_DATA” and “LOB_DATA” pages.

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMax IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMax LOB_DATA 1 99.02 1 8015 8015 NULL

Note that the “in row” record is very small. In this case, the only thing stored “in row” is the 24-byte pointer. The full 8,001-character string is moved to the LOB page.

So SQL Server is storing normal VARCHAR(n) columns and VARCHAR(MAX) columns “in row” by default. When VARCHAR(MAX) exceeds 8,000 characters, the pointer is stored “in row”, and the string is stored in “LOB” pages.

Two String Columns with VARCHAR (n) and VARCHAR(MAX)

Let’s try this again but with two columns. This time, we will exercise page size limits.

8,000 Characters

As expected, two 4,000-character strings fit fine in both tables.

The query results show that only “IN_ROW_DATA” pages are written for each table, and the pages are 99.00% full.

8,060 Characters

The maximum number of bytes per page is 8,060. Let’s see how this works.

It actually doesn’t require the full 8,060 characters to fill that page. 8,047 characters get written to just “in row” pages, but at 8,048 characters, it writes to both “in row” pages and either “row overflow” pages (VARCHAR(n)) or “LOB” pages (VARCHAR(MAX)).

Results with 8,047 Characters (4,023 and 4,024 characters)
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 99.58 1 8060 8060 0
TwoColVarcharN IN_ROW_DATA 1 99.58 1 8060 8060 0
Results with 8,048 Characters (4,024 and 4,024 characters)
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 50.17 1 4061 4061 0
TwoColVarcharMax LOB_DATA 1 49.89 1 4038 4038 NULL
TwoColVarcharN IN_ROW_DATA 1 50.17 1 4061 4061 0
TwoColVarcharN LOB_DATA 1 49.98 1 4038 4038 NULL

Notice that at 8,047 characters, the record size is exactly 8,060 bytes. What is in that 23 bytes? “Each data page contains a page header which stores page meta-data, such as database file identifier, current page number, previous and next page numbers, number of free bytes per page and so forth.” (ref)

200,000 Characters

For the VARCHAR(MAX) table, the results are basically the same even if the column width is pushed far past the 8,000-character limit. Here, we write two 100,000-character strings.

Results
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 0.75 1 61 61 0
TwoColVarcharMax LOB_DATA 27 91.89 28 228 8054 NULL

Here, we have a single page storing two pointers and 27 pages storing the large strings.

In-Row vs. Out-of-Row LOB Storage

For “large-value data types” like VARCHAR(MAX), SQL Server allows the data to be stored in-row (up to 8,000 bytes) or out-of-row (ref).

We will use two new tables, one in-row and one out-of-row:

To change a VARCHAR(MAX) column from the default (in row) to out-of-row, execute this statement:

1 Character

Writing a single-character string to each column in these tables creates an interesting result:

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.15 1 12 12 0
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 1 1.04 1 84 84 NULL

The in-row table created just one page. The out-of-row page created one page for the pointer and one page for the out-of-row string. Notice the byte count difference: the in-row record takes 12 bytes, where the out-of-row record takes 111 bytes.

This illustrates two downsides of out-of-row storage when small records are written:

  1. Multiple pages are affected regardless of the size of the data being written.
  2. Extra storage space is required.

8,000 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 98.97 1 8011 8011 0
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 2 50.02 2 84 8014 NULL

Notice that the out-of-row table writes two pages instead of just one (“PgCt” column). It actually takes only 65 characters in the out-of-row table to cause it to create the second page.

8,001 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMaxIn LOB_DATA 1 99.02 126 8015 8015 NULL
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 2 98.43 127 84 8015 NULL

Once the byte count goes over 8,000, the in-row table writes to a LOB page.

1,000,000 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMaxIn LOB_DATA 126 98.42 126 2020 8054 NULL
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 126 98.43 127 84 8054 NULL

With such large strings, the results are very similar between default (in-row) and out-of-row settings.

Performance

So how do these various choices affect performance? Let’s code some inserts and record results with different settings and values.

INSERT – 0 to 8,000 Characters

First, we’ll test the performance of a process that inserts strings which fit in VARCHAR(n).

The basic idea here is to insert 10,000 rows of data and record the timing. I ran the tests on my desktop. This is not a server-class machine but is a very high-powered desktop. All tests were executed on SQL Server 2008 R2 and Windows 7 Enterprise.

The results here show microseconds per insert for record lengths between 0 and 8,000 characters. The complete collection of test scripts is linked in the References section at the end of the article.

1432-image001small.png

See detailed results in Table 1.

The results here show that VARCHAR(n) is slightly faster than VARCHAR(MAX) (in-row) only for 0-character and 1,000-character strings. At 2,000 characters or above, VARCHAR(MAX) (in-row) is fastest. This result was not expected.

Intuitively, we expect that VARCHAR(n) should be faster than VARCHAR(MAX) (in-row), and in some cases it is. What I think this shows is that, if there is a performance difference, it is small enough that we can’t measure it consistently.

See detailed results in Table 1.

INSERT – Greater Than 8,000 Characters

These tests are limited to VARCHAR(MAX) but compare in-row to out-of-row.

The results here show microseconds per insert for records lengths between 9,000 and 30,000 characters.

1432-image002small.png

See detailed results in Table 2.

UPDATE – Column Limit

When a VARCHAR(MAX) in-row column goes over the 8,000-character limit, the data is moved to a LOB_DATA page. The linked script Allocation Units – UPDATE.sql tests this 8,000-, then 8,001-, then 8,000-character data shift. This test illustrates the move of data from an IN_ROW_DATA page to a LOB_DATA page and then back for the in-row table.

These operations can be expensive, but how expensive? Make a quick mental prediction before reading on.

The linked script PerfTest – UPDATE.sql tests both the 8,000-to-8,001 scenario and the 8,001-to-8,000 scenario. The results were actually quite surprising. I kept checking the code to see if I had done something wrong because the results didn’t match my expectations at all.

UPDATE Test Results

In/Out

From

To

 Elapsed Mcs

Penalty vs

Out-of-Row

In-Row

8,000

8,001

34,238,839

82%

Out-of-row

8,000

8,001

18,804,768

In-Row

8,001

8,000

9,969,989

-47%

Out-of-row

8,001

8,000

18,804,768

As expected, the 8,000-to-8,001 update takes longer for in-row than for out-of-row. The surprise is that the 8,001-to-8,000 update is faster for in-row. Based on these results, there must be a shortcut employed inside SQL Server that optimizes this move of data from a LOB_DATA page to an IN_ROW_DATA page.

UNICODE

All of these tests were conducted using non-UNICODE strings. Without compression, all the byte counts have to be doubled for UNICODE strings.

Recommendations

Don’t Use VARCHAR(MAX) Everywhere

One design-simplification technique that has been suggested is to use VARCHAR(MAX) for every string column. Though the tests here show that using VARCHAR(MAX) (in-row) instead of VARCHAR(n) is not a performance problem for strings under 8,000 characters, there are a number of reasons to limit the length of strings in your database.

UI Issues

Anyone with experience creating an application UI knows that handling long strings is difficult. When laying out the UI for an application, it is important to know the expected and maximum length of strings that will be displayed. Leaving this open to whatever someone writes to the database makes designing, building, and testing an application very difficult.

Performance – Writing or Updating Long Strings

From the performance results, you can see that writing long strings definitely affects write delays. While writing a record of any size (even zero characters) takes time, writing longer strings takes more time. As an example, using VARCHAR(n), writing 1,000 characters takes an average of 217 microseconds while writing 8,000 characters takes an average of 448 microseconds.

Compression

What I believe is the most compelling reason to avoid over-8,000-character strings is compression. LOB data never gets compressed.

“When considering data compression one very key thing to remember is that out of row data (LOB data) isn’t compressed. If the LOB data is stored in row, then it will be compressed, but only when you compress the table with PAGE level compression. If however you use ROW level compression then the LOB data will never be compressed, no matter if it is stored in row or if it is stored out of row.” (ref)

Conclusions

  • Where appropriate, use VARCHAR(n) over VARCHAR(MAX)
    • for reasons of good design if not performance benefits, and
    • because VARCHAR(MAX) data does not compress
  • Storing large strings takes longer than storing small strings.
  • Updating an in-row VARCHAR(MAX) value from below 8,000 to over 8,000 will be relatively slow, but the difference for a single transaction will likely not be measurable.
  • Updating an in-row VARCHAR(MAX) value from over 8,000 to below 8,000 will be faster than if the table is set to store data out-of-row.
  • Using the out-of-row option for VARCHAR(MAX) will cause slower writes until the strings are very long.

All the scripts used here are attached, so you are encouraged to retest the results.

References

Appendix

Table 1

TableName

MaxRecords

StrLen

ElapsedMcs

VarcharMaxIn

10,000

0

1,341,626

VarcharMaxOut

10,000

0

1,950,038

VarcharN

10,000

0

1,216,823

VarcharMaxIn

10,000

1,000

2,246,443

VarcharMaxOut

10,000

1,000

2,511,649

VarcharN

10,000

1,000

2,168,441

VarcharMaxIn

10,000

2,000

2,308,844

VarcharMaxOut

10,000

2,000

3,104,460

VarcharN

10,000

2,000

2,823,654

VarcharMaxIn

10,000

3,000

2,667,651

VarcharMaxOut

10,000

3,000

3,057,659

VarcharN

10,000

3,000

2,698,852

VarcharMaxIn

10,000

4,000

2,839,255

VarcharMaxOut

10,000

4,000

3,307,264

VarcharN

10,000

4,000

3,010,858

VarcharMaxIn

10,000

5,000

3,229,262

VarcharMaxOut

10,000

5,000

3,525,668

VarcharN

10,000

5,000

3,603,669

VarcharMaxIn

10,000

6,000

3,416,466

VarcharMaxOut

10,000

6,000

4,056,078

VarcharN

10,000

6,000

3,712,871

VarcharMaxIn

10,000

7,000

3,728,472

VarcharMaxOut

10,000

7,000

4,602,088

VarcharN

10,000

7,000

4,056,078

VarcharMaxIn

10,000

8,000

4,321,284

VarcharMaxOut

10,000

8,000

4,945,295

VarcharN

10,000

8,000

4,477,286

Table 2

TableName

MaxRecords

StrLen

ElapsedMcs

VarcharMaxIn

10,000

9,000

5,670,605

VarcharMaxOut

10,000

9,000

6,326,708

VarcharMaxIn

10,000

10,000

5,295,689

VarcharMaxOut

10,000

10,000

5,280,067

VarcharMaxIn

10,000

11,000

5,717,469

VarcharMaxOut

10,000

11,000

5,123,852

VarcharMaxIn

10,000

12,000

6,014,277

VarcharMaxOut

10,000

12,000

5,623,740

VarcharMaxIn

10,000

13,000

6,514,166

VarcharMaxOut

10,000

13,000

6,686,002

VarcharMaxIn

10,000

14,000

6,576,651

VarcharMaxOut

10,000

14,000

6,873,460

VarcharMaxIn

10,000

15,000

6,764,110

VarcharMaxOut

10,000

15,000

6,748,488

VarcharMaxIn

10,000

16,000

6,639,137

VarcharMaxOut

10,000

16,000

7,154,647

VarcharMaxIn

10,000

17,000

7,279,619

VarcharMaxOut

10,000

17,000

7,420,213

VarcharMaxIn

10,000

18,000

7,435,834

VarcharMaxOut

10,000

18,000

7,089,147

VarcharMaxIn

10,000

19,000

7,804,550

VarcharMaxOut

10,000

19,000

7,070,923

VarcharMaxIn

10,000

20,000

7,336,277

VarcharMaxOut

10,000

20,000

8,023,077

VarcharMaxIn

10,000

21,000

8,538,178

VarcharMaxOut

10,000

21,000

7,976,250

VarcharMaxIn

10,000

22,000

7,679,677

VarcharMaxOut

10,000

22,000

8,007,468

VarcharMaxIn

10,000

23,000

7,617,241

VarcharMaxOut

10,000

23,000

7,788,941

VarcharMaxIn

10,000

24,000

7,835,768

VarcharMaxOut

10,000

24,000

8,194,778

VarcharMaxIn

10,000

25,000

8,475,741

VarcharMaxOut

10,000

25,000

8,506,960

VarcharMaxIn

10,000

26,000

8,819,141

VarcharMaxOut

10,000

26,000

8,382,087

VarcharMaxIn

10,000

27,000

9,084,496

VarcharMaxOut

10,000

27,000

8,881,578

VarcharMaxIn

10,000

28,000

8,975,233

VarcharMaxOut

10,000

28,000

9,053,278

VarcharMaxIn

10,000

29,000

9,505,942

VarcharMaxOut

10,000

29,000

8,944,014

VarcharMaxIn

10,000

30,000

9,100,105

VarcharMaxOut

10,000

30,000

9,022,060

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 92764 times – thanks for reading.

Tags: , , , ,

  • Rate
    [Total: 89    Average: 4.4/5]
  • Share

Rob Garrison is a Data Architect for Nike in Beaverton, Oregon. He has over twenty years of IT experience, as well as having had a number of articles published and speaking regularly at user groups and other events. His immediate family includes his wife, Karen, and his son Tanner. Outside of work, much of their time is spent in activities and service projects involving their church (http://pfcn.org).

View all articles by Rob Garrison

  • GilaMonster

    Not the entire picture
    There are two things that have been missed here.

    1) Online index rebuilds

    If you have a varchar(max) in the table, you cannot (in SQL 2005, 2008 or 2008 R2) rebuild the index online. It has to be an offline operation.

    For a 24x7x365 system, this is a major reason to avoid varchar/narchar/varbinary(max) or XML as much as possible (and where it is necessary to maybe even consider vertically partitioning the table to move the LOB column to another table)

    2) Moving LOB data

    Moving tables or indexes to another filegroup is easy. Just rebuild the clustered/nonclustered index on the target filegroup. Well, when there’s no LOB data that is. If there is LOB data (stored out of row), it stays behind on the original filegroup and the only way to move it is to create a new table on the desired filegroup WITH TEXT_IMAGE on the desired filegroup, copy the rows across, delete the old table and then add all the indexes, constraints, etc to the new table.

    Sure, moving data to a new filegroup isn’t something you do every day, but when it does need doing, that decision to use VARCHAR(max) is likely to be a regretted one

  • Jesse

    another drawback
    another drawback of varchar(max) is that it can’t be part of an index key.

    This is rather important for things like phonebooks where you look up things by people’s names. 🙂

  • Neil Burnett

    Varchar(max) or 8000?
    I was confused by your recommendations. The article’ s thrust is using max instead of n, but then the recommendations are about using 8000+ or <8000 char string data. If your data is 8000+ then you have no choice.

  • Neil Burnett

    Rebuild index online
    @gilamonster – I didn’t know that you can’t rebuild an index online if it contains a varchar(max). For me, that is sufficient reason to use varchar(n) wherever possible.

    Btw I can’t imagine why one would ever use a column with >8000 chars in an index. Any suggestions?

  • GilaMonster

    Rebuild online
    Well the clustered index contains every single column in the table, so if you have a table with a LOB column and the table has a clustered index, then that clustered index has the LOB column as part of it and it cannot be rebuilt online

    Nonclustered indexes there are very few reasons why you would include a LOB column (and they certainly can’t be part of the index key).

    That said, when it comes to rebuilding online, the clustered index is usually the one you most want to rebuild online.

  • Phil Factor

    LOBs
    Do XML columns suffer from the same drawbacks?

  • GilaMonster

    XML columns
    As in clustered index can’t be rebuilt online and the LOB pages are hard to move?

    Yes.

    Msg 2725, Level 16, State 2, Line 1
    An online operation cannot be performed for index ‘PK__testing’ because the index contains column ‘Blob’ of data type text, ntext, image, varchar(max), nvarchar(max), varbinary(max), xml, or large CLR type.

    The limitation around online index rebuilds has been removed in SQL 2012, but that still leaves those on 2005, 2008 and 2008 R2 the problem to deal with.

  • Karen

    Typo?
    In the “Single String Column with VARCHAR(n) and VARCHAR (MAX)” > “8,001 Characters” section, you have OneColVarcharN as the TableName for the second row in the table. Should this be OneColVarcharMax?

  • RowlandG

    Said it best
    I was going to answer along the lines of Gila Monster but s/he beat me to it!

    Anyone relying on 24×7 index rebuilds should hate the VARCHAR(MAX) approach. Nasty! Ick! Spit it out!

  • twilson

    What about data modeling fundamentals
    Before one even gets to thinking about the performance issues of using varchar(max) everywhere – what about the fundamentals of data modeling? If an attribute is has a maximum length in the real world – say, 11 characters for an SSN with embedded dashes – then making it be char(10) or varchar(11) in the database eliminates an entire category of invalid data, and documents to a consumer of the data that this column has a definite maximum length.

  • RGarrison

    Typo
    Karen, good catch. We’ll get that updated.

  • RGarrison

    Typo
    Karen, good catch. We’ll get that updated.

  • RGarrison

    Responding to Neil
    Neil, you wrote, “The article’s thrust is using max instead of n, …”

    I would say that the article’s thrust was exploration. Given the various choices available, how do those choices affect performance and storage? Even with MAX, you still have the choice of in-row versus out-of-row.

    As you point out, three of the four recommendations are only valid if you have an option of using VARCHAR(n). The other, regarding INSERT/UPDATE performance, simply states what is intuitively obvious: writing or updating larger strings takes more time. But now you have a graph and real results that you can reference.

  • MidBar

    8060 Character
    As per Microsoft
    http://msdn.microsoft.com/en-us/library/ms190969.aspx

    Page size is 8192 bytes and header is 96 bytes
    so remaining is 8096

    1: so what is this 8060 and 8047 or 8048 limitation?

    Please clarify or also include in the initial description that

    Page size limitaion i.e.8192
    Record size limitation i.e.8060
    Column size limitation for varchar i.e. 8000 even it is varchar(max)

    thanks.

  • MidBar

    8060 Character
    As per Microsoft
    http://msdn.microsoft.com/en-us/library/ms190969.aspx

    Page size is 8192 bytes and header is 96 bytes
    so remaining is 8096

    1: so what is this 8060 and 8047 or 8048 limitation?

    Please clarify or also include in the initial description that

    Page size limitaion i.e.8192
    Record size limitation i.e.8060
    Column size limitation for varchar i.e. 8000 even it is varchar(max)

    thanks.

  • RGarrison

    Responding to MidBar
    There are two references in the article that relate to page size:

    Row-Overflow Data Exceeding 8 KB
    http://msdn.microsoft.com/en-us/library/ms186981.aspx

    Table and Record Structure
    http://sqlserverpedia.com/wiki/Table_and_Record_Structure

    What is the 8060 limitation? From the MSDN reference: “A table can contain a maximum of 8,060 bytes per row.”

    Kevin adds details in the SQLServerPedia article.

    The exact number of bytes that can be written was found through experimentation, not calculation.

    Rob

  • srutzky

    RE: UPDATE Performance Difference with IN/OUT of Row
    Hi Rob. I was testing the “PerfTest – UPDATE.sql” script in the In-Row vs. Out-of-Row LOB Storage | UPDATE | Performance section and noticed a minor typo. When displaying the results at the end, you have the @etOut8001 variable used for both out-of-row rows when the @etOut8000 variable should be used for the last row: ‘out-of-row, 8000’.

    After making this change I noticed that the time difference for the out-of-row set varied as to which direction was faster but the in-row set was always faster going to 8000.

    My theory is that the out-of-row set fluctuates due to both directions most likely being the same basic operation in that the same LOB pages are being overwritten in both cases. But rather than there being a short-cut for going from LOB to IN_ROW pages, couldn’t it be that going from IN_ROW to LOB pages is two steps (write the LOB page and then write the pointer to the LOB page in the IN_ROW page) whereas going from LOB to IN_ROW pages is a single step (write the IN_ROW page)?

    Take care,
    Solomon…

  • phorstink

    SSIS
    I have had bad experience with varchar(MAX) when using it in SSIS. By default the data will be typed ad DB_TEXT, which will introduce all kinds of problems, like not being able to use it in lookup.
    That means we have to ‘substring’ the data, but how many characters? This goes back to the design-considderation mentioned before…

  • Dave.Poole

    Indexing VARCHAR columns
    As the maximum size of a VARCHAR column that can participate in an index is 900 bytes (see Index Key size in http://msdn.microsoft.com/en-us/library/ms188783.aspx) the actual ability of a VARCHAR(n) to store more than that isn’t relevant from an indexing perspective.

    Incidentally if you are looking for an exact match in a large VARCHAR column then putting an index on a computed column that holds the CHECKSUM of that VARCHAR column is a useful technique.

    WHERE YourCheckSumColumn = @CheckSumValue AND YourGreatBigunindexedString = @YourGreateBigunindexedStringValue

    The CHECKSUM value is often selective enough to get an index seek so for the cost of a 4 byte integer plus an index on it you gain from not having to have a great big index on your large string.

    It’s probably preaching to the converted but if your database can have dodgy data in it then it will have dodgy data in it. I’ve just profiled data where the UK postcode field is a VARCHAR(20) and even though there are no UK postcodes wider than 8 characters there are all sorts of strings longer than that in the field. And this is from something that is allegedly populated from a PAF system!
    Sticking VARCHAR(MAX) everywhere is committing a bizarre form of data suicide worthy of an entry in the Darwin awards.

  • TheSQLGuru

    Insert Test Code?
    How did you do the insert tests? Were they singular inserts one row at a time? If so the implicit transactions and/or log buffer flushes would likely overwhelm any other performance issues.

  • Anonymous

    Re: LOB and Compression – NVARCHAR
    I am surprised that no one mentions NVARCHAR and NVARCHAR(max) as there are really interesting and difficult issues with index size, data size, and performance. This is the only solution for UTF characters, and UTF 8 remains unsupported in favor of UCS-2 and compression. Might be good as a follow-up article.

  • Brian Feifarek

    Re: Indexing VARCHAR columns
    Regarding Dave.Poole’s suggesting to use a CHECKSUM value to look for exact matches in large varchar columns, my exploration of CHECKSUM in SQL Server 2000 found it to not be very useful under the ‘real world’ data values I was working with. I found very many identical check sums for very different data values in the small data set I was testing (unfortunately I do not recall details, but it was not remotely helpful for flagging identical matches)

  • Anonymous

    Well Done
    great article

  • Jeff Moden

    There IS a performance hit with VARCHAR(MAX)
    Actually, there IS a performance hit with VARCHAR(MAX) for many things including doing delimited splits and the like. Even if the MAX data is <8000 characters, the code will run at least twice as slow because SQL Server really doesn’t like joining to the MAX datatypes even if it’s just a Tally table or Tally cte.

  • Martin Smith

    Compression
    The section on compression seems somewhat confusing. To be clear row compression has no effect on varchar(x) stored either in row or off row either (see http://msdn.microsoft.com/en-us/library/cc280576.aspx). It is true that if the strings are greater than 8,000 characters they will necessarily be stored off row but I’m not sure what solution you propose to that. Split them up into multiple rows?

  • lewisar

    Performance issue with 0 rows?
    I came across this discussion as a result of searching for anyone experiencing this bizarre and unexpected behaviour… and I’m just wondering whether anyone here has come across this?

    I am inserting using a statement of the form

    INSERT INTO A columnlist SELECT column list FROM B JOIN C WHERE B.X NOT IN (SELECT X FROM A)

    So just inserting into a table only if the data doesn’t already exist.

    Basically it worked fine until the situation where the SELECT returned 0 rows – then it would take up to a minute to complete, even with no other activity on the whole db server.

    I took all the indexes off table A and it was still the same. So I started inserting data one column at a time, then 2 columns and so on, and it was fine until it hit the first VARCHAR(MAX) column – and suddenly it took a minute to insert ZERO rows!

    I have now change all VARCHAR(MAX) columns to VARCHAR(N) and the problem has gone away!

    If it was actually inserting data it was fine performance-wise, it was only when inserting 0 rows it ground to a halt, and it was definitely the VARCHAR(MAX) columns which killed it.

    Any ideas anyone?

    Ade.

  • Daniel_Adeniji

    Nice one
    Thanks for sharing!