Table Variables: an empirical approach.Published 27 October 2011 9:06 pm
It isn’t entirely a pleasant experience to publish an article only to have it described on Twitter as ‘Horrible’, and to have it criticized on the MVP forum. When this happened to me in the aftermath of publishing my article on Temporary tables recently, I was taken aback, because these critics were experts whose views I respect. What was my crime? It was, I think, to suggest that, despite the obvious quirks, it was best to use Table Variables as a first choice, and to use local Temporary Tables if you hit problems due to these quirks, or if you were doing complex joins using a large number of rows.
What are these quirks? Well, table variables have advantages if they are used sensibly, but this requires some awareness by the developer about the potential hazards and how to avoid them. You can be hit by a badly-performing join involving a table variable. Table Variables are a compromise, and this compromise doesn’t always work out well. Explicit indexes aren’t allowed on Table Variables, so one cannot use covering indexes or non-unique indexes. The query optimizer has to make assumptions about the data rather than using column distribution statistics when a table variable is involved in a join, because there aren’t any column-based distribution statistics on a table variable. It assumes a reasonably even distribution of data, and is likely to have little idea of the number of rows in the table variables that are involved in queries. However complex the heuristics that are used might be in determining the best way of executing a SQL query, and they most certainly are, the Query Optimizer is likely to fail occasionally with table variables, under certain circumstances, and produce a Query Execution Plan that is frightful. The experienced developer or DBA will be on the lookout for this sort of problem.
In this blog, I’ll be expanding on some of the tests I used when writing my article to illustrate the quirks, and include a subsequent example supplied by Kevin Boles.
A simplified example.
We’ll start out by illustrating a simple example that shows some of these characteristics. We’ll create two tables filled with random numbers and then see how many matches we get between the two tables. We’ll forget indexes altogether for this example, and use heaps. We’ll try the same Join with two table variables, two table variables with OPTION (RECOMPILE) in the JOIN clause, and with two temporary tables.
It is all a bit jerky because of the granularity of the timing that isn’t actually happening at the millisecond level (I used DATETIME). However, you’ll see that the table variable is outperforming the local temporary table up to 10,000 rows. Actually, even without a use of the OPTION (RECOMPILE) hint, it is doing well. What happens when your table size increases?
The table variable is, from around 30,000 rows, locked into a very bad execution plan unless you use OPTION (RECOMPILE) to provide the Query Analyser with a decent estimation of the size of the table. However, if it has the OPTION (RECOMPILE), then it is smokin’. Well, up to 120,000 rows, at least. It is performing better than a Temporary table, and in a good linear fashion.
What about mixed table joins, where you are joining a temporary table to a table variable? You’d probably expect that the query analyzer would throw up its hands and produce a bad execution plan as if it were a table variable. After all, it knows nothing about the statistics in one of the tables so how could it do any better? Well, it behaves as if it were doing a recompile. And an explicit recompile adds no value at all. (we just go up to 45000 rows since we know the bigger picture now)
Now, if you were new to this, you might be tempted to start drawing conclusions. Beware! We’re dealing with a very complex beast: the Query Optimizer. It can come up with surprises
What if we change the query very slightly to insert the results into a Table Variable? We change nothing else and just measure the execution time of the statement as before.
Suddenly, the table variable isn’t looking so much better, even taking into account the time involved in doing the table insert.
OK, if you haven’t used OPTION (RECOMPILE) then you’re toast. Otherwise, there isn’t much in it between the Table variable and the temporary table. The table variable is faster up to 8000 rows and then not much in it up to 100,000 rows. Past the 8000 row mark, we’ve lost the advantage of the table variable’s speed. Any general rule you may be formulating has just gone for a walk. What we can conclude from this experiment is that if you join two table variables, and can’t use constraints, you’re going to need that Option (RECOMPILE) hint.
Count Dracula and the Horror Join.
These tables of integers provide a rather unreal example, so let’s try a rather different example, and get stuck into some implicit indexing, by using constraints.
What unusual words are contained in the book ‘Dracula’ by Bram Stoker?
Here we get a table of all the common words in the English language (60,387 of them) and put them in a table. We put them in a Table Variable with the word as a primary key, a Table Variable Heap and a Table Variable with a primary key. We then take all the distinct words used in the book ‘Dracula’ (7,558 of them). We then create a table variable and insert into it all those uncommon words that are in ‘Dracula’. i.e. all the words in Dracula that aren’t matched in the list of common words. To do this we use a left outer join, where the right-hand value is null.
The results show a huge variation, between the sublime and the gorblimey.
- If both tables contain a Primary Key on the columns we join on, and both are Table Variables, it took 33 Ms.
- If one table contains a Primary Key, and the other is a heap, and both are Table Variables, it took 46 Ms.
- If both Table Variables use a unique constraint, then the query takes 36 Ms.
- If neither table contains a Primary Key and both are Table Variables, it took 116383 Ms. Yes, nearly two minutes!!
- If both tables contain a Primary Key, one is a Table Variables and the other is a temporary table, it took 113 Ms.
- If one table contains a Primary Key, and both are Temporary Tables, it took 56 Ms.
- If both tables are temporary tables and both have primary keys, it took 46 Ms.
Here we see table variables which are joined on their primary key again enjoying a slight performance advantage over temporary tables. Where both tables are table variables and both are heaps, the query suddenly takes nearly two minutes! So what if you have two heaps and you use option Recompile? If you take the rogue query and add the hint, then suddenly, the query drops its time down to 76 Ms. If you add unique indexes, then you’ve done even better, down to half that time. Here are the text execution plans.
So where have we got to? Without drilling down into the minutiae of the execution plans we can begin to create a hypothesis. If you are using table variables, and your tables are relatively small, they are faster than temporary tables, but as the number of rows increases you need to do one of two things: either you need to have a primary key on the column you are using to join on, or else you need to use option (RECOMPILE) If you try to execute a query that is a join, and both tables are table variable heaps, you are asking for trouble, well- slow queries, unless you give the table hint once the number of rows has risen past a point (30,000 in our first example, but this varies considerably according to context).
In describing the table-size, I used the term ‘relatively small’. Kevin Boles produced an interesting case where a single-row table variable produces a very poor execution plan when joined to a very, very skewed table. In the original, pasted into my article as a comment, a column consisted of 100000 rows in which the key column was one number (1) . To this was added eight rows with sequential numbers up to 9. When this was joined to a single-tow Table Variable with a key of 2 it produced a bad plan.
This problem is unlikely to occur in real usage, and the Query Optimiser team probably never set up a test for it. Actually, the skew can be slightly less extreme than Kevin made it. The following test showed that once the table had 54 sequential rows in the table, then it adopted exactly the same execution plan as for the temporary table and then all was well.
Undeniably, real data does occasionally cause problems to the performance of joins in Table Variables due to the extreme skew of the distribution. We’ve all experienced Perfectly Poisonous Table Variables in real live data. As in Kevin’s example, indexes merely make matters worse, and the OPTION (RECOMPILE) trick does nothing to help. In this case, there is no option but to use a temporary table. However, one has to note that once the slight de-skew had taken place, then the plans were identical across a huge range.
Where you need to hold intermediate results as part of a process, Table Variables offer a good alternative to temporary tables when used wisely. They can perform faster than a temporary table when the number of rows is not great. For some processing with huge tables, they can perform well when only a clustered index is required, and when the nature of the processing makes an index seek very effective. Table Variables are scoped to the batch or procedure and are unlikely to hang about in the TempDB when they are no longer required. They require no explicit cleanup. Where the number of rows in the table is moderate, you can even use them in joins as ‘Heaps’, unindexed. Beware, however, since, as the number of rows increase, joins on Table Variable heaps can easily become saddled by very poor execution plans, and this must be cured either by adding constraints (UNIQUE or PRIMARY KEY) or by adding the OPTION (RECOMPILE) hint if this is impossible. Occasionally, the way that the data is distributed prevents the efficient use of Table Variables, and this will require using a temporary table instead.
Tables Variables require some awareness by the developer about the potential hazards and how to avoid them. If you are not prepared to do any performance monitoring of your code or fine-tuning, and just want to pummel out stuff that ‘just runs’ without considering namby-pamby stuff such as indexes, then stick to Temporary tables. If you are likely to slosh about large numbers of rows in temporary tables without considering the niceties of processing just what is required and no more, then temporary tables provide a safer and less fragile means-to-an-end for you.