Click here to monitor SSC

Tony Davis

Simple-Talk Editor
News, views and good brews

Inappropriate Updates?

Published Thursday, February 18, 2010 5:10 PM

A recent Simple-talk article by Kathi Kellenberger dissected the fastest SQL solution, submitted by Peter Larsson as part of Phil Factor's SQL Speed Phreak challenge, to the classic "running total" problem. In its analysis of the code, the article re-ignited a heated debate regarding the techniques that should, and should not, be deemed acceptable in your search for fast SQL code.

Peter's code for running total calculation uses a variation of a somewhat contentious technique, sometimes referred to as a "quirky update":

SET @Subscribers = Subscribers = @Subscribers + PeopleJoined - PeopleLeft

This form of the UPDATE statement, @variable = column = expression, is documented and it allows you to set a variable to the value returned by the expression. Microsoft does not guarantee the order in which rows are updated in this technique because, in relational theory, a table doesn’t have a natural order to its rows and the UPDATE statement has no means of specifying the order.

Traditionally, in cases where a specific order is requires, such as for running aggregate calculations, programmers who used the technique have relied on the fact that the UPDATE statement, without the WHERE clause, is executed in the order imposed by the clustered index, or in heap order, if there isn’t one. Peter wasn’t satisfied with this, and so used the ingenious device of assuring the order of the UPDATE by the use of an "ordered CTE", based on an underlying temporary staging table (a heap). However, in either case, the ordering is still not guaranteed and, in addition, would be broken under conditions of parallelism, or partitioning.

Many argue, with validity, that this reliance on a given order where none can ever be guaranteed is an abuse of basic relational principles, and so is a bad practice; perhaps even irresponsible. More importantly, Microsoft doesn't wish to support the technique and offers no guarantee that it will always work. If you put it into production and it breaks in a later version, you can't file a bug. As such, many believe that the technique should never be tolerated in a production system, under any circumstances.

Is this attitude justified? After all, both forms of the technique, using a clustered index to guarantee the order or using an ordered CTE, have been tested rigorously and are proven to be robust; although not guaranteed by Microsoft, the ordering is reliable, provided none of the conditions that are known to break it are violated. In Peter's particular case, the technique is being applied to a temporary table, where the developer has full control of the data ordering, and indexing, and knows that the table will never be subject to parallelism or partitioning. It might be argued that, in such circumstances, the technique is not really "quirky" at all and to ban it from your systems would server no real purpose other than to deprive yourself of a reliable technique that has uses that extend well beyond the running total calculations. Of course, it is doubly important that such a technique, including its unsupported status and the assumptions that underpin its success, is fully and clearly documented, preferably even when posting it online in a competition or forum post.

Ultimately, however, this technique has been available to programmers throughout the time Sybase and SQL Server has existed, and so cannot be lightly cast aside, even if one sympathises with Microsoft for the awkwardness of maintaining an archaic way of doing updates. After all, a Table hint could easily be devised that, if specified in the WITH (<Table_Hint_Limited>) clause, could be used to request the database engine to do the update in the conventional order. Then perhaps everyone would be satisfied.

Cheers,

Tony.

Comments

 

Damon said:

I'm of the opinion that pragmatism wins out over theory, especially when the pragmatic solution accounts for the assumptions of why an approach is "bad" in theory.  In this case, the approach is theoretically bad because the ordering is not guaranteed.  In reality, however, the ordering is always the same. Should we dismiss reality?  Sounds like the theory needs to be updated.
February 18, 2010 3:26 PM
 

timothyawiseman@gmail.com said:

The technique, when used with full knowledge of the conditions it works under, does reliably work and has for many editions in the past and it does bring a performance update.  It is a perfectly valid technique.

However, it has one massive problem that you did not mention in your article: maintainability.  Many SQL developers do not know about the technique or understand what conditions it is or is not reliable under.  So if a T-SQL master writes code using it, a more junior program may not understand it when they come along later to maintain it.  This will likely cost developer-time while that junior developer looks it up and figures it out, and in the current era even junior developer time is often (often, not always) much more expensive than the tiny sliver of processor time that this saves over more understandable solutions.

Worse, that second developer that comes along may misunderstand it and in the process of making tweaks create a situations where it does return the wrong results entirely, and that can lead to very bad results if decisions are actually made based on that data.  

In short, it is a perfectly valid technique and I would be comfortable using it in production code, but it is a technique to be used sparingly and only where that tiny performance boost truly justifies the increased maintenance cost.
February 18, 2010 4:11 PM
 

BuggyFunBunny said:

>>  that tiny performance boost

Hmmm.  Peter's approach was a tad more than that.  

So far as the "junior developer" argument goes, if the titles were "surgeon" and "surgery resident", the idea that the "surgeon" has to perform so as not to tax the "surgery resident" would be laughed at, loudly.  It's the surgeon's task to inculcate the resident; ours to beat the snot out of coders.  (Too strong??)  We need to stop with the "lowest common denominator" approach, particularly where that means building applications as if SQL and the database were java/C#/COBOL and files.  RDBMS have value, just because, they aren't like language iterators over file data.  It may be a coincidence (I suspect, not) that there just appeared another Celko posting on the evils of procedural coding in the database.  What we as database geeks have to offer is bullet proof transactional data which the coders can't mess up.  Iterating in SQL (and its SP dialects) plays to our weakness, not our strength.  Pandering to them won't get us anywhere.
February 19, 2010 1:30 PM
 

Celko said:

Some of us older SQL Server guys will remember when the GROUP BY was done with a sort so we could skip an ORDER BY. That does not work any more.  
Or not putting semi-colons after statements? The advantage of sticking to Standard SQL is that all vendors aim for it and already have most of it.  How about when BIT was changed to a numeric data type and could suddenly be NULL? I made a lot of money off of code that had been written by guys who think that "pragmatism wins out over theory" and not that standards win over dialect in the long run.




February 19, 2010 5:44 PM
 

timothyawiseman@gmail.com said:

BuggyFunBunny, you make some good points, but your surgeon analogy is rather strained in this case.  The surgeon in a teaching hospital is often explicitly expected to train the resident, and the resident is there specifically to learn.  While it does vary from place to place, a JR Programmer is normally expected to actually produce useful work with learning mostly on their personal time and a DBA often has no obligation to train the developers.  Again, that does vary from place to place and perhaps that is not the way it should be, but in many places that is the way it is.

Another way it is strained is that you rarely hear about surgeons maintaining much less modifying and upgrading a previous surgeon's work through further surgery.  Again, there are exceptions, but they are rare, normally explicitly called experimental, and well documented.  In software you are frequently trying to do something somewhat novel, maintenance and upgrades involving radical changes to the code are positively expected.  Further, very often the original programmer is long gone, leaving behind little documentation besides the code itself.

I am in no way saying that we should cater to the lowest common denominator.  What I am saying is that there is a great value in maintainable programming and that if you actually care about the long term use of that code you will both assume that you will not be the one maintaining that code and that the next person after you may be less experienced (maintenance coders are very frequently more junior than the people who did the original development).  Because of that, all production code should be refactored for readability, and standards should not be ignored lightly.  In some cases it is indeed worth focusing on understandability and compliance to standards over efficiency.  When a clever solution is used for efficiency's sake, it should be very well commented to describe how and why it works and how it might be readily broken.
February 19, 2010 6:35 PM
 

BuggyFunBunny said:

Tim, I generally agree about shop standards and the like.  Where we disagree is on the meaning of those standards, I suspect.  One can object to Peter's solution because it is too advanced, in the sense that it is not procedural or diverges from common procedural practice in stored procedures, for later maintenance developers who will not understand it.  My point is that Peter's solution is not too advanced for a shop intent on using SQL Server (or any industrial strength database) as Dr. Codd intended, as a set processor.  Not that Dr. Codd defined the CTE syntax, of course.  Progress occurs when the deficient is replaced with the superior, doing so in RDBMS applications is to use them in ways that play to their strengths.
February 19, 2010 9:12 PM
 

Damon said:

Life is all about trade-offs, and development is no different.  If you are trying to eek out every last ounce of performance, then maintenance will suffer.  In this case, the code was all about performance so you cannot fault the maintainability.  While developing, I'm of the opinion that you always code for maintainability before performance.  If you need additional performance, you go back and tweak.  If you've been given performance metrics that have to be met, and to do so requires less maintainable code, then you have to do what you have to do to meet requirements.  You just need to let people know what the trade-off is in the long run.  

Celko: Instead of "pragmatism" I probably should have said "rational pragmatism" or something of the like to avoid looking like I subscribe to the "whatever works" mentality.  If you have developers who are breaking rules because it's easy and it works, then you're going to invite misery. But in limited scenarios, where the requirement dictate that it is necessary (and you document it), I have no problem "breaking the rules" to achieve goals.
February 20, 2010 6:25 PM
 

AlexFekken said:

I expect and hope that Microsoft will do what it can to make the make the next version of SQL Server faster. I.e. I expect that a next version of SQL Server (hopefully sooner rather than later) will be able to automatically parallelize a SET statement like this to improve performance. Obviously that will change the "sort order" and so I see it as a good sign that I can't rely on the order being fixed and a bad sign if it is fixed in practice (because this implies insufficient optimization).

And before the anti-cursor police breaks down my door for what I will say next: the statement in question obviously uses a cursor in the background, perhaps one that is stripped down to its bare essentials, but a cursor nevertheless (correct me if I am wrong). I would even argue that this particular syntax (like other, more recent features) was originally introduced to create the impression that one can do without cursors by hiding them behind special syntax.  

So if I want to do this sort of thing with a given sort order then I will probably use a cursor to achieve that. It takes a bit more code but it will give me a lot more control and I don't have to compromise the next version SQL Server for it. I might incur a performance hit but that's not because I am using a cursor (because every other solution will need to use a cursor as well, even if you don't see it) but because the RDBMS vendor hasn't done a good enough job optimizing. The problem in question is not a relational one so why pretend that the solution has to be either relational (i.e. set-based) or fake-relational (e.g. SET-based)?

In my opinion cursors are a useful and powerful language construct for problems that are not relational and RDBMS vendors should spend more time optimizing them (performance seems to be the only tenable argument against them) instead of going out of their way to hide them behind optimized fake-relational syntax to satisfy the anti-cursor police.
February 22, 2010 4:29 PM
 

BuggyFunBunny said:

@Alex:
In my opinion cursors are a useful and powerful language construct for problems that are not relational and RDBMS vendors should spend more time optimizing them (performance seems to be the only tenable argument against them) instead of going out of their way to hide them behind optimized fake-relational syntax to satisfy the anti-cursor police.

I'd pay real money to watch a match twixt you and Celko.  In a steel cage, of course; only one man exits.
February 23, 2010 2:44 PM
 

timothyawiseman@gmail.com said:

Alex you make some good points, though I do not agree with all of it.  

If you have not tried Oracle, it is worth noting that their cursors are more highly optimized than SQL Servers are and there is not nearly so much of a performance penalty for using them.  

However, there are reasons beyond performance for not using cursors.  While I acknolwedge this is at least partially a matter of style, I find code with cursors generally is longer, more verbose and harder to read than the same code written without it, and readability contributes to maintainablity and that is very important in production code.  Also, set based logic stays closer to the relational model.

In terms of performance, I mentioned that Oracle does better with cursors than SQL Server does, but I think in the long run set based code will tend to be faster.  The trend is towards more cores, more processors, and more parallelization.  It is easier to parallelize code that is set based than it is to parallelize a cursor, which is inherently serial and sequential.

To be clear, I simply do not know enough about the internals of SQL Server to say whether this particular solution uses a "hidden cursor", but in general there are both performance and non-performance reasons to use cursors only as a last resort.
February 23, 2010 5:37 PM
 

AlexFekken said:

@BuggyFunBunny

I am happy to hear any counter arguments to what I wrote without getting into a cage. But I am willing to go into a cage to fight dogma for the sake of dogma any time :-)

@timothyawiseman@gmail.com:

Most of your arguments assume that a set-based solution is possible and for cases where it is I agree with you. But I was talking about situations that do not have a set-based solution. If you really want to stay close to the relational model and use set-based logic all the time then you shouldn't use ORDER BY (that's the reason why it isn't allowed in views except through certain non-relational language hacks).
And although I have no hard proof of this because I don't know enough about the internals of the engine either, I think it is obvious that any task that requires (based on the problem definition) iterating through a set of records in a specific order does not have a set-based solution and requires one or more cursors of some sort, whether you actually call them cursors or not. Of course even such sequential operations can be optimized through more cores, more CPUs etc, but not because they are set-based and therefore using different techniques.

I also agree with you that cursors are generally longer and more verbose. But I disagree that they are harder to read or less maintainable: on the contrary I think that the verbosity makes them easier to read (and write: there is a reason why they are used inappropriately in so many cases) and more maintainable. I can easily imagine a set-based solution being broken by a minor change in the requirements that makes a set-based approach impossible (e.g. imposing order as in the SET-based example). In other words I think that set-based solutions are much more fragile than cursor-based solutions in a situation like that.

And I actually find the SET-based example hard to read: it is very compact but totally breaks with everything SQL because in spite of appearances it is not set-based and implies explicit iteration. To me it is a hack in the language that shouldn't be there; like a word borrowed from a foreign language. Comprende? But I agree with you that these things are also a matter of style.

So I guess you can summarize my "mantra" as: don't use a cursor if there is a set-based solution, but don't introduce or use fake-relational language constructs merely to "avoid" the use of a cursor either.

Remember: ORDER BY = CURSOR.  
February 23, 2010 6:36 PM
You need to sign in to comment on this blog
<February 2010>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
28123456
78910111213
Migrating from OCS 2007 R2 to Lync: Part 4
 Having migrated the rest of our users and legacy resources across, and start getting ready to... Read more...

Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL... Read more...

Seth Godin: Big in the IT Business
 Seth Godin has transformed our understanding of marketing in IT. He invented the concept of 'permission... Read more...

Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...

Converting String Data to XML and XML to String Data
 We all appreciate that, in general, XML documents or fragments are held in strings as text markup. In... Read more...