Click here to monitor SSC
Av rating:
Total votes: 31
Total comments: 10


Fabiano Amorim
Query Optimizer and Cartesian Products
22 October 2009

In his continuing quest to bring a deeper understanding of Query Optimizer to the world at large, Fabiano takes a moment to point out a potential pitfall you may encounter. A light read, but one worth perusing.

If you use the Query Optimizer (QO) enough, you will eventually run into this interesting observation: There are some situations where the QO just decides to remove a join instruction from a query, and that behavior, in turn, creates a Cartesian product (in other words, the same thing as a join but without the structured relationship – essentially a Cross Join). For instance, take the following example:

USE tempdb
GO
DECLARE @tab1 TABLE(a INT)
INSERT INTO @tab1
  
SELECT TOP 1000 1
  
FROM sysobjects b, sysobjects a
SET STATISTICS io ON
SET STATISTICS
time ON
SELECT
*
  
FROM @tab1 a
      
INNER JOIN @tab1 b
      
ON a.a = b.a
SELECT *
  
FROM @tab1 a
      
INNER JOIN @tab1 b
      
ON a.a = b.a
  
WHERE a.a = 1
SET STATISTICS io OFF
SET STATISTICS
time OFF

Both of the queries above return the same rows from the @tab1 table, but there is obviously a difference between the first instruction and the second. As we can easily see, the difference is the WHERE condition present in the second query (“a.a” must be equal to 1), and that is the key. The QO knows that there is a redundancy of filters - if “a.a” is equal to 1 and “a.a” is also equal to “b.a” (the JOIN condition), then obviously “b.a” will also be equal to 1. In that case, the QO removes the join condition and applies just the one “a = 1” filter to both “a” and “b” tables.

This behavior can be clearly seen when we look at the execution plan:

StmtText

Argument

select * from @tab1 a   inner join @tab1 b      on a.a = b.a   where a.a = 1

NULL

  |--Nested Loops(Inner Join)

NULL

       |--Table Scan(OBJECT:(@tab1 AS [b]), WHERE:(@tab1.[a] as
       [b].[a]=(1)))

OBJECT:(@tab1 AS [b]), WHERE:(@tab1.[a] as [b].[a]=(1))

       |--Table Scan(OBJECT:(@tab1 AS [a]), WHERE:(@tab1.[a] as
       [a].[a]=(1)))

OBJECT:(@tab1 AS [a]), WHERE:(@tab1.[a] as [a].[a]=(1))

 
Take notice of the argument column; the QO has generated just one argument to read both tables, and there is no join condition in the Nested Loops. That characterizes a Cartesian product, in that for each row of the table 'a', the SQL will link with all rows of the table 'b'.

Ok, but is that a bad thing? In this case, yes. To have that particular query use a hash algorithm to make the join will certainly be better than a Loop Join. If you’re not sure about that, just look at the first query, remembering that they are both semantically equivalent. If you then look at the plan for that first query (below), you can see the QO has generated a Hash Join to link the table:

StmtText

Argument

select * from @tab1 a   inner join @tab1 b      on a.a = b.a

NULL

  |--Hash Match(Inner Join, HASH:([b].[a])=([a].[a]),
  RESIDUAL:(@tab1.[a] as [b].[a]=@tab1.[a] as [a].[a]))

HASH:([b].[a])=([a].[a]), RESIDUAL:(@tab1.[a] as
[b].[a]=@tab1.[a] as [a].[a])

       |--Table Scan(OBJECT:(@tab1 AS [b]))

OBJECT:(@tab1 AS [b])

       |--Table Scan(OBJECT:(@tab1 AS [a]))

OBJECT:(@tab1 AS [a])

If we then analyze the results of the IO and time statistics, we will see that the first query used less time and read fewer pages to return the data:

First query (Hash Join):
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#2E75B1C0'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
   CPU time = 328 ms,  elapsed time = 19976 ms.

  
Second query (Loop Join):
Table '#2E75B1C0'. Scan count 2, logical reads 2002, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
   CPU time = 407 ms,  elapsed time = 20377 ms.

Now, if you’re clever, you’re probably thinking, “Hmmm, in that case we could just put a Hint in to force a Hash Join, yes?” Wrong (at least at this stage...). If you try this, you will receive the follow error message:

SELECT *
  
FROM @tab1 a
      
INNER JOIN @tab1 b
      
ON a.a = b.a
  
WHERE a.a = 1
      
OPTION(hash JOIN)

Msg 8622, Level 16, State 1, Line 20
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

When SQL removes the join condition, it needs to make a Cartesian product, and the QO can’t do this with a Hash or Merge Join, only with a Loop join. So in this case, the logic of the QO clashes with the hint.

What you can do to fix that? Wait.

Yes, that’s all you can do. Wait for the next version or point release of SQL Server, and trust the development team to imitate Oracle. To ’fix’ this problem, Oracle has a handy parameter, which you can look here to learn more about. There is a Connect item talking about the problem, so we should keep watching this space.

that‘s all folks.



This article has been viewed 11589 times.
Fabiano Amorim

Author profile: Fabiano Amorim

Fabiano is fascinated by the SQL Server Query Processor and the way it works to optimize queries, procedures and functions. He graduated as a Technical Processor from Colégio Bezerra de Menezes, SP- Brazil, and has worked for several years with SQL Server, focusing in creating Data Warehouses and optimizing T-SQL codes for many companies in Brazil and Argentina. Fabiano is a SQL Server MVP, MCP for SQL Server 2000, MCTS and MCITP Data Base Developer for SQL Server 2005 and 2008. He also is actively involved in SQL Server community though forums such as MSDN and TechNet Brazil, writes articles for Simple-Talk and SQL Server Magazine Brazil, and he also presents online Webcasts and In-Person events for Microsoft Brazil. His blog is on http://blogfabiano.com

Search for other articles by Fabiano Amorim

Rate this article:   Avg rating: from a total of 31 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: queries
Posted by: arpit trivedi (view profile)
Posted on: Wednesday, November 11, 2009 at 10:22 PM
Message: i would like to see all type of syntax of all sql queries with its example......

Subject: I can't see...
Posted by: niikola (view profile)
Posted on: Monday, December 28, 2009 at 5:18 AM
Message: what is the difference between inner join and Cartesian product in your example?

Subject: "Wait"...I've been waiting
Posted by: Scott D. Nelson (view profile)
Posted on: Monday, December 28, 2009 at 8:29 AM
Message: Like so many other nifty Oracle features, I've been waiting for years for them to show up in SQL Server, and I'm still waiting...

CREATE OR REPLACE would be a nice touch....So would Oracle-style cursors, %ROWTYPE, %TYPE, ROWID, no lock escalation, materialized views that are actually useful...I could go on and on....but then I'd probably get depressed :)

Subject: .
Posted by: mcflyamorim (view profile)
Posted on: Saturday, January 02, 2010 at 7:59 AM
Message: Nikola,
I think i did not get your question, can you explain better?

Scott,
I don't know where start :-( .... but, you don't know how I feel bad to talk about this. We know Oracle have many good features, but SQL Server have their too, I don't want to start that famous argue about which was better, anyway, I don't get depressed too.
Cheers

Subject: Query Cost
Posted by: sivaraba (view profile)
Posted on: Monday, January 04, 2010 at 8:56 AM
Message: If you look at the Estimated Execution Plan, the subtree query cost is much lower for the second query (the one with the WHERE condition) compared to the first one. According to your explanation, if the first one uses a HASH join, it should (in theory) have a lower query cost...but thats not the case.
Can you explain this?

Subject: Query Cost
Posted by: mcflyamorim (view profile)
Posted on: Monday, January 04, 2010 at 9:47 AM
Message: Well done Sivaraba, that is a very interesting question.

Just to you know, run the query with the option to include the execution plan turned on (ctrl-m), than take a look at the estimated number of rows to the second query, and look the big difference between the estimated vs actual.

Why the SQL make this estimation so wrong?
When the actual batch query is compiled the SQL Server don't know how many rows are in the table @tab1, because of this he estimate just one row will be returned.

Another interesting thing is the query cost estimation, this is created based on the cardinality, on the other words, the Query Optimizer uses the estimated number of rows to create plan and show the estimated cost.
As we know variable tables does not keep statistic information of the columns, so if you change the variable table to a temporary table and than run the two queries again, you will see the correctly estimation... the first (hash join) is lower than the second (loop join).

Cheers...

Take advantage of that comment, I want to answer to Scott... I was thinking about that features you have mentioned... The SQL Server have many equivalents... ROWNUMBER(), Indexed Views, NOLOCK hint, Isolation Levels... and so on...

Subject: Invalid Example
Posted by: Algebrizer (view profile)
Posted on: Monday, January 04, 2010 at 6:57 PM
Message: I don't think this example is valid. The main issue is that there's no selectivity in the data - @tab contains a thousand ones, so there's no way to uniquely identify a row and you could never create a primary key on the table. Normally, when you join two tables, you want one row in one table to match up with one or more rows in the other table. In this case, every row matches every row. The where clause doesn't provide any useful filtering at all and you have a logical Cartesian product, no matter how the query is physically processed. In order of efficiency, in most cases a merge join is best, followed by a nested loop join, followed by a hash join. The latter is, in most cases, an attempt for the optimizer to make the best of a bad situation. So, methinks this example proves nothing.

Subject: Invalid
Posted by: mcflyamorim (view profile)
Posted on: Monday, January 04, 2010 at 8:06 PM
Message: Algebrizer, my intention is not provide a valid sample, just show the behavior of the QO. The change one query with a valid join into a cartesian.

Thanks

Subject: Invalid Example
Posted by: Algebrizer (view profile)
Posted on: Monday, January 04, 2010 at 8:30 PM
Message: I don't think this example is valid. The main issue is that there's no selectivity in the data - @tab contains a thousand ones, so there's no way to uniquely identify a row and you could never create a primary key on the table. Normally, when you join two tables, you want one row in one table to match up with one or more rows in the other table. In this case, every row matches every row. The where clause doesn't provide any useful filtering at all and you have a logical Cartesian product, no matter how the query is physically processed. In order of efficiency, in most cases a merge join is best, followed by a nested loop join, followed by a hash join. The latter is, in most cases, an attempt for the optimizer to make the best of a bad situation. So, methinks this example proves nothing.

Subject: Invalid Example
Posted by: Algebrizer (view profile)
Posted on: Tuesday, January 05, 2010 at 4:26 PM
Message: Touche. That IS pretty interesting behavior - how the QO does a hash join in one case and a nested loops join in the other since the where clause should make no difference between the two queries. I still think it's logically a cartesian product no matter what since every row in the first table alias will match every row in the second one. I'll have to ponder on this one for awhile longer. (Also now sure why my message got posted twice in this blog. I may have refreshed my browser, but not sure.)

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...

SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk