Av rating:
Total votes: 48
Total comments: 11


Arthur Fuller
Intelligent Database Design Using Hash Keys
17 February 2006

Using Hash Keys instead of String Indexes

Your application may require an index based on a lengthy string, or even worse, a concatenation of two strings, or of a string and one or two integers. In a small table, you might not notice the impact. But suppose the table of interest contains 50 million rows? Then you will notice the impact both in terms of storage requirements and search performance.

You don’t have to do it this way. There is a very slick alternative, using what are known alternatively as hash buckets or hash keys.

What is a Hash?

In brief, a hash is the integer result of an algorithm (known as a hash function) applied to a given string. You feed said algorithm a string and you get back an integer. If you use an efficient hash function then there will be only a small chance that two different strings will yield the same hash value. If this does occur, then it is known as a hash collision. Suppose that you fed this article into a hash algorithm, then changed one character in the article and fed the article back into the hashing algorithm: it would return a different integer.

Hash Keys in Database Design

Now, how can we apply hash leys intelligently in our database designs? Suppose that we have these columns in the table of interest:

Column Name

Data Type

Name

Varchar(50)

GroupName

Varchar(50)

A compound index on both these columns would consume 50 + 50 characters per row. Given 50 million rows, this is a problem. A hash key based on these two columns is vastly smaller (4 bytes per row). Even better, we don’t have to store the hash keys themselves – or more accurately, we have to store them just once. We create a calculated column whose formula is the hash key of these two columns. Now, we index the hash key row and don’t bother with the index on the two columns mentioned above.

The basic process is as follows:

  1. The user (whether a human or an application) queries the values of interest
  2. These values are then converted into a hash key
  3. The database engine searches the index on the hashed column, returning the required row, or a small subset of matching rows.

In a 50 million row table, there will undoubtedly be hash collisions, but that isn’t the point. The set of rows returned will be dramatically smaller than the set of rows the engine would have to visit in order to find an exact match on the original query values. You isolate a small subset of rows using the hash key and then perform an exact-string match against the hits. A search based on an integer column can be dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.

Hash Key Algorithms using the Checksum Function

There are several algorithms available, the simplest of which is built into SQL Server in the form of the Checksum function. For example, the following query demonstrates how to obtain the hash key for any given value or combination of values:

USE AdventureWorks
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
FROM Adventureworks.HumanResources.Department
ORDER BY HashKey

This results in the following rows (clipped to 10 for brevity):

Name

GroupName

Hashkey

Tool Design

Research and Development

-2142514043

Production

Manufacturing

-2110292704

Shipping and Receiving

Inventory Management

-1405505115

Purchasing

Inventory Management

-1264922199

Document Control

Quality Assurance

-922796840

Information Services

Executive General and Administration

-904518583

Quality Assurance

Quality Assurance

-846578145

Sales

Sales and Marketing

-493399545

Production Control

Manufacturing

-216183716

Marketing

Sales and Marketing

-150901473

You have a number of choices as to how you create the hash key. You might elect to fire an INSERT trigger, or use a stored procedure to create the hash key once the values of interest have been obtained, or even to execute an UPDATE query that creates the hash keys and populates the hash column retroactively (so that you can apply this technique to tables that already contain millions of rows). As stated above, my preferred solution is to "store" the hash keys in a calculated column that is then indexed. As such, the index contains the hash keys but the table itself does not.

Using this technique, you might approach the problem as follows, assuming that the front end passes in the target values for Name and GroupName:

CREATE PROCEDURE DemoHash
(
@Name Varchar(50),
@GroupName Varchar(50)
)
AS
-- USE AdventureWorks
DECLARE @id as int
SET @id = Checksum(@Name,@GroupName)
SELECT * FROM Adventureworks.HumanResources.Department
WHERE HashKey = @id
AND Name = @Name AND GroupName = @GroupName

Conclusion

This approach can yield considerable performance benefits and I encourage you to test it out on your own systems. The technique, as presented here, assumes that the search targets exist in a single table, which may not always be the case. I am still experimenting with ways to use this technique to search joined tables, and when I come up with the best approach, I will let you know.



This article has been viewed 13470 times.
Arthur Fuller

Author profile: Arthur Fuller

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000 and 2005, MySQL, and .NET.

Search for other articles by Arthur Fuller

Rate this article:   Avg rating: from a total of 48 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Useful Article
Posted by: Anonymous (not signed in)
Posted on: Tuesday, July 04, 2006 at 10:39 PM
Message: Hi,

Thanks for writing this. Very useful concept. Quick question do you know if MySQL has a Checksum or similar function, in particular in version 5?

Subject: Checksum woes
Posted by: RedSevina (view profile)
Posted on: Tuesday, July 18, 2006 at 7:20 AM
Message: I am a huge fan of hashkeys. In my experience the checksum function is woefully inadequate. In working with massive amounts of data I have found that checksum will produce duplicates on disparate data thrown at it. By using an MD5 algorithm with the Hashbytes function in SQL 2005, it gets beyond the checksum problem. If you are working with a small amount of data and rows then checksum should work for you, but beware if you have massive amounts of data - which is the reason for hashing in the first place. What say you?

Subject: CRC32
Posted by: Anonymous (not signed in)
Posted on: Friday, August 25, 2006 at 11:42 PM
Message: For small set of strings, use crc32() is fine. But don't count on it. For eg:

fc0591 => 1009521187
123rainerbommert => 1009521187

Subject: How Useful?
Posted by: Anonymous (not signed in)
Posted on: Saturday, October 07, 2006 at 12:27 AM
Message: Basing an index on a hashed field of some other information would lose the natural sort of the original data. So the index is useless for ordered queries and range lookups. The only use for the index would be lookups on discreet values. Unless the hash function is engineered to group the data in some manner and used accordingly then this is useful in only very specific situations. Or am I missing something?


Subject: Some Doubts
Posted by: Anonymous (not signed in)
Posted on: Thursday, November 09, 2006 at 4:16 AM
Message: HI

Subject: MD5 and HashBytes
Posted by: Anonymous (not signed in)
Posted on: Monday, November 13, 2006 at 8:23 AM
Message: RedSevina,

I am experiencing the problem as you described in that CheckSum is not good enough. Could you give some more detail on you MD5 and HashBytes solution,

Many Thanks

Subject: Take care hashing
Posted by: Moh. hassan (not signed in)
Posted on: Wednesday, January 10, 2007 at 1:42 AM
Message: Hash function is good, but mandatory ,hash function must be designed to avoid collision based on good hashing algorithm.
It will take cpu process , so it is better to be computed using DML trigger, and avoid using calculated field.
Search data using hash function will be a problem if a collision occured, because a decision is taken based on search result, like, delete tablex where hash (x,y), so take care on hashing

Subject: quite good
Posted by: Anonymous (not signed in)
Posted on: Friday, March 23, 2007 at 9:19 AM
Message: am just amazed at how u use hashtables to do that

Subject: Hash functions are not intended to be unique
Posted by: Anonymous (not signed in)
Posted on: Monday, April 30, 2007 at 6:57 AM
Message: Several readers have pointed out that hash functions result in duplicates. That is beside the point. They are not intended to produce unique values. If they were, then you might as well just index on the actual values rather than hashing them.

The point of hashes is to reduce the number of rows to examine in the shortest possible time. Let us suppose that we are searching for hash 12345, and that 20 rows out of sixty million correspond to said hash. This is not a problem, but rather a solution! Having reduced (very rapidly) the rowset to 20 rows, we can now apply further logic to those 20 rows, i.e. city = 'Birmingham' and surname = 'Oliphant'.

I stand by my original position, as measured against a database of approximately 60M rows. Hashing worked very well. Typically the initial search reduced the result set to 20 or fewer rows, which could then be examined with more granular logic.

Arthur

Subject: It is a question. Can you help?
Posted by: Anonymous (not signed in)
Posted on: Friday, June 29, 2007 at 7:21 AM
Message: How do you get this worked with the NVARCHAR datatype

Subject: Anonymous Comments Disabled
Posted by: Sarah Grady (view profile)
Posted on: Monday, November 19, 2007 at 3:42 AM
Message: Due to a high volume of spam, anonymous comments have been disabled.

 









Phil Factor
The Data Center that Exploded
 A while back, in a Simple-Talk editorial meeting, someone bet Phil that he couldn't come up with a Halloween story.... Read more...



 View the blog
SQL Toolbelt 2008: Predominantly an Engineering Task
 The conversion of the Red-Gate tools to be compatible with SQL Server 2008 might not seem, on first... Read more...

Audit Crosschecks
 In this short article, the second of a 2-part series, William suggests a solution, using SQL Data... Read more...

SQL Response: The dim sum interview
 Richard Morris met David and Nigel of the SQL Response team, in a dim sum Restaurant in Cambridge. They... Read more...

XML Jumpstart Workbench
 In which Robyn and Phil decide that the best way of starting to learn XML is to jump in and take a ride... Read more...

Discovering Security Uses for SQL Compare
 Much of the security of SQL Server is implemented as part of the database schema. This provides some... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

SQL Server Full Text Search Language Features
 SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Executing SSIS Packages
 Nigel Rivett demonstrates how to execute all SSIS packages in a given folder using either an SSIS... Read more...

Over 150,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk