Click here to monitor SSC
  • Av rating:
  • Total votes: 32
  • Total comments: 8
Joe Celko

Avoiding the EAV of Destruction

18 June 2009

A forum posting, from someone who wanted a better solution to the common problem of handling global settings in a database, leads Joe Celko into a fascinating discussion of the ramifications of the various solutions. 

This interesting problem appeared on the SQL Server programming newsgroup in Early 2009 February. Let me just paste in the quote that started the thread.

Hello, in a database we deploy in our customer's company, we need to use a lot of parameters. For example, how many invoices should we print by default. These parameters are used in a lot of queries, stored procedures, etc ... Do you know a best design for this? Now, the parameters are stored in a table with a name, a type and several columns corresponding to each type..

Before talking about the possible solutions, I want to talk about the mindset that leads to each solution. It is not enough to have an answer; you need to know what the process was that lead to a solution. Math majors appreciate that it is not enough to just come up with a final answer. You have to show your work and give a proof that this is a good answer.

We have all used some kind of configuration file in a software package. A lot of us have probably written one. The C language has a preprocessor that can look at flags in the command line and do conditional compilations. Since C is a low-level systems language, most of the conditional compilations were aimed at the hardware – what kind of disk drives and how many of them were in the system. The preprocessor code was often more complex than the final C program you compiled when all the macros were expanded. You were writing device drivers and hardware dependent code on the fly!

As we moved to higher level languages for applications, we used libraries to interface to hardware. The configuration parameters became more involved with the application features than the hardware. And we have support for constants in the application languages.

Most procedural languages have something like Pascal’s declaration syntax:

CONST pi: REAL = 3.14;

Or us older guys remember FORTRAN:

DOUBLE PRECISION pi, c

      PARAMETER(pi = 3.14159265358979d0)

      PARAMETER(c = 2.998d8)

There was also a DATA statement in FORTRAN. It is important to distinguish between the DATA and PARAMETER statements. The DATA statement merely specifies an initial value for a variable or array which may be altered during the course of execution. The PARAMETER statement, which came later, specifies values for constants which cannot be changed without recompiling the program.

The idea is to gather all of the constants needed in one place, tell the optimizer that they will not change so that it could do its magic when it compiled the program. But let me share a classic quote:

The primary purpose of the DATA statement is to give names to constants; instead of referring to pi as 3.141592653589793 at every appearance, the variable pi can be given that value with a DATA statement and used instead of the longer form of the constant. This also simplifies modifying the program, should the value of pi change.

-- FORTRAN manual for Xerox Computers --

You cannot make up stuff like that.

What the  SQL Server programming newsgroup post described is called EAV (“Entity-Attribute-Value”) tables and it is a very common schema design error for programmers who started with an OO or a loosely typed programming language.

The table looks something like this:

CREATE TABLE eav

(parameter_name VARCHAR(100) NOT NULL PRIMARY KEY, 

 

parameter_datatype VARCHAR(15) NOT NULL  

  CHECK (parameter_datatype IN (‘INTEGER’, ‘REAL’, etc))

 parameter_value VARCHAR(200), -- nullable?

..); 

Did you notice that everything has to be a string? Now we need to do a lot of CAST() expressions to use the parameter_value when it is a numeric or a temporal parameter.

Try to write a single CHECK() constraint that works for all the parameters. It can be done! You need a CASE expression with WHEN clauses that include every parameter_name as part of the test. For example:

 CHECK (CASE WHEN parameter_name = ‘default_invoice_cnt’

               AND CAST (parameter_value AS INTEGER) > 0

             THEN ‘T’

             ..

             ELSE ‘F’ END = ‘T’)

Next, write a CHECK () among two or more parameters. Keep it simple, something like “print_buffer_cnt >= 3 * default_invoice_cnt” then add more complex configuration rules. That CASE expression is a horrible mess, isn’t it? Oh, remember that the WHEN clauses are tested in the order they were written. If you aren’t very careful, you might have them in the wrong order and some of them will never execute.

Try to write a single DEFAULT clause for all the values crammed into one parameter_value column. Impossible! DEFAULT is a scalar value and not a CASE expression. But even if a CASE expression were allowed, it would look as bad the monolithic CASE expression used in the magical, universal, generic CHECK() constraint.

While not usually required for a parameter list, try to set up DRI among the entities. Much like not having a magical, universal DEFAULT, you cannot have a magical, universal DRI action that will CASCADE, SET NULL, RESTRICT and/or SET DEFAULT as needed. That work has to be done with a trigger.

If you thought the WHEN clauses in the single CASE expression were unmaintainable, wait until you see the "TRIGGER from Hell" code. Now maintain it. It will not optimize either! Triggers are procedural and not declarative.

A really good horror story about EAV disasters is here on Simple-Talk, called Bad Carma

Let’s get back to the original poster, whom you probably forgot about by now. Continuing on, he says:

I do not like this so much and I was thinking about creating one very simple scalar function for each parameter. Easy to modify at deployment time or with our interface application, and, more important, better for performances I guess. Something like:

CREATE FUNCTION invoice_print_cnt() 

RETURNS INTEGER 

AS BEGIN 

   -- other code if needed

   RETURN 3; 

   END;

This has some advantage over the EAV model. Only the required functions will be used, rather than loading the whole EAV table. You can be sure of the proper data types and you can add some logic with a RETURN (CASE WHEN ..) expression.

But the down side is that functions do not pass information to the optimizer. If we have a constraint like “print_buffer_cnt >= 3 * default_invoice_cnt”, then we need a query or other logic inside the function to catch it. But that means a call inside the invoice_print_cnt() function that calls the print_buffer_cnt() .. or the other way around or maybe a co-routine call between both functions. Wow! That could get tricky and perhaps lead to endless looping.

The practical solution is to write a program to do all of the inter-parameter constraint checking and then generate simple “return a value” functions like our invoice_print_cnt() example. But if you change a parameter, then you need a re-compile of the running system. Hopefully, installation options will rarely, if ever, change.

A proprietary method would be to write SQLCMD scripts with lots of IF-THEN-ELSE logic using $ install-time variables in the script. But it is not possible to modify once it is installed, except by running another script that will ALTER the procedures.

My suggestion is that you could create a table with all the system parameters each in its own column. The skeleton for the table would look like this:

CREATE TABLE ConfigurationParameters

(one_row_lock CHAR(1) DEFAULT 'X' NOT NULL PRIMARY KEY

   CHECK (one_row_lock = 'X'), 

 invoice_print_cnt INTEGER DEFAULT 1 NOT NULL

     CHECK (invoice_print_cnt > 0), 

 print_buffer_cnt INTEGER DEFAULT 3 NOT NULL, 

  CONSTRAINT buffer_and_printers_okay

  CHECK (print_buffer_cnt >= 3 * default_invoice_cnt), 

 Etc.  

);

This gets the right data types without CAST() expressions. Each column can have one or more CHECK () constraints and its own DEFAULT. You can also have constraints among the columns, as shown in the skeleton code. And finally, you could include references to other tables. I cannot come up with a good example of that in a configuration table, but someone probably has an example.

The “one_row_lock” column is a way to assure that the table has only one row in it. I’ll get back to that in a little while.

The really cool trick is to use the little-known INSERT INTO.. DEFAULT VALUES statement to reset the configuration to the defaults.

CREATE PROCEDURE ResetConfiguration()

AS 

BEGIN

-- ask is this really what you want to do?

DELETE FROM ConfigurationParameters;

INSERT INTO ConfigurationParameters DEFAULT VALUES;

END;

The DELETE FROM statement will clean out the table; all constraints are TRUE on an empty table. The INSERT INTO.. DEFAULT VALUES statement will do what it says – it inserts a row with all the default values of each column. We will assume that the DEFAULT values do not conflict with any CHECK () or referential constraints. Yes, I know that sounds stupid, but check it out anyway.

There is here another advantage in this single table. You can replace the “one_row_lock” column with multiple configurations in the same table.

CREATE TABLE ConfigurationParameters

(configuration_nbr INTEGER DEFAULT 0 NOT NULL PRIMARY KEY

   CHECK (configuration_nbr BETWEEN 0 AND 3), 

 Etc.);

As it turned out, the original application can be used in a multi-company environment. They have clients who use the application to manage three or four companies of their group. The only disadvantage is that if you execute ResetConfiguration(), you will wipe out any configuration except the zero-th, which I would presume is the most general, minimal or most basic version of the application.

There is another trick using row constructors, or the VALUES() clause. In Standard SQL, it is more general than just one row in an INSERT INTO statement and Microsoft is catching up. The <expression list> can have more than constants or simple variables in SQL Server now. That is just the way that most SQL Server programmers write code. But function calls, computations, scalar subquery expressions or anything that returns a proper scalar value (or an allowed NULL) is fine. It has more power than most programmers think.

In Standard SQL, you have been able to use the VALUES() clause anywhere that a derived table could be used. But like a derived table, you need to give it a dummy name. This gives us another way to prevent any updates use a VIEW defined with a table constructor.

CREATE VIEW Constants (pi, e, phi)

AS 

SELECT X.*

   FROM (VALUES (CAST (3.142592653 AS FLOAT), 

                 CAST (2.71828182 AS FLOAT), 

                 CAST (1.6180339887 AS FLOAT))

        ) AS X (pi, e, phi); 

Please notice that you have to use CAST() operators to assure that the data types are correct. This is not a problem with INTEGER values, but the example could have FLOAT, REAL, DOUBLE PRECISION or DECIMAL(s,p) as options It is also just good documentation.

The full syntax for the VALUES clause is:

VALUES (<expression list>)[, (<expression list>)]

The new (well, to Microsoft) version of VALUES allows a list of <expression list>s. Each of these row constructors becomes a row in a nameless derived table. You then use an AS <table name> [(<column name list>)] alias to reference it.

This idea extends to constant tables. A classic example would be look-up tables for functions in statistics and finance, such as the Student-T distribution for small samples. The value of (r) is the size of the sample minus one and the percentages are the confidence intervals. If you don’t remember your college statistics course, then just think of your favorite look up table.

Loosely speaking, the Student's t-distribution is the best guess at the population distribution that we can make without knowing the standard deviation with a certain level of confidence. William Gosset created this statistic in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose "Student" and that name stuck. Here is a short table:

  r    90%    95%      97.5%    99.5%

=======================================

  1 3.07766  6.31371  12.7062  63.65600

  2 1.88562  2.91999   4.30265  9.92482 

  3 1.63774  2.35336   3.18243  5.84089

  4 1.53321  2.13185   2.77644  4.60393

  5 1.47588  2.01505   2.57058  4.03212

 10 1.37218  1.81246   2.22814  3.16922

 30 1.31042  1.69726   2.04227  2.74999

100 1.29007  1.66023   1.98397  2.62589

 .. 1.28156  1.64487   1.95999  2.57584

This becomes the VIEW:

CREATE VIEW Student_T(r, c900, c950, c975, c995)

AS SELECT X.*

  FROM (VALUES (CAST (1 AS INTEGER), 

           CAST (3.07766 AS FLOAT), 

           CAST (6.31371 AS FLOAT), 

           CAST (12.7062 AS FLOAT), 

           CAST (63.65600 AS FLOAT)), 

 (2, 1.88562, 2.91999, 4.30265, 9.92482), 

 (3, 1.63774, 2.35336, 3.18243, 5.84089),

 etc)

AS X;

Notice that the first row has the CAST() function on all the columns. This will guarantee that all the columns in the VIEW will have the appropriate data types. You do not need to repeat it for all the rows.

Another version of this trick to use a CTE (Common Table Expression) instead of a VIEW:

WITH Student_T(r, c900, c950, c975, c995)

AS SELECT X.*

  FROM (VALUES (CAST (1 AS INTEGER), -- redundant but safe

           CAST (3.07766 AS FLOAT), 

           CAST (6.31371 AS FLOAT), 

           CAST (12.7062 AS FLOAT), 

           CAST (63.65600 AS FLOAT)), 

 (2, 1.88562, 2.91999, 4.30265, 9.92482), 

 (3, 1.63774, 2.35336, 3.18243, 5.84089), 

  ..) AS X

SELECT (..) AS t, etc

  FROM <<sample table expression>> 

 WHERE ..;

Most look-up tables should be in VIEWs since it is very likely that more than one query is going to use them.

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 32 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good thinking
Posted by: Bruce W Cassidy (view profile)
Posted on: Sunday, June 28, 2009 at 9:35 PM
Message: In general, I agree that what you have described is a better approach.

Of course, there's a limit to the number of columns in a table, but that can be designed around by breaking the parameter tables into subject areas.

Subject: Tickles the brain in a good way
Posted by: Ewald Cress (not signed in)
Posted on: Monday, June 29, 2009 at 12:33 AM
Message: I like this pattern as an alternative to a subset of EAV cases. But then again, EAV all too often carries the stench of deferred design, and any alternative that forces somebody to think too much before implementation might be rejected precisely because they want to keep on deferring. Oh well, there's no helping some people.

Subject: On Maintainability and Performance
Posted by: Charles Kincaid (view profile)
Posted on: Monday, June 29, 2009 at 10:35 AM
Message: So everytime you add a new parameter you have to alter the table structure. Plus you then have to modify everything that touches the table. Do I have that right? I just want to air out both sides.

Then when confronted with a bunch of "type" tables (Order Type, Pallet Type, Location Type, User Type, etc.) one of our guys put together a demo showing that the performance went UP when they were all in ONE table. Mind you that this is in the Compact Edition on Mobile devices. Performance was a wash on the full version. Are my teams going to have to use two differnt mindsets when working on Mobile apps vs NORMAL (humor intended) apps?

I'm getting a lot of presure to put slowly changing dimensions (like US state names) in a single table. I like the view idea for such data as takes an act of Congress to change. US states and Time Zones come right to mind. I'll have to see if that is supported on the Mobile side.

Subject: Sub-types
Posted by: Anonymous (not signed in)
Posted on: Monday, June 29, 2009 at 11:05 AM
Message: Splitting the EAV name/value table into separate columns for different value types is one way to avoid casting.

This can be taken further by moving the separate columns into sub-type tables.

Meta data is the biggest problem with EAV's i.e. min value, max value, number precision and scale, string length, allowed values etc.

Subject: My Favorite Part
Posted by: JJEugene (view profile)
Posted on: Monday, June 29, 2009 at 4:15 PM
Message: My favorite part of the article is the trick for "INSERT INTO.. DEFAULT VALUES". I would not have thought to use INSERT INTO.. DEFAULT VALUES that way (ie, in context of single row table and to re-set a bunch of defaults). Neat.

Subject: You lost me right away...
Posted by: Si (not signed in)
Posted on: Tuesday, June 30, 2009 at 8:28 PM
Message: Why on earth would you stick every 'Value' in the one column?

There are not that many data types, EAV approach just has to define the data type in the 'Attribute' table to know what column in 'Value' table to select from.

And to answer Anonymous' reply (with Sub-types subject), you place constraints in the 'Attribute' table.

I guess if you only operate at data access layer (everything done in SQL) then this wouldn't be too easy, but surely you have (at least) a business logic layer in most applications so that you can automate EAV access?

Subject: Thanks
Posted by: Fred (view profile)
Posted on: Thursday, July 02, 2009 at 10:42 AM
Message: Hello,
I am the original poster, whom you probably forgot about by now !
First I'd like to thanks Joe for his advice and the time he spent to write such a detailed article.
Then I must say we use this system now. And we even added some functionnality to be able to customize the parameters by company, group and even single user.
Using a «priority» column and a technic inspired from row access based security.

Subject: EAV have a place
Posted by: Seven24 (view profile)
Posted on: Thursday, October 15, 2009 at 2:39 PM
Message: There is a place for EAV structure. In one of the multi-tenant applications I built, users wanted the ability to store a wad of various data values with a top level entity. My rules were: you can never filter on it, sort by it, place it in a specific place on a report (other than spew it out in columns in Excel output). If it is just a bag of data values with a category then it can go into this EAV structure. However, the moment someone wants to filter, sort or put the value in a specific place on a report, it must become a top level column in a table with data integrity rules. That structure saved enormous amounts of money for the customers in that every little request to store an additional bit of data did not have to go through a developer.

Further, one can create a table schema where you have a Numeric, DateTime, and String column instead of one text column to rule them all. If you so desire, you can enforce data integrity through triggers to ensure that a DateTime attribute does not get stored in the String column.

EAV structures are like drugs: in small quantities and used in the proper circumstances, they can be beneficial. However, too much will kill you.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.