Click here to monitor SSC
  • Av rating:
  • Total votes: 155
  • Total comments: 21
Anith Sen

Five Simple Database Design Errors You Should Avoid

16 October 2009

Anith follows up his highly successful article Facts and Fallacies about First Normal Form with a fascinating discussion of five common database design errors which persist in spite of the fact that the unfortunate consequences of their use is so widely known.  It is a needy reminder to anyone who has to design databases.

Most of us in the industry are aware of the dangers of poor database design yet overlook them in  real-world databases.  Harmful design flaws often go unnoticed. In some cases, the limitations of the DBMS or the SQL language itself may contribute to the problem. In others, it may be the inexperienced database designers who pay more attention to writing fanciful code but fail to focus on having a good data model.

To put it simply, database design is the process of translating facts about part of  the real world into a logical model. This model can then be implemented, and is usually done so with a relational database. While it is true that relational databases have the solid foundation of logic and set-based mathematics, the scientific rigor of the database design process also involves  aesthetics and intuition; but it includes, of course, the subjective bias of the designer as well. But how far does this affect the design?  In this article, I’ll try to explain five common design errors people make while modelling tables and suggest some guidelines on how to avoid them.

(1)  Common Lookup Tables


Figure 1

A few years back, Don Peterson wrote an article for SQL Server Central that detailed a common practice of creating a single lookup table for various types of data usually called as code table or an "allowed value table" (AVT).  These tables tend to be massive and have a pile of unrelated data.   Appropriately enough, Don called these tables Massively Unified Code-Key (MUCK) tables (Peterson, 2006) Though many others have written about it over the years, this name seems to capture most effectively the clumsiness associated with such a structure.

In many cases, the data in these tables are  VARCHAR(n) though the real data type of these values can be anything ranging from INTEGER to DATETIME.  They are mostly represented in  three columns that may take some form of the sample table (Figure 1):

The justification here is that each entity in the example here has a similar set of attributes and therefore it is okay to cram it to a single table. After all, it results in fewer  tables, keep the database simpler, right?

During the design process, the database designer may come across several small tables (in the example, these are tables that represent distinct types of entities such as ‘status of orders’, ‘priority of financial assets’, ‘location codes’, ‘type of warehouses’ etc.).

Figures 2-5

 He then decides to combine them all because of the similarity of their columns. He assumes that he is eliminating redundant tables and simplifying the database; he will have fewer tables, he’ll save space, improve efficiency etc. People also assume that it reduces the complexity of the SQL required, because a single routine/stored procedure can be written to access any type of data.

So what is wrong with it?

  • Firstly, you lose the means to ensure accurate data; constraints. By combining different entities into a single table, you have no declarative means to restrain values of a certain category. There is no easy way to enforce simple foreign key constraints without adding the categoryid in all the referencing keys.
  • Secondly, you are forced to represent every data type as a string with this type of  generic lookup table. Such intermingling of different types can be a problem, because check constraints cannot be imposed without major code-hacking . In the example we’ve given, if the discount code is CHAR(3) and location_nbr is INT(4), what should the data type of the  ‘code’ column be in the Common Lookup table?
  • Thirdly, you commit yourself to rigidity and subsequent complexity. You might be tempted to ask, how can such an apparently simple and flexible design be rigid? Well, considering our example of a common lookup table scheme, just imagine that the ‘LocationCode’ table  includes another column which might be ‘region’. What about the consequences of adding a status to the ‘DiscountType’ table? Just in order to change a single category,  you'll have to consider making way for all the rows in the table regardless of whether the new column is applicable to them or not.  What about complexity? Often the idea of using common lookup tables come from the idea of generalizing entities where by a single table represents a "thing" – pretty much anything.
    Contrast this with the fundamental rule that a well-designed table represents a set of facts about entities or relationships of the same kind. The problem with generalizing entities is that a table becomes a pile of unrelated rows: Consequently, you then lose precision of meaning,  followed by confusion and, often, unwanted complexity.  
    The main goal of a DBMS is to enforce the rules that govern how the data is represented and   manipulated. Make sure you do not confuse the terms "generalize", "reuse" etc. in the  context of database design to the extent where you have no control over what is being designed.

  • Fourthly and finally, you are faced with the physical implementation issues.  While logical design is considered to be totally separate from physical implementation, in commercial DBMS products like SQL Server, physical implementations can be influenced by logical design, and vice-versa. In large enterprises, such common lookup tables can grow to hundreds of thousands of rows and require heavy physical database tuning. Locking and concurrency issues with such large tables will also have to be controlled. The internal representation of a particular set of row in physical storage can be a determining factor in how efficient the values can be accessed and manipulated by SQL queries.

As a general recommendation, always use separate tables for each logical entity, identifying the appropriate columns with correct types, constraints and references. It is better to write simple routines and procedures to access and manipulate the data in the tables without aiming for "dynamic code".

Common lookup tables have no place in sensible database design, whether used as a short-term makeshift fix or as a long-term viable solution. While application-enforced integrity is sometimes favored by developers, it remains true that the DBMS must still be the centralized enforcer of all integrity. Because the foremost goal in a given database design is to preserve data integrity and logical correctness,  common lookup tables are one of the worst kind of mistakes that one can make..

(2) Check Constraint conundrum

Check Constraints serve several purposes, but cause trouble to designers in two ways:

  • They miss declaring appropriate check constraints when it is necessary.
  • They are unaware when to use a column level constraints rather than a table with a foreign key constraint.

Constraints in SQL Server can serve many different purposes, including support for domain constraints, column constraints and, to some extent, table constraints. A primary purpose of a database is to preserve data integrity, and well-defined constraints provide an excellent means to control what values are allowed in a column.

So then, should you ever avoid using a check constraint? Well, let’s consider the cases where a referencing table (a table with a foreign key) can be used to restrain the column with a specific set of values.

 

Figure 6

Here the values for ins_code in the PolicyHolders table can be restricted in two ways. One way would involve the use of  a lookup table that holds the allowed values for ins_code. An alternative is to have a check constraint on the PolicyHolders table along the lines of:

CHECK ( ins_code IN ( 'IC', 'FS', 'MC', 'PPO', 'POS', 'HMO' ) )

So what is the rule of thumb in choosing the right approach? Old hands in database design look for three specific criteria to govern their choice  between a check constraint or a separate table that has a  foreign key constraint.

  1. If the list of values changes over a period of time, you must use a separate table with a foreign key constraint rather than a check constraint.
  2. If the list of values is larger than 15 or 20, you should consider a separate table.
  3. If the list of values is shared or reusable, at least used three or more times in the same database, then you have a very strong case to use a separate table.

Note that database design is a mix of art and science and therefore it involves tradeoffs. An experienced designer can make a  trade-off, based on an informed judgment of the specific requirements.

 (3) Entity-Attribute-Value Table

 Ideally a table represents a set of entities, each of which has a set of attributes represented as columns. Sometimes, designers can get caught up in the world of alternative programming "paradigms" and might try to implement  them. One such model is called Entity-Attribute-Value ( or in some contexts as object-attribute-model), which is a nickname for a table that has three columns, one for the type of entity it is supposed to represent, another for a parameter or attribute or property of that entity and a third one for the actual value of that property.

Consider the following example of a table that records data about employees:

Fig 7

Now, the EAV approach shuffles up the data, in order to  represent the attributes as values in one column and the corresponding values of those attributes in another column:

Fig 8

Taking this to the extreme, there is no need for additional tables -- all data can be crammed into a single table! The credit to this invention goes to so called "clinical database" designers who decided that when various data elements are unknown, partially known or sparse it is best to use EAV (Nadkarni, 2002). The problem is that many newcomers get seduced into applying this approach in SQL databases and the results are usually chaos. In fact, many people assume that it is a good thing that they do not know the nature of data!

So what are the benefits that are touted for EAV? Well, there are none. Since EAV tables will contain any kind of data, we have to PIVOT the data to a tabular representation, with appropriate columns, in order to make it useful. In many cases, there is middleware or client-side software that does this behind the scenes, thereby providing the illusion to the user that they are dealing with well-designed data.

EAV models have a host of problems.

  • Firstly, the massive amount of data is, in itself, essentially unmanageable.
  • Secondly, there is no possible way to define the necessary constraints -- any potential check constraints will have to include extensive hard-coding for appropriate attribute names. Since a single column holds all possible values, the datatype is usually VARCHAR(n).  
  • Thirdly, don't even think about having any useful foreign keys.
  • Finally,  there is the complexity and awkwardness of queries. Some folks consider it a benefit to be able to  jam a variety of data into a single table when necessary -- they call it "scalable". In reality, since EAV mixes up data with metadata, it is lot more difficult to manipulate data even for simple requirements. Consider a simple query to retrieve the employees who are born after 1950. In the traditional model, you'd have:

SELECT first_name, last_name

  FROM Employees

 WHERE date_of_birth > '12/31/1950' ;

 In a EAV model, here is one way to write a comparable query :

SELECT MAX( CASE emp_property WHEN 'first_name'

                              THEN value

           END ) AS first_name,

      MAX( CASE emp_property WHEN 'last_name'

                             THEN value

           END ) AS last_name

  FROM EmployeeValues

 WHERE emp_nbr IN ( SELECT emp_nbr

                      FROM EmployeeValues

                     WHERE emp_property = 'date_of_birth'

                       AND CAST( value AS DATETIME ) > '12/31/1950' )

   AND emp_property IN ( 'first_name', 'last_name' )

 GROUP BY emp_nbr ;  

Listing 1

For those who are handy at Transact-SQL, go ahead, add a few new columns with different data types and try out a few queries and see how much fun it can be!

The solution to the EAV nightmare is simple: Analyze and research the users' needs and identify the data requirements up-front.  A relational database maintains the integrity and consistency of data.  It is virtually impossible to make a case for designing such a database without well-defined requirements. Period.

(4) Application Encroachments into DB design

There are several ways that an application can trespass on the field of data management. I’ll briefly explain a couple of ways and suggest some guidelines on how to prevent it.

Enforcing Integrity via applications

Proponents of application based integrity usually argues that constraints negatively impact data access. They also assume selectively applying rules based on the needs of the application is the best route to take.

Let us look at this in detail. Are there any good statistical measurements, comparisons and analyses exist to address the performance difference between the same rules enforced by the DBMS and the application? How effectively can an application enforce data related rules? If the same rules are required by multiple applications, can the duplication of code be avoided? If there is already an integrity enforcement mechanism within the DBMS, why reinvent the wheel?

The solution is simple.

Rely on nothing else to provide completeness and correctness except the database itself. By nothing, I mean neither users nor applications external to the database. While it may be true that current DBMS products may not succeed in enforcing all possible constraints, it is not sensible to let the application or user take over that responsibility.

You may ask why it is bad to rely on the application to enforce data-integrity? Well, if there is only one application per database, then it is not really an issue. But usually, databases act as the central repositories of data and serve several applications. Therefore,  rules must be enforced across all these applications. These rules may change as well.

As a general guideline, databases are more than  mere data repositories; they are the source of rules associated with that data. Declare integrity constraints in the database where possible, for every rule that should be enforced. Use stored procedures and triggers only where declarative integrity enforcement via keys and constraints isn’t possible. Only application-specific rules need to be implemented via the application.

Application Tail wagging the Database Dog:

There is a growing trend among the developer community to treat the database as being a mere component of the ‘application domain’. Often, tables are added as needed by the application developer  and then columns are subsequently slapped in as an afterthought.

This is convenient because it avoids troublesome parts of the design process such as requirements-gathering.. Experience tells us that, in most enterprises, applications come and go, but databases usually stand for a long time.  It therefore makes good sense to make the effort to develop a good design based on the rules that are specific to the business segment in context. (Teorey, 1994).

(5) Misusing Data values as Data Elements

Let’s just clarify something before proceeding further:  a ‘data value’ here refers to the value of an attribute of an entity; a ‘data element’ refers to an unit of metadata such as a column name or a table name. By misusing data values as data elements I refer to the practice of splitting attribute values of a certain entity and representing it across several columns or tables. Joe Celko calls it exactly that -- 'attribute splitting' (Celko, 2005).

Consider a table that represents the sales figures of some salesmen that work for a company. Let’s assume that the following design is adopted so as to make it easier to retrieve the data in order to display it:

Figure 9

You’ll see here that  a single attribute in the business model, the ‘sales amount’, is represented as a series of columns. This  makes life harder for almost everyone using such a scheme.

Now, what would make such design undesirable?

  • The duplication of constraints is going to cause problems. Any constraints that apply to monthly sales will have to be defined for each individual column.
  • Without altering the table, you cannot add the sales for a new month. One poor alternative is to have the columns for all possible monthly sales and use NULLs for the months with no sales.
  • And finally, there is the difficulty in expressing relatively simple queries, like comparing sales among sales persons or finding the best monthly sales. 

By the way, many people consider this to be a violation first normal form. This is a misunderstanding since there are no multi-valued columns here  (Pascal, 2005). For a detailed exposition, please refer to the simple-talk article: Facts and Fallacies about First Normal Form

The ideal way to design this table would be something along the lines of:

           

Fig 10

Of course you can have a separate table for the sales persons and then reference it using a foreign key, preferably with a simple surrogate key such as sales_person_id ,  shown above.

If you are stuck with a table that is designed as Fig 9, you can create a resultset  from the code  in a few different ways:

1. Use a UNION query:

SELECT sales_person, 'jan' AS "month", jan_sales AS "sales"

  FROM salesdata

 UNION ALL

SELECT sales_person, 'feb', feb_sales

  FROM salesdata

 UNION ALL

SELECT sales_person, 'mar', mar_sales

  FROM salesdata

 UNION ALL

SELECT sales_person, 'apr', apr_sales

  FROM salesdata ;

2. Use a JOIN to a derived table with the column names:

SELECT sales_person,

       m AS "month",

       CASE m WHEN 'jan' THEN jan_sales

              WHEN 'feb' THEN feb_sales

              WHEN 'mar' THEN mar_sales

              WHEN 'apr' THEN apr_sales

       END AS sales

  FROM salesdata

 CROSS JOIN ( SELECT 'jan' UNION

              SELECT 'feb' UNION

              SELECT 'mar' UNION

              SELECT 'apr' ) months ( m ) ;

3. Use UNPIVOT:

 

SELECT sales_person, "month", sales

  FROM ( SELECT sales_person,

                jan_sales, feb_sales, mar_sales, apr_sales, may_sales

          FROM salesdata ) s ( sales_person, jan, feb, mar, apr, may )

  UNPIVOT

      ( sales FOR "month" IN ( jan, feb, mar, apr, may ) ) m ;

As usual, you will have to test against the underlying tables,  and consider such things as the magnitude of the data and  existing indexes to  make sure which method is the most efficient .

The other variation of this approach is to split the attributes across tables, i.e. using data values as part of the table name itself. This is commonly done by having multiple tables that are similarly structured. Consider the following set of tables:

   

Figure 11

Here, the individual values of the attribute ‘month’ are assigned to each table. This design shares similar shortcomings such as the duplication of constraints and the difficulty in expressing simple queries.  To be useful, the tables will have to be UNION-ed to form a single table with an additional column representing the month. It would have been easier to start with a  single base table.

We should be careful not to confuse splitting attributes with the logical design principle with table partitioning, a data reorganization process done at the physical level that creates smaller subsets of data from a large table or index in an attempt manage and access them efficiently.

Just as a side note, this problem has been discussed heavily by relational theorists specifically with respect to the limitations it imposes on view updates. Some have identified it as a direct violation of the Information Principle (a relational principle that requires the representation of all data in a database solely as values in a table) and recommended that no two tables in a database should have overlapped meanings.   Originally defined as the New Design Principle, this recommendation for each table to have a single meaning or predicate is currently known as the Principle of Orthogonal Design in relational literature (Date & McGoveran, 1995).

Conclusion

 It is always worth investing time in modeling a sound database schema. Not only does it provide you with an easily accessible and maintainable schema, but it also saves you from patching up the holes periodically.  Often database designers look for shortcuts in an attempt  to save time and effort. Regardless of how fancy they look, most kludges are short lived and end up costing more time, effort and expense.  A good clean design may require no more than following a few simple rules, and it is the resolve to  follow those rules that characterizes genuine database professionals.

References

  1. Celko, J. (2005). SQL Programming Style. San Francisco, CA: Morgan Kaufman Publishers.
  2. Date, C. J., & McGoveran, D. (1995). The principle of Orthogonal Design. In C. J. Date, & D. McGoveran, Relational Database Writrings 1991-1994. Addison Wesley.
  3. Nadkarni, P. M. (2002). An Introduction to EAV systems. National GCRC Meeting. Baltimore, MD.
  4. Pascal, F. (2005, March). What First Normal Form really means and means not. Retrieved Oct 1, 2009, from Database Debunkings: http://www.dbdebunk.com/page/page/629796.htm
  5. Peterson, D. (2006, March 24). Lookup Table Madness. Retrieved September 26, 2009, from SQL Server Central: http://www.sqlservercentral.com/articles/Advanced/lookuptablemadness/1464/
  6. Teorey, T. J. (1994). Database Modeling & Design: The Fundamental Principles. Morgan Kaufmann.

 

 

Anith Sen

Author profile:

Anith S Larson specializes in data management primarily using SQL Server. From the mid 90s, he has been working on a variety of database design and systems development projects for clients primarily in wealth management and financial services industry. He resides in Lakeland, TN.

Search for other articles by Anith Sen

Rate this article:   Avg rating: from a total of 155 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good list
Posted by: Anonymous (not signed in)
Posted on: Sunday, October 18, 2009 at 3:14 PM
Message: Can this be tagged with another article here.
Ten Common Database Design Mistakes


Subject: There is Intelligent Life on the Planet
Posted by: BuggyFunBunny (view profile)
Posted on: Sunday, October 18, 2009 at 10:06 PM
Message: For what it's worth, I regard this article as evidence that not all of the Younger Generation is addled. What I don't understand is how he manages to be employed whilst voicing such obviously subversive and anti-social views. Does he not know that the java/C# wonks have decided how data should be slave to code?

By the bye, I most recently worked in Financial Services, the industry named in the author's bio bits. If ever there is an industry waddling in COBOL file oriented systems, that is it. Which makes his view ever more heretical. Bravo.

Subject: EAVs
Posted by: Bruce Pierson (not signed in)
Posted on: Tuesday, October 20, 2009 at 12:51 AM
Message: EAVs are absolutely necessary in domains such as product / attribute configuration, where you need to use a common database to store attribute information about disparate products.

Your employee example is somewhat contrived and I would agree it makes no sense in that context, but EAVs do serve a purpose.

Subject: A case for EAVS
Posted by: David Kelly (not signed in)
Posted on: Tuesday, October 20, 2009 at 3:47 AM
Message: Although I agree with the broad strokes of your argument that a database SHOULD be designed based on clearly defined requirements AND that an application programmer should NOT simply add tables and columns ad hoc, there is a case (IMHO) where a light version of an EAV is required.

I built a framework for programming web applications back in 2000 and had reasonable success with clients adopting it for their web applications; however over time, I add new features for new clients but needed to ensure that the framework and underlying database design would remain backwards compatible; It is now 2009 and the datbase design has moved on somewhat, however I still find that having an EAV table (which was there since day one) has ensured I can very quickly tailor the generic design to customer specific requirements and STILL support all the legacy clients who have opted NOT to pay for costly upgrades. So my ten cent worth is that there is definitely a case for EAVS but that they should be used with extreme caution and only when there is a compelling legacy/installe base/busines case for them.

Subject: on the other hand
Posted by: Sam (view profile)
Posted on: Tuesday, October 20, 2009 at 5:05 AM
Message: It would have been nice if he had mentioned unneccesary normalisation:
nothing is more annoying than a single column normalized into a separate table, when the column is not shared between anything and, for logical reasons, will never be shared.

Quite a few "by the book database designer" create these fallacies, which are a PITA and error prone to work with.

A database needs just the right normalization. Not too little, neither too much. Both is wrong.

Subject: EAV's
Posted by: BSC Solutions (not signed in)
Posted on: Tuesday, October 20, 2009 at 6:53 AM
Message: On the whole, this is a fantastic article and should be read by all up coming / trainee / graduate developers as a series of initial guidelines.

However, EAV tables can be used in LIMITED CIRCUMSTANCES. The product attributes example is a fantastic real-world example....

...another I have used is in managing control / environment variables used by interfaces that do similar jobs but to different systems, where their configuration values depends on what system is being used. Implementing an interface_type column into the EAV structure allows the (in my case) Payment interfaces to share a configuration table, without having lots of empty / defaulted columns against a single row

Simon

Subject: EAVs
Posted by: Phil Factor (view profile)
Posted on: Tuesday, October 20, 2009 at 7:30 AM
Message: The EAV is a particular solution to a fairly common problem. It isn't the only solution, though.

I have to agree with Anith and the many others who warn about the EAV solution. From my own experience, I'd day it is fine for prototyping systems, but it doesn't scale well and doesn't lend itself to larger applications. Constraints are a nightmare.

See Joe Celko's Avoiding the EAV of destruction which is also well-argued. A telling comment at the end of the article says 'EAV structures are like drugs: in small quantities and used in the proper circumstances, they can be beneficial. However, too much will kill you.'

Just to read about what can happen when this approach it taken to its ultimate extent, read Bad Carma


Subject: GUIDs and beware EAVs
Posted by: Joe Suchy (not signed in)
Posted on: Tuesday, October 20, 2009 at 8:53 AM
Message: Great Article! It would have been good to include using externally created primary keys, like GUIDs, in cases where it's not necessary and just lazy programming.

I have had to use EAVs a few times, but I agree with the comments that they should be avoided if possible. Storing everything generically, or even having different column types and only storing in one of the columns, just sets up problems down the line. I admit that most applications have some level of random data that needs to be stored, and EAVs are one solution, but I would suggest keeping them as "thin" as possible.

Again, great article!

Subject: Constraints
Posted by: Nico van Niekerk (not signed in)
Posted on: Tuesday, October 20, 2009 at 8:56 AM
Message: I never use check constraints but rather put them in a separate table. That may irk some saying it's over normalization, but the reason is more maintenance-driven than rules-driven. It also keeps the application developer's paws off the database and the process standard, which it is not if some constraints are in the database while others are in separate tables. The rationale is not consistent in those cases and if the rationale shifts costs go up and errors creep in.

And, with a separate table a user interface can be introduced that allow user-admins to maintain their constraints and application developers to maintain theirs. Nobody has to have access to the database to do that.

Narrow tables with only a few rows are cached into memory anyway, so there is hardly any performance penalty, if any at all.

Nice article, Anith.

Subject: Must read for those with their head in a cloud
Posted by: Brian Nalewajek (view profile)
Posted on: Tuesday, October 20, 2009 at 9:07 AM
Message: Excellent article.

I recently participated in a discussion during an INETA developer's group presentation. Though the presenter (highly respected developer/consultant), did a fine job of extolling the virtues of Azure Storage, we amiably disagreed about the appropriateness of using non-relational data structures for all application needs. Several of the objections I raised are reflected in Sen's article.

Perhaps most telling is the common perception by developers, that data integrity is best enforced outside of the schema and DBMS - passing the responsibility to the application developers. As Sen points out, there are nearly always many applications (and so app developers) per data structure. That means a lot more people each (often without knowledge of what other developers are doing), have to protect the integrity of the data - which is often not a high priority.

Sen's article commented on bad relational design practices. All cases use of non-relational structures amounts to the same thing (though more insidious).

With the major push to move applications to the cloud, all stakeholders should consider Sen's observations. Azure Storage will have its uses; but it's not the right structure for all purposes.



Subject: On the EAVe of Destruction
Posted by: BuggyFunBunny (view profile)
Posted on: Tuesday, October 20, 2009 at 9:46 AM
Message: Since this has been taken up by many... In that Financial Services position, I was tasked to make an Allowable Values support for a revised system (mostly the COBOL VSAM files moved to DB2 tables with little normalization) with a java servlet front end. The users demanded the ability to define and maintain these pick lists, which was their view of the problem. Data integrity, being from the Green Screen world for the previous 3 decades both users and coders, wasn't considered as part of the situation.

I was able to extract from them the fact that there were a fixed number attributes which constrained the values for columns, and thus built a 5NF set of tables; transitive closure with fixed depth (Celko covers this; I suppose others do too).

Fast, secure, and user maintainable. More importantly, the users/coders were forced to explain how the pick lists were to be constrained. They weren't allowed to pull stuff from out of their sphincters whenever it pleased them. All in all, they were irritated; and I've left Financial Services behind. What a bunch of Cro Magnons.

What wasn't possible, since these folks didn't know any better and refused to listen, is that constraints and even triggers, with a bit of UI, can be user maintained. After all, these are just text loaded into tables. In most cases, user maintained check constraints would have been better, but they were new to DB2. Sometimes half a loaf is more than one can get.

Subject: Join columns
Posted by: G (not signed in)
Posted on: Tuesday, October 20, 2009 at 5:38 PM
Message: Example 5 would be better if the new salesdata table had salesperson_id, instead of "salesperson". What happens in your table when your top earner changes their name? You have to update your sales data table; updating 1 row in the employee table is far better than updating many rows in the sales data table.

Subject: Your EAV sample is a straw man
Posted by: Anonymous (not signed in)
Posted on: Wednesday, October 21, 2009 at 10:29 AM
Message: There are some cases where EAVs are truly needed, basically when you don't know (or can't know) in advance what attributes you need to keep track of.

And your query is inefficient and unnecessarily complicated. Here you are, I re-wrote it: it's more efficient, simpler, easier to understand and demonstrates querying an EAV table need not be such a monstrosity.

Something else, you should *always* use ANSI formatting for dates in a query. Remember only a world minority use month/day/year format (AFAIK only the USA and some Canadians) so on localized servers you'll soon run into problems by not using it.

SELECT FN.value AS first_name, LN.value AS last_name
FROM Employees A
INNER JOIN (
SELECT emp_nbr
FROM EmployeeValues
WHERE emp_property = 'date_of_birth'
AND CAST( value AS DATETIME ) > '19501231'
) Sub
ON A.emp_nbr = Sub.emp_nbr
AND A.emp_property = 'first_name'
INNER JOIN Employees B
ON B.emp_nbr = Sub.emp_nbr
AND B.emp_property = 'last_name'

Subject: Your EAV sample is a straw man
Posted by: Anonymous (not signed in)
Posted on: Wednesday, October 21, 2009 at 11:07 AM
Message: There are some cases where EAVs are truly needed, basically when you don't know (or can't know) in advance what attributes you need to keep track of.

And your query is inefficient and unnecessarily complicated. Here you are, I re-wrote it: it's more efficient, simpler, easier to understand and demonstrates querying an EAV table need not be such a monstrosity.

Something else, you should *always* use ANSI formatting for dates in a query. Remember only a world minority use month/day/year format (AFAIK only the USA and some Canadians) so on localized servers you'll soon run into problems by not using it.

SELECT FN.value AS first_name, LN.value AS last_name
FROM Employees A
INNER JOIN (
SELECT emp_nbr
FROM EmployeeValues
WHERE emp_property = 'date_of_birth'
AND CAST( value AS DATETIME ) > '19501231'
) Sub
ON A.emp_nbr = Sub.emp_nbr
AND A.emp_property = 'first_name'
INNER JOIN Employees B
ON B.emp_nbr = Sub.emp_nbr
AND B.emp_property = 'last_name'

Subject: I screwed my example
Posted by: Anonymous (not signed in)
Posted on: Wednesday, October 21, 2009 at 11:12 AM
Message: Sorry for the double post, it was un intentional

And then I see I screwed my example, first line should read "A.value AS first_name, B.value AS last_name".

Sorry.

Subject: EAV.. Come Now.... Rethink It
Posted by: DS (not signed in)
Posted on: Wednesday, October 21, 2009 at 1:01 PM
Message: Above quote... " basically when you don't know (or can't know) in advance what attributes you need to keep track of." Are you joking? IMO there is a better way.

We have been using the concept of Qualifier and Qualifier values for 7 or 8 years now. Without needing to know what attributes are required by the user, a Qual Table(PK, NameType, Avail) and Qual Value table (PK, FK, Value, Avail) with a join table relating a "known" value, say ProductID, to unknown amounts of property attributes values (Product_FK, Qual_Value_FK).

This allows the user to add any attributes and as many values of that attribute and relate them to other data in any assortment they see fit while allowing them control over their business model. They can make any Qualifier available at any time and make any of the Qualifier values avail at any time as well. This gives the control to the user, which is where is belongs.

Example: Lighting store. If you give a Qual and Qual Value table schema, you dont need to know anything about lighting fixtures at all. All you need to do is code the UI to display the values selected for a particular product and give the user a way to add Qualifiers and Qual Values and relate them to their products. So, a lighting store adds FINISH as a Qual and then adds Brass, Chrome, Bronze, etc as Qual Values. They also add BULB TYPE as a Qual and add 60w, 40w, 75w, etc to the Qual value. So the QUal table now has two IDs with a Description and the Qual Value table has multiple values related to attribute descriptions. On a product form, you allow the user to relate any of these qualifiers and values to their products, filling the join table with the Product ID and the Qual Value ID.

The user now has full control over what attributes he wants to associate to their business model and whether they are still valid for use, without ever losing data intergrity. Giving the user a form where they can now relate a product or products (or customer or vendor or or or...) to a Qual and one or more Qual Values is easy and querying the results and displaying is just as easy. Very relational and easy to query.

We have gone a step further and created a Qual Type Constraint and added the PK to the Qual Table, thus alleviating the need for separate Qual Tables for disparate data. Thus employees can have Quals, Vendors can have quals, Customers can have quals, products can have quals, etc etc etc. And the user has complete control over those quals and their associated values. You need to know nothing about what the customer wants to track about their own customers.

This essentially gives them the ability to add a Field and related data to that field without having to call the DB manager and have it added and call the programmer to write code to allow them to access the that particular fields data. This also means that we dont have to use EAVs and still dont have to know all attributes.







Subject: Common Table Designs
Posted by: Thomas (view profile)
Posted on: Wednesday, October 21, 2009 at 1:28 PM
Message: Common Table Design

IMO, your there are stronger arguments against the common table design (CTD) that were provided. One problem that many theorists get into is that they lose site of what happens in the real world. It is the equivalent of an argument between a physicist and engineer. Let's look at your reasons:

1. Constraints. A valid argument but a relatively weak one. If I have two categories "Foo" and "Bar" both stored in the common table, there is no guarantee that a foreign key to "Foo" will only be "Foo" values unless we also add a foreign key for category. You still have a foreign key to the common table mind you, but just cannot guarantee that you will only get references to values of "Foo". In reality, it almost never matters because the only way this could have happened is if someone hacked the data from behind the scenes and even if they did, a simple update will fix the problem. You still have a foreign key to the common table, so there is still a guarantee of the value existing.

2. Forced to represent every data type as string. Incorrect. In every CTD I have seen, they always use a surrogate key (integer or guid). The "better" CTDs will force a unique constraint on the category and name of the item.

3. "Rigidity and complexity". Although not stated clearly at first, this IMO, is *the* reason for avoiding CTD: additional attributes. The thinking that starts down the road of CTD is that you have a bunch of tables that are all structured with a surrogate key and name. Dozens in some cases. The problem comes when one of the entities require an attribute that the others do not. The right way to solve this problem (assuming we accept the CTD to begin with) is to pull out that entity into its own table with that additional attribute. However, in reality, this rarely happens and you end up with entities that have completely unrelated attributes.

4. Physical implementation. If anything, this argues *for* a CTD rather than against it. The vast majority of these lookup values are relatively short lists. In all the CTD designs I have witnessed, no single entity had more than two or three hundred values. The largest CT I've seen had only a few thousand records and that was a very large system. Further, these lists are not changing often and thus there is almost never contention issues.

CTD are to be avoided but IMO the primary reason to do so is that it creates havoc when one or more of your entities requires attributes (or check constraints) that the others do not.

Subject: EAV Design
Posted by: Thomas (view profile)
Posted on: Wednesday, October 21, 2009 at 1:34 PM
Message: As others have said, there is a place in the world for EAV designs. I wrote a multi-tenant application some time ago where the code base and database schema was shared amongst hundreds of clients even though the data for each customer was stored in separate databases. We often had requests from individual customers to store some additional pieces of data with their information. The cost to have a developer spec out the schema, adjust the version update code, add the entry fields to the GUI, test it, and deploy a new version would extremely expensive in time and cost given how often customer wanted to store an additional data point. The solution was an EAV structure that let customers store a wad of data with their other information. The rule was that you cannot sort on it, filter on it, nor request it appear in a specific spot in a report (i.e, there can be no SQL where you filter on a given attribute being equal to some specific value). The moment a user wanted to any of those things, we make it a first class column and all that it implied and do a data conversion for that customer. The cost and time savings was enormous.

I had mentioned in a previous post's comments that EAV are like drugs: in small quantities and used in the right circumstances, they can be beneficial. Too much will kill you.

Subject: Great list!
Posted by: nina (view profile)
Posted on: Thursday, October 29, 2009 at 5:15 AM
Message: You article is very good and I will recommend it to everyone in my team to read!

Subject: Excellent read !
Posted by: marc_scheuner (view profile)
Posted on: Sunday, November 15, 2009 at 3:15 PM
Message: Should be a mandatory read for any developer about to touch a database! :-) I've seen all those atrocities - and more - in real life; just glad to see there are at least a few others out there who also believe these things ought to be avoided like the plague! (or swine flu)

Subject: EAV tables
Posted by: vrunda (view profile)
Posted on: Thursday, November 19, 2009 at 12:47 AM
Message: Is EAV table approach is good for User Profiles of social networking site which also supports gedcom import/export for family tree??

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.