Click here to monitor SSC

Simple-Talk columnist

Who’s afraid of the big bad data type

Published 26 April 2013 1:46 pm

Guest Editorial

Data types seem so innocuous, but mistakes in their use can creep up on you like a flesh-eating zombie and destroy your project just when you least expect it. Dealing with strings seems so trivial, what could go wrong? An awful lot: if I wanted to write horror films for developers, collations and character sets would be there. However, to chill the marrow of any experienced developer, dates, times and money would all feature high in the cast list.

It surprises me that JSON is now widely accepted as a means of transferring data. As it stands, it is clever but incomplete. I’ve always wondered if Doug Crockford’s mum summoned him to dinner at the point that he was wondering what to do about handling dates, time-periods and money in JSON. Coleridge, after all, had to answer the doorbell and get a parcel from Mr Postie, a person from Porlock, just as he was transferring his wonderful poem ‘Kubla Kahn‘ from his head onto paper. When he returned, the rest of the poem had vanished from his memory.

Whatever actually happened with JSON, the current definition is of no use for data interchange. The only way of using it is to adopt the philosophy of the CSV file in letting the recipient figure out how you’ve represented money or dates. XML is fine if you use it properly since XSD does pretty well with its xs:date and xs:timezone. Dates in Ruby strings seem a cop-out, as they just aren’t human-readable.

This is part of a wider confusion about data. All data types have, as I see it, three fundamental forms. They have a culturally dependent form into which they are rendered, for the human eye. Taking our example of dates, we can represent them as ’29th April 2013 9:32 5:733s AM GMT’. We have a form for data interchange such as 2013-04-29T09:32:05.733 +00:00, and we have a numerical storage form, which depends on the database or application and date datatype that we select. Nowadays, if we’re being sensible, we always select a data interchange format based on ISO 8601 (my favorite flavor is RFC 3339). When we render dates, we use a special-purpose routine that knows how every culture prefers to see dates, in their time zone. That is generally a one-way trip for data.

Interchange formats aren’t enough, though. Applications need a storage form of data. When ‘Big Data’ has to use a data interchange format, such as JSON or XML, to store data, it is facing an inevitable performance headache. Sure, you can solve it with clever indexing strategies, but you’re always going to be dealing with a nightmare of conversions when reading all the data interchange formats during queries that, for example, aggregate date ranges.

Of course, when using JSON or XML as a data storage format, you also need more real estate to store your date that you would using a proper data storage format, three times more, in the example I’ve chosen. Perhaps that is why the data is always “big”.

Dates are simple compared with money. I could tell you stories about financial calculations gone awry that would bring tears to your eyes but space constraints constrain me from telling it to you here. I’m sure one or two readers would like to fill in the details, in the comments.

I’ll end on an optimistic note. James Newton-King of JSON.NET has worked hard to advance JSON to the point where it is, at least, a useable form for data interchange. JSON, as implemented in JSON.NET, now represents the state of the art. Use it with confidence and let’s hope it becomes the basis of a new standard. Even so, both XML and JSON are, like CSV, ways of interchanging data, not storing it.

10 Responses to “Who’s afraid of the big bad data type”

  1. Joe Celko says:

    You forgot about the NULLs that every SQL data type has to support! And then we have all the flavors of NaN in IEEE floating point numbers. Data is more complex than people first think.

  2. BuggyFunBunny says:

    – Even so, both XML and JSON are, like CSV, ways of interchanging data, not storing it.

    Not necessarily XML (JSON, perhaps), even for interchange.

    “The fact is that in order for any data interchange to work, the parties must first agree on what data will be exchanged — semantics — and once they do that, there is no need to repeat the tags in each and every record/document being transmitted. Any agreed-upon delimited format will do, and the criterion here is efficiency, on which XML fares rather poorly…”
    – Fabian Pascal/2005

  3. Phil Factor says:

    I was being charitable, of course. JSON is was once actually used by Microsoft within the Backup Package (BacPac) for providing a ‘backup’ for SQL Azure. All the data within the tables is stored as JSON. I’d have used a properly-implemented CSV if there was a good reason for not using ‘native’ format. It would take less than half the space.
    Nulls, sir? I’m a C Date man! (Good to hear from you, Joe)

  4. Pedro says:

    On the Money headaches it starts right where the separators depend on the culture. Some uses comma (,) and some the period(.) as the decimal divider and thousands separator.
    E.g. Portugal 1.000,00 is in the UK culture: 1,000.00.
    FI see decimal mark Also note the area for Thousands separator.


  5. paschott says:

    In regards to money, we were bitten by Switzerland’s format (in some really, really old code) several times as a single quote is used in their separators. :(

    We use JSON to send some data through our API that wouldn’t work otherwise. We use XML for working with a handful of our vendors. I tend to agree that setting up a format that works for all is usually the best way to handle it, but XML does have the advantage that you can tell if something is off relatively easily as well as some legibility in many cases. It has the disadvantage of being really big for what needs to be sent, though.

    We did find (again, after the fact) that storing multiple XML columns in a table is not a great idea when it comes to index maintenance. Easier to handle for the developers, but now causing some longer-running maintenance jobs due to the XML insert, then an XML update into a second column. The stats and indexes are getting out of control so re-indexing takes longer. (All this on a 2005 machine so no fancy new features for us yet.)

    The “bacpac” idea is very intriguing, but I’d really like to see that expanded into a way to transfer partial data updates/inserts/deletes in order to be more useful. That’s one part that MS seems to be missing in the move to SSDT SQL Projects. We can easily push schema, but have to script all data changes.

  6. edokan says:

    at the end, it always come to the not so famous Turkey Test.

    Please try to remain calm when you read articles on following links. They are not recommended to programmers with heart conditions :)

    Does Your Code Pass The Turkey Test?
    Falsehoods programmers believe about time

    More falsehoods programmers believe about time; “wisdom of the crowd” edition

    Falsehoods Programmers Believe About Names

  7. regbac says:

    Great article. BacPac and DacPac uses BCP though in the current version, not JSON any longer..

  8. Phil Factor says:

    Oops. Didn’t know that. I’ve updated my previous comment to correct it.

  9. Michael Sorens says:

    I have to take you to task for violating a web usability rule here: the reader should be able to tell from a hyperlinked phrase what that hyperlink will reveal before clicking it. Alas, “clever but incomplete” does not reveal anything (directly) about cleverness or completeness.

    That said, nice editorial, Phil; I always find your perspective enlightening!

  10. rupertsland says:

    I am a scientist designing and building for a client a multi-project environmental (biological/ecological) database in SQL Server.

    The biggest headache for me are the data types for date and time. Sometimes in the field, when data is collected, it is not always possible to get the complete date and/or time. One is then left with, say, the month and year, but no known day. What does one do then? What I’ve done is break up the date into separate columns. One column for the year, another to store the month number, and a third column to store the day of month. Yes, it becomes a chore to concatenate these columns back into a readable date in a query.

    Another headache is that existing GIS software can mess up the dates, especially when using non-enterprise software. The dates are simply omitted when shape files are created. We can display sampling locations on a map, but the date is excluded. Or, fork out 60K for the enterprise software.

    An even nastier problem for us scientists is that fact that no database on earth lets us store variable decimal values that include significant digits. The best example would be water chemistry analysis involving many parameters. For one observation, we may have to store a value of, say, 1.00. The next observation may be 20.2340, and so on. One is forced to store the data as text. Not pretty when you need to do calculations later. Frequently, we CANNOT use fixed decimal places. Does anyone have a workaround, besides using varchar? I wish Microsoft or Oracle would create a new “varDecimal” data type to solve this nasty problem once and for all.

    I enjoyed reading Phil’s blog.

Leave a Reply

Blog archive