Who’s afraid of the big bad data typePublished 26 April 2013 1:46 pm
Data types seem so innocuous, but mistakes in their use can creep up on you like a flesh-eating zombie and destroy your project just when you least expect it. Dealing with strings seems so trivial, what could go wrong? An awful lot: if I wanted to write horror films for developers, collations and character sets would be there. However, to chill the marrow of any experienced developer, dates, times and money would all feature high in the cast list.
It surprises me that JSON is now widely accepted as a means of transferring data. As it stands, it is clever but incomplete. I’ve always wondered if Doug Crockford’s mum summoned him to dinner at the point that he was wondering what to do about handling dates, time-periods and money in JSON. Coleridge, after all, had to answer the doorbell and get a parcel from Mr Postie, a person from Porlock, just as he was transferring his wonderful poem ‘Kubla Kahn‘ from his head onto paper. When he returned, the rest of the poem had vanished from his memory.
Whatever actually happened with JSON, the current definition is of no use for data interchange. The only way of using it is to adopt the philosophy of the CSV file in letting the recipient figure out how you’ve represented money or dates. XML is fine if you use it properly since XSD does pretty well with its xs:date and xs:timezone. Dates in Ruby strings seem a cop-out, as they just aren’t human-readable.
This is part of a wider confusion about data. All data types have, as I see it, three fundamental forms. They have a culturally dependent form into which they are rendered, for the human eye. Taking our example of dates, we can represent them as ’29th April 2013 9:32 5:733s AM GMT’. We have a form for data interchange such as 2013-04-29T09:32:05.733 +00:00, and we have a numerical storage form, which depends on the database or application and date datatype that we select. Nowadays, if we’re being sensible, we always select a data interchange format based on ISO 8601 (my favorite flavor is RFC 3339). When we render dates, we use a special-purpose routine that knows how every culture prefers to see dates, in their time zone. That is generally a one-way trip for data.
Interchange formats aren’t enough, though. Applications need a storage form of data. When ‘Big Data’ has to use a data interchange format, such as JSON or XML, to store data, it is facing an inevitable performance headache. Sure, you can solve it with clever indexing strategies, but you’re always going to be dealing with a nightmare of conversions when reading all the data interchange formats during queries that, for example, aggregate date ranges.
Of course, when using JSON or XML as a data storage format, you also need more real estate to store your date that you would using a proper data storage format, three times more, in the example I’ve chosen. Perhaps that is why the data is always “big”.
Dates are simple compared with money. I could tell you stories about financial calculations gone awry that would bring tears to your eyes but space constraints constrain me from telling it to you here. I’m sure one or two readers would like to fill in the details, in the comments.
I’ll end on an optimistic note. James Newton-King of JSON.NET has worked hard to advance JSON to the point where it is, at least, a useable form for data interchange. JSON, as implemented in JSON.NET, now represents the state of the art. Use it with confidence and let’s hope it becomes the basis of a new standard. Even so, both XML and JSON are, like CSV, ways of interchanging data, not storing it.