In the first of a series of articles on the tricks of tackling temporal data in SQL, Joe Celko discusses SQL's temporal data types and agonizes over the fact that, although there are ANSI/ISO Standards for temporal operations in SQL, every vendor has something different. He explains the mysteries of such things as time-zones, lawful time, UTC, CUT, GMT, CE, DST, and EST.
Of the major programming languages, SQL is the only one with built-in temporal data types. In fact, I cannot think of anything else other than spreadsheet languages which have temporal data types! Most SQL Products had some kind of temporal data types before the first official versions in the SQL-92 Standards were defined. Considering how important dates are for commercial work, you would have thought that COBOL would have something by now. We even had the "Y2K Crisis" at the start of this decade, with tourist groups of books and articles about the problems of keeping only two-digit years in dates in computer data. I was one of the first people to write about it back in 1981 (INFORMATION SYSTEMS NEWS, Feb 09, "Father Time Software Secrets Allows Updating of Dates") and even got quoted in VANITY FAIR.
The Common Era
The Common Era calendar is irregular and the only standard unit of time is the second; years, months, weeks, hours, minutes and so forth are not part of the metric system, but are mentioned in the ISO standards as conventions.
Note: the old abbreviations A.D. (Anno Domini - Latin for "in the year of Our Lord") and B.C. ("Before Christ") have been replaced by CE for "Common Era" and BCE for "Before Common Era" in ISO Standard to avoid religious references. Strangely, the Church of Jesus Christ of Latter-day Saints ("Mormons") are the only religious group I know who use CE and BCE in their literature.
Today, you want to set up a program to convert your data to conform to ISO-8601 "Data Elements and Interchange Formats - Information Interchange - Representation of Dates and Times" as a corporate standard and EDIFACT for EDI messages. This is the "yyyy-mm-dd" format that is part of Standard SQL and will become part of other standard programming languages as they add temporal data types.
The full ISO-8601 timestamp can be either a local time or UTC time. UTC is the code for "Universal Coordinated Time," which replaced GMT (which has not been a Standard for decades), which was the abbreviation for "Greenwich Mean Time" (if you listen to CNN, you are used to hearing the term UTC, but if you listen to BBC radio, you are used to the term GMT).
In 1970 the Coordinated Universal Time system was devised by an international advisory group of technical experts within the International Telecommunication Union (ITU). The ITU felt it was best to designate a single abbreviation for use in all languages in order to minimize confusion. The two alternative original abbreviation proposals for the "Universal Coordinated Time" were CUT (English: Coordinated Universal Time) and TUC (French: temps universel Coordinne). UTC was selected both as a compromise between the French and English proposals and because the C at the end looks more like an index in UT0, UT1, UT2 and a mathematical-style notation is always the most international approach.
Technically, Universal Coordinated Time is not quite the same thing as the old Greenwich Mean Time. GMT is a 24-hour astronomical time system based on the local time at Greenwich, England. GMT can be considered equivalent to Universal Coordinated Time when fractions of a second are not important. However, by international agreement, the term UTC is recommended for all general time keeping applications and use of the term GMT is discouraged.
A date without a time
zone is ambiguous in
a distributed system.
A transaction created
1995-12-17 in London
may be younger than a
1995-12-16 in Boston
.Another problem is that besides having time zones, we also have "lawful time" to worry about. This is the technical term for time required by law for commerce. Usually, this means whether or not you use daylight saving time. On 2007-12-09 President Hugo Chavez of Venezuela decided to create his own unique time zone, putting the clock back half-an-hour on a permanent basis. This unnecessary and done simply to put his country in a different time zone from the United States. Similar political crap has been done before, but it usually involves sets Daylight Saving Time (DST) changes in a year.
The need for UTC time in the database and lawful time for display and input has not been generally handled yet. EDI and replicated databases must use UTC time to compare timestamps. A date without a time zone is ambiguous in a distributed system. A transaction created 1995-12-17 in London may be younger than a transaction created 1995-12-16 in Boston.
Standard SQL Temporal Data Types
Standard SQL has a very complete description of its temporal data types. There are rules for converting from numeric and character strings into these data types, and there is a schema table for global time-zone information that is used to make sure that temporal data types are synchronized. It is very complete and elaborate, so SQL Server has not implemented it yet, while DB2 is on the mark and Oracle has its own weirdnesses. As an international standard, Standard SQL has to handle time for the whole world and most of us work with only local time.
The common terms and conventions related to time are also confusing. We talk about "an hour" and use the term to mean a particular point within the cycle of a day ("The train arrives at 13:00 Hrs") or to mean an interval of time not connected to another unit of measurement ("The train takes three hours to get there"); the number of days in a month is not uniform; the number of days in a year is not uniform; weeks are not related to months; and so on.
We need to have a DATE data type for commercial work. Having TIME and a TIMESTAMP data types is very handy, too. Standard SQL has a set of date and time (DATE, TIME and TIMESTAMP) and INTERVALs (DAY, HOUR, MINUTE and SECOND with decimal fraction) data types. Both of these groups are temporal data types, but datetimes represent points in the time line, while the interval data types are durations of time. Standard SQL also has a full set of operators for these data types.
Before SQL Server 2008, handling these things was awful. SQL Server was based on the first UNIX system clock, where a floating point number kept "date stuff" in the front of the number and the "clock stuff" in the back. All of this was based on counting physical hardware clock ticks in one machine. Today, you have wristwatches and alarm clocks that adjust themselves with the NIST radio time signal. As a general statement, there are two ways of representing temporal data internally. The "Unix representation" is based on keeping a single long integer or a word of 64 or more bits that counts the computer clock ticks from a base starting date and time. The other representation I will call the "COBOL method", since it uses separate fields for the year, month, day, hours, minutes and seconds.
The Unix method is very good for calculations, but the engine must convert from the external formats and the internal format and vice versa. The COBOL format is the opposite; good for display purposes, but weaker on calculations.
For example, to reduce a TIMESTAMP to just a date with the clock set to 00:00 Hrs in SQL Server, you can take advantage of their internal representation and write:
Likewise, the following day can be found with this expression:
Display and Internal Date Formats
The ISO ordinal date formats are described in ISO-2711-1973. Their format is a four-digit year, followed by a digit day within the year (001-366). The year can be truncated to the year within the century. The ANSI date formats are described in ANSI X3.30-1971. Their formats include the ISO standard, but add a four-digit year, followed by the two-digit month (01-12), followed by the two-digit day within month (01-31). This option is called the calendar date format. Standard SQL only uses the "yyyy-mm-dd" format, period. The ordinal format is better for date arithmetic; the calendar format is better for display purposes. We will get back to that when we discuss the Calendar table.
Many programs still use a year-in-century date format of some kind. This was supposed to save space in the old days when that sort of thing mattered (i.e., when punch cards had only 80 columns). Programmers assumed that they would not need to tell the difference between the years 1900 and 2000 because they were too far apart. Old COBOL programs that did date arithmetic on these formats returned erroneous negative results. If COBOL had a DATE data type, instead of making the programmers write their own routines, this would not have happened. Relational database users and 4GL programmers can gloat over this, since they have DATE data types built into their products.
Most small databases live and work in one time zone. If you have a database that covers more than one time zone, you might consider storing time in UTC and adding a numeric column to hold the local time-zone offset. The time zones start at UTC, which has an offset of zero. This is how the system-level time-zone table in Standard SQL is defined. There are also ISO standard three-letter codes for the time zones of the world, such as EST, for Eastern Standard Time, in the United States. The offset is usually a positive or negative number of hours, but there are some odd zones that differed by 15 minutes from the expected pattern, which were removed in 1998.
Now you have to factor in Daylight Saving Time on top of that to get what is call "lawful time" which it is the basis for legal agreements. The US government uses DST on federal lands located inside of states that do not use DST. If the hardware clock in the computer in which the database resides is the source of the timestamps, you can get a mix of gaps and duplicate times over a year. This is why Standard SQL uses UTC internally.
You should use a "24-hour" time format, which is less prone to errors than 12-hour (AM/PM) time, since it is less likely to be misread or miswritten. This format can be manually sorted more easily and is less prone to computational errors. Americans use a colon as a field separator between hours, minutes and seconds; some Europeans use a period (this is not a problem for them, since they also use a comma for a decimal point).
One of the major problems with time is that there are three kinds:
- fixed events ("He arrives at 13:00 Hrs"),
- durations ("The trip takes three hours")
- Intervals ("The train leaves at 10:00 Hrs and arrives at 13:00 Hrs")-
....which are all interrelated.
Standard SQL introduces an INTERVAL data type that does not explicitly exist in most current implementations (Rdb, from DEC And now Oracle Corporation, is an exception). An INTERVAL is a unit of duration of time rather than a fixed point in time-days, hours, minutes, and seconds (FIPS-127 requires five decimal places, much beyond SQL Server's original ability).
There are two classes of intervals. One class, called year-month intervals, has an express or implied precision that includes no fields other than YEAR and MONTH, though it is not necessary to use both. The other class, called day-time intervals, has an express or implied interval precision that can include any fields other than YEAR or MONTH -- that is, DAY, HOUR, MINUTE and SECOND (with decimal places).
Where to Learn About Temporal SQL
- Richard T. Snodgrass is a professor at the University of Arizona (http://www.cs.arizona.edu/people/rts/) whose whole career is based on temporal databases. You can download a copy of his Morgan-Kaufmann book in (Developing Time-Oriented Database Applications in SQL) in PDF and get the SQL code at the university website.
- Tom Johnston also did a series of about 30 articles on temporal SQL at http://www.dmreview.com which I hope will appear in a book in the near future.
Next Installment: Calendar Table tricks