This months T-SQL Tuesday is hosted by Aaron Bertrand and the topic is Dealer’s Choice. Aaron gives us the choice to blog about a topic of our choice, either about a passion outside of the SQL Server community, or about a T-SQL bad habit. In this blog post, I’ll tackle the second option, but I also have a post for option number one.
The bad habit I chose is about the representation of dates in a data warehouse (or any other database). In the course of my career I have seen many formats:
The worst are by far the string representation, as there is no actual check on the contents. It can literally contain everything. And is ’01/02/2018′ the first of February 2018 (like any sane person would read, because days come before months), or the 2nd of January? So if you have to store dates in your data warehouse, avoid strings at all costs. No excuses.
The integer representation – e.g. 20171208 – is really popular. If I recall Kimball correctly, he said it’s the one exception where you can use smart keys, aka surrogate keys that have a meaning embedded into them. I used them for quite some time, but I believe I have found a better alternative: using the actual date data type. Some of the reasons why:
- storage. An integer is 4 bytes, while a date is 3 bytes. In a 10 million row fact table with 4 dates, this is 40Mb less overhead.
- you can do actual date logic directly on the column. No need to join to the date dimension or do weird tricks with the integer.
- No more CONVERT(INT,CONVERT(CHAR(8),myDate,112)) to transform all the dates from the source system to integers.
- in the date dimension, you need a column with the date data type anyway, if you want to create a date table in SSAS Tabular.
People will resist this change of course, crying resentment using slogans like “but what about the dummies?”. Dummy records are artificial records in a dimensions that hold placeholder values like “Missing”, “Unknown” or “Not Applicable”. Typically they get values like (0,-1,-2) or (-999,-998,-997). Easy to handle with integers. But what about the dummies when using actual dates? Let’s try some DATEADD magic:
Problem solved. You can also use 9999-12-31 as a placeholder for “dates in the future”. Or you can use actual surrogate keys with no meaning, like you do for any other dimension. It makes debugging a bit harder though.
If you build your date table in Tabular, the table needs to have continuous values, meaning no gaps in the sequence, for the dates. So placeholders need to be filtered out, even if you use integers. In that case, just storing NULL for missing dates might be an alternative.
Conclusion: avoid using strings as dates in your database system. I hope with this blog post I have convinced you to use dates as surrogate keys for the date dimension instead of the smart integer.
------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂
I loved your post especially the statement “like any sane person would read, because days come before months”…I never understood the format month/day/year.
Regarding Data Warehouses:
As far as I recall from my Kimball courses, smart integer dates are permissible as a tool for partitioning data. But you shouldn’t do any querying and conversion on them and rather join to the dimension table. Putting dates as key has some charm…however in my data warehouse I stick with integer keys. However as I am handling datetimes (with separate dimensions time and date) I carry along the actual datetime values in my fact table for reference.