The bad habit I chose is about the representation of dates in a data warehouse (or any other database). In the course of my career I have seen many formats:
The worst are by far the string representation, as there is no actual check on the contents. It can literally contain everything. And is ’01/02/2018′ the first of February 2018 (like any sane person would read, because days come before months), or the 2nd of January? So if you have to store dates in your data warehouse, avoid strings at all costs. No excuses.
The integer representation – e.g. 20171208 – is really popular. If I recall Kimball correctly, he said it’s the one exception where you can use smart keys, aka surrogate keys that have a meaning embedded into them. I used them for quite some time, but I believe I have found a better alternative: using the actual date data type. Some of the reasons why:
People will resist this change of course, crying resentment using slogans like “but what about the dummies?”. Dummy records are artificial records in a dimensions that hold placeholder values like “Missing”, “Unknown” or “Not Applicable”. Typically they get values like (0,-1,-2) or (-999,-998,-997). Easy to handle with integers. But what about the dummies when using actual dates? Let’s try some DATEADD magic:
Problem solved. You can also use 9999-12-31 as a placeholder for “dates in the future”. Or you can use actual surrogate keys with no meaning, like you do for any other dimension. It makes debugging a bit harder though.
If you build your date table in Tabular, the table needs to have continuous values, meaning no gaps in the sequence, for the dates. So placeholders need to be filtered out, even if you use integers. In that case, just storing NULL for missing dates might be an alternative.
Conclusion: avoid using strings as dates in your database system. I hope with this blog post I have convinced you to use dates as surrogate keys for the date dimension instead of the smart integer.
I'm hosting a free webinar at MSSQLTips.com at the 19th of December 2024, 6PM UTC.…
The slides and scripts for my session "Tackling the Gaps & Islands Problem with T-SQL…
Sometimes your Microsoft Entra ID account (formerly known as Azure Active Directory) is added as…
In Azure Data Factory (ADF, but also Synapse Pipelines and Fabric Pipelines), you have a…
I wrote a piece of SQL that had some new T-SQL syntax in it: IS…
I'm very excited to announce I've been selected as a speaker for dataMinds Connect 2024,…
View Comments
I loved your post especially the statement "like any sane person would read, because days come before months"...I never understood the format month/day/year.
Regarding Data Warehouses:
As far as I recall from my Kimball courses, smart integer dates are permissible as a tool for partitioning data. But you shouldn't do any querying and conversion on them and rather join to the dimension table. Putting dates as key has some charm...however in my data warehouse I stick with integer keys. However as I am handling datetimes (with separate dimensions time and date) I carry along the actual datetime values in my fact table for reference.