It’s T-SQL Tuesday time again, and this month’s honor to host the blog party goes to Brent Ozar. The topic is data types:
Mad about money? Feverish about float? Tell us what you use your favorite data type for, and what people need to know before they get started using it. Data types have so many interesting edge cases, like when Aaron Bertrand taught us to use lower case for our data types. (I still don’t do that, but for the record, I do feel guilty.)
I’m going to write about the data types date and time. When I started working with SQL Server, there was only the datetime data type, which usually covered our needs. Unless you needed precision smaller than 3ms that is. In SQL Server 2008, a new set of data types for handling time information was introduced. I’m singling out date and time, because they are very useful in a data warehousing context. A couple of reasons:
Back in the old days, there was no streaming data. Data was refreshed during a long batch load during the night. Which means you had data at the daily level. Whoah, that exactly corresponds with the date data type! What a coincidence. Most of the time, reports and analysis are done at the daily level, and aggregated up to the a higher level such as week, month, quarter or year. So you don’t need the time portion.
In some cases, you do need the time portion. For example if you’re interested if people bought your products in the morning or in the afternoon. Typically I model my transaction data using three columns: one with the date (linked to the date dimension), one with the time (linked to a time dimension, which only contains the valid hours/minutes/seconds – whatever your grain is – of the day) and optionally one final column containing the original timestamp. The last column can be used for drawing detailed line charts for example, or for calculating intervals if you have multiple timestamps. Because date and time are separate and are linked to their own dimensions, analysis is easy.
Which brings me to my next point: separate dimensions for date and time. Please, do not create one single dimension for both. Having a date dimension at the hour level for example for 10 years is 87,600 rows. On a daily level this table only contains about 3,650 rows (leap days not included). This doesn’t seem like much, but if you have a large SSAS Tabular model with a lot of role-playing date dimensions (order due, due date, ship date …), this starts to add up. Typically I use the date and time columns as the actual surrogate keys of the dimensions. Yes, I use an actual date instead of the “smart surrogate key” yyyymmdd. It uses the exact same amount of bytes as an integer and you can directly do some date logic on the date column in the fact table if you need to (for debugging purposes e.g.). In SQL Server itself, you can still add dummy dates like ‘1900-01-01’ or ‘9999-12-31’. However, if you load your date dimension into SSAS Tabular or Power BI, you might want to mark it as a date table. One of the prerequisites is that the table contains contiguous dates from beginning to end, so no dummy dates.
Another advantage of having dates stored in two columns – date and time – is they compress much better in the Vertipaq engine, because there are less unique values in the columns. If you have very large fact tables, this is definitely a performance tuning technique to consider.
Overall, having separate date and time columns expresses your intent better, in my opinion. What do you think? Let me know in the comments!
------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂
Koen Verbeeck
Koen Verbeeck is a Microsoft Business Intelligence consultant at AE, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He's also a speaker at various conferences.