It’s the second Tuesday of the month and you know what that means! T-SQL Tuesday time! If you don’t know what it means, T-SQL Tuesday is a monthly blog party of the SQL Server community – started by Adam Machanic (blog|twitter) – where everyone is invited to write about a common topic.
This month’s T-SQL Tuesday is hosted by Robert L Davis (blog | twitter) and its subject is about changing data:
Be the Change. More specifically, data changes. How do you track changing data? How do you do your ETL? How do you clean or scrub your data? Anything related to changing data
I’ll keep it quite short in this blog post and just summarize a few of the options you have to detect changes in the data.
- In some very rare cases, you can actually use change data capture or change tracking on the source system. If you get one of those features implemented, you’re golden. But most of the time you’re not, as a lot of administrators don’t like them because of potential performance impact.
- If you’re lucky, the source tables have metadata columns telling you if a row was updated or not. Something like modifieddate. All you have to do is keep the last date since you loaded data from that table and you just have to load all records with a modified data later than that date. If you can trust the date of course, because if someone makes manual changes to the tables the system falls apart. And nobody makes manual changes to the source system, right?
- If you can’t use one of the two previous methods, you’re basically stuck with doing a full load each time the ETL runs. Doesn’t mean you have to actually process every row into your data warehouse. With the incremental load pattern you can detect which rows are updates and which rows are inserts. It’s one of my favorite patters in SSIS. Most of the time, if feasible, I use a lookup component with full cache to determine if a row is an update or not. Just don’t use the Slowly Changing Dimension wizard. Ever. The ghosts of ETL past will haunt you forever! To save some precious processing time, you can also determine if an update row has some actual changes or not. There are a few options:
- Write a very complex derived column expressions cross-checking every column. I’ve seen some of those in the wild. Approach with caution.
- Write a very long WHERE clause statement to check if a row has changed. This WHERE clause belongs to the UPDATE statement you use to update the rows in the destination table. Because you don’t use the OLE DB Command to update rows. Not ever. Not even if you’re life depended on it. Use the magic ALT-trick, it’s easy and quick to write such a WHERE clause.
- Calculate a hash over all the columns (both in source and destination). If it’s different, then there is a change. By far the most elegant option. Requires a bit more CPU, but that’s cheap nowadays, right? Or was that storage? I always mix those up. I wrote an article about it a while back. I now realize there are more efficient ways to calculate the hash instead of hard-coded concatenation of all the different columns. You could use T-SQL (but you’d still have to concatenate columns) or you could use reflection in .NET to loop over the input columns of the buffer and concatenate them in a more flexible manner. Check out the most excellent book about SSIS scripting by MVP Joost van Rossum on how to do this.
- In SQL Server 2016, you have a new option. Hoozah! If your source system is SQL Server, you can change the tables to temporal tables and you have all the history of every row that you can imagine. But this falls in the same boat as the first option: admins must be willing to implement it. And the entire history is stored on the source system, which might be problematic.
If I forgot an option to check for changing data, let me know in the comments. Happy to know your thoughts!
View Comments
Thanks for participating Koen. I must admit that I have not really looked into temporal tables and until now, I didn't really have a good reason to do so. Now I do.