I’m delighted to announce I’ll be giving a webinar for MSSQLTips.com about SQL Server Development Best Practices. Aka writing T-SQL and stuff 🙂
The webcast is the 13th of June 2019 at 7PM UTC. In a nutshell, I’ll be talking about the stuff I would tell my 10-year younger self about T-SQL development (if I could travel in time, of course). The abstract:
As a database developer I have had many years to learn the right way and the wrong way of doing things with SQL Server. As SQL Server continues to roll out new features I need to continue to refine my database development skills with all of the new tools Microsoft offers. In addition, I need to keep in mind all of the things I have learned over the years to ensure I still follow best practices as well as things I know work best from my own experience.
In this webcast I will share some of the development skills I have honed over the years and things that I wish I knew when I first started working with SQL Server which could have saved a lot of time and headaches.
Join us for this free webcast and learn things about database design, T-SQL development, optimizing for performance, BI development and more.Â
You can register here. Full disclosure: the webinar is sponsered by Idera.
I recently read the book Agile Data Warehouse Design - Collaborative Dimensional Modeling, from Whiteboard…
You can find the slides for the session Building the €100 data warehouse with the…
I was asked to do a review of the book Microsoft Power BI Performance Best…
This is a quick blog post, mainly so I have the code available if I…
Praise whatever deity you believe in, because it's finally here, a tenant switcher for Microsoft…
This book was making its rounds on social media, and the concept seems interesting enough…
View Comments
Hi Koen, I attended the webinar, however it finished a little before I could leave my question, so here it is. I work in BI projects and once some of my co-workers (including my boss) told me that is a very bad practice to use a DELETE period-of-time-N / INSERT period-of-time-N from stage technique to update fact tables. They instead recommended (or told me "you have to") use a COPY everything-but-period-of-time-N in a text file / TRUNCATE TABLE fact table / INSERT everything-but-period-of-time-N from the text file / INSERT period-of-time-N from stage (the text file can also be a temp table). Other way they recommended was to create a new table, INSERT everything-but-period-of-time-N to the new table / INSERT period-of-time-N from stage / Drop the fact table / Rename the new table with the dropped fact table name. I have to add, their experience are with Oracle and not quite with SQL Server. Which are your thoughts about this recommendation? Thanks in advance
Hi,
that seems a strange and convoluted way to do things. It seems inefficient to copy all of the data of the other periods around, while it is not necessary. Especially copying the data out, truncating the table and copying it back in is quite absurd.
Is there already data for period-of-time-N present in the fact table? If not, why not just insert the data from the staging table? If yes, you can either DELETE the data (if supported by a clustered index) and re-insert the time period, or you can try partition switching (hard to set-up, but probably the fastest method) http://blog.pragmaticworks.com/table-partitioning-in-sql-server-partition-switching
You can also use the upsert method, where you insert data that isn't in the table yet and update rows that have changes (you can easily check by comparing a row hash). Rows from staging that are updates but don't have any actual change are discarded. Personally in a data warehouse I'm not too fond of actually deleting data.
Hope this helps,
Koen
Hi,
I checked with a few colleagues (some of which have Oracle experience) and they say they do it like this probably to avoid slow deletes. Copying data to a text file (or a temp table) can be really fast, especially if you use SELECT INTO.
However, there are some downsides to that approach:
inserting with SELECT INTO doesn't create any indexes, so you'll have to re-create them. Creating a new table, inserting the data and renaming has the downside you have to insert all your data into an empty table, which means if there are indexes those need to be populated again as well.
It might work in some scenarios, but I would rather focus on
* not doing any deletes at all if possible
* if you have to do deletes, maybe speed those up with an index (yes, that's possible)
For very large tables it seems not efficient to copy all that data, especially if you have to delete a relative small portion of the data.
For SQL Server, if you really want to avoid having slow deletes, it seems partition switching might be the most optimal solution. You put all your different time periods in different partitions. You load your N time period in a staging table with the same structure as your fact table. You switch out the partition of the fact table with your freshly loaded partition of the stage table (this takes milliseconds).
Regards,
Koen
Thank you very much Koen for your answer. Yes I guess this solution was because they were small or medium-sized enterprises (customers) with SQL Server 2014 standard edition (can't use filegroups), so my co-workers proposed what I told you about, with the background of better speed and performance and specially a better handle of logs. Unfortunately we can't avoid the Delete action, we have to update, for example, the information of the present month which can have changes every days in the source (the oltp db). My co-workers confirmed they did this under SQL Server environment and it was better than the delete-insert solution. In a summary the txt solution is something like this:
-- 'period-of-time-N' = period to update = present month = 201906.
-- Historic data in myFactTable: from 5 years ago on. Average of rows per year: 50 millions
select @sql = 'bcp "SELECT * from myDB.myFactTable where Period not in (201906) queryout "myPath\myTextFact.txt" -c ... etc
exec master..xp_cmdshell @sql
TRUNCATE TABLE [myFactTable]
DROP INDEX [IDX_myFactTable_MyClusterIndex] ON [myFactTable];
DROP INDEX [IDX_myFactTable_MyNonClusterIndex_1] ON [myFactTable]
SELECT @sql = 'bcp myDB.myFactTable in "myPath\myTextFact.txt" -h "TABLOCK" ... etc
exec master..xp_cmdshell @sql
-- reconfigure commands..
INSERT INTO myDB.myFactTable (Column1, Column2, Column3, Column4... etc...)
SELECT stg1.Column1, stg1.Column2, stg2.Column3, stg3.Column4... etc...
FROM StagingTable1 as stg1
INNER JOIN StagingTable2 as stg2 ON ...
LEFT JOIN StagingTable3 as stg3 ON ...
WHERE stg1.period-of-time-N=201906
CREATE INDEX [IDX_myFactTable_MyClusterIndex] ON [myFactTable] (Columns);
CREATE INDEX [IDX_myFactTable_MyNonClusterIndex_1] ON [myFactTable] (Columns)