Categories: Azure Data Factory

How you can save up to 80% on Azure Data Factory pricing

… with this one weird little trick 🙂 Sorry for the clickbait, but you can actually save a lot on ADF pricing if you design your pipelines carefully. And as we will see, don’t do everything in ADF. Before we proceed further, I advise to watch this short YouTube video by Cathrine Wilhelmsen on ADF cost. It explains the basics very well and demonstrates the concepts of the pricing with a few examples:

As you might’ve noticed, pricing in ADF is not the same as it was in SSIS for example. In SSIS, you pay your SQL Server license and you’re done (well, and you buy a server to run it on). It doesn’t matter what you do with SSIS, the cost is the same. If you run 1 package or 1000 packages, there’s no difference except in your electricity bill. However, in ADF you pay more if you use it more. You pay for each action you do, you pay for each activity you use and for how long things are running. There are a couple of guidelines you can follow to try to minimize costs:

  • Using a VNET for your integration runtimes is expensive. Really expensive. Cathrine has a good overview in the video where it shows costs are a thousand times higher.
  • Try to use ADF as an ELT tool. Or even just as an orchestrator. Use it to ingest data into your sink (data lake, database, whatever) and then use a cheaper type of compute to do the actual calculations. For example, dump the data in a staging table and then use SQL statements to do the rest. External activities (such as executing SQL statements in a database) are the cheapest type of activity in ADF.
  • I may be biased, but I try to avoid dataflows as well in ADF. I can do pretty much the same things in SQL much easier and especially cheaper. I don’t need to startup a Spark cluster just to add a few columns to my dataset or to do a lookup. If you hate writing code though, dataflows might be just your thing.
  • Minimize the number of activities. You pay for at least one minute for each activity. If you have many pipelines or frequently executed pipelines, having many activities starts adding up.
  • If you really really want to save on costs in ADF, you can run everything on a self-hosted integration runtime. This is the cheapest runtime, but on the other hand means that you need to provision your own server again.

You might’ve noticed the most expensive part in ADF is the Copy Activity. Its cost depends on the following formula:

If you have a metadata-driven Copy Activity inside a For Each loop, things can get expensive fast. For example, I have a pipeline that extracts data from a REST API. I need to extract 100 endpoints, which means I get 100 minutes of copy duration. Using the formula and price points from the video, this means an execution cost of approximately $0.83 per pipeline run. This doesn’t seem much, but this is $25 for a whole month (one execution per day). For a single pipeline. If you have multiple pipelines like this one, you can see why it can get expensive. Or if you increase the number of executions. Running the pipeline every hour is about $600 a month. For a single pipeline. Executing it every 5 minutes is about $7200. Whoops. Do not use ADF as a streaming tool. Luckily there’s still a part of the formula we can influence: the number of used DIUs (Data Integration Units). Cathrine also mentions this in the video. The default setting is Auto, which scales automatically but starts at 4.

However, you can set it manually to 2.

This means you can already save 50% on your Copy Activity cost! Obviously, this affects performance and you should only do this when working with smaller data sets. But if you just have to load some smaller files or some REST API data, definitely set it to 2.

How can we cut costs even more? By avoiding the Copy Activity altogether. In my use case of the REST API, I wrote an Azure Function that extracts data from the API and dumps it into a staging table using SqlBulkCopy. This means an external activity is used, which is much cheaper. By using the Azure Function, I reduced the price of my pipeline to $0.05 per month (for a daily run). Even if I would run the pipeline hourly, it would cost me a mere $1.09 in ADF. You have to pay for the Azure Function as well of course, but if you use a consumption plan the first million (!) executions and 400,000 GB/s are free (per subscription). So it’s possible you’re staying in the free part of the offering, which means $0 cost.

Even if you go over the free offering, the costs are cheaper than ADF. Keep in mind though, the development effort is higher when using Azure Functions because you’ll have to write the code yourself, while the Copy Activity is configured in a couple of minutes.


------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂
Koen Verbeeck

Koen Verbeeck is a Microsoft Business Intelligence consultant at AE, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He's also a speaker at various conferences.

View Comments

Recent Posts

Book Review – Agile Data Warehouse Design

I recently read the book Agile Data Warehouse Design - Collaborative Dimensional Modeling, from Whiteboard…

4 days ago

Cloudbrew 2024 – Slides

You can find the slides for the session Building the €100 data warehouse with the…

1 week ago

Book Review – Microsoft Power BI Performance Best Practices

I was asked to do a review of the book Microsoft Power BI Performance Best…

1 month ago

Create a Numbers Table in Power Query

This is a quick blog post, mainly so I have the code available if I…

1 month ago

Microsoft finally adds Tenant Switcher for Fabric / Power BI

Praise whatever deity you believe in, because it's finally here, a tenant switcher for Microsoft…

1 month ago

Book Review – Humanizing Data Strategy by Tiankai Feng

This book was making its rounds on social media, and the concept seems interesting enough…

1 month ago