Categories: Book Review

Book Review – Designing Data-Intensive Applications

The book Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann was recommended to me by a colleague. The author has worked at companies such as LinkedIn, where he has built large distributed systems to handle data, so I guess he knows what he’s talking about 🙂 (he’s also a researcher at Cambridge University)

It’s quite a big book (around 545 pages), but I enjoyed it. It’s quite technical, and I wouldn’t recommend it to anyone who doesn’t have a basic grasp of databases (relational or NoSQL). There’s some jargon in there, and although Martin does a great deal of effort to explain concepts thoroughly, some (basic) concepts are just left as-is.

I learned a lot about the challenges of distributed systems (scalability, transactions, consistency etc.), but for a book of which the title starts with “designing”, it doesn’t actually talk about designing that much. At the start of the book, there’s a use case of how a distributed system could support a platform like Twitter, where some users have millions of followers. If they post a message, you suddenly need to update millions of timelines. The author then gives an example on how this could be solved. I expected more of this design patterns, but unfortunately that’s not the case (in the final chapter there are some general design principles, but that’s it). Most of the book is spent on explaining concepts and how they impact large scale distributed data systems.

Part 1 of the book explains basic concepts such as databases, data models, query languages and storage. Part 2 dives deeper into the distributed concepts such as transactions (which are obviously harder on multiple systems), replication, partitioning, consistency and consensus. Part 3 talks about batch processing and stream processing, and the future of data systems. All very interesting if you want to learn more about the concepts behind distributed systems, but again, I would have liked more design patterns. All of the information is still useful though. If you start a large data project, it’s good to know what leaderless replication is for example. If you need to select a certain database product, it’s helpful to know what some features actually do.

Another “problem” with the book is that it’s already 5 years old. That doesn’t seem very old, but in today’s world of constantly changing technology, it’s already ancient. Most concepts are of course still true today, but you can see the book focuses more on the early big data stuff (Hadoop, HDFS) and much less on more modern systems like Spark (especially Databricks) and Snowflake for example. At the end of each chapter there’s each time a huge list of references, and many of them are almost a decade old. There aren’t many references that are from 2016 or 2017.

My conclusion is that this book is a definite recommendation for everyone who wants to learn more about data and distributed systems, but this shouldn’t be your first book about data. Personally, I hope a revised, more up-to-date second edition will be published someday.


------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂
Koen Verbeeck

Koen Verbeeck is a Microsoft Business Intelligence consultant at AE, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He's also a speaker at various conferences.

Recent Posts

Free webinar – Tackling the Gaps and Islands Problem with T-SQL Window Functions

I'm hosting a free webinar at MSSQLTips.com at the 19th of December 2024, 6PM UTC.…

5 days ago

dataMinds Connect 2024 – Session Materials

The slides and scripts for my session "Tackling the Gaps & Islands Problem with T-SQL…

4 weeks ago

Connect to Power BI as a Guest User in another Tenant

Sometimes your Microsoft Entra ID account (formerly known as Azure Active Directory) is added as…

2 months ago

How to use a Script Activity in ADF as a Lookup

In Azure Data Factory (ADF, but also Synapse Pipelines and Fabric Pipelines), you have a…

4 months ago

Database Build Error – Incorrect syntax near DISTINCT

I wrote a piece of SQL that had some new T-SQL syntax in it: IS…

4 months ago

Speaking at dataMinds Connect 2024

I'm very excited to announce I've been selected as a speaker for dataMinds Connect 2024,…

5 months ago