The book Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann was recommended to me by a colleague. The author has worked at companies such as LinkedIn, where he has built large distributed systems to handle data, so I guess he knows what he’s talking about 🙂 (he’s also a researcher at Cambridge University)
It’s quite a big book (around 545 pages), but I enjoyed it. It’s quite technical, and I wouldn’t recommend it to anyone who doesn’t have a basic grasp of databases (relational or NoSQL). There’s some jargon in there, and although Martin does a great deal of effort to explain concepts thoroughly, some (basic) concepts are just left as-is.
I learned a lot about the challenges of distributed systems (scalability, transactions, consistency etc.), but for a book of which the title starts with “designing”, it doesn’t actually talk about designing that much. At the start of the book, there’s a use case of how a distributed system could support a platform like Twitter, where some users have millions of followers. If they post a message, you suddenly need to update millions of timelines. The author then gives an example on how this could be solved. I expected more of this design patterns, but unfortunately that’s not the case (in the final chapter there are some general design principles, but that’s it). Most of the book is spent on explaining concepts and how they impact large scale distributed data systems.
Part 1 of the book explains basic concepts such as databases, data models, query languages and storage. Part 2 dives deeper into the distributed concepts such as transactions (which are obviously harder on multiple systems), replication, partitioning, consistency and consensus. Part 3 talks about batch processing and stream processing, and the future of data systems. All very interesting if you want to learn more about the concepts behind distributed systems, but again, I would have liked more design patterns. All of the information is still useful though. If you start a large data project, it’s good to know what leaderless replication is for example. If you need to select a certain database product, it’s helpful to know what some features actually do.
Another “problem” with the book is that it’s already 5 years old. That doesn’t seem very old, but in today’s world of constantly changing technology, it’s already ancient. Most concepts are of course still true today, but you can see the book focuses more on the early big data stuff (Hadoop, HDFS) and much less on more modern systems like Spark (especially Databricks) and Snowflake for example. At the end of each chapter there’s each time a huge list of references, and many of them are almost a decade old. There aren’t many references that are from 2016 or 2017.
My conclusion is that this book is a definite recommendation for everyone who wants to learn more about data and distributed systems, but this shouldn’t be your first book about data. Personally, I hope a revised, more up-to-date second edition will be published someday.
It's time for T-SQL Tuesday again! And we're almost to number 200! T-SQL Tuesday is…
A while ago we suddenly had an error while trying to deploy one Fabric workspace…
I've uploaded the slides for my Techorama session Microsoft Fabric for Dummies and my DataGrillen…
I'm doing a small series on indexing basics for SQL Server, and on May 14th…
A short blog post about an issue with Fabric Mirroring (with Azure SQL DB as…
I'm starting a webinar series about SQL Server indexing with the fine folks of MSSQLTips.com.…
View Comments
Hi, I liked your review and would like to what you would suggest as a "first bool about data", since at the end you said this one shouldn't be?
Hi,
that's a very good question. Kind of depends on what you want to do with data I guess.
For data warehousing: the Kimball book (The Data Warehouse Toolkit) is a good start for dimensional modelling.
Any introduction book to SQL is a good start (I can recommend any book by Itzik Ben-Gan).
I also liked the Fundamentals to Data Engineering book.
Alright, so for me, I've started programming in 42 school with C, and then advanced to some C++ projects, like an HTTP server. Nowadays, I think it would be interesting to learn backend development, and I wanted to start by understanding databases at a low level, and that's why I was interested in reading this book.
This book does go into a low level, but more of distributed systems and all the headaches that come with it, not necessarily about databases themselves.
If you're looking for an intro to databases on a more theoretical (or academical) level, the CJ Date book "introduction to database systems" is highly regarded. It's quite theoretical as it goes into relational algebra behind relational databases (it's also quite old by now, I had to read this book in university). If you're looking for something more practical, the book Introducing SQL Server might be a good option.
Thank you very much for your suggestions!