The book Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann was recommended to me by a colleague. The author has worked at companies such as LinkedIn, where he has built large distributed systems to handle data, so I guess he knows what he’s talking about 🙂 (he’s also a researcher at Cambridge University)
It’s quite a big book (around 545 pages), but I enjoyed it. It’s quite technical, and I wouldn’t recommend it to anyone who doesn’t have a basic grasp of databases (relational or NoSQL). There’s some jargon in there, and although Martin does a great deal of effort to explain concepts thoroughly, some (basic) concepts are just left as-is.
I learned a lot about the challenges of distributed systems (scalability, transactions, consistency etc.), but for a book of which the title starts with “designing”, it doesn’t actually talk about designing that much. At the start of the book, there’s a use case of how a distributed system could support a platform like Twitter, where some users have millions of followers. If they post a message, you suddenly need to update millions of timelines. The author then gives an example on how this could be solved. I expected more of this design patterns, but unfortunately that’s not the case (in the final chapter there are some general design principles, but that’s it). Most of the book is spent on explaining concepts and how they impact large scale distributed data systems.
Part 1 of the book explains basic concepts such as databases, data models, query languages and storage. Part 2 dives deeper into the distributed concepts such as transactions (which are obviously harder on multiple systems), replication, partitioning, consistency and consensus. Part 3 talks about batch processing and stream processing, and the future of data systems. All very interesting if you want to learn more about the concepts behind distributed systems, but again, I would have liked more design patterns. All of the information is still useful though. If you start a large data project, it’s good to know what leaderless replication is for example. If you need to select a certain database product, it’s helpful to know what some features actually do.
Another “problem” with the book is that it’s already 5 years old. That doesn’t seem very old, but in today’s world of constantly changing technology, it’s already ancient. Most concepts are of course still true today, but you can see the book focuses more on the early big data stuff (Hadoop, HDFS) and much less on more modern systems like Spark (especially Databricks) and Snowflake for example. At the end of each chapter there’s each time a huge list of references, and many of them are almost a decade old. There aren’t many references that are from 2016 or 2017.
My conclusion is that this book is a definite recommendation for everyone who wants to learn more about data and distributed systems, but this shouldn’t be your first book about data. Personally, I hope a revised, more up-to-date second edition will be published someday.
------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂