Book Review – Learning Pentaho Data Integration 8 CE

Yes, you’re reading that right. A book not about the Microsoft BI stack. At a project, Pentaho Data Integration (PDI) was used as an ETL tool. To get to know this tool a little better, I bought the book Learning Pentaho Data Integration 8 CE (Third Edition) by the author Maria Carina Roldán. For those of you who are not quite familiar with Pentaho (as myself three months back), it’s an open-source suite of BI tools. There’s is also a commercial Enterprise product available. The ETL tool of the suite is Pentaho Data Integration – PDI in short – which is also known in the community as kettle and spoon (yes, they have fun code names. There’s also kitchen etc.). The tool itself is written in Java.

In short, I very much liked the book. It’s to the point, and the author certainly knows her stuff. There’s an abundance of chapters that explain almost everything you need to know about PDI. The book starts easy with the how to install the product, how to do some basic transformations and how to read/write files.  Then it’s the more intermediate and advanced stuff is explained, such as how to do data cleansing, validation and how to load databases and data warehouses. At the end, some best practices are shared on how to create reusable transformations and jobs, and how to design and deploy your projects.

The book is well written, but sometimes the explanation is a bit short. However, the book is already 487 pages, so cramming in more might not have been a good idea. Although the book is about Pentaho 8 (and 8.1 is the current edition), some screenshots are already out-of-date. Maybe some left-overs from previous editions of the book? Also, there’s a short section on how to load files to AWS S3 and the authentication process is completely different in the current version. One small annoyance I had was when installing the product. The book literally says: “Download the latest version, extract it, run a batch file to start PDI. That’s it, you’re done.”. Except that on a lot of desktops the JAVA home hasn’t been added to the environment path, so when you run the batch file it can’t find Java.exe and it exits. Just a minor gripe here 🙂

Conclusion: this book is certainly a good choice to get you up to speed with PDI. You can also use it as a reference work later on. Definitely recommended.


------------------------------------------------
Do you like this blog post? You can thank me by buying me a beer 🙂
Koen Verbeeck

Koen Verbeeck is a Microsoft Business Intelligence consultant at AE, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He's also a speaker at various conferences.

Recent Posts

Book Review – Agile Data Warehouse Design

I recently read the book Agile Data Warehouse Design - Collaborative Dimensional Modeling, from Whiteboard…

4 days ago

Cloudbrew 2024 – Slides

You can find the slides for the session Building the €100 data warehouse with the…

1 week ago

Book Review – Microsoft Power BI Performance Best Practices

I was asked to do a review of the book Microsoft Power BI Performance Best…

1 month ago

Create a Numbers Table in Power Query

This is a quick blog post, mainly so I have the code available if I…

1 month ago

Microsoft finally adds Tenant Switcher for Fabric / Power BI

Praise whatever deity you believe in, because it's finally here, a tenant switcher for Microsoft…

1 month ago

Book Review – Humanizing Data Strategy by Tiankai Feng

This book was making its rounds on social media, and the concept seems interesting enough…

1 month ago