A couple of weeks back I blogged about my progress in the “Microsoft Professional Program in Data Science” curriculum. I’ve now finished the remainder of the courses. Now I only need to finish the capstone project. This project only occurs at the start of each quarter and runs for a month. The next runs are in July or October. I’ll do the project in October, since I like my summer vacations too much. 🙂
7. Principles of Machine Learning
A very interesting course. It goes a bit deeper in the various machine learning algorithms covered in the previous course – classification and regression – but also introduces new algorithms such as decision trees, neural networks and support vector machines. The course also covers how you can optimize your models, for example by using the “tune model hyperparameters” task or by doing some feature engineering. As with the previous course, the labs are just a step-by-step process you need to follow, quite hard to do it wrong (although I did manage to get one question wrong).
The final lab on the other hand was (finally) more open: you get a data set and you’re asked to do 25 predictions. It’s up to you to do the feature engineering, choosing the best model and to tune it. The more predictions you get correct within a margin of error, the more points you receive. I have to mention though it’s the same data set as in the final lab of the previous course, so a big chunk of the test has already been laid out. However, if you just follow the steps, predictions won’t be accurate enough. The problem with this final lab is that if you want to do it right, your experiment can take quite some considerable time running (at my second iteration my experiment took 45 minutes to run). Not exactly encouraging to do an incremental approach to the problem. Suppose you have 20 predictions correct within the set margin. Will you try to improve the model and wait another hour to see it’s effect, just to score a few more points? (I didn’t by the way, I just did some guesses for the last 5 predictions and got 3 more predictions correct). Unfortunately, it seems the edX won’t share the final model which would get the 25 predictions correct, so yet another learning opportunity missed.
I finished this course quite quickly (as usual with videos on twice the speed).
A course that delves a bit deeper into the R language. Unlike the R basic course, this one is not made by Datacamp but by a university in Denmark. Unfortunately the course starts from scratch, so there’s quite some overlap with the basic course in the first modules. There are 12 modules, but they are quite short so the course isn’t longer than any other course in the curriculum. Each module consists of some videos accompanied with some unrated exercises. The exercises try to reinforce the concepts in the video by asking some questions. Under the questions the answers are given and the code is explained. The module ends with some quizzes (rated) and a lab. The labs were the most interesting since you’re writing actual R code at that point.
Although the title mentions “… for Data Science” this course only touches data science concepts in the last 3 modules (simulation, linear models and graphics). I also would have liked this course a bit earlier in the curriculum (now it’s the second last course). The machine learning courses have some more advanced R code in them, and doing this course first might help you with understanding all of that code. You can take courses in any order you wish, but I followed the outline defined by the curriculum.
At the end of the course there’s a final exam, which is provided by Datacamp. The exam is similar in structure as in the basic R course. Not too hard but time is limited. Unfortunately you don’t know which questions you got wrong so as usual you don’t have the opportunity to learn more from your mistakes. You don’t need to take the final test btw, it’s only 20% of the final score.
Again, this course was quickly over. There aren’t too many videos and labs typically don’t take that long to finish.
9. Applied Machine Learning
For the last step in the curriculum – Applied Data Science – you can choose between a couple of optional courses: Applied Machine Learning, Implementing Predictive Solutions with Spark in HDInsight, Developing Intelligent Applications (IoT, will be removed from the curriculum in the future) and Analyzing Big Data with Microsoft R Server (also available in the MPP Big Data curriculum). I chose the first option, as a logical continuation of the previous machine learning courses.
This course delves deeper into some problems you can solve with machine learning: time series forecasting, spatial data analysis, text analysis and image analysis (In my opinion, the time series forecasting module was the most interesting one). Some modules have a lot of videos, but the labs are typically quite short. The lectures are a bit more theoretical than in other courses. If you watch them at double speed, you might miss some information. However, the theory isn’t necessary to do the labs. At the end of each module, there were also some videos by a Microsoft techie, where practical examples of the learned methodologies are given, typically using the Cortana Intelligence suite. It was quite nice to see some practical implementations.
Most parts of the labs were just executing some code in a workbook. In contrast with other courses, some modules only have R code (time series, spatial and text) and one module only Python code (image). Throughout the curriculum, you could always choose between Python and R but in this course this choice has already been mode for you. It’s a bit of a pity; I followed all the R modules and for the video analysis module I suddenly had to “understand” Python. Not a big deal though, since you don’t need to read the code to finish a lab and answer the questions.
The final test were just a couple of multiple choice questions. If you glanced over the lectures, you shouldn’t have too much trouble getting them all correct. I finished this course in a couple of days.