Microsoft Professional Program in Data Science – The Finish Line

In October 2017, I completed the capstone project of the Microsoft Professional Program for Data Science. I’ve blogged about this program before:

I’m happy to announce I got the certificate:

How did the capstone project go? It was a very interesting experience. The project consists of three parts:

  • Answering some questions on edX on the data set.
  • You create a predictive model, which is then scored against a test data set.
  • You write a report with all your findings.

You log into a website hosting a “data science competition”. There you get your assignment; in my case predicting the average income of students in the USA a couple of years after graduating an institution of higher education.  For the first part, there are some questions on the edX website under the header “data analysis”. These questions forced you to take a look at your data, create a couple of charts and histograms, and calculate some statistics in order to find the answer to those questions. All in all the questions were pretty easy so I scored the maximum score.

The second part is creating a predictive model. The competition website gives you a training data set, a test data set and a data glossary. You can choose whatever method you please to create the model. You can use R, Python, Azure ML or any combination of the previous. It’s totally up to you. You score the test data set against your model and upload the .csv file with the results to the competition website. It validates your results based on the RMSE versus the actual data. The lower the RMSE, the higher your score. This score is published on a leaderboard, so you can compare yourself against your peers.

Personally I found the challenge to be … ehrm … challenging 🙂 The main problem is the huge gap between the skills needed to finish all the courses (where you can easily score over 90%) and the skills needed to get over 80% in the capstone project. The training set had a lot of features (over 200 columns) and many columns had missing values. Especially the missing values could mess with the predictive abilities of your model. Another problem is that you can only do 3 submissions a day to calculate your score. If you’re like me (and probably most people who do the capstone project), you have day jobs and need to do this project in your free time. This means that on some evenings or in the weekend I could make some time for this project. If you like to work with multiple iterations, too bad, you can only do 3 iterations at a time.

I used Azure ML to do this project and it worked fine most of the time. I had a standard workspace, not the free version, so it’s a bit faster. But with such a huge training set (many features, the number of rows was about 17,000) Azure ML could be really slow sometimes, especially if you are tuning the hyperparameters over the entire grid. At the end, my model easily ran over 1 hour. At one point in time, I made a change and the model kept running for hours and there was no solid reason for this (I asked a question about this on MSDN and the Azure ML engineers still have to respond). Scoring your test data set against your model in Azure ML wasn’t easy as well. Using the Excel add-in and a webservice was in most cases problematic and a lot of students struggled with this. I ended up by creating a predictive experiment, but instead of publishing it as a webservice I modified it to use the test data set instead.

Aside from all of these issues, I definitely learned a lot about creating a model: data cleaning , feature engineering, feature selection and so on. I’m really glad I did the capstone project. Am I a professional data scientist now? Not really, but at least I know the basics now. If there’s a data science project at a client, at least I can chat along 🙂

The last part, creating a report wasn’t that hard, it’s just time consuming to create a lot of charts. And because my writing style isn’t really condensed, I ended up with 30 pages (charts included). Whoops. 🙂

All-in-all, the Microsoft Professional Program for Data Science was a really nice experience. Will I recommend it to other people? Yes. Will I recommend them to pay for the certificate (about $1000 for all the courses and the project)? Probably not. I wouldn’t pay for it full price (in the beginning of the program we had a discount), unless my employer chips in.

Koen Verbeeck

Koen Verbeeck is a Microsoft Business Intelligence consultant at AE, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He’s also a speaker at various conferences.

7 thoughts to “Microsoft Professional Program in Data Science – The Finish Line”

  1. I am working on completing the course work as well and have found that the Excel add-in for ML Services leaves a lot to be desired. Half the time it just doesn’t work at all. I have learned a lot about data science though with the courses I’ve taken. I have to say that I struggled a bit with the Statistics course. My math skills were a little rusty.

  2. I graduated on the cycle just before yours, and I echo a lot of your thoughts. I thought it was a great course, and the capstone project really had me worried. Our capstone was similar to yours, a lot of features that had to be visualized considerably to see which really mattered. At first, my results weren’t very good, so I had to take a deep breath, go back into the studies and use the methodical method to get my model into good shape. I wound up with only about 14 features used, and it really worried me when others were getting very good results with over 100 features. But there was a lot of over-fitting, and when I submitted my model using the test data, I got a 93% mark, as I remember. Whew!
    Also, I took a lot more time that you did, I almost always spent around 20 hours per course, and some more. Of course, previous experience matters a lot. I disagree with your comment about the usability of the SQL training. As I am working on some data science projects with my company’s data, the ability to get the data I need depends on SQL greatly.
    Anyway, great write up and best wishes. Thanks for giving good advice to those who are considering the plunge.

    1. Hi Ron,
      thanks for sharing your experience. My comments about the usability of the T-SQL course were aimed at the use of T-SQL in this program. There isn’t a single other course in the Professional Program for Data Science that actually uses T-SQL. It’s all R and Python. I did said that anybody working with data should know the basics of SQL, so I guess we’re on the same page there. 🙂

  3. Congrats and thanks for sharing your experience. I’m currently working on the program as well–I have 2 classes left and the capstone. Taking a break now for the holidays but am reading a book on R because I’ve picked up that neither R course goes deep enough. I was planning on taking a week off from work to work on the capstone when I get to that point and your comments about the capstone have more than reinforced my decision.

    1. Hi Molly,
      I wish you good luck with the capstone. If I can give you one piece of advice: do your data analysis first and do it thoroughly. I dived directly into modeling in Azure ML and I wasted time because I had made bad assumptions. And if you use Azure ML, run you experiment with the test data in Azure ML itself, don’t waste your time with the Excel plug-in.

  4. Great suggestion about the waste of time with the Excel plug-in. I might suggest that anyone doing the capstone review the comment made by Graeme M. when so many were floundering on getting test data scored. Here is his comment in the Discussion forum during my capstone. https://courses.edx.org/courses/course-v1:Microsoft+DAT102x+3T2017/discussion/forum/course/threads/595fb4a522a8fb07a50023bb
    He specifically mentions these videos:
    https://youtu.be/tOYflGJpwEQ
    https://youtu.be/666m4IYQTdw
    https://youtu.be/CoLKX5lHk3I
    https://youtu.be/guwOa9H7WRU

    Following these discussions and using them to get my model scored with the test data was critical to passing the capstone. I don’t know why they didn’t cover this during the data science course, but it was crucial. The above videos are in the course DAT228x: Developing Big Data Solutions with Azure Machine Learning.

Leave a Reply

Your email address will not be published. Required fields are marked *