In October 2017, I completed the capstone project of the Microsoft Professional Program for Data Science. I’ve blogged about this program before:
I’m happy to announce I got the certificate:
How did the capstone project go? It was a very interesting experience. The project consists of three parts:
You log into a website hosting a “data science competition”. There you get your assignment; in my case predicting the average income of students in the USA a couple of years after graduating an institution of higher education. For the first part, there are some questions on the edX website under the header “data analysis”. These questions forced you to take a look at your data, create a couple of charts and histograms, and calculate some statistics in order to find the answer to those questions. All in all the questions were pretty easy so I scored the maximum score.
The second part is creating a predictive model. The competition website gives you a training data set, a test data set and a data glossary. You can choose whatever method you please to create the model. You can use R, Python, Azure ML or any combination of the previous. It’s totally up to you. You score the test data set against your model and upload the .csv file with the results to the competition website. It validates your results based on the RMSE versus the actual data. The lower the RMSE, the higher your score. This score is published on a leaderboard, so you can compare yourself against your peers.
Personally I found the challenge to be … ehrm … challenging 🙂 The main problem is the huge gap between the skills needed to finish all the courses (where you can easily score over 90%) and the skills needed to get over 80% in the capstone project. The training set had a lot of features (over 200 columns) and many columns had missing values. Especially the missing values could mess with the predictive abilities of your model. Another problem is that you can only do 3 submissions a day to calculate your score. If you’re like me (and probably most people who do the capstone project), you have day jobs and need to do this project in your free time. This means that on some evenings or in the weekend I could make some time for this project. If you like to work with multiple iterations, too bad, you can only do 3 iterations at a time.
I used Azure ML to do this project and it worked fine most of the time. I had a standard workspace, not the free version, so it’s a bit faster. But with such a huge training set (many features, the number of rows was about 17,000) Azure ML could be really slow sometimes, especially if you are tuning the hyperparameters over the entire grid. At the end, my model easily ran over 1 hour. At one point in time, I made a change and the model kept running for hours and there was no solid reason for this (I asked a question about this on MSDN and the Azure ML engineers still have to respond). Scoring your test data set against your model in Azure ML wasn’t easy as well. Using the Excel add-in and a webservice was in most cases problematic and a lot of students struggled with this. I ended up by creating a predictive experiment, but instead of publishing it as a webservice I modified it to use the test data set instead.
Aside from all of these issues, I definitely learned a lot about creating a model: data cleaning , feature engineering, feature selection and so on. I’m really glad I did the capstone project. Am I a professional data scientist now? Not really, but at least I know the basics now. If there’s a data science project at a client, at least I can chat along 🙂
The last part, creating a report wasn’t that hard, it’s just time consuming to create a lot of charts. And because my writing style isn’t really condensed, I ended up with 30 pages (charts included). Whoops. 🙂
All-in-all, the Microsoft Professional Program for Data Science was a really nice experience. Will I recommend it to other people? Yes. Will I recommend them to pay for the certificate (about $1000 for all the courses and the project)? Probably not. I wouldn’t pay for it full price (in the beginning of the program we had a discount), unless my employer chips in.
I recently read the book Agile Data Warehouse Design - Collaborative Dimensional Modeling, from Whiteboard…
You can find the slides for the session Building the €100 data warehouse with the…
I was asked to do a review of the book Microsoft Power BI Performance Best…
This is a quick blog post, mainly so I have the code available if I…
Praise whatever deity you believe in, because it's finally here, a tenant switcher for Microsoft…
This book was making its rounds on social media, and the concept seems interesting enough…
View Comments
HI Koen,
Congrats! I did finish the MPP Data science and I can say that I agree with your findings. Excel was a real pain with AzureML, for sure.
Hennie
I am working on completing the course work as well and have found that the Excel add-in for ML Services leaves a lot to be desired. Half the time it just doesn't work at all. I have learned a lot about data science though with the courses I've taken. I have to say that I struggled a bit with the Statistics course. My math skills were a little rusty.
I graduated on the cycle just before yours, and I echo a lot of your thoughts. I thought it was a great course, and the capstone project really had me worried. Our capstone was similar to yours, a lot of features that had to be visualized considerably to see which really mattered. At first, my results weren't very good, so I had to take a deep breath, go back into the studies and use the methodical method to get my model into good shape. I wound up with only about 14 features used, and it really worried me when others were getting very good results with over 100 features. But there was a lot of over-fitting, and when I submitted my model using the test data, I got a 93% mark, as I remember. Whew!
Also, I took a lot more time that you did, I almost always spent around 20 hours per course, and some more. Of course, previous experience matters a lot. I disagree with your comment about the usability of the SQL training. As I am working on some data science projects with my company's data, the ability to get the data I need depends on SQL greatly.
Anyway, great write up and best wishes. Thanks for giving good advice to those who are considering the plunge.
Hi Ron,
thanks for sharing your experience. My comments about the usability of the T-SQL course were aimed at the use of T-SQL in this program. There isn't a single other course in the Professional Program for Data Science that actually uses T-SQL. It's all R and Python. I did said that anybody working with data should know the basics of SQL, so I guess we're on the same page there. :)
Congrats and thanks for sharing your experience. I'm currently working on the program as well--I have 2 classes left and the capstone. Taking a break now for the holidays but am reading a book on R because I've picked up that neither R course goes deep enough. I was planning on taking a week off from work to work on the capstone when I get to that point and your comments about the capstone have more than reinforced my decision.
Hi Molly,
I wish you good luck with the capstone. If I can give you one piece of advice: do your data analysis first and do it thoroughly. I dived directly into modeling in Azure ML and I wasted time because I had made bad assumptions. And if you use Azure ML, run you experiment with the test data in Azure ML itself, don't waste your time with the Excel plug-in.
Great suggestion about the waste of time with the Excel plug-in. I might suggest that anyone doing the capstone review the comment made by Graeme M. when so many were floundering on getting test data scored. Here is his comment in the Discussion forum during my capstone. https://courses.edx.org/courses/course-v1:Microsoft+DAT102x+3T2017/discussion/forum/course/threads/595fb4a522a8fb07a50023bb
He specifically mentions these videos:
https://youtu.be/tOYflGJpwEQ
https://youtu.be/666m4IYQTdw
https://youtu.be/CoLKX5lHk3I
https://youtu.be/guwOa9H7WRU
Following these discussions and using them to get my model scored with the test data was critical to passing the capstone. I don't know why they didn't cover this during the data science course, but it was crucial. The above videos are in the course DAT228x: Developing Big Data Solutions with Azure Machine Learning.
Hi Koen,
Thanks for your blogs! I finished all the courses and will start with the Capstone project within a month. Your blog really helps with the expectations.
Regards, Suzanne
Hi Suzanne,
thanks for your comment. Did you like the courses? I hope some of the issues I mentioned in my blog posts have gone away.
Good luck with your capstone project! Remember, first exploration of the data set, then building your model :)
Hi Koen,
Thanks for the good feedback! ;-)
Regards, Suzanne
Hi Koen,
Wanted to give you an update! I earned my certificate! :)
Regards, Suzanne
Congrats!
Hi Koen, Suzanne, and Ron.
Congratulations to you all.
Thank you for the good tips and inside. If I am allowed to ask only one question to those who have passed, it would be this.- On the scale of 1 to 10, how much do I need to know T-SQL, R, Python, Statistics, ML, Big data and Azure.?
I will feel much obliged for your candid response so that I can manage my expectations and address any skills gap now before it gets too late in the process.
Hi Sam,
theoretically the program doesn't assume much knowledge. The T-SQL and R/Python courses are aimed at the beginner level. Of course, it helps if you know a bit of SQL and R, since you'll be using them in the exercises (although SQL not so much, only in the course about SQL). Statistics can be useful, because the Statistics in Excel course is not that easy, but again that course is also aimed at the beginner level, so everything is explained in the modules. It's no issue if you don't know anything about Azure or Big Data, those topics aren't covered in the course. Azure ML is used, but I had almost no experience with the tool before the program, so that's okay as well.
Obviously, the more prior knowledge you have, the easier the program will be. But it's not a prerequisite.
If some stuck with capstone project, is there any help available online or mentoring program.
Hi Nat,
the idea is of course that you try to complete the capstone project yourself. If you followed all the courses and made all the exercises, you should be prepared well enough to pass the capstone project.
That being said, you can always ask questions in the forums of edX. People will try to help you (without giving away all the answers of course).
Thank you for sharing information about Data science
I started DATA science course this year and I need to finish it by the end of this month... I just lack Capstone MODULE and I having a lot of trouble testing my model with the provided test data as excel ML Add in is always failing.
You use an alternative way to do this, right? Can you help me achieving this?
Hi,
sorry for the late answer, your comment was sent to my spam folder.
As explained in the blog post, I created a predictive experiment but instead of publishing it as a webservice, I just modified the input (to read the csv file) and the output (to write the predictions to a csv file).