alphago
Man vs AI

Data Science: My Journey from Doctor to Noob Kaggler.

In the last couple of months, I’ve been pursuing another hobby  – data science and artificial intelligence (AI). You might have followed the much-publicised match between Lee Sedol, one of the world’s best Go players, and AlphaGo, an AI system developed by Google that spurred discussions about the future of AI. (Spoiler : AlphaGo won 4 – 1)

This has been partly triggered by my work at Holmusk, where our brilliant data science team has been winning multiple competitions and most recently took part in the Second Annual Data Science Bowl on Kaggle. This was a challenge to create an algorithm to automatically measures end-systolic and end-diastolic volumes in cardiac MRIs, from a data set of more than 1,000 patients. While I did not have much experience with cardiac MRIs back while I was working as radiologist (the protocols were quite complex), I had several discussions with the data science team on how a doctor would read and interpret MRI scans (for all the help it was worth).

For those of you are not familiar with Kaggle, it is a community of data scientists who compete to solve complex problems from some of the biggest companies in the world. For most of these, you’re provided with a ‘training’ data set on which you use to build a predictive model, and you are ranked based on the accuracy of your model in making predictions on a hidden ‘test’ data set. The prizes can be substantial – for example 1st prize for the cardiac MRI problem is a cool $125,000 – which keeps things competitive. But more importantly, it’s a great way to learn: by get your hands dirty playing with data and building models. Winners often post their solutions online, allowing you to reflect on how you can improve in the future.

Given the rapidly increasing amount of health data available to us today (from genome sequencing to behavioural data from wearables) and the limited capacity of our human brains to mentally process information, it is inevitable that AI will revolutionize healthcare. Imagine its huge potential in providing clinical decision support to physicians, allowing us to make faster and more accurate diagnoses, while freeing up our time to focus on the things that matter most – communicating with our patients & their families, and building trust.

So – not saying that it’s necessarily the best way, but…

Here’s the route I’ve taken so far:

1.Andrew Ng’s Stanford Machine Learning Course

If you want to know what machine learning is about, this 10-week program is one of the best introduction courses on the topic, available for free (whoopee) on Coursera.  The course provides you with a conceptual view of what machine learning is, cost functions and gradient descent, before going into details about how different techniques actually work. It covers both linear and non-linear techniques , including linear & logistic regression, neural networks and Support Vector Machines.

Andrew is the Chief Scientist at Baidu, an authority on AI, and speaks in a manner that is easy to understand, especially for someone like myself who left mathematics behind after entering med school! The course uses Octave (a programming language) as a framework, which is the only issue I had since I was already planning to use Python.

2.  Learning the basics of Python on Codeacademy

It was not difficult for me to pick up the basics of Python, since I had some experience back in school and dabbled in building websites intermittently. Learning the right syntax was the most important thing for me, and took a couple of hours of practice. In particular – functions, loops, lists and dictionaries

Even if you’ve no programming experience, Codeacademy is a good place to start learning to code.

3. Numpy, Matplotlib, Pandas and Sklearn

These are the essential software libraries to be familiar with when using Python for your data science projects. Numpy allows you to work with multi-dimentional arrays easily. Matplotlib is a great visualization tool that is useful especially when exploring the data. Pandas lets you manipulate your data which is stored in the form of data frames.

And of course, sklearn which allows you to perform machine learning on your datasets in a few simple commands – .fit and .predict. As with many things, it’s easy to learn, but takes a lot of time to master as you figure our how to tweak the hyper parameters for the best results.

I picked up the basics of these libraries by reading a few tutorials online, but only really learnt when I started practicing on actual data sets. Being able to consult one of our data scientists whenever I got stuck did help a lot, so find a good programmer friend!

4. Harvard Data Science CS109

You can access the full CS109 data science course online – this includes a series of filmed lecture videos, lecture notes and homework. I haven’t attended Harvard, and this is as close as it gets for now.

I highly recommend starting from lecture 10 onwards, at a minimum. It’s more lengthy that Andrew Ng’s machine learning course, approximately 1 – 1.5 hours per lecture, but the lecturers tell stories and give lots of examples to reinforce key concepts. I don’t follow baseline and so I got lost in some of the analogies – but if you do, you’ll fit right in.

5. Participate in Kaggle competitions!

I’ve started with the most basic practice competition on Kaggle – Titanic: Machine learning from disaster. You get to download a training and test data set to play around with, and your goal is to predict which passengers survived the sinking. Nothing teaches better than working with real data.

To recap, here are a set of helpful links:

Andrew Ng’s Stanford Machine learning course

Codeacademy

Harvard Data Science CS109

Kaggle – Titanic data set

In my next post, I’ll share with you my progress in the competition, and exactly what I’ve done with the Titanic data set.

Leave a Reply

Your email address will not be published. Required fields are marked *