Machine Learning : I did it, so can you too.

As promised, here’s my Python notebook which I used to generate my first set of predictions using machine learning. This is for passenger survival in the Titanic data set (Kaggle).

https://www.kaggle.com/tengyan/titanic/tengs-titanic-notebook/notebook

The Titanic training data set contains 890 rows (Passengers), each with 12 columns including Name, Sex, Age, Ticket number, Passenger Class, Cabin name, Port of Embarkation, number of siblings. A relatively small data set but fun to play with as a novice without feeling overwhelmed!

After some quick explorations of the data, I begin with some basic data munging. This is to prepare the data set for fitting into the machine learning models.

  1. There are several missing values of Age. I fill in the missing values for Age based on Sex, by taking the mean ages of males and females in the respective data sets.
  2. I use OneHotEncoding for Sex (male, female) and Embarked (C, Q, S) to convert these categorical columns into a numerical, sparse matrix.
  3. I drop the columns which are unlikely to be useful, such as Passenger Name, Cabin number

Next, I build a prediction model by fitting the training data set using a standard Random Forest classifier. Then, I run the fitted model using the test data set.

With these few simple steps, I was able to achieve a score of 0.79904, with a rank of 919 / 3927! Just below the TwoDildoBrothers – I don’t even want to imagine what that means.

kaggle titanic

Now, there are couple of ways I can think of to improve the prediction accuracy:

  1. Split the training data set into ‘train’ and ‘validation’ sets. The validation set would allow me to tune the parameters of the machine learning model
  2. Normalize / Scale the data before fitting the model
  3. Consider testing other machine learning models, such as Support Vector Machines or K-nearest neighbours to see which work best.

I’m going to be experimenting further with this when I have time.

Leave a Reply

Your email address will not be published. Required fields are marked *