As promised, here’s my Python notebook which I used to generate my first set of predictions using machine learning. This is for passenger survival in the Titanic data set (Kaggle).
The Titanic training data set contains 890 rows (Passengers), each with 12 columns including Name, Sex, Age, Ticket number, Passenger Class, Cabin name, Port of Embarkation, number of siblings. A relatively small data set but fun to play with as a novice without feeling overwhelmed!
After some quick explorations of the data, I begin with some basic data munging. This is to prepare the data set for fitting into the machine learning models.
- There are several missing values of Age. I fill in the missing values for Age based on Sex, by taking the mean ages of males and females in the respective data sets.
- I use OneHotEncoding for Sex (male, female) and Embarked (C, Q, S) to convert these categorical columns into a numerical, sparse matrix.
- I drop the columns which are unlikely to be useful, such as Passenger Name, Cabin number
Next, I build a prediction model by fitting the training data set using a standard Random Forest classifier. Then, I run the fitted model using the test data set.
With these few simple steps, I was able to achieve a score of 0.79904, with a rank of 919 / 3927! Just below the TwoDildoBrothers – I don’t even want to imagine what that means.
Now, there are couple of ways I can think of to improve the prediction accuracy:
- Split the training data set into ‘train’ and ‘validation’ sets. The validation set would allow me to tune the parameters of the machine learning model
- Normalize / Scale the data before fitting the model
- Consider testing other machine learning models, such as Support Vector Machines or K-nearest neighbours to see which work best.
I’m going to be experimenting further with this when I have time.