The Titanic shipwreck is the most famous shipwreck in history and led to the discussions of better safety regulations for ships. One substantial safety issue was that there were not enough lifeboats for every passenger on board, which meant that some passengers were prioritized over others to use the lifeboats. This is where machine learning comes in. The goal of the competition is to build machine learning models that can predict if a passenger survives from their attributes such as age, sex, cabin class etc. This is a great project for anyone who is looking to start with Machine learning and Kaggle competitions. The data is fairly clean and the calculations are relatively simple.

View the project here: Titanic: Machine Learning from Disaster Start here! Predict survival on the Titanic and get familiar with ML basics

View my Jupyter Notebook.

png

Key Observations

  • Sex of passenger: Strong correlation between the sex of the passenger and survival, females were certainly given a preference over males.
  • Passenger Class (pclass): Also correlated with survival, for both males and females, better class has higher survived percentage.
  • Age: Nothing too striking, however the age distribution for survived population features small peak near lower age(<5 years) suggesting young childen were given preference.
  • Fare: It doesnt seem to have any striking correlation with survival.
  • Family Size: It seems that small family sizes(<4 members) did better than both larger families and solo travellers.
  • Fiting a Logistic Regression model to the training data with above feature resulted in an accuracy of 80% on the test set.

Please keep reading if you are interested in a detailed walk-through the whole project. This project has following parts

Data-set Introduction

The data for the passengers is contained in two files and each row in both data sets represents a passenger on the Titanic.

  • train.csv: Contains data on 712 passengers
  • test.csv: Contains data on 418 passengers

Each column represents one feature.

  • PassengerId – A numerical id assigned to each passenger.
  • Pclass – The class the passenger was in.
  • Name – the name of the passenger.
  • Sex – The gender of the passenger – male or female.
  • Age – The age of the passenger. Fractional.
  • SibSp – The number of siblings and spouses the passenger had on board.
  • Parch – The number of parents and children the passenger had on board.
  • Ticket – The ticket number of the passenger.
  • Fare – How much the passenger paid for the ticker.
  • Cabin – Which cabin the passenger was in.
  • Embarked – Where the passenger boarded the Titanic.

The training data additionally contain a column ‘Survived’, which is what we want to predict for our test data.

  • Survived – Whether the passenger survived (1), or didn’t (0).
# import pandas and numpy
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# Loading data and printing first few rows
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Previewing the statistics of training data and test data
test.head(2)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S

Feature Engineering

Feature engineering is the process of using our knowledge of the data to create features that make machine learning algorithms work. Feature engineering is often considered one of the most challenging parts of Machine Learning as chosing the right features can drastically improve the efficiency of the model. As per Coursera founder and Machine Learning Professor at Standford, Andrew Ng,

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.

The first part of feature engineering is finding the features that are relevant and discarding the ones that are not. For example, features such as passanger Id, Ticket number and names of the passengers are reletively less useful and can be dropped off from the model. The next task is then to clean the data and dive into the correlation of features and ‘survival’.

# print the names of the columns in the data frame
print("In Training Data missing, columns with missing values:")
# Retain columns that are of interest and discard the rest (such as Id, Name, Cabin, and Ticket number)
newcols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare','Embarked']
train = train[newcols]
# Check which columns have missing data
for column in train.columns:
    if np.any(pd.isnull(train[column])) == True:
        print(column)  
In Training Data missing, columns with missing values:
Age, Embarked
# print the names of the columns in the data frame
print("In test Data missing, columns with missing values:")
# Retain columns that are of interest and discard the rest (\Name, Cabin and Ticket number)
newcols = ['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare', 'Embarked']
test = test[newcols]
# Check which columns have missing data
for column in test.columns:
    if np.any(pd.isnull(test[column])) == True:
        print(column)
In test Data missing, columns with missing values:
Age, Fare

Missing Value Imputation

Now that we have broadly peeked into data. Time to fix the columns with missing data. The missing data are in columns Age, Embarked and Cabin so lets figure out of fix these.

  • Age in training data: for age it makes sense to simply fill the data by median age.
  • Embarked in training data: For embarked there are multiple choices (I used the simpler option1, but there are many Notebooks that describe option2 on Kaggle:
    1. Fill it using the most frequent option ‘S’.
    2. Use the fare as the fare price might be dependent on the port embarked.
  • Age in test data: I filled it using median of training data
  • Fare in test data: I filled it using median of training data
  • As compared to mean, the median is less sensitive to very large or very small values (outliers), and is a more realistic center of the distribution if the distributions are not normal.
# Filling missing age data with median values
train["Age"] = train["Age"].fillna(train["Age"].median())

# data cleaning for Embarked
print (train["Embarked"].unique())
print (train.Embarked.value_counts())
train["Embarked"] = train["Embarked"].fillna('S')
['S' 'C' 'Q' nan]
S    644
C    168
Q     77
Name: Embarked, dtype: int64
# Filling missing age data with median values of trainging set
test["Age"] = test["Age"].fillna(train["Age"].median())

# filling fare data with median of training set
test["Fare"] = test["Fare"].fillna(train["Fare"].median())

Feature Creation

Feature creations is perhaps more of an art than science and varies significantly with person analyzing the data. I created a new feature by combining ‘Parch’ and ‘SibSp’. Parch is the abbreviation of ‘parent/children’ and represent the sum of parents and children. SibSp is the abbreviation of ‘sibling/spouse’ and represent the sum of brothers/sisters/wife/husband. I tried plotting these independently with survival frequency and didn’t see anything interesting. However, these two can be combined to get a more meaningful feature, ‘Family Size’. We can compute the family size as the sum of the two above + 1. We will see later in the plots that family size seems to correlate with survival.

for df in [train, test]:
    df['FamilySize'] = df['Parch'] + df['SibSp'] + 1

def filter_family_size(x):
    if x == 1:
        return 'Solo'
    elif x < 4:
        return 'Small'
    else:
        return 'Big'

for df in [train, test]:
    df['FamilySize'] = df['FamilySize'].apply(filter_family_size)

Visualization

For most of my plots, I have used Seaborn which is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

# matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
plt.rcParams.update({'font.size': 22})
# Check with Pclass and Embarked
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,4))
ax1.set_title('Survival rates for Passenger Classes')
sns.barplot(x='Pclass', y='Survived', hue='Sex', data=train, ax=ax1)

ax2.set_title('Survival rates for Port Embarked')
sns.barplot(x='Embarked', y='Survived', hue='Sex', data=train, ax=ax2)
sns.despine()
sns.set(font_scale=1.4)
ax2.legend_.remove()
ax1.legend(loc='upper right')
plt.show()

png

# Check with Age
g = sns.FacetGrid(train, col="Survived", hue='Sex', size=5, aspect = 1.2)
g.map(sns.kdeplot, "Age", shade=True).add_legend().fig.subplots_adjust(wspace=.3)
sns.despine()
sns.set(font_scale=2)
plt.show()

png

# Check with Fare
g = sns.FacetGrid(train, col="Survived", hue='Sex', size=5, aspect = 1.2)
g.map(sns.kdeplot, "Fare", shade=True).add_legend().fig.subplots_adjust(wspace=.3)
sns.set(font_scale=2)
sns.despine()
plt.show()

png

# Family Size
sns.barplot(x='FamilySize', y='Survived' , data=train, order = ['Solo', 'Small', 'Big'])
sns.set(font_scale=1.5)
plt.show()

png

Key Observations

  • Sex of passenger: Again there is a strong correlation between the sex and survival, females were certainly given a preference over males.
  • Passenger Class (pclass): Also correlated with survival, for both males and females, better class has higher survived percentage.
  • Age: Nothing too striking, however if you look at the distribution for survived population, there is a small peak near lower age(<5 years), seems like childen were also given preference.
  • Fare: It doesnt have any striking correlation with survival.
  • Family Size: It seems that small family sizes did better than larger families as well as solo travellers.

Model Training

For our titanic dataset, our prediction is a binary variable, which is discontinuous. So using a logistic regression model makes more sense than using a linear regression model. In the project, I have used python library, ‘Scikit Learn’ to perform logistic regression using the featured defined in predictors.

# Convert to numeric values
train.loc[train["Embarked"] == 'S', "Embarked"] = 0
train.loc[train["Embarked"] == 'C', "Embarked"] = 1
train.loc[train["Embarked"] == 'Q', "Embarked"] = 2

test.loc[test["Embarked"] == 'S', "Embarked"] = 0
test.loc[test["Embarked"] == 'C', "Embarked"] = 1
test.loc[test["Embarked"] == 'Q', "Embarked"] = 2
# convert female/male to numeric values (male=0, female=1)
train.loc[train["Sex"]=="male","Sex"]=0
train.loc[train["Sex"]=="female","Sex"]=1

test.loc[test["Sex"]=="male","Sex"]=0
test.loc[test["Sex"]=="female","Sex"]=1
# convert family size to numeric values

train.loc[train["FamilySize"] == 'Solo', "FamilySize"] = 0
train.loc[train["FamilySize"] == 'Small', "FamilySize"] = 1
train.loc[train["FamilySize"] == 'Big', "FamilySize"] = 2

test.loc[test["FamilySize"] == 'Solo', "FamilySize"] = 0
test.loc[test["FamilySize"] == 'Small', "FamilySize"] = 1
test.loc[test["FamilySize"] == 'Big', "FamilySize"] = 2
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

# columns we'll use to predict outcome
features = ['Pclass', 'Sex', 'Age', 'FamilySize', 'Fare', 'Embarked']
label = 'Survived'

# instantiate the model
logreg = LogisticRegression()

# perform cross-validation
print(cross_val_score(logreg, train[features], train[label], cv=10, scoring='accuracy').mean())
0.798010157757

We have an accuracy of ~80%, and it is great in that given a passenger’s attribute we can predict with almost 80% certainty that the passenger will survive or die!

Prediction and Kaggle Submission

Now we need to run our prediction on the test data set and submit or prediction to Kaggle. Kaggle will provide a format to submit the prediction, in this case they asked for passenger Id and predicted survival on the test data in the csv format.

# Apply our prediction to test data
logreg.fit(train[features], train[label])
prediction = logreg.predict(test[features])

# Create a new dataframe with only the columns Kaggle wants from the dataset
submission_DF = pd.DataFrame({ 
    "PassengerId" : test["PassengerId"],
    "Survived" : prediction
    })
print(submission_DF.head(2))
   PassengerId  Survived
0          892         0
1          893         0
# prepare file for submission
submission_DF.to_csv("submission.csv", index=False)