In this blog post, I am sharing my experience in solving a Machine Learning project on predicting survivors from the Titanic disastor by Kaggle. I chose to do this project in Jupyter notebook as it also allowed me to simultaneously write a markdown file. This was a great project for me to start as the data is fairly clean and the calculations are relatively simple.

View the project here: Titanic: Machine Learning from Disaster Start here! Predict survival on the Titanic and get familiar with ML basics

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex  
Age Age in years  
sibsp # of siblings / spouses aboard  
parch # of parents / children aboard  
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

My project has following parts

Load and understand data

# Imports

# pandas
import pandas as pd
from pandas import Series,DataFrame

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

I started off with the packages that I needed right away such as numpy and pandas and added more as and when I needed more packages. Now lets take a look at our data that I have loaded in a variable called titanic_DF.

I have used the following command to see the first two rows of the data.

titanic_DF.head(2)
# Loading data and printing first few rows
titanic_DF = pd.read_csv('train.csv')
test_DF = pd.read_csv('test.csv')

titanic_DF.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
# Similarly look into test data 
test_DF.head(2)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S

Using titanic_DF.info() will print all the statistical information of the data, including how many data points each column have. For instance you can see the age column only has 714 non NULL data as opposed to PassengerId that has 891.

Similarly the test data also has missing values in several columns. I have commented the test_DF.info() out, but you can uncomment and check.

# Previewing the statistics of training data and test data
titanic_DF.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
# Data Visualization 
plt.rc('font', size=24)
fig = plt.figure(figsize=(18, 8))
alpha = 0.6

# Plot pclass distribution
ax1 = plt.subplot2grid((2,3), (0,0))
titanic_DF.Pclass.value_counts().plot(kind='barh', color='blue', label='train', alpha=alpha)
test_DF.Pclass.value_counts().plot(kind='barh',color='magenta', label='test', alpha=alpha)
ax1.set_ylabel('Pclass')
ax1.set_xlabel('Frequency')
ax1.set_title("Distribution of Pclass" )
plt.legend(loc='best')

# Plot sex distribution
ax2 = plt.subplot2grid((2,3), (0,1))
titanic_DF.Sex.value_counts().plot(kind='barh', color='blue', label='train', alpha=alpha)
test_DF.Sex.value_counts().plot(kind='barh', color='magenta', label='test', alpha=alpha)
ax2.set_ylabel('Sex')
ax2.set_xlabel('Frequency')
ax2.set_title("Distribution of Sex" )
plt.legend(loc='best')


# Plot Embarked Distribution
ax5 = plt.subplot2grid((2,3), (0,2))
titanic_DF.Embarked.fillna('S').value_counts().plot(kind='barh', color='blue', label='train', alpha=alpha)
test_DF.Embarked.fillna('S').value_counts().plot(kind='barh',color='magenta', label='test', alpha=alpha)
ax5.set_ylabel('Embarked')
ax5.set_xlabel('Frequency')
ax5.set_title("Distribution of Embarked" )
plt.legend(loc='best')

# Plot Age distribution
ax3 = plt.subplot2grid((2,3), (1,0))
titanic_DF.Age.fillna(titanic_DF.Age.median()).plot(kind='kde', color='blue', label='train', alpha=alpha)
test_DF.Age.fillna(test_DF.Age.median()).plot(kind='kde',color='magenta', label='test', alpha=alpha)
ax3.set_xlabel('Age')
ax3.set_title("Distribution of age" )
plt.legend(loc='best')

# Plot fare distribution
ax4 = plt.subplot2grid((2,3), (1,1))
titanic_DF.Fare.fillna(titanic_DF.Fare.median()).plot(kind='kde', color='blue', label='train', alpha=alpha)
test_DF.Fare.fillna(test_DF.Fare.median()).plot(kind='kde',color='magenta', label='test', alpha=alpha)
ax4.set_xlabel('Fare')
ax4.set_title("Distribution of Fare" )
plt.legend(loc='best')
plt.tight_layout()

png

We’ve got a good sense of our 12 variables in our training data frame, variable types, and the variables that have missing data. The next section now is feature engineering!

Feature Engineering

Feature engineering is the process of using our knowledge of the data to create features that make machine learning algorithms work. And as per Andrew Ng,

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.

So the challenge for me as a beginner was to pay a lot of attention to the various variables that could be potential features and with an open mind. Lets list our potential features one more time and the data dictionary to decide which ones we can use.

# print the names of the columns in the data frame
titanic_DF.columns
# Check which columns have missing data
for column in titanic_DF.columns:
    if np.any(pd.isnull(titanic_DF[column])) == True:
        print(column)
Age
Cabin
Embarked

Visualizing Features

First I generated a distribution of various features for both training and test data of system to understand which factors are important. Credit

Then, I started plotting the survival frequency to understand my features better and in the process play around with plotting styles (I used Seaborn for most of these plots).

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics

# Plot pclass distribution
fig = plt.figure(figsize=(6, 6))

sns.factorplot('Pclass','Survived',order=[1,2,3], data=titanic_DF, size=4,color="green")
plt.ylabel('Fraction Survived')
plt.xlabel('Pclass')
plt.title("Survival according to Class" )
# Plot Gender Survival
fig = plt.figure(figsize=(6,6))
sns.factorplot('Sex','Survived', data=titanic_DF, size=4,color="green")
plt.ylabel('Fraction Survived')
plt.xlabel('Gender')
plt.title("Survival according to Gender" )

png png

# Plot Fare
fig = plt.figure(figsize=(15, 6))
titanic_DF[titanic_DF.Survived==0].Fare.plot(kind='density', color='red', label='Died', alpha=alpha)
titanic_DF[titanic_DF.Survived==1].Fare.plot(kind='density',color='green', label='Survived', alpha=alpha)
plt.ylabel('Density')
plt.xlabel('Fare')
plt.xlim([-100,200])
plt.title("Distribution of Fare for Survived and Did not survive" )

plt.legend(loc='best')
plt.grid()

png

  • Passenger Class (pclass): Lets start with Passenger Class, Here I have plotted factors survived as a function of passenger class. This seems like a no-brainer, passengers in better classes were certainly evacuated first. There is near linear correlation!

  • Sex of passenger: Again there is a strong correlation between the sex and survival.
  • sibsp and parch

sibsp = # of siblings / spouses aboard

parch = # of parents / children aboard

These features are obviously not linearly correleted with survival, but seem to have some complex dependence

  • ticket: There shouldn’t be any use to ticket data as it seems they are some unique number generated per person. At this point we might as well drop ticket from our dataframe.

  • Fare: Lets see how it fares! On its own it doesnt have any striking correlation with survival frequency.

Fix Missing Data

Now that we have broadly looked at single features for all the columns that didn’t have missing data. Time to fix the columns with missing data. The missing data are in columns Age, Embarked and Cabin so lets figure out of fix these. For age it makes sense to simply fill the data by median age.

# Filling missing age data with median values
titanic_DF["Age"] = titanic_DF["Age"].fillna(titanic_DF["Age"].median())

# Plot age
fig = plt.figure(figsize=(15, 6))
titanic_DF[titanic_DF.Survived==0].Age.plot(kind='density', color='red', label='Died', alpha=alpha)
titanic_DF[titanic_DF.Survived==1].Age.plot(kind='density',color='green', label='Survived', alpha=alpha)
plt.ylabel('Density')
plt.xlabel('Age')
plt.xlim([-10,90])
plt.title("Distribution of Age for Survived and Did not survive" )
plt.legend(loc='best')
plt.grid()

png

For embarked there are multiple choices:

  1. Fill it using the most frequent option ‘S’.
  2. Use the fare as the fare price might be dependent on the port embarked.

Here I have used the simpler option 1, but there are many Notebooks that describe option2 on Kaggle. I have further converted type ‘S’, ‘C’ and ‘Q’ to 0, 1 and 2 respectively to be able to train data.

# data cleaning for Embarked
print (titanic_DF["Embarked"].unique())
print (titanic_DF.Embarked.value_counts())
['S' 'C' 'Q' nan]
S    644
C    168
Q     77
Name: Embarked, dtype: int64
# filling Embarked data with most frequent 'S'
titanic_DF["Embarked"] = titanic_DF["Embarked"].fillna('S')
titanic_DF.loc[titanic_DF["Embarked"] == 'S', "Embarked"] = 0
titanic_DF.loc[titanic_DF["Embarked"] == 'C', "Embarked"] = 1
titanic_DF.loc[titanic_DF["Embarked"] == 'Q', "Embarked"] = 2
# convert female/male to numeric values (male=0, female=1)
titanic_DF.loc[titanic_DF["Sex"]=="male","Sex"]=0
titanic_DF.loc[titanic_DF["Sex"]=="female","Sex"]=1

Prediction

Train a model: Logistic Regression

For our titanic dataset, our prediction is a binary variable, which is discontinuous. So using a logistic regression model makes more sense than using a linear regression model. So in the following snippet I have used python library to perform logistic regression using the featured defined in predictors.

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

# columns we'll use to predict outcome
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]


# instantiate the model
logreg = LogisticRegression()

# perform cross-validation
print(cross_val_score(logreg, titanic_DF[predictors], titanic_DF['Survived'], cv=10, scoring='accuracy').mean())
0.79354102826

Kaggle Submission

Now we need to run our prediction on the test data set and Submit to Kaggle.

# print the names of the columns in the data frame
test_DF.columns
# Check which columns have missing data
for column in test_DF.columns:
    if np.any(pd.isnull(test_DF[column])) == True:
        print(column)
Age
Fare
Cabin
# Filling missing age data with median values
test_DF["Age"] = test_DF["Age"].fillna(titanic_DF["Age"].median())

# filling Embarked data with most frequent 'S'
test_DF["Embarked"] = test_DF["Embarked"].fillna('S')
test_DF.loc[test_DF["Embarked"] == 'S', "Embarked"] = 0
test_DF.loc[test_DF["Embarked"] == 'C', "Embarked"] = 1
test_DF.loc[test_DF["Embarked"] == 'Q', "Embarked"] = 2

# convert female/male to numeric values (male=0, female=1)
test_DF.loc[test_DF["Sex"]=="male","Sex"]=0
test_DF.loc[test_DF["Sex"]=="female","Sex"]=1

# Test also has empty fare columns
test_DF["Fare"] = test_DF["Fare"].fillna(test_DF["Fare"].median())

# Apply our prediction to test data
logreg.fit(titanic_DF[predictors], titanic_DF["Survived"])
prediction = logreg.predict(test_DF[predictors])
# Create a new dataframe with only the columns Kaggle wants from the dataset
submission_DF = pd.DataFrame({ 
    "PassengerId" : test_DF["PassengerId"],
    "Survived" : prediction
    })
print(submission_DF.head(5))
   PassengerId  Survived
0          892         0
1          893         0
2          894         0
3          895         0
4          896         1
# prepare file for submission
submission_DF.to_csv("submission.csv", index=False)