Would You Survive the Titanic? – Data Visualisation in Python

James HeslipComputer ScienceLeave a Comment

There are a lot of useful tools out there that can help you see trends and patterns in your data… but how valuable is that in the real world? Recently, I have been looking into Matplotlib and Seaborn - two of Python’s best data visualisation libraries. With the knowledge of these packages (and a few other data science dependencies), you can begin to really see, for example, the likelihood of your survival on the Titanic…

Initialisation

If we want to work with these third-party data visualisation libraries, we first need to first install them from command prompt/the terminal. This can be done using pip, the Package Installer for Python. Pip is bundled with most modern distributions of Python since about 3.4+.

After installing the package manager, individual libraries can be installed from your command line/terminal as so (just substitute {package name} for the name of the library):


pip install {package name}

And imported through into your interpreter of choice (I am using Jupyter Notebooks through the Anaconda distribution of Python):

import numpy as np
 import pandas as pd
 import matplotlib.pylplot as plt
 %matplotlib inline
 import seaborn as sns

A brief intro to the libraries

Seaborn is the main library I will be demonstrating here. It builds off the power of Matplotlib, and while they both have charting capabilities, it is incredible the amount that you can do with a single line in Seaborn.

Pandas gives the ability to create what is known as a Pandas Dataframe. These can be thought of as Excel-like worksheets, with methods to select/filter on rows and columns by criteria. Numpy is a dependency, mainly used to create data arrays - it is not used raw in this blog.

For the sake of demonstration and ease, Seaborn offers various datasets. Provided you have an internet connection, you can query what sets are available like so:


sns.get_dataset_names()

Loading the Data

One of these datasets is about the Titanic. Loading this in, we can also get a snippet of the first few rows to get an idea of what the data looks like:


titanic = sns.load_dataset('titanic')
titanic.head()

SurvivedpclassSexAgesibspparchfareembarkedClasswhoadult_maledeckembark_townalivealone
003M22107.2500SThirdManTrueNaNSouthamptonNFalse
111F381071.2833CFirstWomenFalseCCherbourgYFalse
213F26007.9250SThirdWomenFalseNaNSouthamptonYTrue
311F351053.1000SFirstWomenFalseCSouthamptonYFalse
403M35008.0500SThirdManTrueNaNSouthamptonNTrue

If you are interested in a further explanation of columns headings, or getting a copy of this dataset, please check out this resource.

Simple data visualisation

If we have simple categorical data- sex, embarked, or even sibsp (number of siblings/spouses)- we can use Pandas methods to aggregate these, and perform a count:


# Count values according to Siblings/Spouses
titanic['sibsp'].value_counts()

0    608
 1    209
 2     28
 4     18
 3     16
 8      7
 5      5
 Name: sibsp, dtype: int64

All this is fine if we are only looking to gather values. To really see the data it helps to graph it.


sns.countplot(x='sibsp', data=titanic)

Image

This is a very simple example showing the count of each distinct item. We are not really gaining any insight we do not already know.

We can also do things like scatter plots of bi-variate data, i.e. age vs fare:


sns.scatterplot(x='age', y='fare', data=titanic)

Image

And add a line of regression to see the correlation:


sns.lmplot(x='age', y='fare', data=titanic, line_kws={'color':'red'})

Image

All these graphs shown before can be extended to show multi-dimensional data.


sns.lmplot(x='age',
           y='fare',
           data=titanic,
           line_kws={'color':'red'},
           col='sex'
          )

Image

So… how does this help you survive the Titanic?


Money?

My first thought, is there any correlation between those who paid more for their tickets, and those who survived?


sns.boxplot(x='survived', y='fare', data=titanic)

Image

Interpreting these box plots, it would appear that those who paid a significant amount (500+) did survive, but this could have been coincidence. If we forgive this outlier, and zoom more into the real data:


plt.figure(figsize=(7,7))
sns.boxplot(x='survived',y='fare',data=titanic, showfliers = False)

Image

Zooming in on this, it is immediately a lot easier to see that those who paid more, were better looked after.


Gender?


sns.countplot(x='survived',data=titanic, hue='sex')

Image

A quick look at gender survival rates shows that being a male is not in your favour. It should be noted that there are more males in the data set, but you can see by comparing the heights of the bars that only about 1/6 of the males survived, while 2/3 of the females survived.


Family?

So, we have established that money and gender are good indicators. But what about the children? Plots using Matplotlib are better for advanced visualisations such as these.


died = titanic[titanic['survived']==False]['parch'].value_counts().sort_index()
 totals = titanic['parch'].value_counts().sort_index()
 dead_pct = (100*died/totals).fillna(0)
 fig = plt.figure()
 ax = fig.add_axes([0,0,1,1])
 ax.plot(dead_pct, color='red')
 ax.set_xlabel('Number of parents/children in party')
 ax.set_ylabel('Mortality %')
 ax.set_title('Family size by Mortality% on the Titanic')

Image

In situations where there were larger families (4+), there was a lesser rate of survival, good argument for smaller family units! Or maybe it is even better to travel alone?


Image

Actually, very much the opposite. It would appear that those passangers traveling alone were almost twice as likely not to make it out than those with others.


Conclusion

I think if I had to make a recommendation based on the data I’ve studied thus far, it’d be don’t buy a ticket for Titanic. If you were going to do it anyway, take some family, maybe treat a couple of nieces or nephews, and consider paying that little bit extra for the upgrades. Life may be shorter than you think…

Of course, all of this is based off my interpretation of a couple of graphs. This is a great way to scratch the surface of what we can learn from this data, but to really get to grips with it, and begin to make accurate predictions, we can consider training our systems using a method known as machine learning. More on this next time.


Resources



About the Author

A picture of James Heslip the author of this blog

James Heslip

APL Team Leader


James is an APL Programmer with a keen interest in mathematics. His love for computing was almost an accident. From a young age he always enjoyed using them- playing video games and such- but it was never considered that anything more would come from it. James originally had plans to pursue a career in finance. More about James.


More from James



Other Posts