There are a lot of useful tools out there that can help you see trends and patterns in your data… but how valuable is that in the real world? Recently, I have been looking into Matplotlib and Seaborn - two of Python’s best data visualisation libraries. With the knowledge of these packages (and a few other data science dependencies), you can begin to really see, for example, the likelihood of your survival on the Titanic…

Initialisation

If we want to work with these third-party data visualisation libraries, we first need to first install them from command prompt/the terminal. This can be done using pip, the Package Installer for Python. Pip is bundled with most modern distributions of Python since about 3.4+.

After installing the package manager, individual libraries can be installed from your command line/terminal as so (just substitute {package name} for the name of the library):

pip install {package name}

And imported through into your interpreter of choice (I am using Jupyter Notebooks through the Anaconda distribution of Python):

import numpy as np
 import pandas as pd
 import matplotlib.pylplot as plt
 %matplotlib inline
 import seaborn as sns

A brief intro to the libraries

Seaborn is the main library I will be demonstrating here. It builds off the power of Matplotlib, and while they both have charting capabilities, it is incredible the amount that you can do with a single line in Seaborn.

Pandas gives the ability to create what is known as a Pandas Dataframe. These can be thought of as Excel-like worksheets, with methods to select/filter on rows and columns by criteria. Numpy is a dependency, mainly used to create data arrays - it is not used raw in this blog.

For the sake of demonstration and ease, Seaborn offers various datasets. Provided you have an internet connection, you can query what sets are available like so:

sns.get_dataset_names()

Loading the Data

One of these datasets is about the Titanic. Loading this in, we can also get a snippet of the first few rows to get an idea of what the data looks like:

titanic = sns.load_dataset('titanic')
 titanic.head()

	Survived	pclass	Sex	Age	sibsp	parch	fare	embarked	Class	who	adult_male	deck	embark_town	alive	alone
0	0	3	M	22	1	0	7.2500	S	Third	Man	True	NaN	Southampton	N	False
1	1	1	F	38	1	0	71.2833	C	First	Women	False	C	Cherbourg	Y	False
2	1	3	F	26	0	0	7.9250	S	Third	Women	False	NaN	Southampton	Y	True
3	1	1	F	35	1	0	53.1000	S	First	Women	False	C	Southampton	Y	False
4	0	3	M	35	0	0	8.0500	S	Third	Man	True	NaN	Southampton	N	True

If you are interested in a further explanation of columns headings, or getting a copy of this dataset, please check out this resource.

Simple data visualisation

If we have simple categorical data- sex, embarked, or even sibsp (number of siblings/spouses)- we can use Pandas methods to aggregate these, and perform a count:

# Count values according to Siblings/Spouses
 titanic['sibsp'].value_counts()

0    608
 1    209
 2     28
 4     18
 3     16
 8      7
 5      5
 Name: sibsp, dtype: int64

All this is fine if we are only looking to gather values. To really see the data it helps to graph it.

sns.countplot(x='sibsp', data=titanic)

This is a very simple example showing the count of each distinct item. We are not really gaining any insight we do not already know.

We can also do things like scatter plots of bi-variate data, i.e. age vs fare:

sns.scatterplot(x='age', y='fare', data=titanic)

And add a line of regression to see the correlation:

sns.lmplot(x='age', y='fare', data=titanic, line_kws={'color':'red'})

All these graphs shown before can be extended to show multi-dimensional data.

sns.lmplot(x='age',
           y='fare',
           data=titanic,
           line_kws={'color':'red'},
           col='sex'
          )

So… how does this help you survive the Titanic?

Money?

My first thought, is there any correlation between those who paid more for their tickets, and those who survived?

sns.boxplot(x='survived', y='fare', data=titanic)

Interpreting these box plots, it would appear that those who paid a significant amount (500+) did survive, but this could have been coincidence. If we forgive this outlier, and zoom more into the real data:

plt.figure(figsize=(7,7))
 sns.boxplot(x='survived',y='fare',data=titanic, showfliers = False)

Zooming in on this, it is immediately a lot easier to see that those who paid more, were better looked after.

Gender?

sns.countplot(x='survived',data=titanic, hue='sex')

A quick look at gender survival rates shows that being a male is not in your favour. It should be noted that there are more males in the data set, but you can see by comparing the heights of the bars that only about 1/6 of the males survived, while 2/3 of the females survived.

Family?

So, we have established that money and gender are good indicators. But what about the children? Plots using Matplotlib are better for advanced visualisations such as these.

died = titanic[titanic['survived']==False]['parch'].value_counts().sort_index()
 totals = titanic['parch'].value_counts().sort_index()
 dead_pct = (100*died/totals).fillna(0)
 fig = plt.figure()
 ax = fig.add_axes([0,0,1,1])
 ax.plot(dead_pct, color='red')
 ax.set_xlabel('Number of parents/children in party')
 ax.set_ylabel('Mortality %')
 ax.set_title('Family size by Mortality% on the Titanic')

In situations where there were larger families (4+), there was a lesser rate of survival, good argument for smaller family units! Or maybe it is even better to travel alone?

Actually, very much the opposite. It would appear that those passangers traveling alone were almost twice as likely not to make it out than those with others.

Conclusion

I think if I had to make a recommendation based on the data I’ve studied thus far, it’d be don’t buy a ticket for Titanic. If you were going to do it anyway, take some family, maybe treat a couple of nieces or nephews, and consider paying that little bit extra for the upgrades. Life may be shorter than you think…

Of course, all of this is based off my interpretation of a couple of graphs. This is a great way to scratch the surface of what we can learn from this data, but to really get to grips with it, and begin to make accurate predictions, we can consider training our systems using a method known as machine learning. More on this next time.

Resources

Anaconda - www.anaconda.com
Pip - pypi.org/project/pip
Seaborn - seaborn.pydata.org
Matplotlib - matplotlib.org
Pandas - pandas.pydata.org
Numpy - numpy.org
Titanic Dataset - www.kaggle.com/c/titanic/data
Awesome Public Datasets - github.com/awesomedata/awesome-public-datasets

About the Author

James Heslip

APL Team Leader

James is an APL Programmer with a keen interest in mathematics. His love for computing was almost an accident. From a young age he always enjoyed using them- playing video games and such- but it was never considered that anything more would come from it. James originally had plans to pursue a career in finance. More about James.

Ask James about APL / APL Consultancy / APL Legacy System Support

Advent of Code 2020 – And the Winner Is…

APL9

Advent of Code 2020 Day 5 – Our Solutions

APL9