A Tour of Machine Learning in Python

How to perform exploratory data analysis and build a machine learning pipeline.


In this tutorial I demonstrate key elements and design approaches that go into building a well-performing machine learning pipeline. The topics I'll cover include:

  1. Exploratory Data Analysis and Feature Engineering.
  2. Data Pre-Processing including cleaning and feature standardization.
  3. Dimensionality Reduction with Principal Component Analysis and Recursive Feature Elimination.
  4. Classifier Optimization via hyperparameter tuning and Validation Curves.
  5. Building a more powerful classifier through Ensemble Voting and Stacking.

Along the way we'll be using several important Python libraries, including scikit-learn and pandas, as well as seaborne for data visualization.

Our task is a binary classification problem inspired by Kaggle's "Getting Started" competition, Titanic: Machine Learning from Disaster. The goal is to accurately predict whether a passenger survived or perished during the Titanic's sinking, based on data such as passenger age, class, and sex. The training and test datasets are provided here.

I have chosen here to focus on the fundamentals that should be a part of every data scientist's toolkit. The topics covered should provide a solid foundation for launching into more advanced machine learning approaches, such as Deep Learning. For an intro to Deep Learning, see my notebook on building a Convolutional Neural Network with Google's TensorFlow API.


Contents

1) Exploratory Data Analysis

2) Data Pre-Processing

3) Feature Selection and Dimensionality Reduction

4) Model Optimization and Selection

Importing Python Libraries

In [1]:
# General Tools:
import math, os, sys  # standard python libraries
import numpy as np
import pandas as pd  # for dataframes
import itertools  # combinatorics toolkit
import time  # for obtaining computation execution times
from scipy import interp  # interpolation function

# Data Pre-Processing:
from sklearn.preprocessing import StandardScaler  # for standardizing data
from collections import Counter  # object class for counting element occurences

# Machine Learning Classifiers:
from xgboost import XGBClassifier  # xgboost classifier (http://xgboost.readthedocs.io/en/latest/model.html)
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron  # linear classifiers
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier  # decision tree classifiers
from sklearn.svm import SVC  # support-vector machine classifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis  # LDA classifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, \
                             GradientBoostingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier  # Nearest-Neighbors classifier
        
# Feature and Model Selection:
from sklearn.model_selection import StratifiedKFold  # train/test splitting tool for cross-validation
from sklearn.model_selection import GridSearchCV  # hyperparameter optimization tool via exhaustive search
from sklearn.model_selection import cross_val_score  # automates cross-validated scoring
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, auc  # scoring metrics
from sklearn.feature_selection import RFE  # recursive feature elimination
from sklearn.model_selection import learning_curve  # learning-curve generation for bias-variance tradeoff
from sklearn.model_selection import validation_curve  # for fine-tuning hyperparameters
from sklearn.pipeline import Pipeline

# Plotting:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic

# Manage Warnings: 
import warnings
warnings.filterwarnings('ignore')

# Ensure Jupyter Notebook plots the figures in-line:
%matplotlib inline


1) Exploratory Data Analysis

1.1 - Getting Started

On April 15th 1912, the Titanic sank during her maiden voyage after colliding with an iceberg. Only 722 out of 2224 (32.5%) of its passengers and crew would survive. Such loss of life was in part due to the lack of sufficient numbers of lifeboats.

Though survival certainly involved an element of luck, some groups of people (e.g. women, children, the upper-class, etc.) may have been more likely to survive than others. Our goal is to use machine learning to predict which passengers survived the tragedy, based on factors such as gender, age, and social status.

For the Kaggle competition, we are to create a .csv submission file with two headers (PassengerId, Survived), and provide binary classification predictions for each passenger, where 1 = survived and 0 = deceased. The competition instructions and data are found here.

Data Dictionary

A few notes below about the meaning of the features in the raw dataset:

  • Survival: 0 = False (Deceased), 1 = True (Survived).
  • Pclass: Passenger ticket class; 1 = 1st (upper class), 2 = 2nd (middle class), 3 = 3rd (lower class).
  • SibSp: Passenger's total number of siblings (including step-siblings) and spouses (legal) aboard the Titanic.
  • Parch: Passenger's total number of parents or children (including stepchildren) aboard the Titanic.
  • Embarked: Port of Embarkation, where C = Cherbourg, Q = Queenstown, S = Southampton.
  • Age: Ages under 1 are given as fractions; if the age is estimated, it is in the form of xx.5.

a) Importing the Data

We will use Panda's dataframe structures for storing our data. As we explore our data and define new features, it will be useful to combine the training and test data into a single dataset.

In [2]:
df_train = pd.read_csv('./titanic-data/train.csv')
df_test = pd.read_csv('./titanic-data/test.csv')
dataset = pd.concat([df_train, df_test])  # combined dataset
test_Ids = df_test['PassengerId']  # identifiers for test set (besides survived=NaN)

df_train.head(5)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

b) Data Completeness

Let us check upfront how complete our dataset is. Here we count the missing entries for each feature:

In [3]:
print('Training Set Dataframe Shape: ', df_train.shape)
print('Test Set Dataframe Shape: ', df_test.shape)
print('\nTotal number of entries in our dataset: ', dataset.shape[0])

print('\nNumber of missing entries in total dataset:')
print(dataset.isnull().sum())
Training Set Dataframe Shape:  (891, 12)
Test Set Dataframe Shape:  (418, 11)

Total number of entries in our dataset:  1309

Number of missing entries in total dataset:
Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

Findings:

  • Cabin data is missing for more than 75% of all passengers. This feature is too incomplete to include in our predictive models. However, we will still explore whether it offers any useful insights.
  • We are missing about 20% of all age entries. We can attempt to impute or infer these missing entries based on average values or correlations with other features.
  • We are missing two entries for Embarked, and one for Fare. We will impute these later.

c) Thinking Up Front / Doing Your Homework!

Gaining a bit more understanding of the problem context can provide several clues about the importance of variables, and help us make more sense of some of the relations that will be uncovered during our exploratory data analysis. So it's worth spending some time learning about what happened the night the Titanic sank (we can think of this as gaining some domain expertise). There are countless books, documentaries, and webpages dedicated to this subject. Here we highlight a few interesting facts:

  • The Titanic's officers issued a "women and children first" order for evacuating passengers via lifeboats. However, there was no organized evacuation plan in place.
  • There was in fact no general "abandon ship" order given by the captain, no public address system, and no lifeboat drill. Many passengers did not realize they were in any imminent danger, and some had to be goaded out of their cabins by members of the crew or staff.
  • Lifeboats were segregated into different class sections, and there were more 1st-class passenger lifeboats than for the other two classes.
  • We know it was more difficult for 3rd class passengers to access lifeboats, because the 3rd-class passenger sections were gated off from the 1st and 2nd-class areas of the ship. This was actually due to US immigration laws, which required immigrants (primarily 3rd-class) to be segregated and processed separately from other passengers upon arrival to the US. As a consequence, 3rd-class passengers had to navigate through a maze of staircases and corridors to reach the lifeboat deck.

Given these facts, we can already surmize that Sex, Pclass, and Age are likely to be the most important features. We will see what trends our Exploratory Data Analysis reveals.

Conerning the Fare feature: Fare is given in Pounds Sterling. There were really no 'standard' fares - many factors influenced the price charged, including the cabin size and number of people sharing a cabin, whether it was on the perimeter of the ship (i.e. with a porthole) or further inside, the qualities of the furnishings and provisions, etc. Children travelled at reduced rates, as did servants with 1st-class tickets. There seemed also to have been some family discount rates given, but we lack detailed information on how this was calculated. However, our research does tell us that:

  • Ticket price (Fare) was cumulative, and included the cost for all passengers sharing that ticket.

d) A Quick Glance at the Sorted CSV File

Several useful observations can be made by quickly glancing at the CSV file containing the combined training and test data, and sorting some of the entries. Don't underestimate the usefulness of this rather rudimentary step!

Findings:

  • Sort by Passenger Name: Passengers with matching surnames tend to also have matching entries for several other features: Pclass, Ticket, Fare, and Embarked (in addition to Cabin when available). This tells us we can use matching Ticket and Fare information as a basis for grouping families. If we sort by ticket, we can use surnames to distinguish between 'family' groups and non-related co-travellers.
  • Sort by Age: All entries with the title 'Master' in the name correspond to males under the age of 15. This can be useful in helping us impute missing age data.
  • Sort by Cabin: We find that Cabin number is available for most passengers with PClass=1, but generally missing for passengers of Pclass=2 or 3.
  • Sort by Ticket Names: (I.Tickets not containing letters.) Tickets with 4-digits correspond to Pclass 2 or 3; the vast majority of 5-digit tickets correspond to Pclass 1 or 2; for 6-digit and 7-digit tickets, the leading number matches the Pclass. (II. Tickets including letters.) Tickets beginning with A/4 correspond to passengers with embarked=S and PClass=3. Tickets beginning with C.A. or CA also correspond to embarked=S, with PClass of 2 or 3. All tickets beginning with PC correspond to PClass=1. These patterns might be useful for helping us spot inconsistencies in the data.

1.2 - Feature Engineering

As we explore our data, we will likely think of new features that may help us understand or predict Survival. The definition of new features from the initial feature set is a dynamic process that typically occurs in the midst of feature exploration. However, for organizational purposes, as we add new features we will return here to group their definitions upfront.

a) FamilySize, Surname, Title, and IsChild

  • It makes sense to sum Parch and SibSp together to create a new feature FamilySize.
  • When identifying families, it will also be useful to compare surnames directly, so we split Name into Surname and Title. Our quick scan of the CSV file showed that all male children 15 and under have the title 'Master', hence Title may be useful for helping us estimate missing Age values.
  • We also create a new variable, IsChild, to denote passengers aged 15 and under.
In [4]:
# Create a new column as a sum of listed relatives
dataset['FamilySize'] = dataset['Parch'] + dataset['SibSp'] + 1  # plus one to include the passenger

# Clean and sub-divide the name data into two new columns using Python's str.split() function. 
# A look at the CSV contents shows that we should first split the string at ', ' to isolate 
# the surname, and then split again at '. ' to isolate the title.
dataset['Surname'] = dataset['Name'].str.split(', ', expand=True)[0]
dataset['Title'] =  dataset['Name'].str.split(', ', expand=True)[1].str.split('. ', expand=True)[0]

# Create a new feature identifying children (15 or younger)
dataset['IsChild'] = np.where(dataset['Age'] < 16, 1, 0)

# We can save this for handling or viewing with external software
# dataset.to_csv('./titanic-data/combined_newvars_v1.csv')

# Now let's print part of the dataframe to check our new variable definitions...
dataset[['Name', 'Surname', 'Title', 'SibSp', 'Parch', 'FamilySize', 'Age', 'IsChild']].head(10) 
Out[4]:
Name Surname Title SibSp Parch FamilySize Age IsChild
0 Braund, Mr. Owen Harris Braund Mr 1 0 2 22.0 0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings Mrs 1 0 2 38.0 0
2 Heikkinen, Miss. Laina Heikkinen Miss 0 0 1 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle Mrs 1 0 2 35.0 0
4 Allen, Mr. William Henry Allen Mr 0 0 1 35.0 0
5 Moran, Mr. James Moran Mr 0 0 1 NaN 0
6 McCarthy, Mr. Timothy J McCarthy Mr 0 0 1 54.0 0
7 Palsson, Master. Gosta Leonard Palsson Master 3 1 5 2.0 1
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson Mrs 0 2 3 27.0 0
9 Nasser, Mrs. Nicholas (Adele Achem) Nasser Mrs 1 0 2 14.0 1

b) Grouping Families and Travellers

Sorting the data by Ticket, one finds that multiple passengers share the same ticket number. This can be used as a basis for grouping passengers that travelled together. It will also be useful to distinguish whether these passenger groups are immediate-related (1st-degree) families, entirely unrelated or non-immediate (e.g. friends, cousins), or a mix. We will also identify passengers who are travelling alone. We define:

  • GroupID: an integer label uniquely identifying each group; a surrogate to Ticket.
  • GroupSize: total number of passengers sharing a ticket.
  • GroupType: categorization of group into 'Family', 'Non-Family', 'Mixed', 'IsAlone'.
  • GroupNumSurvived: number of members in that group which are known to have survived.
  • GroupNumPerished: number of members in that group which are known to have perished.
In [5]:
# Create mappings for assigning GroupID, GroupType, GroupSize, GroupNumSurvived, 
# and GroupNumPerished 
group_id = 1 
ticket_to_group_id = {} 
ticket_to_group_type = {}  
ticket_to_group_size = {}  
ticket_to_group_num_survived = {}
ticket_to_group_num_perished = {}
for (ticket, group) in dataset.groupby('Ticket'):
    
    # Categorize group type (Family, Non-Family, Mixed, )
    num_names = len(set(group['Surname'].values))  # number of unique names in this group
    group_size = len(group['Surname'].values)  # total size of this group
    if group_size > 1:
        if num_names == 1:
            ticket_to_group_type[ticket] = 'Family'
        elif num_names == group_size:
            ticket_to_group_type[ticket] = 'NonFamily'
        else:
            ticket_to_group_type[ticket] = 'Mixed'
    else:
        ticket_to_group_type[ticket] = 'IsAlone'
            
    # assign group size and grouop identifier
    ticket_to_group_size[ticket] = group_size
    ticket_to_group_id[ticket] = group_id
    ticket_to_group_num_survived[ticket] = group[group['Survived'] == 1]['Survived'].count()
    ticket_to_group_num_perished[ticket] = group[group['Survived'] == 0]['Survived'].count()
    group_id += 1
    
# Apply the mappings we've just defined to create the GroupID and GroupType variables
dataset['GroupID'] = dataset['Ticket'].map(ticket_to_group_id)
dataset['GroupSize'] = dataset['Ticket'].map(ticket_to_group_size)    
dataset['GroupType'] = dataset['Ticket'].map(ticket_to_group_type)  
dataset['GroupNumSurvived'] = dataset['Ticket'].map(ticket_to_group_num_survived)
dataset['GroupNumPerished'] = dataset['Ticket'].map(ticket_to_group_num_perished)

# Let's print the first 4 group entries to check that our grouping was successful
counter = 1
break_point = 4
feature_list = ['Surname', 'FamilySize','Ticket','GroupID','GroupType', 'GroupSize']
print('Printing Sample Data Entries to Verify Grouping:\n')
for (ticket, group) in dataset.groupby('Ticket'):
    print('\n', group[feature_list])
    if counter == break_point:
        break
    counter += 1

# Let's also check that GroupNumSurvived and GroupNumPerished were created accurately
feature_list = ['GroupID', 'GroupSize', 'Survived','GroupNumSurvived', 'GroupNumPerished']
dataset[feature_list].sort_values(by=['GroupID']).head(15)
    
Printing Sample Data Entries to Verify Grouping:


     Surname  FamilySize  Ticket  GroupID  GroupType  GroupSize
257  Cherry           1  110152        1  NonFamily          3
504  Maioni           1  110152        1  NonFamily          3
759  Rothes           1  110152        1  NonFamily          3

      Surname  FamilySize  Ticket  GroupID GroupType  GroupSize
262  Taussig           3  110413        2    Family          3
558  Taussig           3  110413        2    Family          3
585  Taussig           3  110413        2    Family          3

       Surname  FamilySize  Ticket  GroupID  GroupType  GroupSize
110    Porter           1  110465        3  NonFamily          2
475  Clifford           1  110465        3  NonFamily          2

      Surname  FamilySize  Ticket  GroupID GroupType  GroupSize
335  Maguire           1  110469        4   IsAlone          1
Out[5]:
GroupID GroupSize Survived GroupNumSurvived GroupNumPerished
504 1 3 1.0 3 0
257 1 3 1.0 3 0
759 1 3 1.0 3 0
585 2 3 1.0 2 1
262 2 3 0.0 2 1
558 2 3 1.0 2 1
110 3 2 0.0 0 2
475 3 2 0.0 0 2
335 4 1 NaN 0 0
158 5 1 NaN 0 0
430 6 1 1.0 1 0
366 7 2 1.0 1 0
236 7 2 NaN 1 0
191 8 1 NaN 0 0
170 9 1 0.0 0 1

Checking For Inconsistencies

In [6]:
# Check for cases where FamilySize = 1 but GroupType = Family
data_reduced = dataset[dataset['FamilySize'] == 1]
data_reduced = data_reduced[data_reduced['GroupType'] == 'Family']

# nri = 'NumRelatives inconsistency'
nri_passenger_ids = data_reduced['PassengerId'].values
nri_unique_surnames = set(data_reduced['Surname'].values)

# How many occurrences?
print('Number of nri Passengers: ', len(nri_passenger_ids))
print('Number of Unique nri Surnames: ',len(nri_unique_surnames))

# We will find that there are only 7 occurences, so let's go ahead and view them here:
data_reduced = data_reduced.sort_values('Name')
data_reduced[['Name', 'Ticket', 'Fare','Pclass', 'Parch', 
              'SibSp', 'GroupID', 'GroupSize','GroupType']].head(int(len(nri_passenger_ids)))
Number of nri Passengers:  7
Number of Unique nri Surnames:  4
Out[6]:
Name Ticket Fare Pclass Parch SibSp GroupID GroupSize GroupType
83 Carrau, Mr. Francisco M 113059 47.10 1 0 0 36 2 Family
403 Carrau, Mr. Jose Pedro 113059 47.10 1 0 0 36 2 Family
538 Risien, Mr. Samuel Beard 364498 14.50 3 0 0 588 2 Family
382 Risien, Mrs. Samuel (Emma) 364498 14.50 3 0 0 588 2 Family
362 Ware, Mrs. John James (Florence Louise Long) CA 31352 21.00 2 0 0 777 2 Family
120 Watt, Miss. Bertha J C.A. 33595 15.75 2 0 0 765 2 Family
161 Watt, Mrs. James (Elizabeth "Bessie" Inglis Mi... C.A. 33595 15.75 2 0 0 765 2 Family
  • With the exception of Mrs. John James Ware, we see that each of these passengers is paired with another having the same surname; we can presume that these are 2nd-degree relations (such as cousins), hence why each still has FamilySize=1 (which refers only to immediate family).
In [7]:
#Check for cases where FamilySize > 1 but GroupType = NonFamily
data_reduced = dataset[dataset['FamilySize'] > 1]
data_reduced = data_reduced[data_reduced['GroupType'] == 'NonFamily']

# ngwr = 'not grouped with relatives'
ngwr_passenger_ids = data_reduced['PassengerId'].values
ngwr_unique_surnames = set(data_reduced['Surname'].values)

# How many occurences?
print('Number of ngwr Passengers: ', len(ngwr_passenger_ids))
print('Number of Unique ngwr Surnames: ',len(ngwr_unique_surnames))

feature_list = ['PassengerId', 'Name', 'Ticket', 'Fare','Pclass', 'Parch', 
                'SibSp', 'GroupID', 'GroupSize','GroupType']
data_reduced[feature_list].sort_values('GroupID').head(int(len(ngwr_unique_surnames)))
Number of ngwr Passengers:  17
Number of Unique ngwr Surnames:  17
Out[7]:
PassengerId Name Ticket Fare Pclass Parch SibSp GroupID GroupSize GroupType
166 167 Chibnall, Mrs. (Edith Martha Bowerman) 113505 55.0000 1 1 0 39 2 NonFamily
356 357 Bowerman, Miss. Elsie Edith 113505 55.0000 1 1 0 39 2 NonFamily
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 11767 83.1583 1 1 0 76 3 NonFamily
150 1042 Earnshaw, Mrs. Boulton (Olive Potter) 11767 83.1583 1 1 0 76 3 NonFamily
571 572 Appleton, Mrs. Edward Dale (Charlotte Lamson) 11769 51.4792 1 0 2 77 2 NonFamily
356 1248 Brown, Mrs. John Murray (Caroline Lane Lamson) 11769 51.4792 1 0 2 77 2 NonFamily
34 926 Mock, Mr. Philipp Edmund 13236 57.7500 1 0 1 92 2 NonFamily
122 1014 Schabert, Mrs. Paul (Emma Mock) 13236 57.7500 1 0 1 92 2 NonFamily
275 276 Andrews, Miss. Kornelia Theodosia 13502 77.9583 1 0 1 93 3 NonFamily
765 766 Hogeboom, Mrs. John C (Anna Andrews) 13502 77.9583 1 0 1 93 3 NonFamily
259 260 Parrish, Mrs. (Lutie Davis) 230433 26.0000 2 1 0 148 2 NonFamily
880 881 Shelley, Mrs. William (Imanita Parrish Hall) 230433 26.0000 2 1 0 148 2 NonFamily
779 780 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... 24160 211.3375 1 1 0 188 4 NonFamily
689 690 Madill, Miss. Georgette Alexandra 24160 211.3375 1 1 0 188 4 NonFamily
591 592 Stephenson, Mrs. Walter Bertram (Martha Eustis) 36947 78.2667 1 0 1 628 2 NonFamily
496 497 Eustis, Miss. Elizabeth Mussey 36947 78.2667 1 0 1 628 2 NonFamily
599 600 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") PC 17485 56.9292 1 0 1 799 2 NonFamily

If we look at matching group IDs, then in some cases these inconsistencies may be due to passenger substitutions. However, we need to better understand the significance of the names in parenthesis.

Consider GroupID=628: we have "Miss Elizabith Eustis" and "Mrs. Walter Sephenson (Martha Eustis)". A quick check of geneology databases online reports that there was indeed a miss Mrs. Walter Bertram Stephenson that boarded the Titanic; in this case, Martha Eustis is her maiden name, while Mrs. Walter B. Stephenson gives her title in terms of her husband's name (an old-fashioned practice). Another example is for GroupID=77, where we have "Brown, Mrs. John Murray (Caroline Lane Lamson)" and "Appleton, Mrs. Edward Dale (Charlotte Lamson)", another case of two related passengers whose names are given in terms of those of their husbands.

Since this hunt for inconsistencies turned up only 17 entries, we can manually correct the Group Type in cases (such as these two examples) where it is obvious the passengers are indeed family.

In [8]:
# manually correcting some mislabeled group types
# note: if group size is greater than the number of listed names above, we assign to Mixed
passenger_ids_toFamily = [167, 357, 572, 1248, 926, 1014, 260, 881, 592, 497]
passenger_ids_toMixed = [880, 1042, 276, 766]

dataset['GroupType'][dataset['PassengerId'].isin(passenger_ids_toFamily)] = 'Family'
dataset['GroupType'][dataset['PassengerId'].isin(passenger_ids_toMixed)] = 'Mixed'

## for verification:
# feature_list = ['PassengerId', 'Name', 'GroupID', 'GroupSize','GroupType']
# dataset[feature_list][dataset['PassengerId'].isin(
#         passenger_ids_toFamily)].sort_values('GroupID').head(len(passenger_ids_toFamily))

LargeGroup Feature:

Lastly, we'll define a new feature, called LargeGroup, which equals 1 for GroupSize of 5 and up, and is 0 otherwise. For an explanation of what motivated this new feature, see our "Summary of Key Findings".

In [9]:
dataset['LargeGroup'] = np.where(dataset['GroupSize'] > 4, 1, 0)

c) Creating Bins for Age

During feature selection, we will assess whether this is advantageous over the continuous-variable representation.

In [10]:
# creation of Age bins; see Section 1.3-b
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
dataset['AgeBin'] = pd.cut(dataset['Age'], bins=bin_thresholds, labels=bin_labels)

d) Logarithmic and 'Split' (Effective) Fare

Our research found that ticket price was cumulative based on the number of passengers sharing that ticket. We therefore define a new fare variable, 'SplitFare', that subdivides the ticket price based on the number of passengers sharing that ticket. We also create 'log10Fare' and 'log10SplitFare' to map these to a base-ten logarithmic scale.

In [11]:
# split the fare based on GroupSize; express as fare-per-passenger on a shared ticket
dataset['SplitFare'] = dataset.apply(lambda row: row['Fare']/row['GroupSize'], axis=1)

# Verify new feature definition
features_list = ['GroupSize', 'Fare', 'SplitFare']
dataset[features_list].head()

# Map to log10 scale
dataset['log10Fare'] = np.log10(dataset['Fare'].values + 1)
dataset['log10SplitFare'] = np.log10(dataset['SplitFare'].values + 1)

1.3 - Univariate Feature Exploration

a) Data Spread

We can use pandas' built-in methods to get a quick first impression of how our ordinal data are distributed:

In [12]:
dataset.describe()
Out[12]:
Age Fare Parch PassengerId Pclass SibSp Survived FamilySize IsChild GroupID GroupSize GroupNumSurvived GroupNumPerished LargeGroup SplitFare log10Fare log10SplitFare
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 1309.000000 891.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 1308.000000 1308.000000 1308.000000
mean 29.881138 33.295479 0.385027 655.000000 2.294882 0.498854 0.383838 1.883881 0.087853 464.625668 2.101604 0.592819 0.851031 0.092437 14.757627 1.293942 1.089904
std 14.413493 51.758668 0.865560 378.020061 0.837836 1.041658 0.486592 1.583639 0.283190 278.069490 1.779832 0.922026 1.299833 0.289753 13.555638 0.420687 0.294065
min 0.170000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 328.000000 2.000000 0.000000 0.000000 1.000000 0.000000 213.000000 1.000000 0.000000 0.000000 0.000000 7.550000 0.949185 0.931966
50% 28.000000 14.454200 0.000000 655.000000 3.000000 0.000000 0.000000 1.000000 0.000000 460.000000 1.000000 0.000000 1.000000 0.000000 8.050000 1.189047 0.956649
75% 39.000000 31.275000 0.000000 982.000000 3.000000 1.000000 1.000000 2.000000 0.000000 728.000000 3.000000 1.000000 1.000000 0.000000 15.008325 1.508866 1.204346
max 80.000000 512.329200 9.000000 1309.000000 3.000000 8.000000 1.000000 11.000000 1.000000 929.000000 11.000000 5.000000 7.000000 1.000000 128.082300 2.710396 2.110867

Findings:

  • Most of our passengers travelled without any relatives onboard. Less than 50% had FamilySize > 1.
  • Less than 9% of our passengers were children.
  • While most fares were under 15.00, it would appear there are passengers on board whose fare price (e.g. 512.00) are more than five standard deviations above this, implying a significant spread in passenger wealth. We will need to examine the Fare feature for outliers.
  • Only 38.8% of the passengers in our training set survived. This gives us a baseline: if we simply predicted Survival=0 for all passengers, we could expect to achieve roughly 60% accuracy. Obviously we will aim to do much better than this using our machine learning models.

b) Univariate Plots

Let's begin by looking at how mean survival depends on individual feature values:

In [13]:
def barplots(dataframe, features, cols=2, width=10, height=10, hspace=0.5, wspace=0.25):
    # define style and layout
    sns.set(font_scale=1.5)
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width, height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataframe.shape[1]) / cols)
    # define subplots
    for i, column in enumerate(dataframe[features].columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        sns.barplot(column,'Survived', data=dataframe)
        plt.xticks(rotation=0)
        plt.xlabel(column, weight='bold')

    
feature_list = ['Sex','Pclass', 'Embarked', 'SibSp', 'Parch', 'FamilySize']        
barplots(dataset, features=feature_list, cols=3, width=15, height=40, hspace=0.35, wspace=0.4)

We'll also consider the statistics associated with GroupType and GroupSize, which we defined in Section 1.2 (Feature Engineering) while grouping families and co-travellers with shared tickets:

In [14]:
feature_list = ['GroupType','GroupSize']        
barplots(dataset, features=feature_list, cols=2, width=15, height=75, hspace=0.3, wspace=0.3)

Note that the black 'error bars' on our plots represent 95% confidence intervals. For practical purposes, when comparing survival versus feature values, these bars can be thought of as statistical uncertainties given our limited sample size and the spread in the data.

Findings:

  • Sex and Pclass both show a strong statistically significant influence on survival.
  • FamilySize of 2-4 is more advantageous than larger families or passengers without family. Survival drops sharply at FamilySize=5 and beyond.
  • Embarked shows no clear trend; we will later investigate this feature in more detail.
  • For GroupType, we can clearly see that lone passengers have a lower survival probability compared to other groups (note, this also mirrors what we see for FamilySize=1 and GroupSize=1). Between the other three categories of Family, NonFamily, and Mixed, the wide confidence bounds on the latter two make it difficult to assert whether any of these three have a statistically significant advantage relative to each other.
  • For GroupSize, we see a trend similar to the one we observed for FamilySize, where survival increases up to GroupSize=4, and then drops off sharply for group sizes of 5 and above. However, compared to FamilySize, the confidence bounds for this variable are tighter, and the relation between survival and GroupSize up to 4 appears more linear, suggesting that GroupSize may be a better variable for model training than FamilySize.

Given the FamilySize feature, it is not clear whether SibSp and Parch are now gratuitous, or whether they can still offer some valuable insight. This will need further investigation.

Now let's examine Age and Fare:

In [15]:
def histograms(dataframe, features, force_bins=False, cols=2, width=10, height=10, hspace=0.2, wspace=0.25):
    # define style and layout
    sns.set(font_scale=1.5)
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width, height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataframe.shape[1]) / cols)
    # define subplots
    for i, column in enumerate(dataframe[features].columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        df_survived = dataframe[dataframe['Survived'] == 1]
        df_perished = dataframe[dataframe['Survived'] == 0]
        if force_bins is False:
            sns.distplot(df_survived[column].dropna().values, kde=False, color='blue')
            sns.distplot(df_perished[column].dropna().values, kde=False, color='red')
        else:
            sns.distplot(df_survived[column].dropna().values, bins=force_bins[i], kde=False, color='blue')
            sns.distplot(df_perished[column].dropna().values, bins=force_bins[i], kde=False, color='red')
        plt.xticks(rotation=25)
        plt.xlabel(column, weight='bold')    


feature_list = ['Age', 'Fare']    
bins = [range(0, 81, 1), range(0, 300, 2)]
histograms(dataset, features=feature_list, force_bins=bins, cols=2, width=15, height=70, hspace=0.3, wspace=0.2)

Blue denotes survivors; red denotes those who perished.

Findings:

  • Age: Survival improves for children. Age-dependent differences in survival are perhaps easier to see in a kernel density estimate (KDE) plot (see below). Elderly passengers, around age 60 and up, tended to perish.
  • Fare: Unsurpisingly passengers with the lowest fares perished in greater numbers. Fare values are indeed widely spread; we see many passengers with fares exceeding 50.00 and 100.00. It may be sensible to convert fare into a logarithmic value to explore its relation to survival.

Below we generate KDE plots and convert Fare to a base-10 log scale. We will also examine the 'SplitFare' variable, which divides the total ticket Fare price among the number of passengers sharing that ticket:

In [16]:
def univariate_kdeplots(dataframe, plot_features, cols=2, width=10, height=10, hspace=0.2, wspace=0.25):
    # define style and layout
    sns.set(font_scale=1.5)
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width, height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataframe.shape[1]) / cols)
    # define subplots
    for i, feature in enumerate(plot_features):
        ax = fig.add_subplot(rows, cols, i + 1)
        g = sns.kdeplot(dataframe[plot_features[i]][(dataframe['Survived'] == 0)].dropna(), shade=True, color="red")
        g = sns.kdeplot(dataframe[plot_features[i]][(dataframe['Survived'] == 1)].dropna(), shade=True, color="blue")
        g.set(xlim=(0 , dataframe[plot_features[i]].max()))
        g.legend(['Perished', 'Survived'])
        plt.xticks(rotation=25)
        ax.set_xlabel(plot_features[i], weight='bold')

feature_list = ['Age', 'log10Fare', 'log10SplitFare']
univariate_kdeplots(dataset, feature_list, cols=1, width=15, height=100, hspace=0.4, wspace=0.25)

We can more clearly see:

  • Children have a survival advantage, particularly those 13 and under.
  • Elderly passengers ~60 and up are most likely to perish.
  • Survival is poorest for Fares of 10.0 or less ($\lt$ 1.0 on this log scale).
  • The Fare KDE plot shows 2 clear points of concavity near log(Fare)=0.9 and log(Fare)=1.4, and possibly a third near log(Fare)=1.8. For Splitfare, the variance on the peaks appear smaller, and there seem to be 3 main peaks, possibly corresponding to the means for Pclasses 3, 2 and 1. It should be investigated how Fare and SplitFare is distributed among different Pclasses.
  • We note that there appear several passengers whose Fare is listed as 0.0, virtually all of whom perished.

Let's now consider discretized Age Bins and see if this makes any of the Age-related trends clearer. First we'll group age into bins of 5 years:

In [17]:
dataset['AgeBin_v1'] = pd.cut(dataset['Age'], bins=range(0, 90, 5))
sns.set(rc={'figure.figsize':(14,8.27)})
sns.set(font_scale=1.0)
plt.style.use('seaborn-whitegrid')
g = sns.barplot('AgeBin_v1','Survived', data=dataset)

table = pd.crosstab(dataset['AgeBin_v1'], dataset['Survived'])
print('\n', table)
 Survived   0.0  1.0
AgeBin_v1          
(0, 5]      13   31
(5, 10]     13    7
(10, 15]     8   11
(15, 20]    63   33
(20, 25]    80   42
(25, 30]    66   42
(30, 35]    47   41
(35, 40]    39   28
(40, 45]    30   17
(45, 50]    23   16
(50, 55]    14   10
(55, 60]    11    7
(60, 65]    10    4
(65, 70]     3    0
(70, 75]     4    0
(75, 80]     0    1
(80, 85]     0    0

Findings (complementary to KDE plot):

  • Survival tends to be highest in the 0-15 age group.
  • There is a slight increase in survival between 30-40 relative to adjacent bins.
  • Survival generally decreases after age 60; however, we have fewer samples upon which to base our statistics.
  • However, due to the large variances on all bars, these findings need to be taken with some caution.

Let's use these trends to custom-define a new set of bins with reduced granulation (see Section 1.2-a for the 'AgeBin' variable definition):

In [18]:
sns.set(font_scale=1.5)
plt.style.use('seaborn-whitegrid')

g = sns.barplot('AgeBin','Survived', data=dataset)

This makes it even clearer that being a child (0-15) is a statistically significant predictor of higher survival. Let's focus on our 'IsChild' variable, which equals 1 for Ages 15 and under, otherwise 0.

In [19]:
barplots(dataset, features=['IsChild'], cols=1, width=5, height=100, hspace=0.3, wspace=0.2)

Findings:

  • Children do indeed have significantly higher survival than non-children, as we might expect from the "women and children first" evacuation order. However, the survival probability (60%) is still quite a bit lower than that of Sex='female' (~73%) and Pclass=1 (~62.5%). There are clearly other features that play an important role in the likelihood of a child surviving, which we'll explore in Section 1.4.

What about discretizing log10 Fare into bins? We do this below.

In [20]:
sns.set(rc={'figure.figsize':(14,8.27)})
sns.set(font_scale=1.5)
plt.style.use('seaborn-whitegrid')
dataset['FareBin'] = pd.cut(dataset['log10Fare'], bins=5)
g = sns.barplot('FareBin','Survived', data=dataset)

Findings:

  • There is an approximately linear increase in survival probability with increasing FareBin. Could this in fact provide better predictive granularization than Pclass? We can test this idea later during Feature Selection.

Does this trend hold true if we look at the log10 SplitFare variable?

In [21]:
sns.set(font_scale=1.0)
plt.style.use('seaborn-whitegrid')
dataset['SplitFareBin'] = pd.cut(dataset['log10SplitFare'], bins=5)
g = sns.barplot('SplitFareBin','Survived', data=dataset)

Unfortunately the nice linear trend doesn't quite hold up. Bins 2-3 are nearly identical; similarly for bins 4-5. This looks more like a replication of the Pclass dependencies. What if we plot again with more bins?

In [22]:
sns.set(font_scale=1.0)
plt.style.use('seaborn-whitegrid')
dataset['SplitFareBin'] = pd.cut(dataset['log10SplitFare'], bins=10)
g = sns.barplot('SplitFareBin','Survived', data=dataset)

The linearity is somewhat better but not great, and we also have relatively low confidences for our highest two bins.

c) The Role of Cabin

Can information about passenger Cabin be useful for predicting survival? Initially, one might think that passengers in cabins located more deeply within the Titanic are more likely to perish. The cabins listed in our passenger data are prefixed with the letters A through G. Some background research tells us that this letter corresponds to the deck on which the cabin was located (with A being closer to the top, G being deeper down). There also appears to be a single cabin entry beginning with the letter 'T', but it's not clear what this means, so we will omit it.

Let's take a look at whether the deck the cabin was located on had any significant influence on survival:

In [23]:
dataset['CabinDeck'] = dataset.apply(lambda row: 
                                     str(row['Cabin'])[0] if str(row['Cabin'])[0] != 'n'
                                     else row['Cabin'], axis=1)

sns.set(font_scale=1.5)
plt.style.use('seaborn-whitegrid')
g = sns.barplot('CabinDeck','Survived', data=dataset[dataset['CabinDeck'] != 'T'].sort_values('CabinDeck'))

Findings:

  • CabinDeck appears to have no significant influence on survival.

A few other comments:

  • The cabin data we have exists predominantly for members of Pclass=1; it is missing for nearly all other passengers.
  • We know that gates in some of the stairwells leading to higher decks were left shut, due to the segregation of 3rd-class passengers from the rest. This of course mainly impacted survival in Pclass=3.
  • The iceberg penetrated the Titanic below deck G. Hence, there wasn't any localized damage/flooding directly to any of the cabins we are considering.
  • Lower-class passengers were not necessarily on lower decks than higher-class passengers. The following schematic of passenger class cabin distributions shows that most decks contained a mix of classes, albeit 3rd-class passengers tended to be far towards the rear or front of the ship:

Conclusion:

  • Cabin is not a useful feature for survival prediction. Having a known cabin number basically just tells us that the passenger is Pclass=1, and is therefore redundant.

1.4 - Exploring Feature Relations

We now proceed with a multivariate analysis to explore how different features influence each other's impact on passenger survival.

a) Feature Correlations

Let's first look at correlations between survival and some of our existing numerical features:

In [24]:
feature_list = ['Survived', 'Pclass', 'Fare', 'SplitFare', 'Age', 'SibSp', 'Parch', 'FamilySize', 'GroupSize']

plt.figure(figsize=(18,14))

sns.set(font_scale=1.5)
g = sns.heatmap(dataset[np.isfinite(dataset['Survived'])][feature_list].corr(), 
                square=True, annot=True, cmap='coolwarm', fmt='.2f')

Findings:

  • Pclass and SplitFare both show significant correlation with survival, unsurprisingly.
  • There is some negative correlation between Age and FamilySize. This makes sense, as we might expect larger families to consist of more children, thereby lowering the mean age.
  • There is some negative correlation between Age and Pclass. This suggests that upper-class passengers (PClass=1) are generally older than lower-class ones.
  • Unsurprisingly Sibsp, Parch, and FamilySize are all strongly correlated with each other.
  • GroupSize appears to correlate positively with Fare, but has near-zero correlation with SplitFare, further indicative that Fare is cumulative and represents the total paid for all passengers sharing a ticket.

In our univariate plots we saw that FamilySize had a clear impact on survival, yet here the correlations between Survival and FamilySize are near-zero. This may be explained by the fact that the positive trend between survival and FamilySize reverses into a negative trend after FamilySize=4. A similar reason may be behind the fact that Survival shows little correlation with Age, yet we know Age < 15 (i.e. child) is an advantage.

At this point it is also common to examine 'pair plots' of variables. However, given that most of our features span only a few discrete values, this turns out to be of limited informative value for us, and is hence omitted.

b) Typical Survivor Attributes

We can easily check which combinations of nominal features give the best and worst survival probabilities. However, this is limited by small sample size for some of our feature combination subsets. Nonetheless it can be useful for benchmarking purposes. We limit this quick exercise to only a few features (if we specify too many, our sample sizes for each combination become too small).

In [25]:
def get_mostProbable(data, feature_list, feature_values):
    
    high_val, low_val = 0, 1
    high_set, low_set = [], []
    high_size, low_size = [], []
    
    for combo in itertools.product(*feature_values):
    
        subset = dataset[dataset[feature_list[0]] == combo[0]]
        for i in range(len(feature_list))[1:]:
            subset = subset[subset[feature_list[i]] == combo[i]]
        mean_survived = subset['Survived'].mean()
    
        if mean_survived > high_val:
            high_set = combo
            high_val = mean_survived
            high_size = subset.shape[0]
        
        if mean_survived < low_val:
            low_set = combo
            low_val = mean_survived
            low_size = subset.shape[0]
        
    print('\n*** Most likely to survive ***')
    for i in range(len(feature_list)):
        print('%s : %s' % (feature_list[i], high_set[i]))
    print('... with survival probability %.2f' % high_val)
    print('and total set size %s' % high_size)
        
    print('\n*** Most likely to perish ***')
    for i in range(len(feature_list)):
        print('%s : %s' % (feature_list[i], low_set[i]))
    print('... with survival probability %.2f' % low_val)
    print('and total set size %s' % low_size)
            

An examination of the best and worst combinations of 'Sex' and 'Pclass':

In [26]:
feature_list = ['Pclass', 'Sex']
feature_values = [[1, 2, 3], ['male', 'female']]  # ['Pclass', 'Sex']
get_mostProbable(dataset, feature_list, feature_values)
    
*** Most likely to survive ***
Pclass : 1
Sex : female
... with survival probability 0.97
and total set size 144

*** Most likely to perish ***
Pclass : 3
Sex : male
... with survival probability 0.14
and total set size 493

Findings:

  • Virtually all 1st-class females survived (97%).
  • 3rd-class males where the most likely to perish, with only 14% survival.

This isn't too surprising, although the associated survival probabilities are useful to note.

What if we also consider whether the passenger was a child?

In [27]:
feature_list = ['Pclass', 'Sex', 'IsChild']
feature_values = [[1, 2, 3], ['male', 'female'], [0, 1]]  # ['Pclass', 'Sex', 'IsChild']
get_mostProbable(dataset, feature_list, feature_values)
          
*** Most likely to survive ***
Pclass : 1
Sex : male
IsChild : 1
... with survival probability 1.00
and total set size 5

*** Most likely to perish ***
Pclass : 2
Sex : male
IsChild : 0
... with survival probability 0.08
and total set size 159

Findings:

  • Male rather than female first-class children are most likely to survive (but note that we have only 5 samples in this subset).
  • Males in Pclass=2 seem to fare worse than Pclass=3 when we delineate between adults and children.

c) Sex, Pclass, and IsChild

Let's take a closer look at how survival differs between male and female across all three passenger classes:

In [28]:
plt.figure(figsize=(10,5))
plt.style.use('seaborn-whitegrid')
g = sns.barplot(x="Pclass", y="Survived", hue="Sex", data=dataset)

Findings:

  • Males in Pclass 2 and 3 have almost the same survival probability. Survival roughly doubles when going to Pclass 1.
  • Females in Pclass 2 are almost as likely to survive as those in Pclass 1; however, survival drops sharply (by ~40%) for Pclass 3.

Now let's see how this subdivides between children and adults:

In [29]:
subset = dataset[dataset['IsChild'] == 0]
table = pd.crosstab(subset[subset['Sex']=='male']['Survived'], subset[subset['Sex']=='male']['Pclass'])
print('\n*** i) Adult Male Survival ***')
print(table)

table = pd.crosstab(subset[subset['Sex']=='female']['Survived'], subset[subset['Sex']=='female']['Pclass'])
print('\n*** ii) Adult Female Survival ***')
print(table)

subset = dataset[dataset['IsChild'] == 1]
table = pd.crosstab(subset[subset['Sex']=='male']['Survived'], subset[subset['Sex']=='male']['Pclass'])
print('\n*** iii) Male Child Survival ***')
print(table)

table = pd.crosstab(subset[subset['Sex']=='female']['Survived'], subset[subset['Sex']=='female']['Pclass'])
print('\n*** iv) Female Child Survival ***')
print(table)

g = sns.factorplot(x="IsChild", y="Survived", hue="Sex", col="Pclass", data=dataset, ci=95.0)
# ci = confidence interval (set to 95%)
*** i) Adult Male Survival ***
Pclass     1   2    3
Survived             
0.0       77  91  281
1.0       42   8   38

*** ii) Adult Female Survival ***
Pclass     1   2   3
Survived            
0.0        2   6  58
1.0       89  60  56

*** iii) Male Child Survival ***
Pclass    1  2   3
Survived          
0.0       0  0  19
1.0       3  9   9

*** iv) Female Child Survival ***
Pclass    1   2   3
Survived           
0.0       1   0  14
1.0       2  10  16

Findings:

Males:

  • Virtually all of the male children in our training dataset survive in Pclasses 1 and 2.
  • The IsChild feature clearly has a significant impact on survival likelihood (well above 50%) for Pclass 1 and 2; for Pclass 3, this difference is a more modest 20%.

Females:

  • All but one of the Pclass 1 and 2 female children survived.
  • Survival of a female child versus an adult is only modestly better (by perhaps 5%).
  • The vast majority of female passengers who perished were in Pclass=3 (see table ii).

d) FamilySize, Sex, and Pclass

We saw earlier that FamilySize improves survival up to FamilySize=4, then sharply drops. Does Sex have any influence on this trend?

In [30]:
g = sns.factorplot(x="FamilySize", y="Survived", hue="Sex", data=dataset, ci=95.0)
# ci = confidence interval (set to 95%)
plt.style.use('seaborn-whitegrid')

Findings:

  • Increase in Survival with FamilySize (up to 4) is greater for males than females.
  • Both sexes follow the same general trend otherwise.

It would be interesting to know the role Pclass plays in FamilySize survival:

In [31]:
# Class distribution for a given FamilySize
table = pd.crosstab(dataset['FamilySize'], dataset['Pclass'])
print('\n', table)
table_fractions = table.div(table.sum(1).astype(float), axis=0)
g = table_fractions.plot(kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.xlabel('FamilySize', weight='bold')
plt.ylabel('Fraction of Total')
plt.tight_layout()
leg = plt.legend(title='Pclass', loc=9, bbox_to_anchor=(1.05, 1.0))

# Comparison of surival
g = sns.factorplot(x="FamilySize", y="Survived", hue="Pclass", data=dataset, ci=95.0)
 Pclass        1    2    3
FamilySize               
1           160  158  472
2           104   52   79
3            39   45   75
4             9   20   14
5             5    1   16
6             6    1   18
7             0    0   16
8             0    0    8
11            0    0   11

Findings:

  • FamilySize > 5 were predominantly 3rd class.
  • FamilySize = 4 was predominantly 1st and 2nd class.
  • Pclass 1 breaks from the linear trend early, dipping below Pclass=2 for FamilySize=4.

Finally, let's look at how the various FamilySizes are split between the two sexes:

In [32]:
# Sex distribution for a given FamilySize
table = pd.crosstab(dataset['FamilySize'], dataset['Sex'])
print('\n', table)
table_fractions = table.div(table.sum(1).astype(float), axis=0)
g = table_fractions.plot(kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.xlabel('FamilySize', weight='bold')
plt.ylabel('Fraction of Total')
plt.tight_layout()
leg = plt.legend(title='Sex', loc=9, bbox_to_anchor=(1.05, 1.0))
 Sex         female  male
FamilySize              
1              194   596
2              123   112
3               79    80
4               29    14
5               14     8
6               10    15
7                9     7
8                3     5
11               5     6