In this tutorial I demonstrate key elements and design approaches that go into building a well-performing machine learning pipeline. The topics I’ll cover include:
- Exploratory Data Analysis and Feature Engineering.
- Data Pre-Processing including cleaning and feature standardization.
- Dimensionality Reduction with Principal Component Analysis and Recursive Feature Elimination.
- Classifier Optimization via hyperparameter tuning and Validation Curves.
- Building a more powerful classifier through Ensemble Voting and Stacking.
Along the way we’ll be using several important Python libraries, including scikit-learn and pandas, as well as seaborne for data visualization.
Our task in this tutorial is a binary classification problem inspired by Kaggle’s “Getting Started” competition, Titanic: Machine Learning from Disaster. The goal is to accurately predict whether a passenger survived or perished during the Titanic’s sinking, based on data such as passenger age, class, and sex. The training and test datasets are provided here.
I have chosen here to focus on the fundamentals that should be a part of every data scientist’s toolkit. The topics covered should provide a solid foundation for launching into more advanced machine learning approaches, such as Deep Learning. For an intro to Deep Learning, see my notebook on building a Convolutional Neural Network with Google’s TensorFlow API.
Notes:
- This IPython notebook is best viewed using Google Chrome; some images and hyperlinks may not work in Mozilla FireFox.
- To download the source code, which you can edit and execute yourself, save this link (.ipynb file extension).
Check out some of my past projects!
I’ve worked on technical projects in a variety of fields. Here are some highlights: