kaggle titanic: machine learning from disaster Predicting with Decision Tree Improving your predictions

In this chapter we will go trough the essential steps that you will need to take before beginning to build predictive models.

How it works

Get the Data with Pandas

Understanding your data

Rose vs Jack, or Female vs Male

Does age play a role

First Prediction

Predicting with Decision Tree

Intro to decision trees

Cleaning and Formatting your Data

Creating your first decision tree

Interpreting your decision tree

Predict and submit to Kaggle

Overfitting and how to control it

Feature-engineering for our Titanic data set

Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size.
A valid assumption is that larger families need more time to get together on a sinking ship, and hence have lower probability of surviving. Family size is determined by the variables SibSp andParch, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable family_size, which is the sum of SibSpand Parch plus one (the observation itself), to the test and train set.

Improving your predictions

What techniques can you use to improve your predictions even more? One possible way is by making use of the machine learning method Random Forest. Namely, a forest is just a collection of trees…

A Random Forest analysis in Python

Interpreting and Comparing

Conclude and Submit