203: Examples and Decision Trees

Help Yourself - Sign Up

Video (1080p .mp4) | Video (720p .mp4) | Audio (.m4a)

Length: 9:16

This video shows an example of using segmentation on categorical data and we'll also find that we have just derived our first important machine learning algorithm: a decision tree.

Example: Segmentation via Information Gain

There’s a fairly famous dataset called the “mushroom dataset”.

It describes whether mushrooms are edible or not, depending on an array of features.

The nice thing about this dataset is that the features are all catagorical.

So we can go through and segment the data for each value in a feature.

This is some example data:

poisonouscap-shapecap-surfacecap-colorbruises?
pxsnt
exsyt
ebswt
pxywt
exsgf

etc.


For each feature, we can then plot the information gain visually, which helps. On the x axis we have the proportion of the sample given a feature value, vs the calculated entropy for the label (poisonous, not poisonous).

This is the information gain for the “cap-shape” feature:

mushroom_cap-shape


And this is the information gain for the “odour” feature:

mushroom_odour

We can visually see that the second feature has a much smaller entropy*probability area, hence the information gain is much greater.

We would certainly consider using this feature over the other.


Segmentation through trees

Segmentation and quantitative measures of information are fundamental to data science.

If we select the most informative feature then we have segmented the data. It is then quite simple to use that segmentation rule (e.g. < or > 50) to predict the class of a new observation.

But what if a single split didn’t completely separate the classes?

One simple solution is to stack segmentation rules. I.e. perform the split again until the classes are pure.

This is known as a tree.


A tree has several components:

  • Root node: The starting point
  • Interior node: A decision point
  • Terminal node/Leaf: A pure node.

Trees are often used as predictive models.


We can continue performing the splits and information gain calculations until all the terminal nodes are pure.

Then if we see a new observation, we can re-run the same rules for that new observation and predict it’s class.

When trees are used to make a decision they are called decision trees. Our first classification algorithm!

Pat yourself on your back. Pat your neighbours back. You’ve just derived a very important algorithm!

They are popular because:

  • They are easy to understand
  • Simple and efficient to implement
  • Work well with little tuning
  • Work with both continuous and categorical data (more on this next)

Given some data, we can build an optimal tree structure for our problem. This is called tree induction.

The goal of the tree is to provide supervised segmentation; given some labelled data, find the rules, based upon their features, into subgroups that have similar values.

We performed this task manually when discussing information gain. We can iteratively select the best split that maximises the information gain.


Decision lines and hyperplanes

Once rules have been derived from the data the rule forms a decision boundary.

If there is only one feature, then the boundary is simply a point. If there are two features, then the boundary is a line. If there are three it becomes a surface. Etc.

Generalising the boundary is known as a hyperplane, which simply means a separating surface.


Trees as rules

One of the beauties of tree-based classification algorithms is that they are purely logical.

It is easy to convert simple trees into a set of rules.

For example, our beer example could be summarised as:

if COLOUR < 50 then Class=A else Class=B

These rules can be easily encoded into software or procedures.


Example: Customer churn

This is an interesting dataset describing whether users of a mobile phone service left for another provider. This is known as “customer churn”.

The goal of this dataset is to predict wheter a customer is going to leave or not from some other features.

The features include:

Index(['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE',
       'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION',
       'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL',
       'CONSIDERING_CHANGE_OF_PLAN'],
      dtype='object')

I won’t go into details just yet, because we’re not quite ready. But I pushed this data through an algorithm to create a decision tree, the calculated the information gain given each of the generated rules.

The resulting rules look like…


graphviz


churn_feature_information_gains


Winder Research logo

EMail

web@WinderResearch.com

Registered Address

Winder Research and Development Ltd.,

Adm Accountants Ltd, Windsor House,

Cornwall Road,

Harrogate,

North Yorkshire,

HG1 2PW,

UK

Registration Number

08762077

VAT Number

GB214263735
© Winder Research and Development Ltd. 2016-2018; all rights reserved.