Why Correlating Data is Bad and What to do About it

Help Yourself - Sign Up

Download this notebook

Correlating Data

Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos.

Correlations between features are bad because you are effectively telling the model that this information is twice more important than everything else. You’re feeding the model the same data twice.

Technically it’s known as multicollinear, which is the generalisation to any number of features that could be correlated.

Generally correlating features will decrease the performance of your model, so we need to find them and remove them.

Again, let’s generate some dummy data for simplicity…

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

X = np.random.randn(100,5)
noise = np.random.randn(100)
X[:,0] = 2*X[:,2] + 3*X[:, 4] + 0.5*noise

The easiest way of spotting correlating features is to generate a scatter matrix. This is an image which plots each feature against each other feature.

# Here, we're plotting a "scatter matrix". I.e. a matrix of scatter plots of each feature.
# Its really useful for spotting dodgy data.

from pandas.plotting import scatter_matrix
df = pd.DataFrame(X)
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='hist')
plt.show()

png

We can see that there is some linearity in the plots. Definitely in the top right.

Again, the simplest thing to do at this stage is manually remove that feature.

del df[4]
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='hist')
plt.show()

png

There’s still a little bit of correlation in the second feature, but it’s not huge. Try scoring your model with and without this feature.

We could perform this process in a more quantitative manner using eigenvectors and eigenvalues to spot the correlation, but that’s a bit too complex to consider at this point.


Winder Research logo

EMail

web@WinderResearch.com

Registered Address

Winder Research and Development Ltd.,

Adm Accountants Ltd, Windsor House,

Cornwall Road,

Harrogate,

North Yorkshire,

HG1 2PW,

UK

Registration Number

08762077

VAT Number

GB214263735
© Winder Research and Development Ltd. 2016-2018; all rights reserved.