603: Nearest Neighbour Tips and Tricks

Feature scaling and the type of distance metric used in nearest neighbour algorithms has a big effect on performance. This data science training video talks about this and other nearest neighbour tips and tricks.

Length: 7:09

Feature scaling and the type of distance metric used in nearest neighbour algorithms has a big effect on performance. This data science training video talks about this and other nearest neighbour tips and tricks.

Dimensionality and domain knowledge

• Is it right to use the same distance measure for all features?

E.g. height and sex? CPU and Disk space?

• Some features will have more of an effect than others due to their scales.

???

In this version of the algorithm all features are used in the distance calculation.

This treats all features the same. So a measure of height has the same effect as the measure of sex.

The result is that some variables will tend to influence a result more than others.

The way around this is to make full use of feature selection and/or engineering (see later). I.e. remove or alter features that don’t provide any descriptive power.

This leads to a requirement of being more of a specialist in the domain that you would need to be for other algorithms (e.g. deep learning).

Computational efficiency

• k-NN has no training period. Only prediction.
• Performance is proportional to the number of features and observations.

This means that finding the neighbours can be expensive in production.

You can find optimisations of k-NN to overcome this. E.g. KDTree

???

k-NN is particularly interesting because most algorithms take longer to train than they do to predict.

k-NN is different because there is no training, the prediction step of measuring the distances is all that is required.

The draw-back is that when numbers of features or numbers of observations are large, prediction times become increasingly large.

For tasks that require fast predictions (e.g. online advertisement recommendations) this might not be the best choice.

Although you could of course limit the dataset to a certain size to guarantee performance. Or use some random sampling to improve performance.

Distance Measures

• There are other distance measures (hundreds!)

???

We’ve already used the Euclidian distance, but there are many others (hundreds actually) which might make more sense depending on your application.

The reason is that distance measures tend to be domain specific.

E.g. there might be a good way to calculate the distance between taxi start and end points in a city (they often travel at right angles due to blocks of buildings).

To recap…

Euclidean (L2 norm)

$$d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …}$$

???

As before, the L2 norm tends to be influence more by outliers.

And again, it assumes a gaussian distribution (which is probably not true for complex multi-dimensional data).

This is the most popular distance measure, but probably not the right one.

Manhattan (L1 norm)

$$d_{Manhatten}(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|=|x_1 - y_1| + |x_2 - y_2| + …$$

This is the L1 norm and is named Manhatten since it would be the total street distance you would have to travel in a city to reach a destination.

???

As before, the L1 norm also tends to be less influenced by outliers so might be a better choice for noisy datasets.

Jaccard

$$d_{Jaccard}(\mathbf{x}, \mathbf{y}) = 1 - \frac{ \mathbf{x} \cap \mathbf{y} }{ \mathbf{x} \cup \mathbf{y} }$$

IntersectionUnion
.center[].center[]
• Shared items / all items

E.g. if two whiskeys are peaty, that’s significant. If they are not spicy, that’s probably not.

???

It is the union (whole area) of two sets divided buy the intersection (shared area) of the two sets. The Jaccard distance is useful when you have highly categorical data. This is useful for situations when categories are similar, but dissimilar categories are not.

More distances measures are available here:

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

Or over a hundred were collated here:

The Dictionary of Distances by Deza and Deza (Elsevier Science, 2006)

Combination functions

• Majority Vote
• Weighting the class based upon distance
• Dividing weighting by all observations to generate a probability estimate
• Using the median, rather than the average
• All of the above but for regression.

See the “weights” parameter in the sklearn k-NN documentation for example implementations: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

???

Remember that after we’ve picked the k nearest neighbours, we then need to pick the best class or value or whatever.

What we need to do is combine the other observations into a result.

The normal k-NN algorithm uses a simple majority vote. But others include:

• Weighting the class based upon distance
• Dividing weighting by all observations to generate a probability estimate
• Using the median, rather than the average
• All of the above but for regression.

See the “weights” parameter in the sklearn k-NN documentation for example implementations: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html