601: Similarity and Nearest Neighbours

Help Yourself - Sign Up

Video (1080p .mp4) | Video (720p .mp4) | Audio (.m4a)

Length: 7:12

The idea of similarity transcends problem boundaries. From recommendations to classification, learn how the nearest neighbour algorithm is not only useful but also very simple.

This section introduces the idea of “similarity”.

Why?:

  • Simplicity
  • Many business tasks require a measure of “similarity”
  • Works well

Business reasoning

Why would businesses want to use a measure of similarity? What business problems map well to similarity classifiers?

  • Find similar companies on a CRM
  • Find similar people in an online dating app
  • Find similar configurations of machines in a data centre
  • Find pictures of cats that look like this cat
  • Recommend products to buy from similar customers
  • Find similar wines

Similarity

What is similarity?

  • We can say that two wines are similar if they have the same colour, alcohol content, tastes, etc.

  • We can say that configurations of machines are similar if they have the same RAM, CPU, HDD, etc.

In other words, we are comparing the features of the observation.

Observations are similar if they have matching features.


Distance

What is the best way of reducing the similarity into a single distance measurement?

The simplest conversion would be to use the Euclidean distance (a.k.a. L2 norm, Pythagoras’ Theorem):

$$ d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …} $$

???

I.e. small distances are very similar, large distances are dissimilar.

Note that although we use the word distance, this measurement has no units. We’re potentially comparing multiple types of features, so a real “distance” doesn’t make sense.


Nearest Neighbour Algorithm

  1. Calculate distance to all observations
  2. Find the next closest observation
  • Recommendations: List the next nearest
  • Classification: Predict the same class as the nearest observations
  • Regression: Predict the same value as the nearest observations

???

Now we have a measure of distance, we can perform the nearest neighbour algorithm!

If we wanted to find the next similar wine for example, we’d simply calculate the distance between the current wine and all other wines.

Our next wine would be the one with the smallest distance!

If we wanted to perform classification, then we’d do the same but predict a classification using the nearest neighbour’s class.

If we wanted to perform regression, then we’d do the same but pick the same value as the closest neighbour.

Simple!


class: pure-table, pure-table-striped

Example: Whiskey recommendations

DistilleryBodySweetnessSmokyMedicinal
Aberfeldy2220
Aberlour3310
AnCnoc1320
Ardbeg4144
Ardmore2220

Columns: ['Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral', 'Postcode', ' Latitude', ' Longitude']

???

Do you like whiskey? (Say yes!)

If we had a dataset detailing whiskey characteristics, then we could take your favourite whiskey and return the most similar whiskeys as a personalised recommendation!


Algorithm

given favourite whiskey
foreach whiskey:
    dist = 0
    foreach feature:
        dist += (favorite[feature] - whiskey[feature])^2
    neighbours[whiskey] = dist
sort(neighbours by value)
print(first 5 neighbours)

Results

So, let’s go for a super-smoke: Laphroig. The results:

[
    (array([4, 2, 4, 4, 1, 0, 0, 1, 1, 1, 0, 0, 'Laphroig'], dtype=object), 0.0),
    (array([4, 1, 4, 4, 1, 0, 1, 2, 1, 1, 1, 0, 'Lagavulin'], dtype=object), 2.0),
    (array([4, 1, 4, 4, 0, 0, 2, 0, 1, 2, 1, 0, 'Ardbeg'], dtype=object), 3.0),
    (array([3, 2, 3, 3, 1, 0, 2, 0, 1, 1, 2, 0, 'Clynelish'], dtype=object), 3.4641016151377544),
    (array([3, 1, 4, 2, 1, 0, 2, 0, 2, 1, 1, 1, 'Caol Ila'], dtype=object), 3.7416573867739413)
]

???

We can see that Laphroig matches itself perfectly, as it should.

Next up is Lagavulin and Ardbeg.

According to the smoke classifications (the third column), and others, these are pretty good recommendations.

However, this points out how important your data is. I’ve definitely have non-smokey Ardbeg’s before.


Winder Research logo

EMail

web@WinderResearch.com

Registered Address

Winder Research and Development Ltd.,

Adm Accountants Ltd, Windsor House,

Cornwall Road,

Harrogate,

North Yorkshire,

HG1 2PW,

UK

Registration Number

08762077

VAT Number

GB214263735
© Winder Research and Development Ltd. 2016-2018; all rights reserved.