Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos.
In a previous workshop we investigated how the nearest neighbour algorithm uses the concept of distance as a similarity measure.
We can also use this concept of similarity as a classification metric. I.e. new observations will be classified the same as its neighbours.
This is accomplished by finding the most similar observations and setting the predicted classification as some combination of the k-nearest neighbours. (e.g. the most common)
This time, let’s use a classification dataset like the
iris dataset and
let’s use the inbuilt k-NN implementation in
# Usual imports import os import pandas as pd import matplotlib.pyplot as plt import numpy as np from IPython.display import display from sklearn import datasets from sklearn import neighbors # import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. We could # avoid this ugly slicing by using a two-dim dataset y = iris.target
# Quick plot to see what we're dealing with plt.scatter(X[:,0], X[:,1], c=y) plt.show()
Now let’s write a fancy method to plot the decision boundaries…
from matplotlib.colors import ListedColormap def plot_decision_regions(X, y, classifier, resolution=0.02): # setup marker generator and color map markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) # plot the decision surface x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)) Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) Z = Z.reshape(xx1.shape) plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
figure = plt.figure(figsize=(12, 6)) n_neighbours = [1, 20] i = 1 for n in n_neighbours: clf = neighbors.KNeighborsClassifier(n, weights='distance') clf.fit(X, y) ax = plt.subplot(1, len(n_neighbours), i) i += 1 plot_decision_regions(X=X, y=y, classifier=clf) ax.set_xticks(()) ax.set_yticks(()) plt.title(str(n) + '-NN') plt.show()
As we can see, when we only pick a single neighbour for classification purposes, we end up with a very complex decision boundary. If we were to perform validation on this classifier we would probably see that it has poor performance; we have overfit.
Generally, as you increase the value of
k, the decision boundary becomes smoother.
But beware. You must pick a value of
k to avoid ties (i.e. odd) and it must be coprime (to allow all classes a chance to vote). See the training videos for more information.
- Try plotting validation curves on this dataset for different values of k