|
KNN is possibly the second most famous classification technique, also very simple and easy to apply.
It doesn't create a model, and whenever we need to classify a new record, it chooses the n most similar (closest) records to the given one, call them neighbors, and classifies the new one as the majority of its neighbors.
Naturally, the number of neighbors to consider, call it n, is one of the parameters of any implementation of KNN.
KNeighborsClassifier
receives n through the n_neighbors
parameter. Another important
parameter is the distance function to use to choose the neighbors - metric
is the parameter to use, and it
can be manhattan, euclidean or chebyshev, among others.
Given the importance of these parameters, we need to choose them carefully, which means we need to try different ones and understand how they impact on the quality of the results.
Next, we can see the results achieved by a set of parameters combinations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.neighbors import KNeighborsClassifier
import ds_charts as ds
file_tag = 'diabetes'
filename = 'data/diabetes'
target = 'class'
train: pd.DataFrame = pd.read_csv(f'{filename}_train.csv')
trnY: np.ndarray = train.pop(target).values
trnX: np.ndarray = train.values
labels = pd.unique(trnY)
test: pd.DataFrame = pd.read_csv(f'{filename}_test.csv')
tstY: np.ndarray = test.pop(target).values
tstX: np.ndarray = test.values
nvalues = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
dist = ['manhattan', 'euclidean', 'chebyshev']
values = {}
best = (0, '')
last_best = 0
for d in dist:
yvalues = []
for n in nvalues:
knn = KNeighborsClassifier(n_neighbors=n, metric=d)
knn.fit(trnX, trnY)
prdY = knn.predict(tstX)
yvalues.append(metrics.accuracy_score(tstY, prdY))
if yvalues[-1] > last_best:
best = (n, d)
last_best = yvalues[-1]
values[d] = yvalues
plt.figure()
ds.multiple_line_chart(nvalues, values, title='KNN variants', xlabel='n', ylabel='accuracy', percentage=True)
plt.savefig('images/{file_tag}_knn_study.png')
plt.show()
print('Best results with %d neighbors and %s'%(best[0], best[1]))
Best results with 11 neighbors and manhattan
After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.
clf = knn = KNeighborsClassifier(n_neighbors=best[0], metric=best[1])
clf.fit(trnX, trnY)
prd_trn = clf.predict(trnX)
prd_tst = clf.predict(tstX)
ds.plot_evaluation_results(labels, trnY, prd_trn, tstY, prd_tst)
plt.savefig('images/{file_tag}_knn_best.png')
plt.show()