Classification
KNN
KNN is possibly the second most famous classification technique, also very simple and easy to apply.
It doesn't create a model, and whenever we need to classify a new record, it chooses the k most similar (closest) records to the given one, call them neighbors, and classifies the new one as the majority of its neighbors.
Naturally, the number of neighbors to consider, call it K
, is one of the parameters of any implementation of KNN.
KNeighborsClassifier
receives K through the n_neighbors
parameter. Another important parameter is the distance function to use to choose the neighbors - metric
is the parameter to use, and it can be manhattan, euclidean or chebyshev, among others.
Paremeters study
Given the importance of these parameters, we need to choose them carefully, which means we need to try different ones and understand how they impact on the quality of the results.
Next, we can see the results achieved by a set of parameters combinations.
from typing import Literal
from numpy import array, ndarray
from sklearn.neighbors import KNeighborsClassifier
from matplotlib.pyplot import figure, savefig, show
from dslabs_functions import CLASS_EVAL_METRICS, DELTA_IMPROVE, plot_multiline_chart
from dslabs_functions import read_train_test_from_files, plot_evaluation_results
def knn_study(
trnX: ndarray, trnY: array, tstX: ndarray, tstY: array, k_max: int=19, lag: int=2, metric='accuracy'
) -> tuple[KNeighborsClassifier | None, dict]:
dist: list[Literal['manhattan', 'euclidean', 'chebyshev']] = ['manhattan', 'euclidean', 'chebyshev']
kvalues: list[int] = [i for i in range(1, k_max+1, lag)]
best_model: KNeighborsClassifier | None = None
best_params: dict = {'name': 'KNN', 'metric': metric, 'params': ()}
best_performance: float = 0.0
values: dict[str, list] = {}
for d in dist:
y_tst_values: list = []
for k in kvalues:
clf = KNeighborsClassifier(n_neighbors=k, metric=d)
clf.fit(trnX, trnY)
prdY: array = clf.predict(tstX)
eval: float = CLASS_EVAL_METRICS[metric](tstY, prdY)
y_tst_values.append(eval)
if eval - best_performance > DELTA_IMPROVE:
best_performance: float = eval
best_params['params'] = (k, d)
best_model = clf
# print(f'KNN {d} k={k}')
values[d] = y_tst_values
print(f'KNN best with k={best_params['params'][0]} and {best_params['params'][1]}')
plot_multiline_chart(kvalues, values, title=f'KNN Models ({metric})', xlabel='k', ylabel=metric, percentage=True)
return best_model, best_params
file_tag = 'stroke'
train_filename = 'data/stroke_train_smote.csv'
test_filename = 'data/stroke_test.csv'
target = 'stroke'
eval_metric = 'accuracy'
trnX, tstX, trnY, tstY, labels, vars = read_train_test_from_files(train_filename, test_filename, target)
print(f'Train#={len(trnX)} Test#={len(tstX)}')
print(f'Labels={labels}')
figure()
best_model, params = knn_study(trnX, trnY, tstX, tstY, k_max=25, metric=eval_metric)
savefig(f'images/{file_tag}_knn_{eval_metric}_study.png')
show()
Train#=6806 Test#=1533 Labels=[0, 1] KNN best with k=1 and manhattan
Best model performance
After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.
prd_trn: array = best_model.predict(trnX)
prd_tst: array = best_model.predict(tstX)
figure()
plot_evaluation_results(params, trnY, prd_trn, tstY, prd_tst, labels)
savefig(f'images/{file_tag}_knn_{params["name"]}_best_{params["metric"]}_eval.png')
show()
<Figure size 600x450 with 0 Axes>
Overfitting study
The overfitting study is useful for identifying the situations where the specialization of a given model becomes too adjusted to the training data. This occurs when the performance on the training data keeps improving, but on the test set deteriorates.
In order to develop said study, we need to select the parameter that controls the specialization of a model. In the KNN algorithm this parameter is the number of neighbors, and so, we variate it for a chosen range, setting the other parameters to the best values identified for them, in the first study.
from matplotlib.pyplot import figure, savefig
distance: Literal["manhattan", "euclidean", "chebyshev"] = params["params"][1]
K_MAX = 25
kvalues: list[int] = [i for i in range(1, K_MAX, 2)]
y_tst_values: list = []
y_trn_values: list = []
acc_metric: str = "accuracy"
for k in kvalues:
clf = KNeighborsClassifier(n_neighbors=k, metric=distance)
clf.fit(trnX, trnY)
prd_tst_Y: array = clf.predict(tstX)
prd_trn_Y: array = clf.predict(trnX)
y_tst_values.append(CLASS_EVAL_METRICS[acc_metric](tstY, prd_tst_Y))
y_trn_values.append(CLASS_EVAL_METRICS[acc_metric](trnY, prd_trn_Y))
figure()
plot_multiline_chart(
kvalues,
{"Train": y_trn_values, "Test": y_tst_values},
title=f"KNN overfitting study for {distance}",
xlabel="K",
ylabel=str(eval_metric),
percentage=True,
)
savefig(f"images/{file_tag}_knn_overfitting.png")
show()
In this case, we don't see any overfitting, since the performance on both train and test sets keep the same trend.
Note that for KNN, the specialization of a model increases when the number of neighbors decrease.