Classification

Classification is one of the major tasks in data science, and can be performed through sklearn package and its multiple subpackages. The image below summarizes the different major classification techniques and the corresponding implementation packages in sklearn.

No description has been provided for this image

The classification task occurs in three steps:

  • first, we learn several models, by training them over a labeled dataset, called the train dataset;
  • second, we evaluate the different models over a independent dataset - the test dataset, predicting the target variable for each of its records, and comparing them with the real ones;
  • thirdly, we choose the model showing the best performance, and use it for predicting the target value for unseen records.

Training Strategies

Whenever we are in the presence of a classification problem, the first thing to do is to identify the target or class, which is the variable to predict. The type of the target variable determines the kind of operation to perform: targets with just a few values allow for a classification task, while real-valued targets require a prediction one.

In the presence of a classification task, identifying the target balancing is mandatory, in order to choose the most adequate balancing strategy (see Data balancing) and elect the best metrics to evaluate the results achieved.

After applying balancing techniques, if required, the next thing to proceed with the training step, is to choose the best training strategy to apply. This strategy concerns with the way to get the train and test datasets, which is done in accordance to the dataset size:

  • k-fold cross validation (StratifiedKFold): used in the presence of a few thousand records;

  • hold-out (train_test_split): used in the presence of a few thousands of records;

  • sample hold-out: used in the presence of large thousands of records.

Remark: in each one of the strategies is important to note, that the split can't be completely random, but should keep the original distribution of the target variable. More, all the variables distribution should be kept for each data subset, which usually is achieved through a stratify parameter.

train_test_split function

As noted above, the train of classification models is achieved through sklearn package. Since it is constructed over the numpy package, we need to present numpy arrays ndarray as parameters for the different methods, like train_test_split.

In mathematical terms, classification aims to map the data X to values into the domain of the target variable, call it y.

After loading the data, in data dataframe, we need to separate the target variable from the rest of the data, since it plays a different role in the training procedure. Through the application of the pop method, we get the class variable, and simultaneously removing it from the dataframe. So, y will keep the ndarray with the target variable for each record and X the ndarray containing the records themselves.

In [ ]:
from numpy import array, ndarray
from pandas import read_csv, DataFrame

file_tag = "stroke"
index_col = "id"
target = "stroke"
data: DataFrame = read_csv("data/stroke_mvi_encoded.csv", index_col=index_col)
labels: list = list(data[target].unique())
labels.sort()
print(f"Labels={labels}")

positive: int = 1
negative: int = 0
values: dict[str, list[int]] = {
    "Original": [
        len(data[data[target] == negative]),
        len(data[data[target] == positive]),
    ]
}

y: array = data.pop(target).to_list()
X: ndarray = data.values
Labels=[0, 1]

Be careful, when you do not transform the class values to a numeric format, sklearn faces the first value encountered in the data as 0, which results in a very poor correspondence, since it may change per dataset. And the evaluation metrics become inconsistent.

For this reason we should encode the class variable to a numeric format, to avoid all these inconsistences.

From the chart plotted, we realize that the number of people with a stroke 1 is around fiver percent of the number of people without strokes, 0 records, making it deeplly unbalanced. And that this distribution was kept unchanged for our train and test datasets... and in order to make it easier for the training we should balance the data.

In [ ]:
from pandas import concat
from matplotlib.pyplot import figure, show
from sklearn.model_selection import train_test_split
from dslabs_functions import plot_multibar_chart


trnX, tstX, trnY, tstY = train_test_split(X, y, train_size=0.7, stratify=y)

train: DataFrame = concat(
    [DataFrame(trnX, columns=data.columns), DataFrame(trnY, columns=[target])], axis=1
)
train.to_csv(f"data/{file_tag}_train.csv", index=False)

test: DataFrame = concat(
    [DataFrame(tstX, columns=data.columns), DataFrame(tstY, columns=[target])], axis=1
)
test.to_csv(f"data/{file_tag}_test.csv", index=False)

values["Train"] = [
    len(train[train[target] == negative]),
    len(train[train[target] == positive]),
]
values["Test"] = [
    len(test[test[target] == negative]),
    len(test[test[target] == positive]),
]

figure(figsize=(6, 4))
plot_multibar_chart(labels, values, title="Data distribution per dataset")
show()
No description has been provided for this image

The code above applies a hold-out split, through the train_test_split, which receives both X and y as the data to split, and returns both of them split in two: trnX will contain trains_size of X and tstX will contain the remaining 30%, and the same for y.

Reading Train and Test datasets

Another useful function is one that reads both the train and test datasets from files, that we implement through the read_train_test_from_files below.

In [ ]:
from pandas import read_csv


def read_train_test_from_files(
    train_fn: str, test_fn: str, target: str = "class"
) -> tuple[ndarray, ndarray, array, array, list, list]:
    train: DataFrame = read_csv(train_fn, index_col=None)
    labels: list = list(train[target].unique())
    labels.sort()
    trnY: array = train.pop(target).to_list()
    trnX: ndarray = train.values

    test: DataFrame = read_csv(test_fn, index_col=None)
    tstY: array = test.pop(target).to_list()
    tstX: ndarray = test.values
    return trnX, tstX, trnY, tstY, labels, train.columns.to_list()


file_tag = "stroke"
train_filename = "data/stroke_train_smote.csv"
test_filename = "data/stroke_test.csv"
target = "stroke"
eval_metric = "accuracy"

trnX: ndarray
tstX: ndarray
trnY: array
tstY: array
labels: list
vars: list
trnX, tstX, trnY, tstY, labels, vars = read_train_test_from_files(
    train_filename, test_filename, target
)
print(f"Train#={len(trnX)} Test#={len(tstX)}")
print(f"Labels={labels}")
Train#=6806 Test#=1533
Labels=[0, 1]

Estimators and Models

With the data split, we proceed to create the prediction model. However, there is a plethora of techniques and extensions, with an infinite number of different parametrisations, and the choice of the best one to apply can only be done by comparing their results in our data. Additionally, each technique works better for data with some specific characteristics, which demands the application of some data preparation transformations.

In the sklearn package, an estimator is an object of an extension of the BaseEstimator class, which implements the fit and predict methods. Beside these, it also implements the score method. Estimators parametrization are done through passing the different choices as parameters to their constructors methods.

Note that in sklearn there is no abstract class for representing the models learnt, but their effects are reachable through the estimator object. Indeed, an estimator is the result of parametrising a learning technique, trained over a particular dataset, creating a classification model.

Estimators Methods

fit(trnX: np.ndarray, trnY: np.ndarray)

trains the classifier over the data trnX labeled according to trnY, creating an internal model

predict(trnX: np.ndarray) -> np.ndarray

applies the learnt model to the training data in trnX and returns their predicted labels

score(tstX: np.ndarray, tstY: np.ndarray) -> float

applies the model to tstX and compares the predicted labels to the labels in tstY, computing model's mean accuracy on the given data

Next, we illustrate the use of these methods with the simplest estimator provided - the Gaussian Naive Bayes.

In [ ]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(trnX, trnY)
pred_trnY: array = clf.predict(trnX)
print(f"Score over Train: {clf.score(trnX, trnY):.3f}")
print(f"Score over Test: {clf.score(tstX, tstY):.3f}")
Score over Train: 0.772
Score over Test: 0.702

Evaluation

The evaluation of the results of each learnt model, in the classification paradigm, is objective and straightforward. We just need to assess if the predicted labels are correct, which is done by measuring the number of records where the predicted label is equal to the known ones.

Accuracy, Recall and Precision

The simplest measure is accuracy, which reports the percentage of correct predictions. It is just the opposite of error. the score method from each classifier, after its training and measured over a particular dataset and its known labels.

As we saw, by training the estimator we are able to see its mean accuracy by using the score method. However, it's not practical to predict the labels whenever we need to analyse the models performance. To address this, sklearn provides an additional package - the sklearn.metrics, that makes available a set of evaluation measures.

In this package, accuracy is reported through the accuracy_score method, which also provides recall and precision, through recall_score and precision_score respectively, allowing for the analysis of the impact of the different types of errors.

In [ ]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

pred_tstY: array = clf.predict(tstX)

acc: float = accuracy_score(tstY, pred_tstY)
recall: float = recall_score(tstY, pred_tstY)
prec: float = precision_score(tstY, pred_tstY)
print(f"accuracy={acc:.3f} recall={recall:.3f} precision={prec:.3f}")
accuracy=0.702 recall=0.827 precision=0.123

In our example, the results are not extraordinary, reching around 70% of accuracy, 81% of recall but only 12% of precision.

Besides these three, there are several other measures that try to reflect the quality of the model, also available in the sklearn.metrics package. Next, we summarize the most used.

Classification metrics

recall_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

also called sensitivity> and TP rate, reveals the models ability to recognize the positive records, and is given by TP/(TP+FN); receives the known labels in tstY and the predicted ones in prdY

precision_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

reveals the models ability to not misclassify negative records, and is given by TP/(TP+FP); receives the known labels in tstY and the predicted ones in prdY

f1_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

computes the average between precision and recall, and is given by 2 * (precision * recall) / (precision + recall); receives the known labels in tstY and the predicted ones in prdY

balanced_accuracy_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

reveals the average of recall scores for all the classes; receives the known labels in tstY and the predicted ones in prdY

Confusion Matrix

Despite the usefulness of those metrics, they all derive from the number of errors made by our classifier, that can be of two different kinds. In the presence of binary classification problems (there are only two possible outcomes for the target variable), this distinction is easy. Usually, we call negative to the most common target value and positive to the other one.

From this, we have:

  • true positives (TP): the number of positive records rightly predicted as positve;
  • true negatives (TN): the number of negatives records rightly predicted as negative;
  • false positives (FP): the number of negative records wrongly predicted as positve;
  • false negatives (FN): the number of positive records wrongly predicted as negative.

The confusion matrix is the standard to present these numbers, and is computed through the confusion matrix in the sklearn.metrics package.

In [ ]:
from pandas import unique
from sklearn.metrics import confusion_matrix

labels: list = list(unique(tstY))
labels.sort()

prdY: array = clf.predict(tstX)
cnf_mtx_tst: ndarray = confusion_matrix(tstY, prdY, labels=labels)
print(cnf_mtx_tst)
[[1014  444]
 [  13   62]]

Unfortunately, we are not able to see the correspondence between the matrix elements and the enumerated above, since there is no standard to specify the row and columns in the matrix. So the best way is to plot it in a specialized chart, as the one below.

In [ ]:
from itertools import product
from numpy import ndarray, set_printoptions, arange
from matplotlib.pyplot import gca, cm
from matplotlib.axes import Axes


def plot_confusion_matrix(cnf_matrix: ndarray, classes_names: ndarray, ax: Axes = None) -> Axes:  # type: ignore
    if ax is None:
        ax = gca()
    title = "Confusion matrix"
    set_printoptions(precision=2)
    tick_marks: ndarray = arange(0, len(classes_names), 1)
    ax.set_title(title)
    ax.set_ylabel("True label")
    ax.set_xlabel("Predicted label")
    ax.set_xticks(tick_marks)
    ax.set_yticks(tick_marks)
    ax.set_xticklabels(classes_names)
    ax.set_yticklabels(classes_names)
    ax.imshow(cnf_matrix, interpolation="nearest", cmap=cm.Blues)

    for i, j in product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        ax.text(
            j, i, format(cnf_matrix[i, j], "d"), color="y", horizontalalignment="center"
        )
    return ax


figure()
plot_confusion_matrix(cnf_mtx_tst, labels)
show()
No description has been provided for this image

ROC Charts

ROC charts are another mean to understand models' performance, in particular in the presence of binary non balanced datasets. They present the balance between True Positive rate (recall) and False Positive rate in a graphical way, and are available through the RocCurveDisplay method in sklearn.metrics.

In [ ]:
from sklearn.metrics import RocCurveDisplay
from config import ACTIVE_COLORS


def plot_roc_chart(tstY: ndarray, predictions: dict, ax: Axes = None, target: str = "class") -> Axes:  # type: ignore
    if ax is None:
        ax = gca()
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.0, 1.0)
    ax.set_xlabel("FP rate")
    ax.set_ylabel("TP rate")
    ax.set_title("ROC chart for %s" % target)

    ax.plot(
        [0, 1],
        [0, 1],
        color="navy",
        label="random",
        linewidth=1,
        linestyle="--",
        marker="",
    )
    models = list(predictions.keys())
    for i in range(len(models)):
        RocCurveDisplay.from_predictions(
            y_true=tstY,
            y_pred=predictions[models[i]],
            name=models[i],
            ax=ax,
            color=ACTIVE_COLORS[i],
            linewidth=1,
        )
    ax.legend(loc="lower right", fontsize="xx-small")
    return ax


figure()
plot_roc_chart(tstY, {"GaussianNB": prdY}, target=target)
show()
No description has been provided for this image

In addition to the chart, the area under the roc curve, auc for short, is another important measure, mostly for unbalanced datasets. It is available as roc_auc_score, in the sklearn.metric package, and receives the known labels in its first parameter and the predicted labels as the second one.

In order to make it easier to analyse the performance of a given model, we can make use of a function that combines all those elements. We implement it through plot_evaluation_results function, that receives as its first parameter a descritpion of the model to evaluate. This description consists of a dictionary, where the keys name, metric and params have to be present. While name identifies the classifier trained, metric concerns to the metric used to select the model and params stores the parameters used to train the model.

In [ ]:
from typing import Callable
from matplotlib.figure import Figure
from matplotlib.pyplot import subplots, savefig, figure
from sklearn.metrics import roc_auc_score, f1_score
from dslabs_functions import plot_multibar_chart, HEIGHT

CLASS_EVAL_METRICS: dict[str, Callable] = {
    "accuracy": accuracy_score,
    "recall": recall_score,
    "precision": precision_score,
    "auc": roc_auc_score,
    "f1": f1_score,
}


def plot_evaluation_results(
    model, trn_y, prd_trn, tst_y, prd_tst, labels: ndarray
) -> ndarray:
    evaluation: dict = {}
    for key in CLASS_EVAL_METRICS:
        evaluation[key] = [
            CLASS_EVAL_METRICS[key](trn_y, prd_trn),
            CLASS_EVAL_METRICS[key](tst_y, prd_tst),
        ]

    params_st: str = "" if () == model["params"] else str(model["params"])
    fig: Figure
    axs: ndarray
    fig, axs = subplots(1, 2, figsize=(2 * HEIGHT, HEIGHT))
    fig.suptitle(f'Best {model["metric"]} for {model["name"]} {params_st}')
    plot_multibar_chart(["Train", "Test"], evaluation, ax=axs[0], percentage=True)

    cnf_mtx_tst: ndarray = confusion_matrix(tst_y, prd_tst, labels=labels)
    plot_confusion_matrix(cnf_mtx_tst, labels, ax=axs[1])
    return axs


model_description: dict = {"name": "GaussianNB", "metric": eval_metric, "params": ()}

prd_trn: array = clf.predict(trnX)
prd_tst: array = clf.predict(tstX)
figure()
plot_evaluation_results(model_description, trnY, prd_trn, tstY, prd_tst, labels)
savefig(
    f'images/{file_tag}_{model_description["name"]}_best_{model_description["metric"]}_eval.png'
)

show()
<Figure size 600x450 with 0 Axes>
No description has been provided for this image

Among the techniques that we are going to use, are: GaussianNB, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier and MultiLayerPerceptron.

The rest of this module is organized in a similar way for each one of the classification techniques: it first succinctly describes the technique and its main parameters, then we train different models through different parametrisations of the technique, using a 70%train-30%test split strategy, and evaluate the accuracy of each model as explained, comparing the different results.