Lab 3: Classification

Classification is one of the major tasks in data science, and can be performed through sklearn package and its multiple subpackages. The image below summarizes the different major classification techniques and the corresponding implementation packages in sklearn.

The classification task occurs in three steps:

Training Models

Whenever we are in the presence of a classification problem, the first thing to do is to identify the target or class, which is the variable to predict. The type of the target variable determines the kind of operation to perform: targets with just a few values allow for a classification task, while real-valued targets require a prediction one.

In the presence of a classification task, identifying the target balancing is mandatory, in order to choose the most adequate balancing strategy (see Data balancing) and elect the best metrics to evaluate the results achieved.

After applying balancing techniques, if required, the next thing to proceed with the training step, is to choose the best training strategy to apply. This strategy concerns with the way to get the train and test datasets, which is done in accordance to the dataset size:

Remark: in each one of the strategies is important to note, that the split can't be completely random, but should keep the original distribution of the target variable. More, all the variables distribution should be kept for each data subset, which usually is achieved through a stratify parameter.

train_test_split function

As noted above, the train of classification models is achieved through sklearn package. Since it is constructed over the numpy package, we need to present numpy arrays ndarray as parameters for the different methods, like train_test_split.

In mathematical terms, classification aims to map the data X to values into the domain of the target variable, call it y.

After loading the data, in data dataframe, we need to separate the target variable from the rest of the data, since it plays a different role in the training procedure. Through the application of the pop method, we get the class variable, and simultaneously removing it from the dataframe. So, y will keep the ndarray with the target variable for each record and X the ndarray containing the records themselves.

From the chart plotted, we realize that the number of people with diabetes P is around half of the number of people without diabetes, N records, making it slightly unbalanced. And that this distribution was kept unchanged for our train and test datasets... and in order to make it easier for the training we should balance the data.

The code above applies a hold-out split, through the train_test_split, which receives both X and y as the data to split, and returns both of them split in two: trnX will contain trains_size of X and tstX will contain the remaining 30%, and the same for y.

Evaluation

The evaluation of the results of each learnt model, in the classification paradigm, is objective and straightforward. We just need to assess if the predicted labels are correct, which is done by measuring the number of records where the predicted label is equal to the known ones.

Accuracy

The simplest measure is accuracy, which reports the percentage of correct predictions. It is just the opposite of error. In sklearn, accuracy is reported through the score method from each classifier, after its training and measured over a particular dataset and its known labels.

In our example, naive Bayes presents a not so good, only above 70%. But by itself, this number doesn't allow for understanding which are the errors - where Naive Bayes is struggling.

Confusion Matrix

Indeed, first we need to distinguish among the errors. In the presence of binary classification problems (there are only two possible outcomes for the target variable), this distinction is easy. Usually, we call negative to the most common target value and positive to the other one.

From this, we have:

The confusion matrix is the standard to present these numbers, and is computed through the confusion matrix in the sklearn.metrics package.

Unfortunately, we are not able to see the correspondence between the matrix elements and the enumerated above, since there is no standard to specify the row and columns in the matrix. So the best way is to plot it in a specialized chart, as the one below.

Whenever in the presence of non-binary classification, we adapt those notions for each possible combination. For example, in the iris dataset we have 3 different classes: iris-setosa, iris-versicolor and iris-virginica.

Besides the confusion matrix, there are several other measures that try to reflect the quality of the model, also available in the sklearn.metrics package. Next, we summarize the most used.

Classification metrics

recall_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

also called sensitivity> and TP rate, reveals the models ability to recognize the positive records, and is given by TP/(TP+FN); receives the known labels in tstY and the predicted ones in prdY

precision_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

reveals the models ability to not misclassify negative records, and is given by TP/(TP+FP); receives the known labels in tstY and the predicted ones in prdY

f1_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

computes the average between precision and recall, and is given by 2 * (precision * recall) / (precision + recall); receives the known labels in tstY and the predicted ones in prdY

balanced_accuracy_score(tstY: np.ndarray, prdY: np.ndarray) -> [0..1]

reveals the average of recall scores for all the classes; receives the known labels in tstY and the predicted ones in prdY

ROC Charts

ROC charts are another mean to understand models' performance, in particular in the presence of binary non balanced datasets. They present the balance between True Positive rate (recall) and False Positive rate in a graphical way, and are available through the roc_curve method in the sklearn.metrics.

However, roc charts require two parameters: tstY and scores. While the first one is just the labels for the test set, the second one reflects a kind of a probablity of each record in the test set being positive. These scores are not trivial to get from some classification techniques, but can be computed through the proba method in their estimators. It works like the predict method, but instead of returning the predictions themselves, it return those scores.

In addition to the chart, the area under the roc curve, auc for short, is another important measure, mostly for unbalanced datasets. It is available as roc_auc_score, in the sklearn.metric package, and receives the known labels in its first parameter and the previous computed scores as the second one, just like roc_curve method.

Estimators and Models

With the data split, we proceed to create the prediction model. However, there is a plethora of techniques and extensions, with an infinite number of different parametrisations, and the choice of the best one to apply can only be done by comparing their results in our data. Additionally, each technique works better for data with some specific characteristics, which demands the application of some data preparation transformations.

In sklearn, a estimator is an object of an extension of the BaseEstimator class, which implements the fit and predict methods. Beside these, it also implements the score method. Estimators parametrization are done through passing the different choices as parameters to their constructors methods.

Note that in sklearn there is no class for representing the models learnt, but their effects are reachable through the estimator object. Indeed, an estimator is the result of parametrising a learning technique, trained over a particular dataset, creating a classification model.

Estimators Methods

fit(trnX: np.ndarray, trnY: np.ndarray)

trains the classifier over the data trnX labeled according to trnY, creating an internal model

predict(trnX: np.ndarray) -> np.ndarray

applies the learnt model to the training data in trnX and returns their predicted labels

score(tstX: np.ndarray, tstY: np.ndarray) -> float

applies the model to tstX and compares the predicted labels to the labels in tstY, computing model's mean accuracy on the given data

Among the techniques that we are going to use, are: GaussianNB, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier and GradientBoostingClassifier.

The rest of this module is organized in a similar way for each one of the classification techniques: it first succinctly describes the technique and its main parameters, then we train different models through different parametrisations of the technique, using a 70%train-30%test split strategy, and evaluate the accuracy of each model as explained, comparing the different results.