Lab 3: Classification (cont.)

Decision Trees

Decision Trees are a friendly kind of model for classification, since they are interpretable and easy to apply. They are implemented in the DecisionTreeClassifier of sklearn.tree package.

In general, algorithms for training decision trees choose the best variable to split the dataset, in a manner that in each branch we will have a smaller mixture of classes. Then the algorithm repeats in the same way for each branch, until it reaches a pure leaf (a node with all the records of the same class) or there are no more variables to split the data.

The choice of the best variable is done according to a criterion: entropy and giny for implementing the information gain and giny impurity functions, respectively.

Among the several parameters, the max_depth determines the maximum size of the tree to reach, implementing a pre-pruning strategy. Other parameters with similar effects are the min_samples_leaf, min_samples_split and min_impurity_decrease thresholds, that avoid continuing growing the tree.

The min_impurity_decrease parameter implements the post-pruning strategy, since it only splits some node, if it brings more accuracy to the model.

In order to show the learned tree, we can use the graphviz package.

However, this is a very heavy image. In order to print a simpler version of the tree, we can just use:

After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.