|
Decision Trees are a friendly kind of model for classification, since they are interpretable and easy
to apply. They are implemented in the DecisionTreeClassifier
of sklearn.tree
package.
In general, algorithms for training decision trees choose the best variable to split the dataset, in a manner that in each branch we will have a smaller mixture of classes. Then the algorithm repeats in the same way for each branch, until it reaches a pure leaf (a node with all the records of the same class) or there are no more variables to split the data.
The choice of the best variable is done according to a criterion
: entropy
and giny
for implementing the information gain and giny impurity functions, respectively.
Among the several parameters, the max_depth
determines the maximum size of the tree to reach, implementing
a pre-pruning strategy. Other parameters with similar effects are the min_samples_leaf
,
min_samples_split
and min_impurity_decrease
thresholds, that avoid continuing growing the
tree.
The min_impurity_decrease
parameter implements the post-pruning strategy, since it only splits some node,
if it brings more accuracy to the model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.tree import DecisionTreeClassifier
import ds_charts as ds
file_tag = 'diabetes'
filename = 'data/diabetes'
target = 'class'
train: pd.DataFrame = pd.read_csv(f'{filename}_train.csv')
trnY: np.ndarray = train.pop(target).values
trnX: np.ndarray = train.values
labels = pd.unique(trnY)
test: pd.DataFrame = pd.read_csv(f'{filename}_test.csv')
tstY: np.ndarray = test.pop(target).values
tstX: np.ndarray = test.values
min_impurity_decrease = [0.025, 0.01, 0.005, 0.0025, 0.001]
max_depths = [2, 5, 10, 15, 20, 25]
criteria = ['entropy', 'gini']
best = ('', 0, 0.0)
last_best = 0
best_tree = None
plt.figure()
fig, axs = plt.subplots(1, 2, figsize=(16, 4), squeeze=False)
for k in range(len(criteria)):
f = criteria[k]
values = {}
for d in max_depths:
yvalues = []
for imp in min_impurity_decrease:
tree = DecisionTreeClassifier(max_depth=d, criterion=f, min_impurity_decrease=imp)
tree.fit(trnX, trnY)
prdY = tree.predict(tstX)
yvalues.append(metrics.accuracy_score(tstY, prdY))
if yvalues[-1] > last_best:
best = (f, d, imp)
last_best = yvalues[-1]
best_tree = tree
values[d] = yvalues
ds.multiple_line_chart(min_impurity_decrease, values, ax=axs[0, k], title=f'Decision Trees with {f} criteria',
xlabel='min_impurity_decrease', ylabel='accuracy', percentage=True)
plt.savefig(f'images/{file_tag}_dt_study.png')
plt.show()
print('Best results achieved with %s criteria, depth=%d and min_impurity_decrease=%1.2f ==> accuracy=%1.2f'%(best[0], best[1], best[2], last_best))
<Figure size 600x450 with 0 Axes>
Best results achieved with gini criteria, depth=5 and min_impurity_decrease=0.00 ==> accuracy=0.79
In order to show the learned tree, we can use the graphviz
package.
from sklearn.tree import export_graphviz
file_tree = 'best_tree.png'
dot_data = export_graphviz(best_tree, out_file='best_tree.dot', filled=True, rounded=True, special_characters=True)
# Convert to png
from subprocess import call
call(['dot', '-Tpng', 'best_tree.dot', '-o', file_tree, '-Gdpi=600'])
plt.figure(figsize = (14, 18))
plt.imshow(plt.imread(file_tree))
plt.axis('off')
plt.show()
However, this is a very heavy image. In order to print a simpler version of the tree, we can just use:
from sklearn import tree
tree.plot_tree(best_tree, feature_names=train.columns, class_names=labels)
plt.savefig(f'images/{file_tag}_dt_best_tree.png')
After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.
prd_trn = best_tree.predict(trnX)
prd_tst = best_tree.predict(tstX)
ds.plot_evaluation_results(labels, trnY, prd_trn, tstY, prd_tst)
plt.savefig('images/{file_tag}_dt_best.png')
plt.show()