Classification

Decision Trees

Decision Trees are a friendly kind of model for classification, since they are interpretable and easy to apply. They are implemented in the DecisionTreeClassifier of sklearn.tree package.

In general, algorithms for training decision trees choose the best variable to split the dataset, in a manner that in each branch we will have a smaller mixture of classes. Then the algorithm repeats in the same way for each branch, until it reaches a pure leaf (a node with all records belonging to a same class) or there are no more variables to split the data.

Paremeters study

The choice of the best variable is done according to a criterion: entropy and giny for implementing the information gain and giny impurity functions, respectively.

Among the several parameters, the max_depth determines the maximum size of the tree to reach, implementing a pre-pruning strategy. Other parameters with similar effects are the min_samples_leaf, min_samples_split and min_impurity_decrease thresholds, that avoid continuing growing the tree.

In [ ]:
from typing import Literal
from numpy import array, ndarray
from matplotlib.pyplot import figure, savefig, show
from sklearn.tree import DecisionTreeClassifier
from dslabs_functions import CLASS_EVAL_METRICS, DELTA_IMPROVE, read_train_test_from_files
from dslabs_functions import plot_evaluation_results, plot_multiline_chart


def trees_study(
        trnX: ndarray, trnY: array, tstX: ndarray, tstY: array, d_max: int=10, lag:int=2, metric='accuracy'
        ) -> tuple:
    criteria: list[Literal['entropy', 'gini']] = ['entropy', 'gini']
    depths: list[int] = [i for i in range(2, d_max+1, lag)]

    best_model: DecisionTreeClassifier | None = None
    best_params: dict = {'name': 'DT', 'metric': metric, 'params': ()}
    best_performance: float = 0.0

    values: dict = {}
    for c in criteria:
        y_tst_values: list[float] = []
        for d in depths:
            clf = DecisionTreeClassifier(max_depth=d, criterion=c, min_impurity_decrease=0)
            clf.fit(trnX, trnY)
            prdY: array = clf.predict(tstX)
            eval: float = CLASS_EVAL_METRICS[metric](tstY, prdY)
            y_tst_values.append(eval)
            if eval - best_performance > DELTA_IMPROVE:
                best_performance = eval
                best_params['params'] = (c, d)
                best_model = clf
            # print(f'DT {c} and d={d}')
        values[c] = y_tst_values
    print(f'DT best with {best_params['params'][0]} and d={best_params['params'][1]}')
    plot_multiline_chart(depths, values, title=f'DT Models ({metric})', xlabel='d', ylabel=metric, percentage=True)

    return best_model, best_params

file_tag = 'stroke'
train_filename = 'data/stroke_train_smote.csv'
test_filename = 'data/stroke_test.csv'
target = 'stroke'
eval_metric = 'accuracy'

trnX, tstX, trnY, tstY, labels, vars = read_train_test_from_files(train_filename, test_filename, target)
print(f'Train#={len(trnX)} Test#={len(tstX)}')
print(f'Labels={labels}')

figure()
best_model, params = trees_study(trnX, trnY, tstX, tstY, d_max=25, metric=eval_metric)
savefig(f'images/{file_tag}_dt_{eval_metric}_study.png')
show()
Train#=6806 Test#=1533
Labels=[0, 1]
DT best with entropy and d=20
No description has been provided for this image

Best model performance

As for the models learnt with other techniques, we may study the performance for the best tree discovered.

In [ ]:
prd_trn: array = best_model.predict(trnX)
prd_tst: array = best_model.predict(tstX)
figure()
plot_evaluation_results(params, trnY, prd_trn, tstY, prd_tst, labels)
savefig(f'images/{file_tag}_dt_{params["name"]}_best_{params["metric"]}_eval.png')
show()
<Figure size 600x450 with 0 Axes>
No description has been provided for this image

Variables importance

In order to show the learnt tree, we can use the graphviz a graph visualization software, through the export_graphviz function from the sklearn.tree package.. In order to this to work, we need to install that software, but unfortunately running pip is not enough.

Setting the max_depth allows for avoiding the presentation of too many branches, making it imperceptible. If we do not want to cut it, we can set the max_depth to the best one found (params['params'][1]) or to None.

In [ ]:
from sklearn.tree import export_graphviz
from matplotlib.pyplot import imread, imshow, axis
from subprocess import call

tree_filename: str = f"images/{file_tag}_dt_{eval_metric}_best_tree"
max_depth2show = 3
st_labels: list[str] = [str(value) for value in labels]

dot_data: str = export_graphviz(
    best_model,
    out_file=tree_filename + ".dot",
    max_depth=max_depth2show,
    feature_names=vars,
    class_names=st_labels,
    filled=True,
    rounded=True,
    impurity=False,
    special_characters=True,
    precision=2,
)
# Convert to png
call(
    ["dot", "-Tpng", tree_filename + ".dot", "-o", tree_filename + ".png", "-Gdpi=600"]
)

figure(figsize=(14, 6))
imshow(imread(tree_filename + ".png"))
axis("off")
show()
No description has been provided for this image

However, this is a very heavy image. In order to print a simpler version of the tree, we can just use the plot_tree from the sklearn.tree package:

In [ ]:
from sklearn.tree import plot_tree

figure(figsize=(14, 6))
plot_tree(
    best_model,
    max_depth=max_depth2show,
    feature_names=vars,
    class_names=st_labels,
    filled=True,
    rounded=True,
    impurity=False,
    precision=2,
)
savefig(tree_filename + ".png")
No description has been provided for this image

Note that, looking at the tree, we are able to identify the most relevant variables to discriminate between the classes. And beside that,decision trees provide the featureimportances attribute, revealing the importance of each variable in the discrimination.

In [ ]:
from numpy import argsort
from dslabs_functions import plot_horizontal_bar_chart

importances = best_model.feature_importances_
indices: list[int] = argsort(importances)[::-1]
elems: list[str] = []
imp_values: list[float] = []
for f in range(len(vars)):
    elems += [vars[indices[f]]]
    imp_values += [importances[indices[f]]]
    print(f"{f+1}. {elems[f]} ({importances[indices[f]]})")

figure()
plot_horizontal_bar_chart(
    elems,
    imp_values,
    title="Decision Tree variables importance",
    xlabel="importance",
    ylabel="variables",
    percentage=True,
)
savefig(f"images/{file_tag}_dt_{eval_metric}_vars_ranking.png")
1. age (0.33753352214527566)
2. smoking_status (0.16339424761359403)
3. ever_married (0.0906455884128783)
4. avg_glucose_level (0.08807485604975059)
5. Residence_type (0.08189601049327727)
6. bmi (0.07812614656138524)
7. gender (0.0581436772126566)
8. heart_disease (0.04735401754120233)
9. work_type (0.02892692140541717)
10. hypertension (0.025905012564562825)
No description has been provided for this image

Overfitting study

For Decision Trees one of the simpler parameters to create specializations is the maximum depth allowed: with larger models having higher complexities.

In [ ]:
crit: Literal["entropy", "gini"] = params["params"][0]
d_max = 25
depths: list[int] = [i for i in range(2, d_max + 1, 1)]
y_tst_values: list[float] = []
y_trn_values: list[float] = []
acc_metric = "accuracy"
for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, criterion=crit, min_impurity_decrease=0)
    clf.fit(trnX, trnY)
    prd_tst_Y: array = clf.predict(tstX)
    prd_trn_Y: array = clf.predict(trnX)
    y_tst_values.append(CLASS_EVAL_METRICS[acc_metric](tstY, prd_tst_Y))
    y_trn_values.append(CLASS_EVAL_METRICS[acc_metric](trnY, prd_trn_Y))

figure()
plot_multiline_chart(
    depths,
    {"Train": y_trn_values, "Test": y_tst_values},
    title=f"DT overfitting study for {crit}",
    xlabel="max_depth",
    ylabel=str(eval_metric),
    percentage=True,
)
savefig(f"images/{file_tag}_dt_{eval_metric}_overfitting.png")
No description has been provided for this image

In this case, there is no overfitting, since both the performance over the train and test datasets continue to improve.