Classification
Decision Trees
Decision Trees are a friendly kind of model for classification, since they are interpretable and easy to apply. They are implemented in the DecisionTreeClassifier
of sklearn.tree
package.
In general, algorithms for training decision trees choose the best variable to split the dataset, in a manner that in each branch we will have a smaller mixture of classes. Then the algorithm repeats in the same way for each branch, until it reaches a pure leaf (a node with all records belonging to a same class) or there are no more variables to split the data.
Paremeters study
The choice of the best variable is done according to a criterion
: entropy and giny
for implementing the information gain and giny impurity functions, respectively.
Among the several parameters, the max_depth
determines the maximum size of the tree to reach, implementing
a pre-pruning strategy. Other parameters with similar effects are the min_samples_leaf
,
min_samples_split
and min_impurity_decrease
thresholds, that avoid continuing growing the
tree.
from typing import Literal
from numpy import array, ndarray
from matplotlib.pyplot import figure, savefig, show
from sklearn.tree import DecisionTreeClassifier
from dslabs_functions import CLASS_EVAL_METRICS, DELTA_IMPROVE, read_train_test_from_files
from dslabs_functions import plot_evaluation_results, plot_multiline_chart
def trees_study(
trnX: ndarray, trnY: array, tstX: ndarray, tstY: array, d_max: int=10, lag:int=2, metric='accuracy'
) -> tuple:
criteria: list[Literal['entropy', 'gini']] = ['entropy', 'gini']
depths: list[int] = [i for i in range(2, d_max+1, lag)]
best_model: DecisionTreeClassifier | None = None
best_params: dict = {'name': 'DT', 'metric': metric, 'params': ()}
best_performance: float = 0.0
values: dict = {}
for c in criteria:
y_tst_values: list[float] = []
for d in depths:
clf = DecisionTreeClassifier(max_depth=d, criterion=c, min_impurity_decrease=0)
clf.fit(trnX, trnY)
prdY: array = clf.predict(tstX)
eval: float = CLASS_EVAL_METRICS[metric](tstY, prdY)
y_tst_values.append(eval)
if eval - best_performance > DELTA_IMPROVE:
best_performance = eval
best_params['params'] = (c, d)
best_model = clf
# print(f'DT {c} and d={d}')
values[c] = y_tst_values
print(f'DT best with {best_params['params'][0]} and d={best_params['params'][1]}')
plot_multiline_chart(depths, values, title=f'DT Models ({metric})', xlabel='d', ylabel=metric, percentage=True)
return best_model, best_params
file_tag = 'stroke'
train_filename = 'data/stroke_train_smote.csv'
test_filename = 'data/stroke_test.csv'
target = 'stroke'
eval_metric = 'accuracy'
trnX, tstX, trnY, tstY, labels, vars = read_train_test_from_files(train_filename, test_filename, target)
print(f'Train#={len(trnX)} Test#={len(tstX)}')
print(f'Labels={labels}')
figure()
best_model, params = trees_study(trnX, trnY, tstX, tstY, d_max=25, metric=eval_metric)
savefig(f'images/{file_tag}_dt_{eval_metric}_study.png')
show()
Train#=6806 Test#=1533 Labels=[0, 1] DT best with entropy and d=20
Best model performance
As for the models learnt with other techniques, we may study the performance for the best tree discovered.
prd_trn: array = best_model.predict(trnX)
prd_tst: array = best_model.predict(tstX)
figure()
plot_evaluation_results(params, trnY, prd_trn, tstY, prd_tst, labels)
savefig(f'images/{file_tag}_dt_{params["name"]}_best_{params["metric"]}_eval.png')
show()
<Figure size 600x450 with 0 Axes>
Variables importance
In order to show the learnt tree, we can use the graphviz
a graph visualization software, through the export_graphviz
function from the sklearn.tree
package.. In order to this to work, we need to install that software, but unfortunately running pip
is not enough.
Setting the max_depth
allows for avoiding the presentation of too many branches, making it imperceptible. If we do not want to cut it, we can set the max_depth to the best one found (params['params'][1]
) or to None
.
from sklearn.tree import export_graphviz
from matplotlib.pyplot import imread, imshow, axis
from subprocess import call
tree_filename: str = f"images/{file_tag}_dt_{eval_metric}_best_tree"
max_depth2show = 3
st_labels: list[str] = [str(value) for value in labels]
dot_data: str = export_graphviz(
best_model,
out_file=tree_filename + ".dot",
max_depth=max_depth2show,
feature_names=vars,
class_names=st_labels,
filled=True,
rounded=True,
impurity=False,
special_characters=True,
precision=2,
)
# Convert to png
call(
["dot", "-Tpng", tree_filename + ".dot", "-o", tree_filename + ".png", "-Gdpi=600"]
)
figure(figsize=(14, 6))
imshow(imread(tree_filename + ".png"))
axis("off")
show()
However, this is a very heavy image. In order to print a simpler version of the tree, we can just use the plot_tree
from the sklearn.tree
package:
from sklearn.tree import plot_tree
figure(figsize=(14, 6))
plot_tree(
best_model,
max_depth=max_depth2show,
feature_names=vars,
class_names=st_labels,
filled=True,
rounded=True,
impurity=False,
precision=2,
)
savefig(tree_filename + ".png")
Note that, looking at the tree, we are able to identify the most relevant variables to discriminate between the classes. And beside that,decision trees provide the featureimportances attribute, revealing the importance of each variable in the discrimination.
from numpy import argsort
from dslabs_functions import plot_horizontal_bar_chart
importances = best_model.feature_importances_
indices: list[int] = argsort(importances)[::-1]
elems: list[str] = []
imp_values: list[float] = []
for f in range(len(vars)):
elems += [vars[indices[f]]]
imp_values += [importances[indices[f]]]
print(f"{f+1}. {elems[f]} ({importances[indices[f]]})")
figure()
plot_horizontal_bar_chart(
elems,
imp_values,
title="Decision Tree variables importance",
xlabel="importance",
ylabel="variables",
percentage=True,
)
savefig(f"images/{file_tag}_dt_{eval_metric}_vars_ranking.png")
1. age (0.33753352214527566) 2. smoking_status (0.16339424761359403) 3. ever_married (0.0906455884128783) 4. avg_glucose_level (0.08807485604975059) 5. Residence_type (0.08189601049327727) 6. bmi (0.07812614656138524) 7. gender (0.0581436772126566) 8. heart_disease (0.04735401754120233) 9. work_type (0.02892692140541717) 10. hypertension (0.025905012564562825)
Overfitting study
For Decision Trees one of the simpler parameters to create specializations is the maximum depth allowed: with larger models having higher complexities.
crit: Literal["entropy", "gini"] = params["params"][0]
d_max = 25
depths: list[int] = [i for i in range(2, d_max + 1, 1)]
y_tst_values: list[float] = []
y_trn_values: list[float] = []
acc_metric = "accuracy"
for d in depths:
clf = DecisionTreeClassifier(max_depth=d, criterion=crit, min_impurity_decrease=0)
clf.fit(trnX, trnY)
prd_tst_Y: array = clf.predict(tstX)
prd_trn_Y: array = clf.predict(trnX)
y_tst_values.append(CLASS_EVAL_METRICS[acc_metric](tstY, prd_tst_Y))
y_trn_values.append(CLASS_EVAL_METRICS[acc_metric](trnY, prd_trn_Y))
figure()
plot_multiline_chart(
depths,
{"Train": y_trn_values, "Test": y_tst_values},
title=f"DT overfitting study for {crit}",
xlabel="max_depth",
ylabel=str(eval_metric),
percentage=True,
)
savefig(f"images/{file_tag}_dt_{eval_metric}_overfitting.png")
In this case, there is no overfitting, since both the performance over the train and test datasets continue to improve.