|
Random Forests, implemented through the RandomForestClassifier
in the sklearn.ensemble
package, are one of the most powerful classification technique, simple and easy to apply.
It trains a set of n decision trees, that are combined in an ensemble of n_estimators
. Each tree,
however, is trained over a different subset of the original training data, first by choosing a subset of k
variables describing the data, with k determined by the max_features
parameter. Beside many other
parameters we can choose the maximum size of each tree, through the max_depth
parameter.
Next, we can see the results achieved by a set of parameters combinations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.ensemble import RandomForestClassifier
import ds_charts as ds
file_tag = 'diabetes'
filename = 'data/diabetes'
target = 'class'
train: pd.DataFrame = pd.read_csv(f'{filename}_train.csv')
trnY: np.ndarray = train.pop(target).values
trnX: np.ndarray = train.values
labels = pd.unique(trnY)
test: pd.DataFrame = pd.read_csv(f'{filename}_test.csv')
tstY: np.ndarray = test.pop(target).values
tstX: np.ndarray = test.values
n_estimators = [5, 10, 25, 50, 75, 100, 150, 200, 250, 300]
max_depths = [5, 10, 25]
max_features = [.1, .3, .5, .7, .9, 1]
best = ('', 0, 0)
last_best = 0
best_model = None
cols = len(max_depths)
plt.figure()
fig, axs = plt.subplots(1, cols, figsize=(cols*ds.HEIGHT, ds.HEIGHT), squeeze=False)
for k in range(len(max_depths)):
d = max_depths[k]
values = {}
for f in max_features:
yvalues = []
for n in n_estimators:
rf = RandomForestClassifier(n_estimators=n, max_depth=d, max_features=f)
rf.fit(trnX, trnY)
prdY = rf.predict(tstX)
yvalues.append(metrics.accuracy_score(tstY, prdY))
if yvalues[-1] > last_best:
best = (d, f, n)
last_best = yvalues[-1]
best_model = rf
values[f] = yvalues
ds.multiple_line_chart(n_estimators, values, ax=axs[0, k], title=f'Random Forests with max_depth={d}',
xlabel='nr estimators', ylabel='accuracy', percentage=True)
plt.savefig(f'images/{file_tag}_rf_study.png')
plt.show()
print('Best results with depth=%d, %1.2f features and %d estimators, with accuracy=%1.2f'%(best[0], best[1], best[2], last_best))
<Figure size 600x450 with 0 Axes>
Best results with depth=25, 0.30 features and 25 estimators, with accuracy=0.80
After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.
prd_trn = best_model.predict(trnX)
prd_tst = best_model.predict(tstX)
ds.plot_evaluation_results(labels, trnY, prd_trn, tstY, prd_tst)
plt.savefig('images/{file_tag}_rf_best.png')
plt.show()
Random forests have the particularity of providing the importance of each variable in the global model. In order to reach those
importances we just need to collect the feature_importances_
attribute from the learnt model as below.
variables = train.columns
importances = best_model.feature_importances_
std = np.std([tree.feature_importances_ for tree in best_model.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
elems = []
for f in range(trnX.shape[1]):
elems += [variables[indices[f]]]
print(f'{f+1}. feature {elems[f]} ({importances[indices[f]]})')
plt.figure()
ds.horizontal_bar_chart(elems, importances[indices], std[indices], title='Random Forest Features importance', xlabel='importance', ylabel='variables')
plt.savefig(f'images/{file_tag}_rf_ranking.png')
1. feature A (0.24935903422068356) 2. feature B (0.14536351609612758) 3. feature C (0.11824891419238961) 4. feature G (0.10904563291325395) 5. feature id (0.10451448997428683) 6. feature F (0.08037813727710363) 7. feature H (0.07549310436071138) 8. feature E (0.06539842935759095) 9. feature D (0.052198741607852456)