|
Naive Bayes is one of the most famous classification techniques, one of the most simplest ones, and the easiest to apply.
Like other Bayesian techniques, it just chooses the most probable class for each record, according to the estimation of the probability of each class given the record, whose label we want to predict. The trick and simplicity of Naive Bayes resides in the assumption of conditional independence among the variables, with simplifies that estimation and turns Naive Bayes as the standard baseline for classification.
Indeed, we can evaluate the performance of each classifier over a given dataset, simply by comparing their results among each other, in particular with the results of Naive Bayes over the dataset.
The nicest property of Naive Bayes is that it is not parametrizable, and so, its performance serves as a comparison baseline: any model is only interesting if it outperforms the one learnt through Naive Bayes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
import ds_charts as ds
file_tag = 'diabetes'
filename = 'data/diabetes'
target = 'class'
train: pd.DataFrame = pd.read_csv(f'{filename}_train.csv')
trnY: np.ndarray = train.pop(target).values
trnX: np.ndarray = train.values
labels = pd.unique(trnY)
test: pd.DataFrame = pd.read_csv(f'{filename}_test.csv')
tstY: np.ndarray = test.pop(target).values
tstX: np.ndarray = test.values
clf = GaussianNB()
clf.fit(trnX, trnY)
prd_trn = clf.predict(trnX)
prd_tst = clf.predict(tstX)
ds.plot_evaluation_results(labels, trnY, prd_trn, tstY, prd_tst)
plt.savefig('images/{file_tag}_nb_best.png')
plt.show()
If we inspect the classes available in the sklearn.naive_bayes
package, we see there are more then the
GaussianNB
estimators. Indeed, there are also the MultinomialNB
and the BernoulliNB
,
that are adequate to use when the data distribution is close to be a multinomial or Bernoulli.
estimators = {'GaussianNB': GaussianNB(),
'MultinomialNB': MultinomialNB(),
'BernoulyNB': BernoulliNB()}
xvalues = []
yvalues = []
for clf in estimators:
xvalues.append(clf)
estimators[clf].fit(trnX, trnY)
prdY = estimators[clf].predict(tstX)
yvalues.append(metrics.accuracy_score(tstY, prdY))
plt.figure()
ds.bar_chart(xvalues, yvalues, title='Comparison of Naive Bayes Models', ylabel='accuracy', percentage=True)
plt.savefig('images/{file_tag}_nb_study.png')
plt.show()