|
The granularity at which we analyze each variable also plays a fundamental role. Indeed, when
we plot the histogram for each variable, we have to determine the number of bins used to discretize the
data. Until now, we used automatic choice for it through nbins='auto'
).
In order to see it in more detail lets consider another dataset.
import pandas as pd
import matplotlib.pyplot as plt
import ds_charts as ds
filename = 'data/electrical_grid_stability.csv'
data = pd.read_csv(filename)
values = {'nr records': data.shape[0], 'nr variables': data.shape[1]}
The data under analysis is in the file electrical_grid_stability, and has 10 000 records and 13 variables, which doesn't bring any problem per se. (More information about the dataset can be found in UCI Machine Learning Repository)
Let's start by plotting an histogram with the 100 bins:
import ds_charts as ds
variable_types = ds.get_variable_types(data)
variables = variable_types['numeric']
rows, cols = ds.choose_grid(len(variables))
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
i, j = 0, 0
for n in range(len(variables)):
axs[i, j].set_title('Histogram for %s'%variables[n])
axs[i, j].set_xlabel(variables[n])
axs[i, j].set_ylabel('nr records')
axs[i, j].hist(data[variables[n]].values, bins=100)
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/granularity_single.png')
plt.show()
With these charts, we are able to recognize that:
p1
shows approximately a normal distributionWe have set the number of bins to 100, but this number has a direct impact on the shape presented, since it reveals
the level of granularity considered.
Consider for example the variable g1
, which range is from 0 to 1. Setting the number of bins to 100,
means that we are analysing the variable at a centesimal granularity.
But, only looking at the data we know each variable precision, which in our case is 6 decimal places. Indeed, we may
consider other levels of detail, and in order to do that, we try setting the number of bins to different values.
variable = 'tau1'
bins = (10, 100, 1000, 10000)
fig, axs = plt.subplots(1, len(bins), figsize=(len(bins)*ds.HEIGHT, ds.HEIGHT))
for j in range(len(bins)):
axs[j].set_title('Histogram for %s %d bins'%(variable, bins[j]))
axs[j].set_xlabel(variable)
axs[j].set_ylabel('Nr records')
axs[j].hist(data[variable].values, bins=bins[j])
plt.savefig(f'images/granularity_study_{variable}.png')
plt.show()
Now look at each of the histograms, for the tau1
variable.
In the first one, we set the number of bins to 10, and we recognize a perfect uniform distribution, with 1000 records assuming values in each one of the ten possible values (1000 records with tau1 between 0 and 1, another 1000 between 1 and 2, and so on).
In the second and third ones, we set the bins to 100 and 1000, respectively, and we see small fluctuations with all the bins with around 100 and 10 records, respectively, and the uniform distribution remains.
In the last one, we set the number of bins to 10000 - the same as the number of records. The result shows the majority part of the records having unique values, and the rest of the values appear just twice. Ignoring the ragged shape, we are able to confirm the uniform distribution.
Usually, we don't go so deep, and we want to have several records for each bin. While not able to guarantee this happens for all values, if we set the number of bins to be smaller than the number of records, we will reach our goal for the majority of values.
So for analysing the rest of the variables we will try the histograms with 10, 100 and 1000 bins.
variable_types = ds.get_variable_types(data)
columns = variable_types['numeric']
rows = len(columns)
bins = (10, 100, 1000)
cols = len(bins)
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
for i in range(rows):
for j in range(cols):
axs[i, j].set_title('Histogram for %s %d bins'%(columns[i], bins[j]))
axs[i, j].set_xlabel(columns[i])
axs[i, j].set_ylabel('Nr records')
axs[i, j].hist(data[columns[i]].values, bins=bins[j])
plt.savefig('images/granularity_study.png')
plt.show()