|
After loading the data, and understanding the variables we have available to describe the records, we proceed with the distribution analysis for each variable by itself, non relating them with each other.
Depending on being numeric or symbolic, we can apply a set of different functions to explore their nature. Lets start with the numeric ones.
The simplest way to describe each numeric variable is through its five-number summary, inspecting its range through the identification of its minimum and maximum values, along with other estimators, like the mean, mode, standard deviation and other different percentiles.
In dataframes, we can access it through the describe
function. It receives the include
parameter with numeric
as the default value, meaning that without changing it explicitly, only numeric
variables are shown.
import pandas as pd
import matplotlib.pyplot as plt
import ds_charts as ds
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
data.describe()
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | |
---|---|---|---|---|---|---|---|---|
count | 199.000000 | 198.000000 | 190.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 188.000000 |
mean | 8.011759 | 9.117778 | 43.636842 | 3.282828 | 154.447475 | 83.325909 | 111.550808 | 13.541011 |
std | 0.598302 | 2.391253 | 46.830993 | 3.776458 | 180.011207 | 116.783165 | 102.214088 | 20.448323 |
min | 5.600000 | 1.500000 | 0.220000 | 0.050000 | 5.000000 | 1.000000 | 0.900000 | 0.000000 |
25% | 7.700000 | 7.725000 | 10.985000 | 1.297500 | 35.625000 | 16.000000 | 19.395000 | 2.000000 |
50% | 8.060000 | 9.800000 | 32.730000 | 2.675000 | 99.665000 | 41.400000 | 84.500000 | 5.200000 |
75% | 8.400000 | 10.800000 | 57.825000 | 4.447500 | 203.730000 | 102.247500 | 182.160000 | 18.300000 |
max | 9.700000 | 13.400000 | 391.500000 | 45.650000 | 931.830000 | 771.600000 | 558.750000 | 110.460000 |
Keep a moment to analyse the count row showing the number of non-missing values for each variable.
The five-numbers summary alone, give all the information required, but it's not easy to interpret. A better
way to understand the impact of such values is through the analysis of boxplots for each variable.
Again the DataFrame object provides several functions to explore the data, in particular the boxplot
function, which plots all numeric variables in the same chart.
data.boxplot(rot=45)
plt.savefig('images/global_boxplot.png')
plt.show()
Despite the ability to see the relations among the different ranges for all variables, it is difficult to analyze each one in particular, due to the different scales.
In order to address this difference, we can plot singular boxplots for each variable using boxplot
methods
from matplotlib
.
In order to show the best plots, we can use our get_variable_types
function to select the variables for
each type, and then explore them alone.
variable_types = ds.get_variable_types(data)
numeric_vars = variable_types['numeric']
rows, cols = ds.choose_grid(len(numeric_vars))
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
i, j = 0, 0
for n in range(len(numeric_vars)):
axs[i, j].set_title('Boxplot for %s'%numeric_vars[n])
axs[i, j].boxplot(data[numeric_vars[n]].dropna().values)
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/single_boxplots.png')
plt.show()
From the boxplots, it's clear that our variables have different ranges, scales, that there are several outliers, but
we are not able to know anything else about the variables distribution. In order to see their distribution the best
option is to plot the histogram for each numeric variable, through the use of hist
method.
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
i, j = 0, 0
for n in range(len(numeric_vars)):
axs[i, j].set_title('Histogram for %s'%numeric_vars[n])
axs[i, j].set_xlabel(numeric_vars[n])
axs[i, j].set_ylabel("nr records")
axs[i, j].hist(data[numeric_vars[n]].dropna().values, 'auto')
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/single_histograms_numeric.png')
plt.show()
Indeed histograms give us an insight about the distribution of each variable, but recognizing the distribution that
best fits the data may be hard.
Seaborn
provides the distplot
method to display the best fit for the variable.
import seaborn as sns
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
i, j = 0, 0
for n in range(len(numeric_vars)):
axs[i, j].set_title('Histogram with trend for %s'%numeric_vars[n])
sns.distplot(data[numeric_vars[n]].dropna().values, norm_hist=True, ax=axs[i, j], axlabel=numeric_vars[n])
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/histograms_trend_numeric.png')
plt.show()
Despite the simplicity of this approach, we are not able to verify how much standard distributions fit to the data.
In order to do that, we can try to fit different known distributions to it, using scipy.stats
package
functionalities to compute distributions (norm, expon, skewnorm, etc
).
Lets look at the histogram for the pH and Ammonium variables, and possible distributions...
import scipy.stats as _stats
import numpy as np
def compute_known_distributions(x_values: list) -> dict:
distributions = dict()
# Gaussian
mean, sigma = _stats.norm.fit(x_values)
distributions['Normal(%.1f,%.2f)'%(mean,sigma)] = _stats.norm.pdf(x_values, mean, sigma)
# Exponential
loc, scale = _stats.expon.fit(x_values)
distributions['Exp(%.2f)'%(1/scale)] = _stats.expon.pdf(x_values, loc, scale)
# LogNorm
sigma, loc, scale = _stats.lognorm.fit(x_values)
distributions['LogNor(%.1f,%.2f)'%(np.log(scale),sigma)] = _stats.lognorm.pdf(x_values, sigma, loc, scale)
return distributions
def histogram_with_distributions(ax: plt.Axes, series: pd.Series, var: str):
values = series.sort_values().values
ax.hist(values, 20, density=True)
distributions = compute_known_distributions(values)
ds.multiple_line_chart(values, distributions, ax=ax, title='Best fit for %s'%var, xlabel=var, ylabel='')
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT))
i, j = 0, 0
for n in range(len(numeric_vars)):
histogram_with_distributions(axs[i, j], data[numeric_vars[n]].dropna(), numeric_vars[n])
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/histogram_numeric_distribution.png')
plt.show()
The exploration of symbolic variables is similar, but we are not able to use the functions previously used. Indeed, boxplots are not applicable to non-numeric variables.
In order to explore our 3 symbolic variables, in terms of their distribution we have to manually create histograms
for them, since hist
function doesn't work with non-numeric variables
Histograms for symbolic variables may also be produced, just through the use of bar charts and counting the frequency of each value for each variable.
symbolic_vars = variable_types['symbolic']
rows, cols = ds.choose_grid(len(symbolic_vars))
fig, axs = plt.subplots(rows, cols, figsize=(cols*ds.HEIGHT, rows*ds.HEIGHT), squeeze=False)
i, j = 0, 0
for n in range(len(symbolic_vars)):
counts = data[symbolic_vars[n]].value_counts()
ds.bar_chart(counts.index.to_list(), counts.values, ax=axs[i, j], title='Histogram for %s'%symbolic_vars[n],
xlabel=symbolic_vars[n], ylabel='nr records', percentage=False)
i, j = (i + 1, 0) if (n+1) % cols == 0 else (i, j + 1)
plt.savefig('images/histograms_symbolic.png')
plt.show()