|
The last of analysis perspectives is the sparsity analysis, which encompasses the exploration of the domain coverage, in a multi-variate way. This is done through the cross analysis of the records projected according to just a few of its variables (usually only two). In practice, these projections are achieved through bidimensional charts, visualizing one variable against another.
A dataset is said to be sparse when most of the space defined by its variables is not covered by the records in the
dataset. A way to have some insight over the sparsity of data is to use scatter plots to project the data along two
of its defining variables. To do it, we can use the scatter
method.
Note that boolean variables show at most 4 points for any pair of boolean variables, meaning that it is difficult to get any information from scatter plots for them.
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
import ds_charts as ds
register_matplotlib_converters()
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', parse_dates=True, infer_datetime_format=True)
variable_types = ds.get_variable_types(data)
numeric_vars = variable_types['numeric']
rows, cols = len(numeric_vars)-1, len(numeric_vars)-1
plt.figure()
fig, axs = plt.subplots(rows, cols, figsize=(cols*4, rows*4), squeeze=False)
for i in range(len(numeric_vars)):
var1 = numeric_vars[i]
for j in range(i+1, len(numeric_vars)):
var2 = numeric_vars[j]
axs[i, j-1].set_title("%s x %s"%(var1,var2))
axs[i, j-1].set_xlabel(var1)
axs[i, j-1].set_ylabel(var2)
axs[i, j-1].scatter(data[var1], data[var2])
plt.savefig(f'images/sparsity_study_numeric.png')
plt.show()
<Figure size 600x450 with 0 Axes>
But for symbolic variables, it is possible to get some information.
symbolic_vars = variable_types['symbolic']
rows, cols = len(symbolic_vars)-1, len(symbolic_vars)-1
plt.figure()
fig, axs = plt.subplots(rows, cols, figsize=(cols*4, rows*4), squeeze=False)
for i in range(len(symbolic_vars)):
var1 = symbolic_vars[i]
for j in range(i+1, len(symbolic_vars)):
var2 = symbolic_vars[j]
axs[i, j-1].set_title("%s x %s"%(var1,var2))
axs[i, j-1].set_xlabel(var1)
axs[i, j-1].set_ylabel(var2)
axs[i, j-1].scatter(data[var1], data[var2])
plt.savefig(f'images/sparsity_study_symbolic.png')
plt.show()
<Figure size 600x450 with 0 Axes>
However, in this case there are all combinations for each pair of variable, and we don't find anything useful.
Despite showing data sparsity, scatter plots are also useful for showing the correlation among variables.
However, in the presence of a large dimensionality, a heatmap is easier to analyze.
To do it, we can use the heatmap
from the seaborn
package.
import seaborn as sns
fig = plt.figure(figsize=[12, 12])
corr_mtx = data.corr()
sns.heatmap(corr_mtx, xticklabels=corr_mtx.columns, yticklabels=corr_mtx.columns, annot=True, cmap='Blues')
plt.title('Correlation analysis')
plt.savefig(f'images/correlation_analysis.png')
plt.show()