|
Let's reload the algae data being explored. Note that the parameters parse_dates
and
infer_datetime_format
are needed only if the data has some datetime variables.
import pandas as pd
import matplotlib.pyplot as plt
import ds_charts as ds
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
data.shape
(200, 11)
In the presence of missing values, is important to define the symbol used to represent them in the datafile, using the na_values parameter.
The next thing we need to do is to understand the ratio between the number of records and variables.
plt.figure(figsize=(4,2))
values = {'nr records': data.shape[0], 'nr variables': data.shape[1]}
ds.bar_chart(list(values.keys()), list(values.values()), title='Nr of records vs nr variables')
plt.savefig('images/records_variables.png')
plt.show()
Note that function save_fig
saves the figure in a png
file named records_variables
in folder images, which has to exist in the working directory.
As you know, the different types of variables require different treatments.
dtypes
returns the types of all variables in the dataframe.
data.dtypes
pH float64 Oxygen float64 Chloride float64 Nitrates float64 Ammonium float64 Orthophosphate float64 Phosphate float64 Chlorophyll float64 fluid_velocity object river_depth object season object dtype: object
If we need to apply any function that only deals with symbolic variables, we need to transform
object
variables into category
ones (the name for symbolic in pandas).
cat_vars = data.select_dtypes(include='object')
data[cat_vars.columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))
data.dtypes
pH float64 Oxygen float64 Chloride float64 Nitrates float64 Ammonium float64 Orthophosphate float64 Phosphate float64 Chlorophyll float64 fluid_velocity category river_depth category season category dtype: object
But we can go further ahead and write a function to collect the names of variables for each different type. It receives the dataframe to analyse and returns a dictionary with a list of column names for types numeric, binary, symbolic and date.
from numpy import isnan
from datetime import datetime
def get_variable_types(df):
NR_SYMBOLS = 10
variable_types = {'binary': [], 'numeric': [], 'date': [], 'symbolic': []}
for c in df.columns:
mv = df[c].isna().sum()
uniques = df[c].unique()
if mv == 0:
if len(uniques) == 2:
variable_types['binary'].append(c)
df[c].astype('bool')
elif df[c].dtype == 'datetime64':
variable_types['date'].append(c)
elif len(uniques) < NR_SYMBOLS:
df[c].astype('category')
variable_types['symbolic'].append(c)
else:
variable_types['numeric'].append(c)
else:
uniques = [v for v in uniques if not isnan(v)]
values = [v for v in uniques if isinstance(v,str)]
if len(uniques) == 2:
variable_types['binary'].append(c)
elif len(values) == len(uniques):
df[c].astype('category')
variable_types['symbolic'].append(c)
else:
values = [v for v in uniques if isinstance(v, datetime)]
if len(values) == len(uniques):
variable_types['date'].append(c)
else:
variable_types['numeric'].append(c)
return variable_types
Showing and saving the chart is just a plus.
variable_types = ds.get_variable_types(data)
counts = {}
for tp in variable_types.keys():
counts[tp] = len(variable_types[tp])
plt.figure(figsize=(4,2))
ds.bar_chart(list(counts.keys()), list(counts.values()), title='Nr of variables per type')
plt.savefig('images/variable_types.png')
plt.show()
With the previous function is then easy to see how many missing values there are for each variable.
mv = {}
plt.figure()
for var in data:
nr = data[var].isna().sum()
if nr > 0:
mv[var] = nr
ds.bar_chart(list(mv.keys()), list(mv.values()), title='Nr of missing values per variable',
xlabel='variables', ylabel='nr missing values', rotation=True)
plt.savefig('images/mv.png')
plt.show()