|
object
variables to symbolic ones as seen in the previous lab, and keep the
numeric separated from the symbolic ones, to deal with them with the right tools.
import pandas as pd
import numpy as np
from pandas.plotting import register_matplotlib_converters
import matplotlib.pyplot as plt
import ds_charts as ds
register_matplotlib_converters()
file = 'algae'
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
variable_types = ds.get_variable_types(data)
numeric_vars = variable_types['numeric']
symbolic_vars = variable_types['symbolic']
boolean_vars = variable_types['binary']
Missing values are a kind of a plague in data science, but in particular when using sci-kit learn, since its estimators are not able to deal with them. A missing value corresponds to a variable without any value for a given record. Let's recover the procedure to find the variables with missing values from the data dimensionality lab:
mv = {}
plt.figure()
for var in data:
nr = data[var].isna().sum()
if nr > 0:
mv[var] = nr
ds.bar_chart(list(mv.keys()), list(mv.values()), title='Nr of missing values per variable',
xlabel='variables', ylabel='nr missing values', rotation=True)
plt.savefig(f'images/{file}_missing_values.png')
mv
{'pH': 1, 'Oxygen': 2, 'Chloride': 10, 'Nitrates': 2, 'Ammonium': 2, 'Orthophosphate': 2, 'Phosphate': 2, 'Chlorophyll': 12}
The first one is when a column has a significant number of missing values. It's difficult to establish a threshold since the number of records remaining plays an important part. If the remaining records are enough to serve as a characteristic, otherwise we can discard the entire column.
Since our dataset has just 200 records we will discard the columns that have more than 90% of missing values, as follows.
# defines the number of records to discard entire columns
threshold = data.shape[0] * 0.90
missings = [c for c in mv.keys() if mv[c]>threshold]
df = data.drop(columns=missings, inplace=False)
df.to_csv(f'data/{file}_drop_mv.csv', index=False)
print('Dropped variables', missings)
Dropped variables []
Note that we made a copy of the original data, setting the inplace
parameter to False, in order to
do not impact on the following approaches.
The second situation is in the presence of single records that have a majority of variables without values. In this
case, we prefer to discard the records instead of dropping all columns. For this we use the dropna
method.
# defines the number of variables to discard entire records
threshold = data.shape[1] * 0.50
df = data.dropna(thresh=threshold, inplace=False)
df.to_csv(f'data/{file}_dropna_mv.csv', index=False)
print(df.shape)
(198, 11)
As we can see, we didn't discard any variable and only dropped out two records.
sklearn.impute
is the SimpleImputer
. First, it is
created, defining the strategy to follow, and then it is fitted to the data (fit
method). Then, it is
possible to apply it to the data through the transform
method.
Using the fit_transform
method, we are able to apply both in just one call, but we are not able to
reuse the imputer to any other dataset.
It uses a simple strategy to fill any missing value with a new value, which we need to define through the
strategy
parameter. We can choose among:
constant
: the constant value chosen depends on the type of variable (usually: NaN, -1 or 0 for
numeric, 'NA' for symbolic and False for booleanmean
, only applicable for numeric variablesmedian
, only applicable for numeric variablesmost_frequent
mostly applicable for symbolic variablesThere is also a IterativeImputer
, which considers all the variables to estimate missing values, but
it is out of the scope of this tutorial.
So lets imput a constant value for each distinct type of variable and join the result.
from sklearn.impute import SimpleImputer
from pandas import concat, DataFrame
tmp_nr, tmp_sb, tmp_bool = None, None, None
if len(numeric_vars) > 0:
imp = SimpleImputer(strategy='constant', fill_value=0, missing_values=np.nan, copy=True)
tmp_nr = DataFrame(imp.fit_transform(data[numeric_vars]), columns=numeric_vars)
if len(symbolic_vars) > 0:
imp = SimpleImputer(strategy='constant', fill_value='NA', missing_values=np.nan, copy=True)
tmp_sb = DataFrame(imp.fit_transform(data[symbolic_vars]), columns=symbolic_vars)
if len(boolean_vars) > 0:
imp = SimpleImputer(strategy='constant', fill_value=False, missing_values=np.nan, copy=True)
tmp_bool = DataFrame(imp.fit_transform(data[boolean_vars]), columns=boolean_vars)
data = concat([tmp_nr, tmp_sb, tmp_bool], axis=1)
data.to_csv(f'data/{file}_mv_constant.csv', index=False)
data.describe(include='all')
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | fluid_velocity | river_depth | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 200.000000 | 200.000000 | 200.00000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200 | 200 | 200 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 3 | 4 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | high | medium | winter |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 84 | 84 | 62 |
mean | 7.971700 | 9.026600 | 41.45500 | 3.250000 | 152.903000 | 82.492650 | 110.435300 | 12.728550 | NaN | NaN | NaN |
std | 0.822866 | 2.547113 | 46.62442 | 3.771674 | 179.765702 | 116.491727 | 102.306051 | 20.082659 | NaN | NaN | NaN |
min | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN |
25% | 7.700000 | 7.675000 | 9.04500 | 1.272500 | 34.247500 | 15.367500 | 18.822500 | 1.300000 | NaN | NaN | NaN |
50% | 8.055000 | 9.800000 | 31.04500 | 2.655000 | 98.255000 | 40.150000 | 83.220000 | 4.400000 | NaN | NaN | NaN |
75% | 8.400000 | 10.800000 | 56.97750 | 4.422500 | 199.850000 | 102.082500 | 179.140000 | 17.200000 | NaN | NaN | NaN |
max | 9.700000 | 13.400000 | 391.50000 | 45.650000 | 931.830000 | 771.600000 | 558.750000 | 110.460000 | NaN | NaN | NaN |
Be aware that filling missing values with already existing values, such as 0, -1 or False changes the data distribution. For this reason, it is usually to apply the mean and mode instead.
tmp_nr, tmp_sb, tmp_bool = None, None, None
if len(numeric_vars) > 0:
imp = SimpleImputer(strategy='mean', missing_values=np.nan, copy=True)
tmp_nr = DataFrame(imp.fit_transform(data[numeric_vars]), columns=numeric_vars)
if len(symbolic_vars) > 0:
imp = SimpleImputer(strategy='most_frequent', missing_values=np.nan, copy=True)
tmp_sb = DataFrame(imp.fit_transform(data[symbolic_vars]), columns=symbolic_vars)
if len(boolean_vars) > 0:
imp = SimpleImputer(strategy='most_frequent', missing_values=np.nan, copy=True)
tmp_bool = DataFrame(imp.fit_transform(data[boolean_vars]), columns=boolean_vars)
data = concat([tmp_nr, tmp_sb, tmp_bool], axis=1)
data.to_csv(f'data/{file}_mv_most_frequent.csv', index=False)
data.describe(include='all')
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | fluid_velocity | river_depth | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 200.000000 | 200.000000 | 200.00000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200 | 200 | 200 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 3 | 4 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | high | medium | winter |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 84 | 84 | 62 |
mean | 7.971700 | 9.026600 | 41.45500 | 3.250000 | 152.903000 | 82.492650 | 110.435300 | 12.728550 | NaN | NaN | NaN |
std | 0.822866 | 2.547113 | 46.62442 | 3.771674 | 179.765702 | 116.491727 | 102.306051 | 20.082659 | NaN | NaN | NaN |
min | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN |
25% | 7.700000 | 7.675000 | 9.04500 | 1.272500 | 34.247500 | 15.367500 | 18.822500 | 1.300000 | NaN | NaN | NaN |
50% | 8.055000 | 9.800000 | 31.04500 | 2.655000 | 98.255000 | 40.150000 | 83.220000 | 4.400000 | NaN | NaN | NaN |
75% | 8.400000 | 10.800000 | 56.97750 | 4.422500 | 199.850000 | 102.082500 | 179.140000 | 17.200000 | NaN | NaN | NaN |
max | 9.700000 | 13.400000 | 391.50000 | 45.650000 | 931.830000 | 771.600000 | 558.750000 | 110.460000 | NaN | NaN | NaN |