|
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import ds_charts as ds
register_matplotlib_converters()
file = 'algae'
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
variable_types = ds.get_variable_types(data)
numeric_vars = variable_types['numeric']
symbolic_vars = variable_types['symbolic']
boolean_vars = variable_types['binary']
Scaling transformations may be accomplished using both StandardScaler
and MinMaxScaler
classes
from the sklearn.preprocessing
package.
But they only apply to numeric and boolean variables. Nevertheless, we need to apply the transfomation and rejoin the data together in order to have a unique dataframe. Be careful, that they can only be applied to numerical data, without any missing value. In order to do that, we are splitting our dataframe into three dataframes, one for each data type: above, discarding date variables, since the majority of techniques are not able to deal with them.
df_nr = data[numeric_vars]
df_sb = data[symbolic_vars]
df_bool = data[boolean_vars]
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from pandas import concat
transf = StandardScaler(with_mean=True, with_std=True, copy=True).fit(df_nr)
tmp = pd.DataFrame(transf.transform(df_nr), columns= numeric_vars)
norm_data_zscore = concat([tmp, df_sb, df_bool], axis=1)
norm_data_zscore.to_csv(f'data/{file}_scaled_zscore.csv', index=False)
And then we do the same with the MinMaxScaler. Note the use of the parameter copy
on both scalers,
in order to keep the original data untouched.
transf = MinMaxScaler(feature_range=(0, 1), copy=True).fit(df_nr)
tmp = pd.DataFrame(transf.transform(df_nr), columns= numeric_vars)
norm_data_minmax = concat([tmp, df_sb, df_bool], axis=1)
norm_data_minmax.to_csv(f'data/{file}_scaled_minmax.csv', index=False)
print(norm_data_minmax.describe())
pH Oxygen Chloride Nitrates Ammonium \ count 199.000000 198.000000 190.000000 198.000000 198.000000 mean 0.588234 0.640149 0.110961 0.070895 0.161246 std 0.145927 0.200946 0.119687 0.082817 0.194222 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 0.512195 0.523109 0.027512 0.027357 0.033043 50% 0.600000 0.697479 0.083086 0.057566 0.102138 75% 0.682927 0.781513 0.147222 0.096436 0.214419 max 1.000000 1.000000 1.000000 1.000000 1.000000 Orthophosphate Phosphate Chlorophyll count 198.000000 198.000000 188.000000 mean 0.106834 0.198352 0.122587 std 0.151548 0.183229 0.185120 min 0.000000 0.000000 0.000000 25% 0.019465 0.033154 0.018106 50% 0.052427 0.149861 0.047076 75% 0.131388 0.324926 0.165671 max 1.000000 1.000000 1.000000
Now we can se the result of the transformed data with a single boxplot, again.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 3, figsize=(20,10),squeeze=False)
axs[0, 0].set_title('Original data')
data.boxplot(ax=axs[0, 0])
axs[0, 1].set_title('Z-score normalization')
norm_data_zscore.boxplot(ax=axs[0, 1])
axs[0, 2].set_title('MinMax normalization')
norm_data_minmax.boxplot(ax=axs[0, 2])
plt.show()
Note the difference on the results obtained with both scalers, with MinMax not doing a great job in the presence of outliers.
After preparing the data, it is often useful to write the transformed one into a new data file as follows:
norm_data_zscore.to_csv('data/algae_scaled_zscore.csv', index=False)