![]() |
![]() |
|
|
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import ds_charts as ds
register_matplotlib_converters()
file = 'algae'
filename = 'data/algae.csv'
data = pd.read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
variable_types = ds.get_variable_types(data)
numeric_vars = variable_types['numeric']
symbolic_vars = variable_types['symbolic']
boolean_vars = variable_types['binary']
Scaling transformations may be accomplished using both StandardScaler and MinMaxScaler classes
from the sklearn.preprocessing package.
But they only apply to numeric and boolean variables. Nevertheless, we need to apply the transfomation and rejoin the data together in order to have a unique dataframe. Be careful, that they can only be applied to numerical data, without any missing value. In order to do that, we are splitting our dataframe into three dataframes, one for each data type: above, discarding date variables, since the majority of techniques are not able to deal with them.
df_nr = data[numeric_vars]
df_sb = data[symbolic_vars]
df_bool = data[boolean_vars]
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from pandas import concat
transf = StandardScaler(with_mean=True, with_std=True, copy=True).fit(df_nr)
tmp = pd.DataFrame(transf.transform(df_nr), columns= numeric_vars)
norm_data_zscore = concat([tmp, df_sb, df_bool], axis=1)
norm_data_zscore.to_csv(f'data/{file}_scaled_zscore.csv', index=False)
And then we do the same with the MinMaxScaler. Note the use of the parameter copy on both scalers,
in order to keep the original data untouched.
transf = MinMaxScaler(feature_range=(0, 1), copy=True).fit(df_nr)
tmp = pd.DataFrame(transf.transform(df_nr), columns= numeric_vars)
norm_data_minmax = concat([tmp, df_sb, df_bool], axis=1)
norm_data_minmax.to_csv(f'data/{file}_scaled_minmax.csv', index=False)
print(norm_data_minmax.describe())
pH Oxygen Chloride Nitrates Ammonium \
count 199.000000 198.000000 190.000000 198.000000 198.000000
mean 0.588234 0.640149 0.110961 0.070895 0.161246
std 0.145927 0.200946 0.119687 0.082817 0.194222
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.512195 0.523109 0.027512 0.027357 0.033043
50% 0.600000 0.697479 0.083086 0.057566 0.102138
75% 0.682927 0.781513 0.147222 0.096436 0.214419
max 1.000000 1.000000 1.000000 1.000000 1.000000
Orthophosphate Phosphate Chlorophyll
count 198.000000 198.000000 188.000000
mean 0.106834 0.198352 0.122587
std 0.151548 0.183229 0.185120
min 0.000000 0.000000 0.000000
25% 0.019465 0.033154 0.018106
50% 0.052427 0.149861 0.047076
75% 0.131388 0.324926 0.165671
max 1.000000 1.000000 1.000000
Now we can se the result of the transformed data with a single boxplot, again.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 3, figsize=(20,10),squeeze=False)
axs[0, 0].set_title('Original data')
data.boxplot(ax=axs[0, 0])
axs[0, 1].set_title('Z-score normalization')
norm_data_zscore.boxplot(ax=axs[0, 1])
axs[0, 2].set_title('MinMax normalization')
norm_data_minmax.boxplot(ax=axs[0, 2])
plt.show()
Note the difference on the results obtained with both scalers, with MinMax not doing a great job in the presence of outliers.
After preparing the data, it is often useful to write the transformed one into a new data file as follows:
norm_data_zscore.to_csv('data/algae_scaled_zscore.csv', index=False)