Lab 2: Data Preparation (cont.)

Scaling

Scaling transformations may be accomplished using both StandardScaler and MinMaxScaler classes from the sklearn.preprocessing package.

But they only apply to numeric and boolean variables. Nevertheless, we need to apply the transfomation and rejoin the data together in order to have a unique dataframe. Be careful, that they can only be applied to numerical data, without any missing value. In order to do that, we are splitting our dataframe into three dataframes, one for each data type: above, discarding date variables, since the majority of techniques are not able to deal with them.

Standard Scaler

The Standard Scaler implements the z-score transformation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

Standard Scaler

And then we do the same with the MinMaxScaler. Note the use of the parameter copy on both scalers, in order to keep the original data untouched.

Now we can se the result of the transformed data with a single boxplot, again.

Note the difference on the results obtained with both scalers, with MinMax not doing a great job in the presence of outliers.

Writing to a file

After preparing the data, it is often useful to write the transformed one into a new data file as follows:

Note the parameter index is set to False, since we are not using the date as the index anymore. If it was not the case, we should set it to 'date', as we did on the read_csv call through the index_col parameter.