|
Dealing with nominal variables demands additional transformations for some of mining techniques, in particular the ones depending on similarity measures, where the distance between the distinct values is of major importance. The easiest transformation of such variables is called dummification, and consists on creating a new variable for each possible value from the original one, removing it from the dataset. Note, however, that this shouldn't be applied to the class variable, since it will transform a simple multi label classification problem into a multiclass problem.
from pandas import read_csv, DataFrame, concat
from pandas.plotting import register_matplotlib_converters
from ds_charts import get_variable_types
register_matplotlib_converters()
file = 'algae'
filename = 'data/algae.csv'
data = read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)
# Drop out all records with missing values
data.dropna(inplace=True)
variable_types = get_variable_types(data)
numeric_vars = variable_types['numeric']
symbolic_vars = variable_types['symbolic']
boolean_vars = variable_types['binary']
We can make use of the OneHotEncoder
, in order to apply dummification, from the package
sklearn.preprocessing
. The pandas.DataFrame.getDummies
is much less interesting since it
isn't able to apply the same encoder to different parts of a dataset, while the first one is.
Be careful with missing values, since dummification only works if there is no missing value on the variables to dummify.
For example, after dummifying the algae dataframe, we get a new one with 18 variables, instead of the 11 original ones, since each one of the three symbolic variables had three different values.
from sklearn.preprocessing import OneHotEncoder
def dummify(df, vars_to_dummify):
other_vars = [c for c in df.columns if not c in vars_to_dummify]
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X = df[vars_to_dummify]
encoder.fit(X)
new_vars = encoder.get_feature_names(vars_to_dummify)
trans_X = encoder.transform(X)
dummy = DataFrame(trans_X, columns=new_vars, index=X.index)
final_df = concat([df[other_vars], dummy], axis=1)
return final_df
df = dummify(data, symbolic_vars)
df.to_csv(f'data/{file}_dummified.csv', index=False)
df.describe(include='all')
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | fluid_velocity_high | fluid_velocity_low | fluid_velocity_medium | river_depth_high | river_depth_low | river_depth_medium | season_autumn | season_spring | season_summer | season_winter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 | 184.000000 |
mean | 8.078315 | 9.018587 | 44.881467 | 3.384457 | 164.432609 | 88.745543 | 118.242880 | 13.443261 | 0.413043 | 0.168478 | 0.418478 | 0.228261 | 0.320652 | 0.451087 | 0.195652 | 0.260870 | 0.233696 | 0.309783 |
std | 0.471697 | 2.407158 | 47.066649 | 3.874723 | 182.693434 | 119.217311 | 102.185903 | 20.213400 | 0.493724 | 0.375312 | 0.494655 | 0.420857 | 0.468001 | 0.498959 | 0.397784 | 0.440307 | 0.424335 | 0.463666 |
min | 7.000000 | 1.500000 | 0.800000 | 0.050000 | 5.800000 | 1.250000 | 0.900000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 7.777500 | 7.675000 | 11.857500 | 1.362500 | 43.937500 | 19.000000 | 27.125000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 8.100000 | 9.750000 | 35.080000 | 2.820000 | 103.165000 | 47.500000 | 90.370000 | 5.215000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 8.400000 | 10.700000 | 58.515000 | 4.545000 | 211.285000 | 104.557500 | 187.462500 | 18.300000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 |
max | 9.500000 | 13.400000 | 391.500000 | 45.650000 | 931.830000 | 771.600000 | 558.750000 | 110.460000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |