Lab23_dummification

Variable Dummification

Dealing with nominal variables demands additional transformations for some of mining techniques, in particular the ones depending on similarity measures, where the distance between the distinct values is of major importance. The easiest transformation of such variables is called dummification, and consists on creating a new variable for each possible value from the original one, removing it from the dataset. Note, however, that this shouldn't be applied to the class variable, since it will transform a simple multi label classification problem into a multiclass problem.

In [1]:

from pandas import read_csv, DataFrame, concat
from pandas.plotting import register_matplotlib_converters
from ds_charts import get_variable_types

register_matplotlib_converters()
file = 'algae'
filename = 'data/algae.csv'
data = read_csv(filename, index_col='date', na_values='', parse_dates=True, infer_datetime_format=True)

# Drop out all records with missing values
data.dropna(inplace=True)

variable_types = get_variable_types(data)
numeric_vars = variable_types['numeric']
symbolic_vars = variable_types['symbolic']
boolean_vars = variable_types['binary']

We can make use of the OneHotEncoder, in order to apply dummification, from the package sklearn.preprocessing. The pandas.DataFrame.getDummies is much less interesting since it isn't able to apply the same encoder to different parts of a dataset, while the first one is.

Be careful with missing values, since dummification only works if there is no missing value on the variables to dummify.

For example, after dummifying the algae dataframe, we get a new one with 18 variables, instead of the 11 original ones, since each one of the three symbolic variables had three different values.

In [2]:

from sklearn.preprocessing import OneHotEncoder

def dummify(df, vars_to_dummify):
    other_vars = [c for c in df.columns if not c in vars_to_dummify]
    encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    X = df[vars_to_dummify]
    encoder.fit(X)
    new_vars = encoder.get_feature_names(vars_to_dummify)
    trans_X = encoder.transform(X)
    dummy = DataFrame(trans_X, columns=new_vars, index=X.index)
    final_df = concat([df[other_vars], dummy], axis=1)
    return final_df

df = dummify(data, symbolic_vars)
df.to_csv(f'data/{file}_dummified.csv', index=False)

df.describe(include='all')

Out[2]:

	pH	Oxygen	Chloride	Nitrates	Ammonium	Orthophosphate	Phosphate	Chlorophyll	fluid_velocity_high	fluid_velocity_low	fluid_velocity_medium	river_depth_high	river_depth_low	river_depth_medium	season_autumn	season_spring	season_summer	season_winter
count	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000	184.000000
mean	8.078315	9.018587	44.881467	3.384457	164.432609	88.745543	118.242880	13.443261	0.413043	0.168478	0.418478	0.228261	0.320652	0.451087	0.195652	0.260870	0.233696	0.309783
std	0.471697	2.407158	47.066649	3.874723	182.693434	119.217311	102.185903	20.213400	0.493724	0.375312	0.494655	0.420857	0.468001	0.498959	0.397784	0.440307	0.424335	0.463666
min	7.000000	1.500000	0.800000	0.050000	5.800000	1.250000	0.900000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	7.777500	7.675000	11.857500	1.362500	43.937500	19.000000	27.125000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	8.100000	9.750000	35.080000	2.820000	103.165000	47.500000	90.370000	5.215000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	8.400000	10.700000	58.515000	4.545000	211.285000	104.557500	187.462500	18.300000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000	0.000000	1.000000	0.000000	1.000000
max	9.500000	13.400000	391.500000	45.650000	931.830000	771.600000	558.750000	110.460000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

Lab 2: Data Preparation

Variable Dummification