|
Data balancing techniques are needed in the presence of unbalanced datasets - when the target variable doesn't have a uniform distribution, i.e. the classes are not equiprobable. In the case of binary classification, we usually call positive to the minority class, and negative to the majority one.
Lets consider the unbalanced dataset, whose target is the Outcome variable, with two possible values: Active the minority class, and Inactive as the majority class. The following chart shows the original target distribution, and the following one the resulting distribution after applying each strategy.
import pandas as pd
import matplotlib.pyplot as plt
import ds_charts as ds
filename = 'data/unbalanced.csv'
file = "unbalanced"
original = pd.read_csv(filename, sep=',', decimal='.')
class_var = 'Outcome'
target_count = original[class_var].value_counts()
positive_class = target_count.idxmin()
negative_class = target_count.idxmax()
#ind_positive_class = target_count.index.get_loc(positive_class)
print('Minority class=', positive_class, ':', target_count[positive_class])
print('Majority class=', negative_class, ':', target_count[negative_class])
print('Proportion:', round(target_count[positive_class] / target_count[negative_class], 2), ': 1')
values = {'Original': [target_count[positive_class], target_count[negative_class]]}
plt.figure()
ds.bar_chart(target_count.index, target_count.values, title='Class balance')
plt.savefig(f'images/{file}_balance.png')
plt.show()
Minority class= Active : 12 Majority class= Inactive : 844 Proportion: 0.01 : 1
Before proceeding, lets split the dataset into two subdatasets: one for each class. Then we can sample the required one and join to the other one, as we did on the other preparation techniques. In the end, we can write the dataset into a new datafile to explore later.
from pandas import concat, DataFrame
df_positives = original[original[class_var] == positive_class]
df_negatives = original[original[class_var] == negative_class]
We can follow two different strategies: undersampling and oversampling. The choice of each one of them, depends on the size of the dataset, i.e., the number of records to use as train:
df_neg_sample = DataFrame(df_negatives.sample(len(df_positives)))
df_under = concat([df_positives, df_neg_sample], axis=0)
df_under.to_csv(f'data/{file}_under.csv', index=False)
values['UnderSample'] = [len(df_positives), len(df_neg_sample)]
print('Minority class=', positive_class, ':', len(df_positives))
print('Majority class=', negative_class, ':', len(df_neg_sample))
print('Proportion:', round(len(df_positives) / len(df_neg_sample), 2), ': 1')
Minority class= Active : 12 Majority class= Inactive : 12 Proportion: 1.0 : 1
This implements undersampling, and in a similar way, we get oversampling by replication. Note on the replace
parameter in the sample
method, which means that we are taking a sample with replacement, meaning that
we pick the same record more than once.
df_pos_sample = DataFrame(df_positives.sample(len(df_negatives), replace=True))
df_over = concat([df_pos_sample, df_negatives], axis=0)
df_over.to_csv(f'data/{file}_over.csv', index=False)
values['OverSample'] = [len(df_pos_sample), len(df_negatives)]
print('Minority class=', positive_class, ':', len(df_pos_sample))
print('Majority class=', negative_class, ':', len(df_negatives))
print('Proportion:', round(len(df_pos_sample) / len(df_negatives), 2), ': 1')
plt.figure()
ds.multiple_bar_chart([positive_class, negative_class], values,
title='Target', xlabel='frequency', ylabel='Class balance')
plt.show()
Minority class= Active : 844 Majority class= Inactive : 844 Proportion: 1.0 : 1
Among the different oversampling strategies there is SMOTE
, one of the most interesting ones.
In this case, the oversample is created from the minority class, by artificially creating new records in the
neighborhood of the positive records.
It is usual to adopt a hybrid approach, by choosing a number of records between the number of positives and negatives, say N. This however implies taking a sample from the negatives with N records, and generating the new positives ones reaching the same number of records.
from imblearn.over_sampling import SMOTE
RANDOM_STATE = 42
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
y = original.pop(class_var).values
X = original.values
smote_X, smote_y = smote.fit_resample(X, y)
df_smote = concat([DataFrame(smote_X), DataFrame(smote_y)], axis=1)
df_smote.columns = list(original.columns) + [class_var]
df_smote.to_csv(f'data/{file}_smote.csv', index=False)
smote_target_count = pd.Series(smote_y).value_counts()
values['SMOTE'] = [smote_target_count[positive_class], smote_target_count[negative_class]]
print('Minority class=', positive_class, ':', smote_target_count[positive_class])
print('Majority class=', negative_class, ':', smote_target_count[negative_class])
print('Proportion:', round(smote_target_count[positive_class] / smote_target_count[negative_class], 2), ': 1')
print(df_smote.describe())
Minority class= Active : 844 Majority class= Inactive : 844 Proportion: 1.0 : 1 WBN_GC_L_0.25 WBN_GC_H_0.25 WBN_GC_L_0.50 WBN_GC_H_0.50 \ count 1688.000000 1688.000000 1688.000000 1688.000000 mean -2.063864 1.874356 -2.285678 2.381064 std 0.356506 0.293333 0.264463 0.210189 min -2.785000 1.104900 -3.036800 1.765300 25% -2.381225 1.679700 -2.523400 2.303650 50% -2.251460 1.889814 -2.382600 2.389783 75% -1.740141 2.079091 -2.059948 2.495525 max -1.167400 2.819200 -1.629500 3.347800 WBN_GC_L_0.75 WBN_GC_H_0.75 WBN_GC_L_1.00 WBN_GC_H_1.00 \ count 1688.000000 1688.000000 1688.000000 1688.000000 mean -2.708940 3.100907 -3.382650 3.931627 std 0.181816 0.189875 0.220031 0.214837 min -3.425300 2.510300 -3.909600 3.247300 25% -2.798850 3.000645 -3.560050 3.800475 50% -2.720843 3.093746 -3.327946 3.921609 75% -2.586449 3.216496 -3.197432 4.076678 max -2.241700 3.992300 -2.882900 4.703700 WBN_EN_L_0.25 WBN_EN_H_0.25 ... WBN_LP_L_1.00 WBN_LP_H_1.00 \ count 1688.000000 1688.000000 ... 1688.000000 1688.000000 mean -0.795246 1.489139 ... -3.328857 3.738963 std 0.045627 0.355777 ... 0.208995 0.188153 min -0.888000 1.026200 ... -3.782700 2.700500 25% -0.821925 1.245914 ... -3.527200 3.613400 50% -0.793031 1.363244 ... -3.306447 3.732522 75% -0.776800 1.535200 ... -3.169700 3.863878 max -0.437500 2.411900 ... -2.718200 4.271300 XLogP PSA NumRot NumHBA NumHBD \ count 1688.000000 1688.000000 1688.000000 1688.000000 1688.000000 mean 3.954976 88.224130 5.912849 5.299995 1.243740 std 1.325194 26.716484 2.555263 1.536745 0.755493 min -2.892000 26.190000 0.000000 0.000000 0.000000 25% 3.070937 66.852503 4.000000 4.000000 1.000000 50% 4.225436 84.097114 6.000000 5.078183 1.000000 75% 4.945742 107.610000 7.173255 6.000000 1.998320 max 8.407000 223.370000 24.000000 13.000000 6.000000 MW BBB BadGroup count 1688.000000 1688.000000 1688.000000 mean 406.373955 0.339935 0.493365 std 64.453893 0.432909 0.607238 min 205.217000 0.000000 0.000000 25% 357.450925 0.000000 0.000000 50% 409.485485 0.000000 0.088846 75% 447.539500 0.912286 1.000000 max 681.786000 1.000000 3.000000 [8 rows x 32 columns]
See, that for SMOTE method we have to split the original data into two: one with just one variable - the class
variable, call it y
, and another with all the other variables, call it X
(look at the
Classification lab for more details about this).
Then the SMOTE technique generates the positive records, don't needing to join the positive and negatives ones.
Indeed, what we have to do is just rejoin the data (smote_X) with the corresponding class (smote_y), already updated.
plt.figure()
ds.multiple_bar_chart([positive_class, negative_class], values,
title='Target', xlabel='frequency', ylabel='Class balance')
plt.show()