![]() |
![]() |
|
|
Data balancing techniques are needed in the presence of unbalanced datasets - when the target variable doesn't have a uniform distribution, i.e. the classes are not equiprobable. In the case of binary classification, we usually call positive to the minority class, and negative to the majority one.
Lets consider the unbalanced dataset, whose target is the Outcome variable, with two possible values: Active the minority class, and Inactive as the majority class. The following chart shows the original target distribution, and the following one the resulting distribution after applying each strategy.
import pandas as pd
import matplotlib.pyplot as plt
import ds_charts as ds
filename = 'data/unbalanced.csv'
file = "unbalanced"
original = pd.read_csv(filename, sep=',', decimal='.')
class_var = 'Outcome'
target_count = original[class_var].value_counts()
positive_class = target_count.idxmin()
negative_class = target_count.idxmax()
#ind_positive_class = target_count.index.get_loc(positive_class)
print('Minority class=', positive_class, ':', target_count[positive_class])
print('Majority class=', negative_class, ':', target_count[negative_class])
print('Proportion:', round(target_count[positive_class] / target_count[negative_class], 2), ': 1')
values = {'Original': [target_count[positive_class], target_count[negative_class]]}
plt.figure()
ds.bar_chart(target_count.index, target_count.values, title='Class balance')
plt.savefig(f'images/{file}_balance.png')
plt.show()
Minority class= Active : 12 Majority class= Inactive : 844 Proportion: 0.01 : 1
Before proceeding, lets split the dataset into two subdatasets: one for each class. Then we can sample the required one and join to the other one, as we did on the other preparation techniques. In the end, we can write the dataset into a new datafile to explore later.
from pandas import concat, DataFrame
df_positives = original[original[class_var] == positive_class]
df_negatives = original[original[class_var] == negative_class]
We can follow two different strategies: undersampling and oversampling. The choice of each one of them, depends on the size of the dataset, i.e., the number of records to use as train:
df_neg_sample = DataFrame(df_negatives.sample(len(df_positives)))
df_under = concat([df_positives, df_neg_sample], axis=0)
df_under.to_csv(f'data/{file}_under.csv', index=False)
values['UnderSample'] = [len(df_positives), len(df_neg_sample)]
print('Minority class=', positive_class, ':', len(df_positives))
print('Majority class=', negative_class, ':', len(df_neg_sample))
print('Proportion:', round(len(df_positives) / len(df_neg_sample), 2), ': 1')
Minority class= Active : 12 Majority class= Inactive : 12 Proportion: 1.0 : 1
This implements undersampling, and in a similar way, we get oversampling by replication. Note on the replace
parameter in the sample method, which means that we are taking a sample with replacement, meaning that
we pick the same record more than once.
df_pos_sample = DataFrame(df_positives.sample(len(df_negatives), replace=True))
df_over = concat([df_pos_sample, df_negatives], axis=0)
df_over.to_csv(f'data/{file}_over.csv', index=False)
values['OverSample'] = [len(df_pos_sample), len(df_negatives)]
print('Minority class=', positive_class, ':', len(df_pos_sample))
print('Majority class=', negative_class, ':', len(df_negatives))
print('Proportion:', round(len(df_pos_sample) / len(df_negatives), 2), ': 1')
plt.figure()
ds.multiple_bar_chart([positive_class, negative_class], values,
title='Target', xlabel='frequency', ylabel='Class balance')
plt.show()
Minority class= Active : 844 Majority class= Inactive : 844 Proportion: 1.0 : 1
Among the different oversampling strategies there is SMOTE, one of the most interesting ones.
In this case, the oversample is created from the minority class, by artificially creating new records in the
neighborhood of the positive records.
It is usual to adopt a hybrid approach, by choosing a number of records between the number of positives and negatives, say N. This however implies taking a sample from the negatives with N records, and generating the new positives ones reaching the same number of records.
from imblearn.over_sampling import SMOTE
RANDOM_STATE = 42
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
y = original.pop(class_var).values
X = original.values
smote_X, smote_y = smote.fit_resample(X, y)
df_smote = concat([DataFrame(smote_X), DataFrame(smote_y)], axis=1)
df_smote.columns = list(original.columns) + [class_var]
df_smote.to_csv(f'data/{file}_smote.csv', index=False)
smote_target_count = pd.Series(smote_y).value_counts()
values['SMOTE'] = [smote_target_count[positive_class], smote_target_count[negative_class]]
print('Minority class=', positive_class, ':', smote_target_count[positive_class])
print('Majority class=', negative_class, ':', smote_target_count[negative_class])
print('Proportion:', round(smote_target_count[positive_class] / smote_target_count[negative_class], 2), ': 1')
print(df_smote.describe())
Minority class= Active : 844
Majority class= Inactive : 844
Proportion: 1.0 : 1
WBN_GC_L_0.25 WBN_GC_H_0.25 WBN_GC_L_0.50 WBN_GC_H_0.50 \
count 1688.000000 1688.000000 1688.000000 1688.000000
mean -2.063864 1.874356 -2.285678 2.381064
std 0.356506 0.293333 0.264463 0.210189
min -2.785000 1.104900 -3.036800 1.765300
25% -2.381225 1.679700 -2.523400 2.303650
50% -2.251460 1.889814 -2.382600 2.389783
75% -1.740141 2.079091 -2.059948 2.495525
max -1.167400 2.819200 -1.629500 3.347800
WBN_GC_L_0.75 WBN_GC_H_0.75 WBN_GC_L_1.00 WBN_GC_H_1.00 \
count 1688.000000 1688.000000 1688.000000 1688.000000
mean -2.708940 3.100907 -3.382650 3.931627
std 0.181816 0.189875 0.220031 0.214837
min -3.425300 2.510300 -3.909600 3.247300
25% -2.798850 3.000645 -3.560050 3.800475
50% -2.720843 3.093746 -3.327946 3.921609
75% -2.586449 3.216496 -3.197432 4.076678
max -2.241700 3.992300 -2.882900 4.703700
WBN_EN_L_0.25 WBN_EN_H_0.25 ... WBN_LP_L_1.00 WBN_LP_H_1.00 \
count 1688.000000 1688.000000 ... 1688.000000 1688.000000
mean -0.795246 1.489139 ... -3.328857 3.738963
std 0.045627 0.355777 ... 0.208995 0.188153
min -0.888000 1.026200 ... -3.782700 2.700500
25% -0.821925 1.245914 ... -3.527200 3.613400
50% -0.793031 1.363244 ... -3.306447 3.732522
75% -0.776800 1.535200 ... -3.169700 3.863878
max -0.437500 2.411900 ... -2.718200 4.271300
XLogP PSA NumRot NumHBA NumHBD \
count 1688.000000 1688.000000 1688.000000 1688.000000 1688.000000
mean 3.954976 88.224130 5.912849 5.299995 1.243740
std 1.325194 26.716484 2.555263 1.536745 0.755493
min -2.892000 26.190000 0.000000 0.000000 0.000000
25% 3.070937 66.852503 4.000000 4.000000 1.000000
50% 4.225436 84.097114 6.000000 5.078183 1.000000
75% 4.945742 107.610000 7.173255 6.000000 1.998320
max 8.407000 223.370000 24.000000 13.000000 6.000000
MW BBB BadGroup
count 1688.000000 1688.000000 1688.000000
mean 406.373955 0.339935 0.493365
std 64.453893 0.432909 0.607238
min 205.217000 0.000000 0.000000
25% 357.450925 0.000000 0.000000
50% 409.485485 0.000000 0.088846
75% 447.539500 0.912286 1.000000
max 681.786000 1.000000 3.000000
[8 rows x 32 columns]
See, that for SMOTE method we have to split the original data into two: one with just one variable - the class
variable, call it y, and another with all the other variables, call it X (look at the
Classification lab for more details about this).
Then the SMOTE technique generates the positive records, don't needing to join the positive and negatives ones.
Indeed, what we have to do is just rejoin the data (smote_X) with the corresponding class (smote_y), already updated.
plt.figure()
ds.multiple_bar_chart([positive_class, negative_class], values,
title='Target', xlabel='frequency', ylabel='Class balance')
plt.show()