Lab 2: Data Preparation

Data balancing

Data balancing techniques are needed in the presence of unbalanced datasets - when the target variable doesn't have a uniform distribution, i.e. the classes are not equiprobable. In the case of binary classification, we usually call positive to the minority class, and negative to the majority one.

Lets consider the unbalanced dataset, whose target is the Outcome variable, with two possible values: Active the minority class, and Inactive as the majority class. The following chart shows the original target distribution, and the following one the resulting distribution after applying each strategy.

Before proceeding, lets split the dataset into two subdatasets: one for each class. Then we can sample the required one and join to the other one, as we did on the other preparation techniques. In the end, we can write the dataset into a new datafile to explore later.

We can follow two different strategies: undersampling and oversampling. The choice of each one of them, depends on the size of the dataset, i.e., the number of records to use as train:

This implements undersampling, and in a similar way, we get oversampling by replication. Note on the replace parameter in the sample method, which means that we are taking a sample with replacement, meaning that we pick the same record more than once.

SMOTE

Among the different oversampling strategies there is SMOTE, one of the most interesting ones. In this case, the oversample is created from the minority class, by artificially creating new records in the neighborhood of the positive records.

It is usual to adopt a hybrid approach, by choosing a number of records between the number of positives and negatives, say N. This however implies taking a sample from the negatives with N records, and generating the new positives ones reaching the same number of records.

See, that for SMOTE method we have to split the original data into two: one with just one variable - the class variable, call it y, and another with all the other variables, call it X (look at the Classification lab for more details about this). Then the SMOTE technique generates the positive records, don't needing to join the positive and negatives ones. Indeed, what we have to do is just rejoin the data (smote_X) with the corresponding class (smote_y), already updated.