Lab 2: Data Preparation

Lets consider the algae dataset again. Remember the kinds of variables that describe the data. Remember to transform object variables to symbolic ones as seen in the previous lab, and keep the numeric separated from the symbolic ones, to deal with them with the right tools.

Missing Values Imputation

Missing values are a kind of a plague in data science, but in particular when using sci-kit learn, since its estimators are not able to deal with them. A missing value corresponds to a variable without any value for a given record. Let's recover the procedure to find the variables with missing values from the data dimensionality lab:

Dropping Missing Values

The easiest way to deal with this situation is to drop the records with missing values. There are two situations to distinguish.

The first one is when a column has a significant number of missing values. It's difficult to establish a threshold since the number of records remaining plays an important part. If the remaining records are enough to serve as a characteristic, otherwise we can discard the entire column.

Since our dataset has just 200 records we will discard the columns that have more than 90% of missing values, as follows.

Note that we made a copy of the original data, setting the inplace parameter to False, in order to do not impact on the following approaches.

The second situation is in the presence of single records that have a majority of variables without values. In this case, we prefer to discard the records instead of dropping all columns. For this we use the dropna method.

As we can see, we didn't discard any variable and only dropped out two records.

Filling missing values

The simplest missing values imputer in sklearn.impute is the SimpleImputer. First, it is created, defining the strategy to follow, and then it is fitted to the data (fit method). Then, it is possible to apply it to the data through the transformmethod. Using the fit_transform method, we are able to apply both in just one call, but we are not able to reuse the imputer to any other dataset.

It uses a simple strategy to fill any missing value with a new value, which we need to define through the strategy parameter. We can choose among:

There is also a IterativeImputer, which considers all the variables to estimate missing values, but it is out of the scope of this tutorial. So lets imput a constant value for each distinct type of variable and join the result.

Be aware that filling missing values with already existing values, such as 0, -1 or False changes the data distribution. For this reason, it is usually to apply the mean and mode instead.

Pay attention to the differences between the mean and mode values achieved through both strategies for the different variables.

Don't forget to save the resulting data to a datafile, to be used for training models and discovering other kinds of information.