Data Preparation

Variable Encoding

Despite the heavy popularity of sklearn, unfortunatelly it does not deal with anything but numeric variables. For this reason, variable encoding has to be one of the first transformations to apply.

Ordinal Encoding

The easiest encoding is the ordinal one, but only shall be applied to binary and ordinal variables.

The usual way to do it is trough the use of the OrdinalEncoder and LabelEncoder, both from de sklearn.preprocessing package, however their usage is difficult to manage, in particular to impose the desired order.

A simple alternative is to use the replace from the DataFrame class, where the coding is explicitly made through the encoding dictionary.

In [ ]:
from pandas import read_csv, DataFrame
from dslabs_functions import get_variable_types, encode_cyclic_variables, dummify

data: DataFrame = read_csv("data/stroke_mvi.csv", index_col="id", na_values="")
vars: dict[str, list] = get_variable_types(data)

yes_no: dict[str, int] = {"no": 0, "No": 0, "yes": 1, "Yes": 1}
residence_type_values: dict[str, int] = {"Rural": 0, "Urban": 1}

encoding: dict[str, dict[str, int]] = {
    "Residence_type": residence_type_values,
    "hypertension": yes_no,
    "heart_disease": yes_no,
    "ever_married": yes_no,
    "stroke": yes_no,
}
df: DataFrame = data.replace(encoding, inplace=False)
df.head()
Out[ ]:
age avg_glucose_level bmi gender work_type smoking_status hypertension heart_disease ever_married Residence_type stroke
id
9046 67.0 228.69 36.600000 Male Private formerly smoked 0 1 1 1 1
51676 61.0 202.21 28.893237 Female Self-employed never smoked 0 0 1 0 1
31112 80.0 105.92 32.500000 Male Private never smoked 0 1 1 0 1
60182 49.0 171.23 34.400000 Female Private smokes 0 0 1 1 1
1665 79.0 174.12 24.000000 Female Self-employed never smoked 1 0 1 0 1

In the code above, we encoded all the binary variables, since the order among the values is irrelevant. Naturally, we could have chosen any order among the values, but in that case we would loose some information, which consequently would bias the training of models.

In order to choose the order to consider for each variable, we may start by collecting the individual values for each symbolic var.

In [ ]:
for v in vars["symbolic"]:
    print(v, data[v].unique())
gender ['Male' 'Female' 'Other']
work_type ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
smoking_status ['formerly smoked' 'never smoked' 'smokes']

Unexpectadly, the gender variable presents 3 different values, which makes it non-binary. We opt to code Female as 0, Male as 2, and Other as 1, considering it represents someone in between, meaning the value shouldn't be closer to any of the traditional values.

In [ ]:
gender_values: dict[str, int] = {"Female": 0, "Other": 1, "Male": 2}
work_values: dict[str, int] = {
    "children": 0,
    "Never_worked": 1,
    "Self-employed": 2,
    "Private": 3,
    "Govt_job": 4,
}
status_values: dict[str, int] = {"never smoked": 0, "formerly smoked": 1, "smokes": 2}

encoding: dict[str, dict[str, int]] = {
    "gender": gender_values,
    "work_type": work_values,
    "smoking_status": status_values,
}

df: DataFrame = df.replace(encoding, inplace=False)
df.head()
Out[ ]:
age avg_glucose_level bmi gender work_type smoking_status hypertension heart_disease ever_married Residence_type stroke
id
9046 67.0 228.69 36.600000 2 3 1 0 1 1 1 1
51676 61.0 202.21 28.893237 0 2 0 0 0 1 0 1
31112 80.0 105.92 32.500000 2 3 0 0 1 1 0 1
60182 49.0 171.23 34.400000 0 3 2 0 0 1 1 1
1665 79.0 174.12 24.000000 0 2 0 1 0 1 0 1

The logic for the rest of the variables shall be similar. Of course, if we have domain knowledge the choice of the order is natural, and there shouldn't be any doubt about it. Otherwise, we need to pick an order that seems to make sense in helping to descriminate among the class variables.

The smoking_status variable is an example of a situation where common sense is everything we need. Never having smoked (never smoked) aligns more closely with having quit smoking (formerly smoked) than actively smoking (smokes).

Cyclic variables

Among the ordinal variables there are some that instead of having a sequential order, show a cyclic one. Examples of these are season and day of the week.

In these cases, there is no right choice to use as the first or the last one, and so we need a different strategy to encode them.

The common methods applied nowadays to encode each one of these variables is to create two variables per each one, using trigonometric functions to simulate an angle. Say for a var variable we create two new variables to encode it - var_sin and var_cos.

In this manner, if var assumes a value x between 0 and x_max, then var_sin becomes x_sin and var_cos becomes x_cos given below.

No description has been provided for this image

In order to do so, we just need to map the original values from 0 to x_max to values between 0 and (2pi * x / x_max).

In [ ]:
from math import pi, sin, cos

data: DataFrame = read_csv(
    "data/algae.csv",
    index_col="date",
    na_values="",
    parse_dates=True,
    infer_datetime_format=True,
)

season_val: dict[str, float] = {
    "spring": 0,
    "summer": pi / 2,
    "autumn": pi,
    "winter": -pi / 2,
}
lov: dict[str, int] = {"low": 0, "medium": 1, "high": 2}
encoding: dict[str, dict] = {
    "river_depth": lov,
    "fluid_velocity": lov,
    "season": season_val,
}

data = data.replace(encoding)
data.head()
Out[ ]:
pH Oxygen Chloride Nitrates Ammonium Orthophosphate Phosphate Chlorophyll fluid_velocity river_depth season
date
2018-09-30 8.10 11.4 40.02 5.33 346.67 125.67 187.06 15.6 1 0 3.141593
2018-10-05 8.06 9.0 55.35 10.42 233.70 58.22 97.58 10.5 1 0 3.141593
2018-10-07 8.05 10.6 59.07 4.99 205.67 44.67 77.43 6.9 2 0 3.141593
2018-10-09 7.55 11.5 4.70 1.32 14.75 4.25 98.25 1.1 2 0 3.141593
2018-10-11 7.75 10.3 32.92 2.94 42.00 16.00 40.00 7.6 2 0 3.141593

and then create the two new variables from the angle.

In [ ]:
def encode_cyclic_variables(data: DataFrame, vars: list[str]) -> None:
    for v in vars:
        x_max: float | int = max(data[v])
        data[v + "_sin"] = data[v].apply(lambda x: round(sin(2 * pi * x / x_max), 3))
        data[v + "_cos"] = data[v].apply(lambda x: round(cos(2 * pi * x / x_max), 3))
    return


data: DataFrame | None = encode_cyclic_variables(data, ["season"])
if data is not None:
    data.head()

Dummification or One-hot Encoding

Dealing with nominal variables is another story. Indeed, by definition there is no order among the values assumed by the variable. When after exploring all possible perspectives, we are not able to specify an acceptable order among those variables the only solution is dummification.

This consists on creating a new variable for each possible value from the original one, removing it from the dataset.

Note, however, that a small number of values leads to the creation of several new variables, creating a much sparser dataset. And so, we shall avoid it as much as possible.

Additionaly, do not dummify the class variable, since it will transform a simple multi label classification problem into a multi class problem.

In order to apply dummification, we can make use of the OneHotEncoder from the package sklearn.preprocessing. The pandas.DataFrame.getDummies is much less interesting since it isn't able to apply the same encoder to different parts of a dataset, while the first one is.

For example, after dummifying the algae dataframe, we get a new one with 18 variables, instead of the 11 original ones, since each one of the three symbolic variables had three different values.

As we saw, we could have considered them as ordinal or cyclic and avoid the increasing of dimensionality.

In [ ]:
from numpy import ndarray
from pandas import DataFrame, read_csv, concat
from sklearn.preprocessing import OneHotEncoder


def dummify(df: DataFrame, vars_to_dummify: list[str]) -> DataFrame:
    other_vars: list[str] = [c for c in df.columns if not c in vars_to_dummify]

    enc = OneHotEncoder(
        handle_unknown="ignore", sparse_output=False, dtype="bool", drop="if_binary"
    )
    trans: ndarray = enc.fit_transform(df[vars_to_dummify])

    new_vars: ndarray = enc.get_feature_names_out(vars_to_dummify)
    dummy = DataFrame(trans, columns=new_vars, index=df.index)

    final_df: DataFrame = concat([df[other_vars], dummy], axis=1)
    return final_df


data: DataFrame = read_csv(
    "data/algae.csv", index_col="date", na_values="", parse_dates=True, dayfirst=True
)
vars: list[str] = ["river_depth", "fluid_velocity", "season"]
df: DataFrame = dummify(data, vars)
df.head(5)
Out[ ]:
pH Oxygen Chloride Nitrates Ammonium Orthophosphate Phosphate Chlorophyll river_depth_high river_depth_low river_depth_medium fluid_velocity_high fluid_velocity_low fluid_velocity_medium season_autumn season_spring season_summer season_winter
date
2018-09-30 8.10 11.4 40.02 5.33 346.67 125.67 187.06 15.6 False True False False False True True False False False
2018-10-05 8.06 9.0 55.35 10.42 233.70 58.22 97.58 10.5 False True False False False True True False False False
2018-10-07 8.05 10.6 59.07 4.99 205.67 44.67 77.43 6.9 False True False True False False True False False False
2018-10-09 7.55 11.5 4.70 1.32 14.75 4.25 98.25 1.1 False True False True False False True False False False
2018-10-11 7.75 10.3 32.92 2.94 42.00 16.00 40.00 7.6 False True False True False False True False False False