Data Preparation

Variable Encoding

Despite the heavy popularity of sklearn, unfortunatelly it does not deal with anything but numeric variables. For this reason, variable encoding has to be one of the first transformations to apply.

Ordinal Encoding

The easiest encoding is the ordinal one, but only shall be applied to binary and ordinal variables.

The usual way to do it is trough the use of the OrdinalEncoder and LabelEncoder, both from de sklearn.preprocessing package, however their usage is difficult to manage, in particular to impose the desired order.

A simple alternative is to use the replace from the DataFrame class, where the coding is explicitly made through the encoding dictionary.

In [ ]:

from pandas import read_csv, DataFrame
from dslabs_functions import get_variable_types, encode_cyclic_variables, dummify

data: DataFrame = read_csv("data/stroke_mvi.csv", index_col="id", na_values="")
vars: dict[str, list] = get_variable_types(data)

yes_no: dict[str, int] = {"no": 0, "No": 0, "yes": 1, "Yes": 1}
residence_type_values: dict[str, int] = {"Rural": 0, "Urban": 1}

encoding: dict[str, dict[str, int]] = {
    "Residence_type": residence_type_values,
    "hypertension": yes_no,
    "heart_disease": yes_no,
    "ever_married": yes_no,
    "stroke": yes_no,
}
df: DataFrame = data.replace(encoding, inplace=False)
df.head()

Out[ ]:

	age	avg_glucose_level	bmi	gender	work_type	smoking_status	hypertension	heart_disease	ever_married	Residence_type	stroke
id
9046	67.0	228.69	36.600000	Male	Private	formerly smoked	0	1	1	1	1
51676	61.0	202.21	28.893237	Female	Self-employed	never smoked	0	0	1	0	1
31112	80.0	105.92	32.500000	Male	Private	never smoked	0	1	1	0	1
60182	49.0	171.23	34.400000	Female	Private	smokes	0	0	1	1	1
1665	79.0	174.12	24.000000	Female	Self-employed	never smoked	1	0	1	0	1

In the code above, we encoded all the binary variables, since the order among the values is irrelevant. Naturally, we could have chosen any order among the values, but in that case we would loose some information, which consequently would bias the training of models.

In order to choose the order to consider for each variable, we may start by collecting the individual values for each symbolic var.

In [ ]:

for v in vars["symbolic"]:
    print(v, data[v].unique())

gender ['Male' 'Female' 'Other']
work_type ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
smoking_status ['formerly smoked' 'never smoked' 'smokes']

Unexpectadly, the gender variable presents 3 different values, which makes it non-binary. We opt to code Female as 0, Male as 2, and Other as 1, considering it represents someone in between, meaning the value shouldn't be closer to any of the traditional values.

In [ ]:

gender_values: dict[str, int] = {"Female": 0, "Other": 1, "Male": 2}
work_values: dict[str, int] = {
    "children": 0,
    "Never_worked": 1,
    "Self-employed": 2,
    "Private": 3,
    "Govt_job": 4,
}
status_values: dict[str, int] = {"never smoked": 0, "formerly smoked": 1, "smokes": 2}

encoding: dict[str, dict[str, int]] = {
    "gender": gender_values,
    "work_type": work_values,
    "smoking_status": status_values,
}

df: DataFrame = df.replace(encoding, inplace=False)
df.head()

Out[ ]:

	age	avg_glucose_level	bmi	gender	work_type	smoking_status	hypertension	heart_disease	ever_married	Residence_type	stroke
id
9046	67.0	228.69	36.600000	2	3	1	0	1	1	1	1
51676	61.0	202.21	28.893237	0	2	0	0	0	1	0	1
31112	80.0	105.92	32.500000	2	3	0	0	1	1	0	1
60182	49.0	171.23	34.400000	0	3	2	0	0	1	1	1
1665	79.0	174.12	24.000000	0	2	0	1	0	1	0	1

The logic for the rest of the variables shall be similar. Of course, if we have domain knowledge the choice of the order is natural, and there shouldn't be any doubt about it. Otherwise, we need to pick an order that seems to make sense in helping to descriminate among the class variables.

The smoking_status variable is an example of a situation where common sense is everything we need. Never having smoked (never smoked) aligns more closely with having quit smoking (formerly smoked) than actively smoking (smokes).

Cyclic variables

Among the ordinal variables there are some that instead of having a sequential order, show a cyclic one. Examples of these are season and day of the week.

In these cases, there is no right choice to use as the first or the last one, and so we need a different strategy to encode them.

The common methods applied nowadays to encode each one of these variables is to create two variables per each one, using trigonometric functions to simulate an angle. Say for a var variable we create two new variables to encode it - var_sin and var_cos.

In this manner, if var assumes a value x between 0 and x_max, then var_sin becomes x_sin and var_cos becomes x_cos given below.

No description has been provided for this image

In order to do so, we just need to map the original values from 0 to x_max to values between 0 and (2pi * x / x_max).

In [ ]:

from math import pi, sin, cos

data: DataFrame = read_csv(
    "data/algae.csv",
    index_col="date",
    na_values="",
    parse_dates=True,
    infer_datetime_format=True,
)

season_val: dict[str, float] = {
    "spring": 0,
    "summer": pi / 2,
    "autumn": pi,
    "winter": -pi / 2,
}
lov: dict[str, int] = {"low": 0, "medium": 1, "high": 2}
encoding: dict[str, dict] = {
    "river_depth": lov,
    "fluid_velocity": lov,
    "season": season_val,
}

data = data.replace(encoding)
data.head()

Out[ ]:

	pH	Oxygen	Chloride	Nitrates	Ammonium	Orthophosphate	Phosphate	Chlorophyll	fluid_velocity	river_depth	season
date
2018-09-30	8.10	11.4	40.02	5.33	346.67	125.67	187.06	15.6	1	0	3.141593
2018-10-05	8.06	9.0	55.35	10.42	233.70	58.22	97.58	10.5	1	0	3.141593
2018-10-07	8.05	10.6	59.07	4.99	205.67	44.67	77.43	6.9	2	0	3.141593
2018-10-09	7.55	11.5	4.70	1.32	14.75	4.25	98.25	1.1	2	0	3.141593
2018-10-11	7.75	10.3	32.92	2.94	42.00	16.00	40.00	7.6	2	0	3.141593

and then create the two new variables from the angle.

In [ ]:

def encode_cyclic_variables(data: DataFrame, vars: list[str]) -> None:
    for v in vars:
        x_max: float | int = max(data[v])
        data[v + "_sin"] = data[v].apply(lambda x: round(sin(2 * pi * x / x_max), 3))
        data[v + "_cos"] = data[v].apply(lambda x: round(cos(2 * pi * x / x_max), 3))
    return


data: DataFrame | None = encode_cyclic_variables(data, ["season"])
if data is not None:
    data.head()

Dummification or One-hot Encoding

Dealing with nominal variables is another story. Indeed, by definition there is no order among the values assumed by the variable. When after exploring all possible perspectives, we are not able to specify an acceptable order among those variables the only solution is dummification.

This consists on creating a new variable for each possible value from the original one, removing it from the dataset.

Note, however, that a small number of values leads to the creation of several new variables, creating a much sparser dataset. And so, we shall avoid it as much as possible.

Additionaly, do not dummify the class variable, since it will transform a simple multi label classification problem into a multi class problem.

In order to apply dummification, we can make use of the OneHotEncoder from the package sklearn.preprocessing. The pandas.DataFrame.getDummies is much less interesting since it isn't able to apply the same encoder to different parts of a dataset, while the first one is.

For example, after dummifying the algae dataframe, we get a new one with 18 variables, instead of the 11 original ones, since each one of the three symbolic variables had three different values.

As we saw, we could have considered them as ordinal or cyclic and avoid the increasing of dimensionality.

In [ ]:

from numpy import ndarray
from pandas import DataFrame, read_csv, concat
from sklearn.preprocessing import OneHotEncoder


def dummify(df: DataFrame, vars_to_dummify: list[str]) -> DataFrame:
    other_vars: list[str] = [c for c in df.columns if not c in vars_to_dummify]

    enc = OneHotEncoder(
        handle_unknown="ignore", sparse_output=False, dtype="bool", drop="if_binary"
    )
    trans: ndarray = enc.fit_transform(df[vars_to_dummify])

    new_vars: ndarray = enc.get_feature_names_out(vars_to_dummify)
    dummy = DataFrame(trans, columns=new_vars, index=df.index)

    final_df: DataFrame = concat([df[other_vars], dummy], axis=1)
    return final_df


data: DataFrame = read_csv(
    "data/algae.csv", index_col="date", na_values="", parse_dates=True, dayfirst=True
)
vars: list[str] = ["river_depth", "fluid_velocity", "season"]
df: DataFrame = dummify(data, vars)
df.head(5)

Out[ ]:

	pH	Oxygen	Chloride	Nitrates	Ammonium	Orthophosphate	Phosphate	Chlorophyll	river_depth_high	river_depth_low	river_depth_medium	fluid_velocity_high	fluid_velocity_low	fluid_velocity_medium	season_autumn	season_spring	season_summer	season_winter
date
2018-09-30	8.10	11.4	40.02	5.33	346.67	125.67	187.06	15.6	False	True	False	False	False	True	True	False	False	False
2018-10-05	8.06	9.0	55.35	10.42	233.70	58.22	97.58	10.5	False	True	False	False	False	True	True	False	False	False
2018-10-07	8.05	10.6	59.07	4.99	205.67	44.67	77.43	6.9	False	True	False	True	False	False	True	False	False	False
2018-10-09	7.55	11.5	4.70	1.32	14.75	4.25	98.25	1.1	False	True	False	True	False	False	True	False	False	False
2018-10-11	7.75	10.3	32.92	2.94	42.00	16.00	40.00	7.6	False	True	False	True	False	False	True	False	False	False