Feature Engineering
Feature Selection
The high dimensiionality of data has been shown to have a strong impact on the modeling results. This happens when data is described by a high number of variables, requiring a larger training dataset, which sometimes is not available. The task is even harder when the number of variables is higher than the number of records available for training.
The usually way to go is to drop some variables, preferentially the least useful for the modeling task. As any other preparation task, feature selection shall be tested in order to choose the most promising choice, as proposed before.
Dropping Low Variance Variables
A variable is said to be relevant when it contributes to discriminate among the classes. Since a variable may be relevant for one classification task but irrelevant for another one, we can look at it as a supervised task.
However, there are situations when this doesn't happen, and the variables are just irrelevant by nature. This happens for variables with very low variance (when the variable presents almost always the same value) or with the highest possible variance (when the variable is an identifier, presenting a different value for each record). Note however, that having a different value for each record, doesn't imply it is an identifier, and so, for this last situation it is only safe to drop the variable if there is enough domain knowledge to recognize it as an identifier. Remember we can't discard the target variable!
In order to discard low variance variables, we need first to identify them, which can be done as in the select_low_variance_variables function, that makes use of the describe method from the DataFrame data object.
from pandas import DataFrame, Index, read_csv
from dslabs_functions import (
select_low_variance_variables,
study_variance_for_feature_selection,
apply_feature_selection,
select_redundant_variables,
study_redundancy_for_feature_selection,
)
def select_low_variance_variables(
data: DataFrame, max_threshold: float, target: str = "class"
) -> list:
summary5: DataFrame = data.describe()
vars2drop: Index[str] = summary5.columns[
summary5.loc["std"] * summary5.loc["std"] < max_threshold
]
vars2drop = vars2drop.drop(target) if target in vars2drop else vars2drop
return list(vars2drop.values)
target = "stroke"
file_tag = "stroke"
train: DataFrame = read_csv("data/stroke_train.csv")
print("Original variables", train.columns.to_list())
vars2drop: list[str] = select_low_variance_variables(train, 3, target=target)
print("Variables to drop", vars2drop)
Original variables ['age', 'avg_glucose_level', 'bmi', 'gender', 'work_type', 'smoking_status', 'hypertension', 'heart_disease', 'ever_married', 'Residence_type', 'stroke'] Variables to drop ['gender', 'work_type', 'smoking_status', 'hypertension', 'heart_disease', 'ever_married', 'Residence_type']
But more than magically choose that threshold, we shall study the impact of different ones in the model performance, so use the study_variance_for_feature_selection function below. It receives both the train and test datasets, to support the training of the classifier and to testing it, respectively, the target variable, the max_threshold value to consider as the maximum low variance threshold, the lag to pace the change on the threshold value to change, the metric to optimize and a file_tag to facilitate managing the charts.
from math import ceil
from matplotlib.pyplot import savefig, show, figure
from dslabs_functions import HEIGHT, evaluate_approach, plot_multiline_chart
def study_variance_for_feature_selection(
train: DataFrame,
test: DataFrame,
target: str = "class",
max_threshold: float = 1,
lag: float = 0.05,
metric: str = "accuracy",
file_tag: str = "",
) -> dict:
options: list[float] = [
round(i * lag, 3) for i in range(1, ceil(max_threshold / lag + lag))
]
results: dict[str, list] = {"NB": [], "KNN": []}
summary5: DataFrame = train.describe()
for thresh in options:
vars2drop: Index[str] = summary5.columns[
summary5.loc["std"] * summary5.loc["std"] < thresh
]
vars2drop = vars2drop.drop(target) if target in vars2drop else vars2drop
train_copy: DataFrame = train.drop(vars2drop, axis=1, inplace=False)
test_copy: DataFrame = test.drop(vars2drop, axis=1, inplace=False)
eval: dict[str, list] | None = evaluate_approach(
train_copy, test_copy, target=target, metric=metric
)
if eval is not None:
results["NB"].append(eval[metric][0])
results["KNN"].append(eval[metric][1])
plot_multiline_chart(
options,
results,
title=f"{file_tag} variance study ({metric})",
xlabel="variance threshold",
ylabel=metric,
percentage=True,
)
savefig(f"images/{file_tag}_fs_low_var_{metric}_study.png")
return results
eval_metric = "recall"
test: DataFrame = read_csv("data/stroke_test.csv")
figure(figsize=(2 * HEIGHT, HEIGHT))
study_variance_for_feature_selection(
train,
test,
target=target,
max_threshold=3,
lag=0.1,
metric=eval_metric,
file_tag=file_tag,
)
show()
As we can see, the difference on performance is not strong, but it is slightly better for NB when variables with a variance bellow 1.3 were removed. Note in this case naive Bayes doesn't show any change due to the feature selection, but it can happen.
From this verification, we can now save both datasets resulting from dropping those variables. First we identify the variables to drop for the selected threshold - select_low_variance_variables, just like we do through the previous function and then we drop those variables from both the train and test datafiles, saving them permantely
def apply_feature_selection(
train: DataFrame,
test: DataFrame,
vars2drop: list,
filename: str = "",
tag: str = "",
) -> tuple[DataFrame, DataFrame]:
train_copy: DataFrame = train.drop(vars2drop, axis=1, inplace=False)
train_copy.to_csv(f"{filename}_train_{tag}.csv", index=True)
test_copy: DataFrame = test.drop(vars2drop, axis=1, inplace=False)
test_copy.to_csv(f"{filename}_test_{tag}.csv", index=True)
return train_copy, test_copy
vars2drop: list[str] = select_low_variance_variables(
train, max_threshold=1.2, target=target
)
train_cp, test_cp = apply_feature_selection(
train, test, vars2drop, filename=f"data/{file_tag}", tag="lowvar"
)
print(f"Original data: train={train.shape}, test={test.shape}")
print(f"After low variance FS: train_cp={train_cp.shape}, test_cp={test_cp.shape}")
Original data: train=(3577, 11), test=(1533, 11) After low variance FS: train_cp=(3577, 5), test_cp=(1533, 5)
Dropping Redundant Variables
A second possibility is to discard redundant variables. Two variables are said to be redundant if they express the same information. So, from the modeling perspective they both has the same impact over the result. One of the ways to avoid redundancy is to find the set of pairs of correlated variables, and drop one of each pair.
from pandas import Series
def select_redundant_variables(
data: DataFrame, min_threshold: float = 0.90, target: str = "class"
) -> list:
df: DataFrame = data.drop(target, axis=1, inplace=False)
corr_matrix: DataFrame = abs(df.corr())
variables: Index[str] = corr_matrix.columns
vars2drop: list = []
for v1 in variables:
vars_corr: Series = (corr_matrix[v1]).loc[corr_matrix[v1] >= min_threshold]
vars_corr.drop(v1, inplace=True)
if len(vars_corr) > 1:
lst_corr = list(vars_corr.index)
for v2 in lst_corr:
if v2 not in vars2drop:
vars2drop.append(v2)
return vars2drop
print("Original variables", train.columns.values)
vars2drop: list[str] = select_redundant_variables(
train, target=target, min_threshold=0.5
)
print("Variables to drop", vars2drop)
Original variables ['age' 'avg_glucose_level' 'bmi' 'gender' 'work_type' 'smoking_status' 'hypertension' 'heart_disease' 'ever_married' 'Residence_type' 'stroke'] Variables to drop ['work_type', 'ever_married']
After being able to select the redundant variables to drop, it is then possible to study the impact of their removal from the training dataset, as done in the study_redundancy_for_feature_selection function.
def study_redundancy_for_feature_selection(
train: DataFrame,
test: DataFrame,
target: str = "class",
min_threshold: float = 0.90,
lag: float = 0.05,
metric: str = "accuracy",
file_tag: str = "",
) -> dict:
options: list[float] = [
round(min_threshold + i * lag, 3)
for i in range(ceil((1 - min_threshold) / lag) + 1)
]
df: DataFrame = train.drop(target, axis=1, inplace=False)
corr_matrix: DataFrame = abs(df.corr())
variables: Index[str] = corr_matrix.columns
results: dict[str, list] = {"NB": [], "KNN": []}
for thresh in options:
vars2drop: list = []
for v1 in variables:
vars_corr: Series = (corr_matrix[v1]).loc[corr_matrix[v1] >= thresh]
vars_corr.drop(v1, inplace=True)
if len(vars_corr) > 1:
lst_corr = list(vars_corr.index)
for v2 in lst_corr:
if v2 not in vars2drop:
vars2drop.append(v2)
train_copy: DataFrame = train.drop(vars2drop, axis=1, inplace=False)
test_copy: DataFrame = test.drop(vars2drop, axis=1, inplace=False)
eval: dict | None = evaluate_approach(
train_copy, test_copy, target=target, metric=metric
)
if eval is not None:
results["NB"].append(eval[metric][0])
results["KNN"].append(eval[metric][1])
plot_multiline_chart(
options,
results,
title=f"{file_tag} redundancy study ({metric})",
xlabel="correlation threshold",
ylabel=metric,
percentage=True,
)
savefig(f"images/{file_tag}_fs_redundancy_{metric}_study.png")
return results
eval_metric = "recall"
test: DataFrame = read_csv("data/stroke_test.csv")
figure(figsize=(2 * HEIGHT, HEIGHT))
study_redundancy_for_feature_selection(
train,
test,
target=target,
min_threshold=0.25,
lag=0.05,
metric=eval_metric,
file_tag=file_tag,
)
show()
From these results, it is clear that removing low correlated variables has a negative impact on the selected modeling techniques, much stronger for Naive Bayes than for KNN, in this case study.
The best results for naive Bayes are when we discard variables with a correlation higher than 0.5, but for KNN it is preferable to drop for correlations above 0.4. Since the improvement for naive Bayes is much higher, we choose to do it choosing a threshold of 0.5.
We could then run the apply_feature_selection function, as before, but now after applying the select_redundant_variables with the min_threshold = 0.5.
vars2drop: list[str] = select_redundant_variables(
train, min_threshold=0.5, target=target
)
train_cp, test_cp = apply_feature_selection(
train, test, vars2drop, filename=f"data/{file_tag}", tag="redundant"
)
print(f"Original data: train={train.shape}, test={test.shape}")
print(f"After redundant FS: train_cp={train_cp.shape}, test_cp={test_cp.shape}")
Original data: train=(3577, 11), test=(1533, 11) After redundant FS: train_cp=(3577, 9), test_cp=(1533, 9)