Classification
Logistic Regression
Logistic Regression is a simple classifier implemented through the LogisticRegression
in the
sklearn.linearmodel
package.
Parameters study
Its training may be done through different algorithms, but only one of them - liblinear, is able to use both L1 and L2 regularization, which can be specified through the penalty
parameter. Beside these, the number of maximum iterations (max_iter
parameter) is another relevant one.
Setting the verbose
parameter we are able to see the error evolution along iterations.
In order to see an evolution on the training along the number of iterations, we need to make use of the warm_start
, reducing the amount of time needed for training.
from numpy import array, ndarray
from matplotlib.pyplot import figure, savefig, show
from sklearn.linear_model import LogisticRegression
from dslabs_functions import (
CLASS_EVAL_METRICS,
DELTA_IMPROVE,
read_train_test_from_files,
)
from dslabs_functions import plot_evaluation_results, plot_multiline_chart
def logistic_regression_study(
trnX: ndarray,
trnY: array,
tstX: ndarray,
tstY: array,
nr_max_iterations: int = 2500,
lag: int = 500,
metric: str = "accuracy",
) -> tuple[LogisticRegression | None, dict]:
nr_iterations: list[int] = [lag] + [
i for i in range(2 * lag, nr_max_iterations + 1, lag)
]
penalty_types: list[str] = ["l1", "l2"] # only available if optimizer='liblinear'
best_model = None
best_params: dict = {"name": "LR", "metric": metric, "params": ()}
best_performance: float = 0.0
values: dict = {}
for type in penalty_types:
warm_start = False
y_tst_values: list[float] = []
for j in range(len(nr_iterations)):
clf = LogisticRegression(
penalty=type,
max_iter=lag,
warm_start=warm_start,
solver="liblinear",
verbose=False,
)
clf.fit(trnX, trnY)
prdY: array = clf.predict(tstX)
eval: float = CLASS_EVAL_METRICS[metric](tstY, prdY)
y_tst_values.append(eval)
warm_start = True
if eval - best_performance > DELTA_IMPROVE:
best_performance = eval
best_params["params"] = (type, nr_iterations[j])
best_model: LogisticRegression = clf
# print(f'MLP lr_type={type} lr={lr} n={nr_iterations[j]}')
values[type] = y_tst_values
plot_multiline_chart(
nr_iterations,
values,
title=f"LR models ({metric})",
xlabel="nr iterations",
ylabel=metric,
percentage=True,
)
print(
f'LR best for {best_params["params"][1]} iterations (penalty={best_params["params"][0]})'
)
return best_model, best_params
file_tag = "stroke"
train_filename = "data/stroke_train_smote.csv"
test_filename = "data/stroke_test.csv"
target = "stroke"
eval_metric = "accuracy"
trnX, tstX, trnY, tstY, labels, vars = read_train_test_from_files(
train_filename, test_filename, target
)
print(f"Train#={len(trnX)} Test#={len(tstX)}")
print(f"Labels={labels}")
figure()
best_model, params = logistic_regression_study(
trnX,
trnY,
tstX,
tstY,
nr_max_iterations=5000,
lag=500,
metric=eval_metric,
)
savefig(f"images/{file_tag}_lr_{eval_metric}_study.png")
show()
Train#=6806 Test#=1533 Labels=[0, 1] LR best for 500 iterations (penalty=l1)
Best model performance
After the plot you can see the parameters for which the best results were achieved. So let's see its performance, in that context in terms of other metrics.
prd_trn: array = best_model.predict(trnX)
prd_tst: array = best_model.predict(tstX)
figure()
plot_evaluation_results(params, trnY, prd_trn, tstY, prd_tst, labels)
savefig(f'images/{file_tag}_lr_{params["name"]}_best_{params["metric"]}_eval.png')
show()
<Figure size 600x450 with 0 Axes>
From this information, it is clear that the model learn is not useful, since it has a zero recall and zero precision. Indeed, it simply says that everything is negative, meaning it doesn't serve its purpose.
This means that accuracy is clearly not the right measure to optimize in this situation.
Overfitting study
For Logistic Regression the simplest parameter to create specializations is the number of iterations allowed: again the larger the number of iterations, the higher the complexity of the model.
type: str = params["params"][0]
nr_iterations: list[int] = [i for i in range(100, 1001, 100)]
y_tst_values: list[float] = []
y_trn_values: list[float] = []
acc_metric = "accuracy"
warm_start = False
for n in nr_iterations:
clf = LogisticRegression(
warm_start=warm_start,
penalty=type,
max_iter=n,
solver="liblinear",
verbose=False,
)
clf.fit(trnX, trnY)
prd_tst_Y: array = clf.predict(tstX)
prd_trn_Y: array = clf.predict(trnX)
y_tst_values.append(CLASS_EVAL_METRICS[acc_metric](tstY, prd_tst_Y))
y_trn_values.append(CLASS_EVAL_METRICS[acc_metric](trnY, prd_trn_Y))
warm_start = True
figure()
plot_multiline_chart(
nr_iterations,
{"Train": y_trn_values, "Test": y_tst_values},
title=f"LR overfitting study for penalty={type}",
xlabel="nr_iterations",
ylabel=str(eval_metric),
percentage=True,
)
savefig(f"images/{file_tag}_lr_{eval_metric}_overfitting.png")
In terms of overfitting, there is none, since we don't testify any change along the number of iterations, neither in the train nor in the test dataset.