Forecasting
ARIMA
ARIMA receives three mandatory parameters p, d, q, but it works a little different from the other estimators we have been using from sklearn, since it is implemented in the statsmodels.tsa package.
Like before we create the ARIMA object, but now we pass the data to model and the parameters to use. After this we fit the model, which is returned as a ARIMAResults object.
from pandas import read_csv, DataFrame, Series
from statsmodels.tsa.arima.model import ARIMA
from dslabs_functions import series_train_test_split, HEIGHT
filename: str = "data/time_series/ashrae.csv"
file_tag: str = "ASHRAE"
target: str = "meter_reading"
timecol: str = "timestamp"
measure: str = "R2"
data: DataFrame = read_csv(filename, index_col=timecol, sep=",", decimal=".", parse_dates=True)
series: Series = data[target]
train, test = series_train_test_split(data, trn_pct=0.90)
predictor = ARIMA(train, order=(3, 1, 2))
model = predictor.fit()
print(model.summary())
SARIMAX Results
==============================================================================
Dep. Variable: meter_reading No. Observations: 864
Model: ARIMA(3, 1, 2) Log Likelihood -3397.092
Date: Sun, 10 Dec 2023 AIC 6806.184
Time: 19:22:22 BIC 6834.746
Sample: 01-15-2016 HQIC 6817.117
- 02-19-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 2.1373 0.050 42.898 0.000 2.040 2.235
ar.L2 -1.4363 0.089 -16.085 0.000 -1.611 -1.261
ar.L3 0.2482 0.044 5.629 0.000 0.162 0.335
ma.L1 -1.7648 0.038 -46.409 0.000 -1.839 -1.690
ma.L2 0.7993 0.037 21.368 0.000 0.726 0.873
sigma2 129.4178 4.911 26.354 0.000 119.793 139.043
===================================================================================
Ljung-Box (L1) (Q): 3.60 Jarque-Bera (JB): 127.28
Prob(Q): 0.06 Prob(JB): 0.00
Heteroskedasticity (H): 0.80 Skew: 0.73
Prob(H) (two-sided): 0.06 Kurtosis: 4.19
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
These results can be summarized and plotted in a fashionable way through the plot_diagnostics method, which shows the following plots (ordered clockwise from top left):
- Standardized residuals over time
- Histogram plus estimated density of standardized residuals, along with a Normal(0,1) density plotted for reference.
- Normal Q-Q plot, with Normal reference line.
- Correlogram
model.plot_diagnostics(figsize=(2 * HEIGHT, 1.5 * HEIGHT))
So, as usual, in order to find the best model, we need to look for the best (p, d, q) parameters...
from matplotlib.pyplot import figure, savefig, subplots
from dslabs_functions import FORECAST_MEASURES, DELTA_IMPROVE, plot_multiline_chart
def arima_study(train: Series, test: Series, measure: str = "R2"):
d_values = (0, 1, 2)
p_params = (1, 2, 3, 5, 7, 10)
q_params = (1, 3, 5, 7)
flag = measure == "R2" or measure == "MAPE"
best_model = None
best_params: dict = {"name": "ARIMA", "metric": measure, "params": ()}
best_performance: float = -100000
fig, axs = subplots(1, len(d_values), figsize=(len(d_values) * HEIGHT, HEIGHT))
for i in range(len(d_values)):
d: int = d_values[i]
values = {}
for q in q_params:
yvalues = []
for p in p_params:
arima = ARIMA(train, order=(p, d, q))
model = arima.fit()
prd_tst = model.forecast(steps=len(test), signal_only=False)
eval: float = FORECAST_MEASURES[measure](test, prd_tst)
# print(f"ARIMA ({p}, {d}, {q})", eval)
if eval > best_performance and abs(eval - best_performance) > DELTA_IMPROVE:
best_performance: float = eval
best_params["params"] = (p, d, q)
best_model = model
yvalues.append(eval)
values[q] = yvalues
plot_multiline_chart(
p_params, values, ax=axs[i], title=f"ARIMA d={d} ({measure})", xlabel="p", ylabel=measure, percentage=flag
)
print(
f"ARIMA best results achieved with (p,d,q)=({best_params['params'][0]:.0f}, {best_params['params'][1]:.0f}, {best_params['params'][2]:.0f}) ==> measure={best_performance:.2f}"
)
return best_model, best_params
which when applied to our dataset finds the best model for the specific case, plotting the performace for each set of parameters:
from matplotlib.pyplot import savefig
best_model, best_params = arima_study(train, test, measure=measure)
savefig(f"images/{file_tag}_arima_{measure}_study.png")
ARIMA best results achieved with (p,d,q)=(3, 0, 5) ==> measure=0.16
And now we can see the best model and its performance.
from dslabs_functions import plot_forecasting_eval
params = best_params["params"]
prd_trn = best_model.predict(start=0, end=len(train) - 1)
prd_tst = best_model.forecast(steps=len(test))
plot_forecasting_eval(
train, test, prd_trn, prd_tst, title=f"{file_tag} - ARIMA (p={params[0]}, d={params[1]}, q={params[2]})"
)
savefig(f"images/{file_tag}_arima_{measure}_eval.png")
and its visualization...
from dslabs_functions import plot_forecasting_series
plot_forecasting_series(
train,
test,
prd_tst,
title=f"{file_tag} - ARIMA ",
xlabel=timecol,
ylabel=target,
)
savefig(f"images/{file_tag}_arima_{measure}_forecast.png")