Forecasting

ARIMA

ARIMA receives three mandatory parameters p, d, q, but it works a little different from the other estimators we have been using from sklearn, since it is implemented in the statsmodels.tsa package.

Like before we create the ARIMA object, but now we pass the data to model and the parameters to use. After this we fit the model, which is returned as a ARIMAResults object.

In [ ]:
from pandas import read_csv, DataFrame, Series
from statsmodels.tsa.arima.model import ARIMA
from dslabs_functions import series_train_test_split, HEIGHT

filename: str = "data/time_series/ashrae.csv"
file_tag: str = "ASHRAE"
target: str = "meter_reading"
timecol: str = "timestamp"
measure: str = "R2"

data: DataFrame = read_csv(filename, index_col=timecol, sep=",", decimal=".", parse_dates=True)
series: Series = data[target]
train, test = series_train_test_split(data, trn_pct=0.90)

predictor = ARIMA(train, order=(3, 1, 2))
model = predictor.fit()
print(model.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:          meter_reading   No. Observations:                  864
Model:                 ARIMA(3, 1, 2)   Log Likelihood               -3397.092
Date:                Sun, 10 Dec 2023   AIC                           6806.184
Time:                        19:22:22   BIC                           6834.746
Sample:                    01-15-2016   HQIC                          6817.117
                         - 02-19-2016                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          2.1373      0.050     42.898      0.000       2.040       2.235
ar.L2         -1.4363      0.089    -16.085      0.000      -1.611      -1.261
ar.L3          0.2482      0.044      5.629      0.000       0.162       0.335
ma.L1         -1.7648      0.038    -46.409      0.000      -1.839      -1.690
ma.L2          0.7993      0.037     21.368      0.000       0.726       0.873
sigma2       129.4178      4.911     26.354      0.000     119.793     139.043
===================================================================================
Ljung-Box (L1) (Q):                   3.60   Jarque-Bera (JB):               127.28
Prob(Q):                              0.06   Prob(JB):                         0.00
Heteroskedasticity (H):               0.80   Skew:                             0.73
Prob(H) (two-sided):                  0.06   Kurtosis:                         4.19
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

These results can be summarized and plotted in a fashionable way through the plot_diagnostics method, which shows the following plots (ordered clockwise from top left):

  • Standardized residuals over time
  • Histogram plus estimated density of standardized residuals, along with a Normal(0,1) density plotted for reference.
  • Normal Q-Q plot, with Normal reference line.
  • Correlogram
In [ ]:
model.plot_diagnostics(figsize=(2 * HEIGHT, 1.5 * HEIGHT))
Out[ ]:
No description has been provided for this image
No description has been provided for this image

So, as usual, in order to find the best model, we need to look for the best (p, d, q) parameters...

In [ ]:
from matplotlib.pyplot import figure, savefig, subplots
from dslabs_functions import FORECAST_MEASURES, DELTA_IMPROVE, plot_multiline_chart


def arima_study(train: Series, test: Series, measure: str = "R2"):
    d_values = (0, 1, 2)
    p_params = (1, 2, 3, 5, 7, 10)
    q_params = (1, 3, 5, 7)

    flag = measure == "R2" or measure == "MAPE"
    best_model = None
    best_params: dict = {"name": "ARIMA", "metric": measure, "params": ()}
    best_performance: float = -100000

    fig, axs = subplots(1, len(d_values), figsize=(len(d_values) * HEIGHT, HEIGHT))
    for i in range(len(d_values)):
        d: int = d_values[i]
        values = {}
        for q in q_params:
            yvalues = []
            for p in p_params:
                arima = ARIMA(train, order=(p, d, q))
                model = arima.fit()
                prd_tst = model.forecast(steps=len(test), signal_only=False)
                eval: float = FORECAST_MEASURES[measure](test, prd_tst)
                # print(f"ARIMA ({p}, {d}, {q})", eval)
                if eval > best_performance and abs(eval - best_performance) > DELTA_IMPROVE:
                    best_performance: float = eval
                    best_params["params"] = (p, d, q)
                    best_model = model
                yvalues.append(eval)
            values[q] = yvalues
        plot_multiline_chart(
            p_params, values, ax=axs[i], title=f"ARIMA d={d} ({measure})", xlabel="p", ylabel=measure, percentage=flag
        )
    print(
        f"ARIMA best results achieved with (p,d,q)=({best_params['params'][0]:.0f}, {best_params['params'][1]:.0f}, {best_params['params'][2]:.0f}) ==> measure={best_performance:.2f}"
    )

    return best_model, best_params

which when applied to our dataset finds the best model for the specific case, plotting the performace for each set of parameters:

In [ ]:
from matplotlib.pyplot import savefig

best_model, best_params = arima_study(train, test, measure=measure)
savefig(f"images/{file_tag}_arima_{measure}_study.png")
ARIMA best results achieved with (p,d,q)=(3, 0, 5) ==> measure=0.16
No description has been provided for this image

And now we can see the best model and its performance.

In [ ]:
from dslabs_functions import plot_forecasting_eval

params = best_params["params"]
prd_trn = best_model.predict(start=0, end=len(train) - 1)
prd_tst = best_model.forecast(steps=len(test))

plot_forecasting_eval(
    train, test, prd_trn, prd_tst, title=f"{file_tag} - ARIMA (p={params[0]}, d={params[1]}, q={params[2]})"
)
savefig(f"images/{file_tag}_arima_{measure}_eval.png")
No description has been provided for this image

and its visualization...

In [ ]:
from dslabs_functions import plot_forecasting_series

plot_forecasting_series(
    train,
    test,
    prd_tst,
    title=f"{file_tag} - ARIMA ",
    xlabel=timecol,
    ylabel=target,
)
savefig(f"images/{file_tag}_arima_{measure}_forecast.png")
No description has been provided for this image