Lab 9: Time Series Profiling

The data under analysis in this topic is a single variable time series, collected from the ASHRAE - Great Energy Predictor III challenge available on Kaggle. data

Data Dimensionality

As for tabular data, the first thing to understand is the data dimensionality. In the case of a single time series, it's simple - we are in the presence of one single dimension. But usually, in this context, dimensionality corresponds to the number of observations taken, which corresponds to the length of the series.

So after loading the data, the usual procedure is to plot the data at the most atomic granularity to look for regularities (repetitions) in the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import ts_functions as ts

data = pd.read_csv('data/ashrae_single.csv', index_col='timestamp', sep=',', decimal='.',
                   parse_dates=True, infer_datetime_format=True)
print("Nr. Records = ", data.shape[0])
print("First timestamp", data.index[0])
print("Last timestamp", data.index[-1])
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(data, x_label='timestamp', y_label='consumption', title='ASHRAE')
plt.xticks(rotation = 45)
plt.show()
Nr. Records =  8784
First timestamp 2016-01-01 00:00:00
Last timestamp 2016-12-31 23:00:00

The plot shows the electrical consumption of a specific building during the year of 2016, hourly measured, totalizing 8784 observations.

There some interesting things we can see from looking at the chart:

  • first, there are some time intervals where the consumption is constant and around 425 kWh;
  • second, there is a kind of a pattern that approximately repeats weekly;
  • and third, the series seems to show a small consumption reduction along the year.

This tells us that there should be some components, and analysing the series granularity is one of the ways to address it.

Data Granularity

In order to address the last two claims, we use the data granularity perspective. We've already perceived that data is recorded hourly, but we can try other aggregations...

In [2]:
day_df = data.copy().groupby(data.index.date).mean()
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(day_df, title='Daily consumptions', x_label='timestamp', y_label='consumption')
plt.xticks(rotation = 45)
plt.show()

Aggregating by days, we perform a kind of a smoothing, since we are using the mean as aggregation function. And as a result, we found a smoother version of the original time series, which less noise.

In this new version, we continue to identify a cyclic behavior, which seems to be shown weekly.

In [3]:
index = data.index.to_period('W')
week_df = data.copy().groupby(index).mean()
week_df['timestamp'] = index.drop_duplicates().to_timestamp()
week_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(week_df, title='Weekly consumptions', x_label='timestamp', y_label='consumption')
plt.xticks(rotation = 45)
plt.show()

The chart for weekly consumption is quite different from the previous ones – it does not show any cyclic behavior as before! Indeed, despite the reduction trend on the second semester, the weekly consumptions are almost constant in the first quarter.

In [4]:
index = data.index.to_period('M')
month_df = data.copy().groupby(index).mean()
month_df['timestamp'] = index.drop_duplicates().to_timestamp()
month_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(month_df, title='Monthly consumptions', x_label='timestamp', y_label='consumption')
plt.show()

The chart for monthly consumptions confirm those identified trends…

In [5]:
index = data.index.to_period('Q')
quarter_df = data.copy().groupby(index).mean()
quarter_df['timestamp'] = index.drop_duplicates().to_timestamp()
quarter_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(quarter_df, title='Quarterly consumptions', x_label='timestamp', y_label='consumption')
plt.show()

and jointly with the last chart, they confirm any suspicion about the lack of stationarity in the time series. Indeed its mean is not constant along time. In particular we identify very different values of consumption per quarter.

Data Distribution

Like for tabular data, one of the perspectives of analysis to consider is the distribution of data, in particular the centrality, trends and distribution of the variable. But be aware that the different aggregations may show different distributions.

5-Number Summary

The simplest way to analyze our variable is through the 5-number summary, which can be made through boxplots. Lets analyze both the hourly and weekly data. But now considering the sum instead of the mean values, in order to not introduce any error.

In [6]:
index = data.index.to_period('W')
week_df = data.copy().groupby(index).sum()
week_df['timestamp'] = index.drop_duplicates().to_timestamp()
week_df.set_index('timestamp', drop=True, inplace=True)
_, axs = plt.subplots(1, 2, figsize=(2*ts.HEIGHT, ts.HEIGHT/2))
axs[0].grid(False)
axs[0].set_axis_off()
axs[0].set_title('HOURLY', fontweight="bold")
axs[0].text(0, 0, str(data.describe()))
axs[1].grid(False)
axs[1].set_axis_off()
axs[1].set_title('WEEKLY', fontweight="bold")
axs[1].text(0, 0, str(week_df.describe()))
plt.show()

_, axs = plt.subplots(1, 2, figsize=(2*ts.HEIGHT, ts.HEIGHT))
data.boxplot(ax=axs[0])
week_df.boxplot(ax=axs[1])
plt.show()

Variables Distribution

But from this chart is not possible to completely understand the variable distribution. In order to do so, we use histograms.

In [7]:
bins = (10, 25, 50)
_, axs = plt.subplots(1, len(bins), figsize=(len(bins)*ts.HEIGHT, ts.HEIGHT))
for j in range(len(bins)):
    axs[j].set_title('Histogram for hourly meter_reading %d bins'%bins[j])
    axs[j].set_xlabel('consumption')
    axs[j].set_ylabel('Nr records')
    axs[j].hist(data.values, bins=bins[j])
plt.show()

In the histograms we recognize that our data do not follow a normal distribution, despite the first histogram induce it. But with the following ones clarify the distribution as a multimodal one, with at least two distinct modes.

Note the importance of creating the histogram for the most adequate data granularity, for the task at hands.

In [8]:
_, axs = plt.subplots(1, len(bins), figsize=(len(bins)*ts.HEIGHT, ts.HEIGHT))
for j in range(len(bins)):
    axs[j].set_title('Histogram for weekly meter_reading %d bins'%bins[j])
    axs[j].set_xlabel('consumption')
    axs[j].set_ylabel('Nr records')
    axs[j].hist(week_df.values, bins=bins[j])
plt.show()

Data Stationarity

In [9]:
import numpy as np
dt_series = pd.Series(data['meter_reading'])

mean_line = pd.Series(np.ones(len(dt_series.values)) * dt_series.mean(), index=dt_series.index)
series = {'ashrae': dt_series, 'mean': mean_line}
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(series, x_label='timestamp', y_label='consumption', title='Stationary study', show_std=True)
plt.show()
In [10]:
BINS = 10
line = []
n = len(dt_series)
for i in range(BINS):
    b = dt_series[i*n//BINS:(i+1)*n//BINS]
    mean = [b.mean()] * (n//BINS)
    line += mean
line += [line[-1]] * (n - len(line))
mean_line = pd.Series(line, index=dt_series.index)
series = {'ashrae': dt_series, 'mean': mean_line}
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(series, x_label='time', y_label='consumptions', title='Stationary study', show_std=True)
plt.show()