|
So after loading the data, the usual procedure is to plot the data at the most atomic granularity to look for regularities (repetitions) in the data.
import pandas as pd
import matplotlib.pyplot as plt
import ts_functions as ts
data = pd.read_csv('data/ashrae_single.csv', index_col='timestamp', sep=',', decimal='.',
parse_dates=True, infer_datetime_format=True)
print("Nr. Records = ", data.shape[0])
print("First timestamp", data.index[0])
print("Last timestamp", data.index[-1])
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(data, x_label='timestamp', y_label='consumption', title='ASHRAE')
plt.xticks(rotation = 45)
plt.show()
The plot shows the electrical consumption of a specific building during the year of 2016, hourly measured, totalizing 8784 observations.
There some interesting things we can see from looking at the chart:
This tells us that there should be some components, and analysing the series granularity is one of the ways to address it.
day_df = data.copy().groupby(data.index.date).mean()
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(day_df, title='Daily consumptions', x_label='timestamp', y_label='consumption')
plt.xticks(rotation = 45)
plt.show()
Aggregating by days, we perform a kind of a smoothing, since we are using the mean as aggregation function. And as a result, we found a smoother version of the original time series, which less noise.
In this new version, we continue to identify a cyclic behavior, which seems to be shown weekly.
index = data.index.to_period('W')
week_df = data.copy().groupby(index).mean()
week_df['timestamp'] = index.drop_duplicates().to_timestamp()
week_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(week_df, title='Weekly consumptions', x_label='timestamp', y_label='consumption')
plt.xticks(rotation = 45)
plt.show()
The chart for weekly consumption is quite different from the previous ones – it does not show any cyclic behavior as before! Indeed, despite the reduction trend on the second semester, the weekly consumptions are almost constant in the first quarter.
index = data.index.to_period('M')
month_df = data.copy().groupby(index).mean()
month_df['timestamp'] = index.drop_duplicates().to_timestamp()
month_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(month_df, title='Monthly consumptions', x_label='timestamp', y_label='consumption')
plt.show()
The chart for monthly consumptions confirm those identified trends…
index = data.index.to_period('Q')
quarter_df = data.copy().groupby(index).mean()
quarter_df['timestamp'] = index.drop_duplicates().to_timestamp()
quarter_df.set_index('timestamp', drop=True, inplace=True)
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(quarter_df, title='Quarterly consumptions', x_label='timestamp', y_label='consumption')
plt.show()
and jointly with the last chart, they confirm any suspicion about the lack of stationarity in the time series. Indeed its mean is not constant along time. In particular we identify very different values of consumption per quarter.
Like for tabular data, one of the perspectives of analysis to consider is the distribution of data, in particular the centrality, trends and distribution of the variable. But be aware that the different aggregations may show different distributions.
The simplest way to analyze our variable is through the 5-number summary, which can be made through boxplots. Lets analyze both the hourly and weekly data. But now considering the sum instead of the mean values, in order to not introduce any error.
index = data.index.to_period('W')
week_df = data.copy().groupby(index).sum()
week_df['timestamp'] = index.drop_duplicates().to_timestamp()
week_df.set_index('timestamp', drop=True, inplace=True)
_, axs = plt.subplots(1, 2, figsize=(2*ts.HEIGHT, ts.HEIGHT/2))
axs[0].grid(False)
axs[0].set_axis_off()
axs[0].set_title('HOURLY', fontweight="bold")
axs[0].text(0, 0, str(data.describe()))
axs[1].grid(False)
axs[1].set_axis_off()
axs[1].set_title('WEEKLY', fontweight="bold")
axs[1].text(0, 0, str(week_df.describe()))
plt.show()
_, axs = plt.subplots(1, 2, figsize=(2*ts.HEIGHT, ts.HEIGHT))
data.boxplot(ax=axs[0])
week_df.boxplot(ax=axs[1])
plt.show()
But from this chart is not possible to completely understand the variable distribution. In order to do so, we use histograms.
bins = (10, 25, 50)
_, axs = plt.subplots(1, len(bins), figsize=(len(bins)*ts.HEIGHT, ts.HEIGHT))
for j in range(len(bins)):
axs[j].set_title('Histogram for hourly meter_reading %d bins'%bins[j])
axs[j].set_xlabel('consumption')
axs[j].set_ylabel('Nr records')
axs[j].hist(data.values, bins=bins[j])
plt.show()
In the histograms we recognize that our data do not follow a normal distribution, despite the first histogram induce it. But with the following ones clarify the distribution as a multimodal one, with at least two distinct modes.
Note the importance of creating the histogram for the most adequate data granularity, for the task at hands.
_, axs = plt.subplots(1, len(bins), figsize=(len(bins)*ts.HEIGHT, ts.HEIGHT))
for j in range(len(bins)):
axs[j].set_title('Histogram for weekly meter_reading %d bins'%bins[j])
axs[j].set_xlabel('consumption')
axs[j].set_ylabel('Nr records')
axs[j].hist(week_df.values, bins=bins[j])
plt.show()
import numpy as np
dt_series = pd.Series(data['meter_reading'])
mean_line = pd.Series(np.ones(len(dt_series.values)) * dt_series.mean(), index=dt_series.index)
series = {'ashrae': dt_series, 'mean': mean_line}
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(series, x_label='timestamp', y_label='consumption', title='Stationary study', show_std=True)
plt.show()
BINS = 10
line = []
n = len(dt_series)
for i in range(BINS):
b = dt_series[i*n//BINS:(i+1)*n//BINS]
mean = [b.mean()] * (n//BINS)
line += mean
line += [line[-1]] * (n - len(line))
mean_line = pd.Series(line, index=dt_series.index)
series = {'ashrae': dt_series, 'mean': mean_line}
plt.figure(figsize=(3*ts.HEIGHT, ts.HEIGHT))
ts.plot_series(series, x_label='time', y_label='consumptions', title='Stationary study', show_std=True)
plt.show()