Time Series Analysis

Introduction
The objective of time series analysis is to uncover a pattern in a time series and then extrapolate the pattern into the future. Being able to forecast the future is the essence of time series analysis.

The forecast is based solely on past values of the variable and/or on past forecast errors.

Why forecast?
Forecasting applies to many business situations: forecasting demand with a view to make capacity build-out decision, staff scheduling in a call center, understanding the demand for credit, determining the inventory to order in anticipation of demand, etc. Forecast timescales may differ based on needs: some situations require forecasting years ahead, while others may require forecasts for the next day, or the even the next minute.

What is a time series?
A time series is a sequence of observations on a variable measured at successive points in time or over successive periods of time.

The measurements may be taken every hour, day, week, month, year, or any other regular interval. The pattern of the data is important in understanding the series’ past behavior.

If the behavior of the times series data of the past is expected to continue in the future, it can be used as a guide in selecting an appropriate forecasting method.

Let us look at an example.

As usual, some library imports first

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (20, 9)

Loading the data

Let us load some data: https://data.seattle.gov/Transportation/Fremont-Bridge-Bicycle-Counter/65db-xm6k.

This is a picture of the bridge. The second picture shows the bicycle counter.

(Photo retrieved from a Google search, credit: Jason H, 2020))

(Photo retrieved from: http://www.sheridestoday.com/blog/2015/12/21/fremont-bridge-bike-counter)

# Load the data
# You can get more information on this dataset at 
# https://data.seattle.gov/Transportation/Fremont-Bridge-Bicycle-Counter/65db-xm6k

df = pd.read_csv('https://data.seattle.gov/api/views/65db-xm6k/rows.csv')

# Review the column names
df.columns

Index(['Date', 'Fremont Bridge Sidewalks, south of N 34th St',
       'Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk',
       'Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk'],
      dtype='object')

df

	Date	Fremont Bridge Sidewalks, south of N 34th St	Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk	Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
0	08/01/2022 12:00:00 AM	23.0	7.0	16.0
1	08/01/2022 01:00:00 AM	12.0	5.0	7.0
2	08/01/2022 02:00:00 AM	3.0	0.0	3.0
3	08/01/2022 03:00:00 AM	5.0	2.0	3.0
4	08/01/2022 04:00:00 AM	10.0	2.0	8.0
...	...	...	...	...
95635	08/31/2023 07:00:00 PM	224.0	72.0	152.0
95636	08/31/2023 08:00:00 PM	142.0	59.0	83.0
95637	08/31/2023 09:00:00 PM	67.0	35.0	32.0
95638	08/31/2023 10:00:00 PM	43.0	18.0	25.0
95639	08/31/2023 11:00:00 PM	12.0	8.0	4.0

95640 rows × 4 columns

# df.to_excel('Bridge_crossing_data_07Nov2023.xlsx')

We have hourly data on bicycle crossings with three columns, Total = East + West sidewalks. Our data is from 2012 all the way to July 2021, totaling 143k+ rows.

For doing time series analysis with Pandas, the data frame's index should be equal to the datetime for the row.

For convenience, we also rename the column names to be ['Total', 'East', 'West'].

# Set the index of the time series
df.index = pd.DatetimeIndex(df.Date)

# Now drop the Date column as it is a part of the index
df.drop(columns='Date', inplace=True)

df.head()

	Fremont Bridge Sidewalks, south of N 34th St	Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk	Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
Date
2022-08-01 00:00:00	23.0	7.0	16.0
2022-08-01 01:00:00	12.0	5.0	7.0
2022-08-01 02:00:00	3.0	0.0	3.0
2022-08-01 03:00:00	5.0	2.0	3.0
2022-08-01 04:00:00	10.0	2.0	8.0

# Rename the columns to make them simpler to use
df.columns = ['Total', 'East', 'West']

Data Exploration

df.shape

(95640, 3)

# Check the maximum and the minimum dates in our data
print(df.index.max())
print(df.index.min())

2023-08-31 23:00:00
2012-10-03 00:00:00

# Let us drop NaN values
df.dropna(inplace=True)
df.shape

(95614, 3)

# Let us look at some sample rows
df.head(10)

	Total	East	West
Date
2022-08-01 00:00:00	23.0	7.0	16.0
2022-08-01 01:00:00	12.0	5.0	7.0
2022-08-01 02:00:00	3.0	0.0	3.0
2022-08-01 03:00:00	5.0	2.0	3.0
2022-08-01 04:00:00	10.0	2.0	8.0
2022-08-01 05:00:00	27.0	5.0	22.0
2022-08-01 06:00:00	100.0	43.0	57.0
2022-08-01 07:00:00	219.0	90.0	129.0
2022-08-01 08:00:00	335.0	143.0	192.0
2022-08-01 09:00:00	212.0	85.0	127.0

# We plot the data 
# Pandas knows that this is a time-series, and creates the right plot

df.plot(kind = 'line',figsize=(12,6));

png

# Let us look at just the first 200 data points

title='Bicycle Crossings'
ylabel='Count'
xlabel='Date'


ax = df.iloc[:200,:].plot(figsize=(18,6),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

png

Resampling

resample() is a time-based groupby pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

The resample function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation.

Many functions are available as a method of the returned object, including sum, mean, std, sem, max, min, median, first, last, ohlc.

Alias	Description
B	business day frequency
D	calendar day frequency
W	weekly frequency
M	month end frequency
SM	semi-month end frequency (15th and end of month)
BM	business month end frequency
MS	month start frequency
Q	quarter end frequency
A, Y	year end frequency
H	hourly frequency
T, min	minutely frequency
S	secondly frequency
N	nanoseconds

# Let us resample the data to be monthly

df.resample(rule='M').sum().plot(figsize = (18,6));

png

# Let us examine monthly data
# We create a new monthly dataframe

df_monthly = df.resample(rule='M').sum()

# Just to keep our analysis clean and be able to understand concepts,
# we will limit ourselves to pre-Covid data

df_precovid = df_monthly[df_monthly.index < pd.to_datetime('2019-12-31')]

df_precovid.plot(figsize = (18,6));

png

# We suppress some warnings pandas produces, more for 
# visual cleanliness than any other reason

pd.options.mode.chained_assignment = None

df_monthly

	Total	East	West
Date
2012-10-31	65695.0	33764.0	31931.0
2012-11-30	50647.0	26062.0	24585.0
2012-12-31	36369.0	18608.0	17761.0
2013-01-31	44884.0	22910.0	21974.0
2013-02-28	50027.0	25898.0	24129.0
...	...	...	...
2023-04-30	60494.0	23784.0	36710.0
2023-05-31	105039.0	40303.0	64736.0
2023-06-30	102158.0	38076.0	64082.0
2023-07-31	112791.0	43064.0	69727.0
2023-08-31	108541.0	38967.0	69574.0

131 rows × 3 columns

Filtering time series

Source: https://pandas.pydata.org/docs/user_guide/timeseries.html#indexing

Using the index for time series provides us the advantage of being able to filter easily by year or month.

for example, you can do df.loc['2017'] to list all observations for 2017, or df.loc['2017-02'], or df.loc['2017-02-15'].

df.loc['2017-02-15']

	Total	East	West
Date
2017-02-15 00:00:00	4.0	3.0	1.0
2017-02-15 01:00:00	3.0	1.0	2.0
2017-02-15 02:00:00	0.0	0.0	0.0
2017-02-15 03:00:00	2.0	0.0	2.0
2017-02-15 04:00:00	2.0	1.0	1.0
2017-02-15 05:00:00	18.0	8.0	10.0
2017-02-15 06:00:00	60.0	45.0	15.0
2017-02-15 07:00:00	188.0	117.0	71.0
2017-02-15 08:00:00	262.0	152.0	110.0
2017-02-15 09:00:00	147.0	68.0	79.0
2017-02-15 10:00:00	49.0	25.0	24.0
2017-02-15 11:00:00	23.0	13.0	10.0
2017-02-15 12:00:00	12.0	7.0	5.0
2017-02-15 13:00:00	22.0	9.0	13.0
2017-02-15 14:00:00	17.0	2.0	15.0
2017-02-15 15:00:00	47.0	22.0	25.0
2017-02-15 16:00:00	99.0	29.0	70.0
2017-02-15 17:00:00	272.0	54.0	218.0
2017-02-15 18:00:00	181.0	48.0	133.0
2017-02-15 19:00:00	76.0	16.0	60.0
2017-02-15 20:00:00	43.0	14.0	29.0
2017-02-15 21:00:00	15.0	5.0	10.0
2017-02-15 22:00:00	16.0	6.0	10.0
2017-02-15 23:00:00	3.0	1.0	2.0

df.loc['2018']

	Total	East	West
Date
2018-01-01 00:00:00	28.0	14.0	14.0
2018-01-01 01:00:00	16.0	2.0	14.0
2018-01-01 02:00:00	8.0	4.0	4.0
2018-01-01 03:00:00	2.0	2.0	0.0
2018-01-01 04:00:00	0.0	0.0	0.0
...	...	...	...
2018-12-31 19:00:00	14.0	9.0	5.0
2018-12-31 20:00:00	26.0	12.0	14.0
2018-12-31 21:00:00	14.0	7.0	7.0
2018-12-31 22:00:00	7.0	3.0	4.0
2018-12-31 23:00:00	13.0	7.0	6.0

8759 rows × 3 columns

df.loc['2018-02']

	Total	East	West
Date
2018-02-01 00:00:00	8.0	2.0	6.0
2018-02-01 01:00:00	3.0	2.0	1.0
2018-02-01 02:00:00	0.0	0.0	0.0
2018-02-01 03:00:00	6.0	3.0	3.0
2018-02-01 04:00:00	8.0	5.0	3.0
...	...	...	...
2018-02-28 19:00:00	77.0	17.0	60.0
2018-02-28 20:00:00	35.0	7.0	28.0
2018-02-28 21:00:00	32.0	14.0	18.0
2018-02-28 22:00:00	13.0	2.0	11.0
2018-02-28 23:00:00	9.0	3.0	6.0

672 rows × 3 columns

You can also use the regular methods for filtering date ranges

df[(df.index > pd.to_datetime('1/31/2020')) & (df.index < pd.to_datetime('1/1/2022'))]

	Total	East	West
Date
2020-01-31 01:00:00	1.0	0.0	1.0
2020-01-31 02:00:00	0.0	0.0	0.0
2020-01-31 03:00:00	0.0	0.0	0.0
2020-01-31 04:00:00	8.0	6.0	2.0
2020-01-31 05:00:00	14.0	7.0	7.0
...	...	...	...
2021-12-31 19:00:00	0.0	0.0	0.0
2021-12-31 20:00:00	0.0	0.0	0.0
2021-12-31 21:00:00	0.0	0.0	0.0
2021-12-31 22:00:00	0.0	0.0	0.0
2021-12-31 23:00:00	0.0	0.0	0.0

16821 rows × 3 columns

Sometimes, the date may be contained in a column.

In such cases, we filter as follows:

# We create a temporary dataframe to illustrate
temporary_df = df.loc['2017-01'].copy()
temporary_df.reset_index(inplace = True)
temporary_df

	Date	Total	East	West
0	2017-01-01 00:00:00	5.0	0.0	5.0
1	2017-01-01 01:00:00	19.0	5.0	14.0
2	2017-01-01 02:00:00	1.0	1.0	0.0
3	2017-01-01 03:00:00	2.0	0.0	2.0
4	2017-01-01 04:00:00	1.0	0.0	1.0
...	...	...	...	...
739	2017-01-31 19:00:00	116.0	27.0	89.0
740	2017-01-31 20:00:00	64.0	25.0	39.0
741	2017-01-31 21:00:00	32.0	19.0	13.0
742	2017-01-31 22:00:00	19.0	4.0	15.0
743	2017-01-31 23:00:00	15.0	6.0	9.0

744 rows × 4 columns

temporary_df['Date'].dt.day==2

0      False
1      False
2      False
3      False
4      False
       ...  
739    False
740    False
741    False
742    False
743    False
Name: Date, Length: 744, dtype: bool

temporary_df[temporary_df['Date'].dt.month == 1]

# or use temporary_df['Date'].dt.day and year as well

	Date	Total	East	West
0	2017-01-01 00:00:00	5.0	0.0	5.0
1	2017-01-01 01:00:00	19.0	5.0	14.0
2	2017-01-01 02:00:00	1.0	1.0	0.0
3	2017-01-01 03:00:00	2.0	0.0	2.0
4	2017-01-01 04:00:00	1.0	0.0	1.0
...	...	...	...	...
739	2017-01-31 19:00:00	116.0	27.0	89.0
740	2017-01-31 20:00:00	64.0	25.0	39.0
741	2017-01-31 21:00:00	32.0	19.0	13.0
742	2017-01-31 22:00:00	19.0	4.0	15.0
743	2017-01-31 23:00:00	15.0	6.0	9.0

744 rows × 4 columns

Plot by month and quarter

from statsmodels.graphics.tsaplots import month_plot, quarter_plot

# Plot the months to see trends over months
month_plot(df_precovid.Total);

png

# Plot the quarter to see trends over quarters
quarter_plot(df_precovid.resample(rule='Q').Total.sum());

png

ETS Decomposition

When we decompose a time series, we are essentially expressing a belief that our data has several discrete components to it, each which can be isolated and studied separately.

Generally, time series data is split into 3 components: error, trend and seasonality (hence ‘ETS Decomposition’):
1. Seasonal component
2. Trend/cycle component
2. Residual, or error component which is not explained by the above two.

Multiplicative vs Additive Decomposition
If we assume an additive decomposition, then we can write:

$y_t = S_t + T_t + R_t$
where $y_t$ is the data,
- $S_t$ is the seasonal component,
- $T_t$ is the trend-cycle component, and
- $R_t$ is the remainder component at time period $t$ .

A multiplicative decomposition would be similarly written
$y_t = S_t * T_t * R_t$

The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series.

When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

Components are additive when the components do not change over time, and
Components are multiplicative when their levels are changing with time.

Multiplicative

The Statsmodels library gives us the functionality to decompose time series. Below, we decompose the time series using multiplicative decomposition. Let us spend a couple of moments looking at the chart below. Note that the first panel, ‘Total’, is the sum of the other three, ie Trend, Seasonal and Resid.

# Now we decompose our time series

import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# We use the multiplicative model

result = seasonal_decompose(df_precovid['Total'], model = 'multiplicative') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

# Each of the above components are contained in our `result`
# object as trend, seasonal and error.
# Let us put them in a dataframe
ets = pd.DataFrame({'Total': df_precovid['Total'], 
              'trend': result.trend, 
              'seasonality': result.seasonal, 
              'error': result.resid}).head(20)
# ets.to_excel('ets_mul.xlsx')
ets

	Total	trend	seasonality	error
Date
2012-10-31	65695.0	NaN	1.001160	NaN
2012-11-30	50647.0	NaN	0.731912	NaN
2012-12-31	36369.0	NaN	0.536130	NaN
2013-01-31	44884.0	NaN	0.702895	NaN
2013-02-28	50027.0	NaN	0.588604	NaN
2013-03-31	66089.0	NaN	0.837828	NaN
2013-04-30	71998.0	75386.958333	0.980757	0.973785
2013-05-31	108574.0	76398.625000	1.392398	1.020650
2013-06-30	99280.0	77057.250000	1.327268	0.970710
2013-07-31	117974.0	77981.125000	1.427836	1.059543
2013-08-31	104549.0	78480.583333	1.347373	0.988712
2013-09-30	80729.0	78247.375000	1.125840	0.916396
2013-10-31	81352.0	78758.291667	1.001160	1.031736
2013-11-30	59270.0	79796.916667	0.731912	1.014822
2013-12-31	43553.0	80700.958333	0.536130	1.006629
2014-01-31	59873.0	81297.708333	0.702895	1.047762
2014-02-28	47025.0	81740.875000	0.588604	0.977386
2014-03-31	63494.0	82772.958333	0.837828	0.915565
2014-04-30	86855.0	83550.500000	0.980757	1.059948
2014-05-31	118644.0	83531.833333	1.392398	1.020071

# Check if things work in the multiplicative model

print('Total = ', 71998.0)
print('Trend * Factor for Seasonality * Factor for Error =',75386.958333 * 0.980757 * 0.973785)

Total =  71998.0
Trend * Factor for Seasonality * Factor for Error = 71998.04732763417

Additive

We do the same thing as before, except that we change the model to be additive.

result = seasonal_decompose(df_precovid['Total'], model = 'additive') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

Here, an additive model seems to make sense. This is because the residuals seem to be better centered around zero.

Obtaining the components numerically
While this is great from a visual or graphical perspective, sometimes we may need to get the actual numbers for the three decomposed components. We can do so easily - the code below provides us this data in a dataframe.

# Each of the above components are contained in our `result`
# object as trend, seasonal and error.
# Let us put them in a dataframe
ets = pd.DataFrame({'Total': df_precovid['Total'], 
              'trend': result.trend, 
              'seasonality': result.seasonal, 
              'error': result.resid}).head(20)
# ets.to_excel('ets_add.xlsx')
ets

	Total	trend	seasonality	error
Date
2012-10-31	65695.0	NaN	315.920635	NaN
2012-11-30	50647.0	NaN	-22171.933532	NaN
2012-12-31	36369.0	NaN	-38436.746032	NaN
2013-01-31	44884.0	NaN	-24517.440476	NaN
2013-02-28	50027.0	NaN	-34696.308532	NaN
2013-03-31	66089.0	NaN	-13329.627976	NaN
2013-04-30	71998.0	75386.958333	-1536.224206	-1852.734127
2013-05-31	108574.0	76398.625000	32985.115079	-809.740079
2013-06-30	99280.0	77057.250000	26951.087302	-4728.337302
2013-07-31	117974.0	77981.125000	35276.733135	4716.141865
2013-08-31	104549.0	78480.583333	28635.517857	-2567.101190
2013-09-30	80729.0	78247.375000	10523.906746	-8042.281746
2013-10-31	81352.0	78758.291667	315.920635	2277.787698
2013-11-30	59270.0	79796.916667	-22171.933532	1645.016865
2013-12-31	43553.0	80700.958333	-38436.746032	1288.787698
2014-01-31	59873.0	81297.708333	-24517.440476	3092.732143
2014-02-28	47025.0	81740.875000	-34696.308532	-19.566468
2014-03-31	63494.0	82772.958333	-13329.627976	-5949.330357
2014-04-30	86855.0	83550.500000	-1536.224206	4840.724206
2014-05-31	118644.0	83531.833333	32985.115079	2127.051587

ets.describe()

	Total	trend	seasonality	error
count	20.000000	14.000000	20.000000	14.000000
mean	72844.050000	79692.997024	-5069.362252	-284.346372
std	25785.957067	2640.000168	25620.440647	3935.011207
min	36369.000000	75386.958333	-38436.746032	-8042.281746
25%	50492.000000	78047.687500	-24517.440476	-2388.509425
50%	65892.000000	79277.604167	-7432.926091	634.610615
75%	89961.250000	81630.083333	14630.701885	2240.103671
max	118644.000000	83550.500000	35276.733135	4840.724206

What is this useful for?
- Time series decomposition is primarily useful for studying time series data, and exploring historical trends over time.
- It is also useful for calculating 'seasonally adjusted' numbers, which is really just the trend number. The trend has no seasonality.
- Seasonally adjusted number = $y_t - S_t = T_t + R_t$ - Note that seasons are different from cycles. Cycles have no fixed length, and we can never be sure of when they begin, peak and end. The timing of cycles is unpredictable.

Moving Average and Exponentially Weighted Moving Average

Moving averages are an easy way to understand and describe time series.

By using a sliding window along which observations are averaged, they can suppress seasonality and noise, and expose the trend.

Moving averages are not generally used for forecasting, and don’t inform us about the future behavior of our time series. Their huge advantage is they are simple to understand, and explain, and get to a high level view of what is in the data.

Simple Moving Averages
Simple moving averages (SMA) tend to even out seasonality, and offer an easy way to examine the trend. Consider the 6 month and 12 month moving averages in the graphic below. SMAs are difficult to use for forecasting, and will lag by the window size.

Exponentially Weighted Moving Average (EWMA)
EWMA is a more advanced method than SMA, and puts more weight on values that occurred more recently. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. The more recent the observation, the higher the associated weight.

This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry.

EWMA can be easily calculated using the ewm() function in pandas. The parameter adjust controls how the EWMA term is calculated.

When adjust=True (default), the EW function is calculated using weights $w_i = (1 - \alpha)^i$ . For example, the EW moving average of the series [ $x_0, x_1, ..., x_t$ ] would be: $y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^t x_0}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^t}$
When adjust=False, the exponentially weighted function is calculated recursively:
$<script type="math/tex; mode=display">\begin{split}\begin{split} y_0 &= x_0\\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t, \end{split}$ \end{split}

(Source: Pandas documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html)

The alpha parameter can be specified in the formula in one of four ways: - Alpha specifies the smoothing factor directly. Specify smoothing factor $\alpha$ directly $0 < \alpha \leq 1$ - Span corresponds to what is commonly called an "N-day Exponentially Weighted Moving Average". Specify decay in terms of span $\alpha = 2 / (span + 1)$ , for $span \geq 1$ .
- COM (Center of mass): Specify decay in terms of center of mass $\alpha = 1 / (1 + com)$ , for $com \geq 0$ . - Half-life is the period of time for the exponential weight to reduce to one half. Specify decay in terms of half-life $\alpha = 1 - \exp\left(-\ln(2) / halflife\right)$ , for $halflife > 0$ . If times is specified, the time unit (str or timedelta) over which an observation decays to half its value. Only applicable to mean(), and halflife value will not apply to the other functions.

.

# let us look at rolling averages & EWM together

new_df = df_precovid[['Total']]
new_df['Moving_Avg_6m'] = new_df['Total'].rolling(window=6).mean()
new_df['Moving_Avg_12m'] = new_df['Total'].rolling(window=12).mean()

new_df['EWMA12'] = new_df['Total'].ewm(span=12,adjust=False).mean() 

# Note that available EW functions include mean(), var(), std(), corr(), cov().

new_df.plot();

png

# new_df.to_excel('temp.xlsx')

2/(12+1)

0.15384615384615385

In the graph above, the red line is the EWMA with span=12, or alpha=2/(12+1) ≈ 0.15.

EWMA has a single smoothing parameter, $\alpha$ , and does not account for seasonality or trend. It is only suitable for data with no clear trend or seasonal pattern.

Note that we haven’t talked about forecasting yet – that comes next. We have so far only ‘fitted’ the EWMA model to a given time series. Know that EWMA is just a weighted average, with more (or less, depending on alpha) weight to recent observations.

Stationarity

What is Stationarity?
A stationary series has constant mean and variance over time. Which means there is no trend, and no seasonality either. Stationarity is important for forecasting time series because if the mean and variance are changing with the passage of time, any estimates using a regression model will start to drift very quickly as we forecast into the future.

If a time series is not stationary, we need to ‘difference’ it with itself so it becomes stationary. Differencing means you subtract the previous observation from the current observation.

How do we know if a series is stationary?
- We can examine stationarity by visually inspecting the time series.
- Or, we can run a statistical test (The Augmented Dickey-Fuller test) to check for stationarity.

Fortunately, an ARIMA model takes care of most issues with non-stationarity for us and we do not need to adjust it. However, if we are using ARMA, we do need to ensure that our series is stationary.

Let us look at two real time series to get a sense of stationarity. We import some stock price data, and also look at stock price returns. We just pick the S&P500 index, though we could have picked any listed company.

# Let us get some data.  We download the daily time series for the S&P500 for 30 months
import yfinance as yf
SPY = yf.download('SPY', start = '2013-01-01', end = '2015-06-30')

[*********************100%%**********************]  1 of 1 completed

# Clean up
SPY.index = pd.DatetimeIndex(SPY.index) # Set index
SPY = SPY.asfreq('B') # This creates rows for any missing dates
SPY.fillna(method = 'bfill', inplace=True) # Fills missing dates with last observation

SPY.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 2013-01-02 to 2015-06-29
Freq: B
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       649 non-null    float64
 1   High       649 non-null    float64
 2   Low        649 non-null    float64
 3   Close      649 non-null    float64
 4   Adj Close  649 non-null    float64
 5   Volume     649 non-null    float64
dtypes: float64(6)
memory usage: 35.5 KB

Example of stationary vs non-stationary time series
The top panel shows stock returns, that appear to have a mean close to zero. The bottom is stock prices, which appear to have a trend

SPY['Returns'] = (SPY['Close'].shift(1) / SPY['Close']) - 1
SPY[['Returns']].plot(figsize = (22,6));

png

SPY[['Close']].plot(figsize = (22,6));

png

Making a series stationary
If data is not stationary, ‘differencing’ can make it stationary. Differencing is subtracting the prior observation from the current one.

$Difference_t = Observation_t - Observation_{t-1}$

If the differenced time series is not stationary either, we can continue differencing till we get to a stationary time series.

The number of times we have to difference a time series to get to stationarity is the ‘order’ of differencing. This reflects the $d$ parameter in ARIMA.

We can difference a series using Pandas series.diff() function, however there are libraries available that will automatically use an appropriate value for the parameter d.

Dickey Fuller Test for Stationarity

We can run the Dickey Fuller test for stationarity - If p-value > 0.05, we decide that the dataset is not stationary. Let us run this test against our stock price time series.

When we run this test in Python, we get a cryptic output in the form of a tuple. The help text for this function shows the complete explanation for how to interpret the results:

Returns
-------
adf : float
    The test statistic.
pvalue : float
    MacKinnon's approximate p-value based on MacKinnon (1994, 2010).
usedlag : int
    The number of lags used.
nobs : int
    The number of observations used for the ADF regression and calculation
    of the critical values.
critical values : dict
    Critical values for the test statistic at the 1 %, 5 %, and 10 %
    levels. Based on MacKinnon (2010).
icbest : float
    The maximized information criterion if autolag is not None.
resstore : ResultStore, optional
    A dummy class with results attached as attributes.

For us, the second value is the p-value that we are interested in. If this is > 0.05, we decide the series is not stationary.

# Test the stock price data

from statsmodels.tsa.stattools import adfuller
adfuller(SPY['Close'])

(-1.6928673813673563,
 0.4347911128784576,
 0,
 648,
 {'1%': -3.4404817800778034,
  '5%': -2.866010569916275,
  '10%': -2.569150763698369},
 2126.1002309138994)

# Test the stock returns data  

adfuller(SPY['Returns'].dropna())

(-26.546757517762995,
 0.0,
 0,
 647,
 {'1%': -3.4404975024933813,
  '5%': -2.8660174956716795,
  '10%': -2.569154453750397},
 -4424.286299515888)

Auto-Correlation and Partial Auto-Correlation (ACF and PACF plots)

Autocorrelation in a time series is the correlation of an observation to the observations that precede it. Autocorrelation is the basis for being able to use auto regression to forecast a time series.

To calculate autocorrelation for a series, we shift the series by one step, and calculate the correlation between the two. We keep increasing the number of steps to see correlations with past periods.

Fortunately, libraries exist that allow us to do these tedious calculations and present a tidy graph.

Next, we will look at Autocorrelation plots for both our stock price series, and also the total number of bicycle crossings in Seattle.

# Autocorrelation and partial autocorrelation plots for stock prices  

plt.rc("figure", figsize=(18,4))
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(SPY['Close']);
plot_pacf(SPY['Close']);

png

The shaded area represents the 95% confidence level.

PACF for ARIMA: - The PACF plot can be used to identify the value of p, the AR order. - The ACF plot can be used to identify the value of q, the MA order. - The interpretation of ACF and PACF plots to determine values of p & q for ARIMA can be complex.

Below, we see the ACF and PACF plots for the bicycle crossings. Their seasonality is quite visible.

# Autocorrelation and partial autocorrelation plots for stock prices  

plot_acf(new_df.Total);
plot_pacf(new_df.Total);

png

# Get raw values for auto-correlations

from statsmodels.tsa.stattools import acf
acf(new_df.Total)

array([ 1.        ,  0.77571582,  0.52756191,  0.09556274, -0.30509232,
       -0.60545323, -0.73226582, -0.64087121, -0.37044611, -0.01811217,
        0.35763185,  0.59437737,  0.7515538 ,  0.61992343,  0.40344956,
        0.06119329, -0.28745883, -0.53888819, -0.65309667, -0.57442506])

# Slightly nicer output making it easy to read lag and correlation
[(n,x ) for n, x in enumerate(acf(new_df.Total))]

[(0, 1.0),
 (1, 0.7757158156417427),
 (2, 0.5275619080263295),
 (3, 0.09556274009387859),
 (4, -0.30509232288765453),
 (5, -0.6054532313442101),
 (6, -0.732265815426328),
 (7, -0.6408712113652556),
 (8, -0.3704461111442763),
 (9, -0.018112170681472205),
 (10, 0.35763184544832965),
 (11, 0.5943773727598759),
 (12, 0.7515538007556243),
 (13, 0.6199234263452077),
 (14, 0.4034495609681654),
 (15, 0.061193291548871764),
 (16, -0.28745882651811744),
 (17, -0.5388881889155354),
 (18, -0.6530966725971313),
 (19, -0.5744250570228712)]

Granger Causality Tests

The Granger Causality tests are used to check if two time series are related with each other, specifically, given two time series, whether the time series in the second column can be used to predict the time series in the first column.

The ‘maxlag’ parameter needs to be specified and the code will identify the p-values at different lag points up to the maxlag value. If p-value<0.05 for any lag, that may be a valid predictor (or causal factor).

Example
This test is quite easy for us to run using statsmodels. As an example, we apply it to two separate time series, one showing the average daily temperature, and the other showing the average daily household power consumption. This data was adapted from a Kaggle dataset to create this illustration.

from statsmodels.tsa.stattools import grangercausalitytests

# Data adapted from: 
# https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries

df_elec = pd.read_csv('pwr_usage.csv')
df_elec.index = pd.DatetimeIndex(df_elec.Date, freq='W-SUN')
df_elec.drop(['Date'], axis = 1, inplace = True)
df_elec

	Temp_avg	kwh
Date
2017-01-08	75.542857	106.549
2017-01-15	71.014286	129.096
2017-01-22	64.414286	68.770
2017-01-29	56.728571	71.378
2017-02-05	66.128571	107.829
...	...	...
2019-12-08	74.371429	167.481
2019-12-15	61.242857	86.248
2019-12-22	50.000000	73.206
2019-12-29	60.128571	35.655
2020-01-05	7.185714	4.947

157 rows × 2 columns

# Check if Average Temperature can be used to predict kwh

grangercausalitytests(df_elec[["kwh", "Temp_avg"]], maxlag = 6);

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1022  , p=0.7496  , df_denom=153, df_num=1
ssr based chi2 test:   chi2=0.1042  , p=0.7468  , df=1
likelihood ratio test: chi2=0.1042  , p=0.7469  , df=1
parameter F test:         F=0.1022  , p=0.7496  , df_denom=153, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.9543  , p=0.3874  , df_denom=150, df_num=2
ssr based chi2 test:   chi2=1.9722  , p=0.3730  , df=2
likelihood ratio test: chi2=1.9597  , p=0.3754  , df=2
parameter F test:         F=0.9543  , p=0.3874  , df_denom=150, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=1.6013  , p=0.1916  , df_denom=147, df_num=3
ssr based chi2 test:   chi2=5.0327  , p=0.1694  , df=3
likelihood ratio test: chi2=4.9523  , p=0.1753  , df=3
parameter F test:         F=1.6013  , p=0.1916  , df_denom=147, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.5901  , p=0.1801  , df_denom=144, df_num=4
ssr based chi2 test:   chi2=6.7579  , p=0.1492  , df=4
likelihood ratio test: chi2=6.6129  , p=0.1578  , df=4
parameter F test:         F=1.5901  , p=0.1801  , df_denom=144, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.2414  , p=0.2930  , df_denom=141, df_num=5
ssr based chi2 test:   chi2=6.6913  , p=0.2446  , df=5
likelihood ratio test: chi2=6.5482  , p=0.2565  , df=5
parameter F test:         F=1.2414  , p=0.2930  , df_denom=141, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.0930  , p=0.3697  , df_denom=138, df_num=6
ssr based chi2 test:   chi2=7.1759  , p=0.3049  , df=6
likelihood ratio test: chi2=7.0106  , p=0.3199  , df=6
parameter F test:         F=1.0930  , p=0.3697  , df_denom=138, df_num=6

# Check if kwh can be used to predict Average Temperature 
# While we get p<0.05 at lag 3, the result is obviously absurd
grangercausalitytests(df_elec[["Temp_avg", "kwh"]], maxlag = 6);

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=3.1953  , p=0.0758  , df_denom=153, df_num=1
ssr based chi2 test:   chi2=3.2580  , p=0.0711  , df=1
likelihood ratio test: chi2=3.2244  , p=0.0725  , df=1
parameter F test:         F=3.1953  , p=0.0758  , df_denom=153, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=2.8694  , p=0.0599  , df_denom=150, df_num=2
ssr based chi2 test:   chi2=5.9301  , p=0.0516  , df=2
likelihood ratio test: chi2=5.8194  , p=0.0545  , df=2
parameter F test:         F=2.8694  , p=0.0599  , df_denom=150, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=3.0044  , p=0.0324  , df_denom=147, df_num=3
ssr based chi2 test:   chi2=9.4423  , p=0.0240  , df=3
likelihood ratio test: chi2=9.1642  , p=0.0272  , df=3
parameter F test:         F=3.0044  , p=0.0324  , df_denom=147, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.3019  , p=0.2722  , df_denom=144, df_num=4
ssr based chi2 test:   chi2=5.5329  , p=0.2369  , df=4
likelihood ratio test: chi2=5.4352  , p=0.2455  , df=4
parameter F test:         F=1.3019  , p=0.2722  , df_denom=144, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=0.8068  , p=0.5466  , df_denom=141, df_num=5
ssr based chi2 test:   chi2=4.3488  , p=0.5004  , df=5
likelihood ratio test: chi2=4.2877  , p=0.5088  , df=5
parameter F test:         F=0.8068  , p=0.5466  , df_denom=141, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=0.5381  , p=0.7785  , df_denom=138, df_num=6
ssr based chi2 test:   chi2=3.5328  , p=0.7396  , df=6
likelihood ratio test: chi2=3.4921  , p=0.7450  , df=6
parameter F test:         F=0.5381  , p=0.7785  , df_denom=138, df_num=6

In short, we don't find any causality above, though our intuition would have told us that something should exist. Perhaps there are variables other than temperature that impact power consumption that we have not thought of.

That is the power of data - commonly held conceptions can be challenged.

Forecasting with Simple Exponential Smoothing, Holt and Holt-Winters Method

(a) Simple Exponential Smoothing
In simple exponential smoothing, we reduce the time series to a single variable.

Simple exponential smoothing is suitable for forecasting data that has no clear trend or seasonal component. Obviously, this sort of data will have no pattern, so how do we forecast it? Consider two extreme approaches:
- Every future value will be equal to the average of all prior values,
- Every future value is the same as the last one.

The difference between the two extreme situations above is that in the first one, we weigh all past observations as equally important, and in the second, we give all the weight to the last observation and none to the ones prior to that.

The Simple Exponential Smoothing method takes an approach in between - it gives the most weight to the last observation, and gradually reduces the weight as we go further back in the past. It does so using a single parameter called alpha.

$\hat{y}_{T+1|T} = \alpha y_T + \alpha(1-\alpha) y_{T-1} + \alpha(1-\alpha)^2 y_{T-2}+ \cdots$

(Source: https://otexts.com/fpp2/ses.html)

(b) Holt's Method - Double Exponential Smoothing
Holt extended simple exponential smoothing described above to account for a trend. This is captured in a parameter called $\beta$ .

This method involves calculating a ‘level’, with the smoothing parameter α, as well as the trend using a smoothing parameter β. These parameters are used in a way similar to what we saw with EWMA.

Because we are using two parameters, it is called ‘double exponential smoothing’.

$<script type="math/tex; mode=display">\begin{align*} \text{Forecast equation}&& \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} \\ \text{Level equation} && \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ \text{Trend equation} && b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1}, \end{align*}$

We can specify additive or multiplicative trends. Additive trends are preferred when the level of change over time is constant. Multiplicative trends make sense when the trend varies proportional to the current values of the series.

(c) Holt-Winters' Method - Triple Exponential Smoothing
The Holt-Winters’ seasonal method accounts for the level, as well as the trend and seasonality, with corresponding smoothing parameters α, β and γ respectively. - Level: α
- Trend: β
- Seasonality: γ

We also specify the frequency of the seasonality, i.e., the number of periods that comprise a season. For example, for quarterly data the frequency would be 4 , and for monthly data, it would be 12.

Like for trend, we can specify whether the seasonality is additive or multiplicative.

The equations for Triple Exponential Smoothing look as follows:
$<script type="math/tex; mode=display">\begin{align*} \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} + s_{t+h-m(k+1)} \\ \ell_{t} &= \alpha(y_{t} - s_{t-m}) + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)b_{t-1}\\ s_{t} &= \gamma (y_{t}-\ell_{t-1}-b_{t-1}) + (1-\gamma)s_{t-m}, \end{align*}$

We will not cover these equations in detail as the code does everything for us. As practitioners, we need to think about the problems we can solve with this, and while being aware of the underlying logic. The code will calculate the values of $\alpha$ , $\beta$ and $\gamma$ and use these in the equations above to make predictions.

# Let us look at the index of our data frame that has the bicycle crossing data

new_df.index

DatetimeIndex(['2012-10-31', '2012-11-30', '2012-12-31', '2013-01-31',
               '2013-02-28', '2013-03-31', '2013-04-30', '2013-05-31',
               '2013-06-30', '2013-07-31', '2013-08-31', '2013-09-30',
               '2013-10-31', '2013-11-30', '2013-12-31', '2014-01-31',
               '2014-02-28', '2014-03-31', '2014-04-30', '2014-05-31',
               '2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30',
               '2014-10-31', '2014-11-30', '2014-12-31', '2015-01-31',
               '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31',
               '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30',
               '2015-10-31', '2015-11-30', '2015-12-31', '2016-01-31',
               '2016-02-29', '2016-03-31', '2016-04-30', '2016-05-31',
               '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30',
               '2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31',
               '2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31',
               '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30',
               '2017-10-31', '2017-11-30', '2017-12-31', '2018-01-31',
               '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31',
               '2018-06-30', '2018-07-31', '2018-08-31', '2018-09-30',
               '2018-10-31', '2018-11-30', '2018-12-31', '2019-01-31',
               '2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31',
               '2019-06-30', '2019-07-31', '2019-08-31', '2019-09-30',
               '2019-10-31', '2019-11-30'],
              dtype='datetime64[ns]', name='Date', freq='M')

# Clean up the data frame, set index frequency explicitly to Monthly
# This is needed as Holt-Winters will not work otherwise

new_df.index.freq = 'M'

# Let us drop an NaN entries, just in case

new_df.dropna(inplace=True)

new_df

	Total	Moving_Avg_6m	Moving_Avg_12m	EWMA12
Date
2013-09-30	80729.0	97184.000000	74734.583333	82750.448591
2013-10-31	81352.0	98743.000000	76039.333333	82535.302654
2013-11-30	59270.0	90525.666667	76757.916667	78956.025322
2013-12-31	43553.0	81237.833333	77356.583333	73509.406042
2014-01-31	59873.0	71554.333333	78605.666667	71411.497420
...	...	...	...	...
2019-07-31	137714.0	101472.833333	91343.750000	100389.020732
2019-08-31	142414.0	119192.000000	93894.166667	106854.402158
2019-09-30	112174.0	123644.833333	95221.833333	107672.801826
2019-10-31	104498.0	126405.833333	96348.166667	107184.370776
2019-11-30	84963.0	119045.833333	97725.833333	103765.698349

75 rows × 4 columns

# Set warnings to ignore so we don't get the ugly orange boxes

import warnings
warnings.filterwarnings('ignore')

# Some library imports 

from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error as mse

Simple Exponential Smoothing (same as EWMA)

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  60
Test set:  15

# Fit model using Simple Exponential Smoothing
model = SimpleExpSmoothing(train_set['Total']).fit()

predictions = model.forecast(15)

# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params

{'smoothing_level': 0.995,
 'smoothing_trend': nan,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 80729.0,
 'initial_trend': nan,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}

# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2018-09-30	96242.0	111889.675227	-15647.675227
2018-10-31	90982.0	111889.675227	-20907.675227
2018-11-30	68431.0	111889.675227	-43458.675227
2018-12-31	46941.0	111889.675227	-64948.675227
2019-01-31	72883.0	111889.675227	-39006.675227
2019-02-28	36099.0	111889.675227	-75790.675227
2019-03-31	85457.0	111889.675227	-26432.675227
2019-04-30	87932.0	111889.675227	-23957.675227
2019-05-31	129123.0	111889.675227	17233.324773
2019-06-30	132512.0	111889.675227	20622.324773
2019-07-31	137714.0	111889.675227	25824.324773
2019-08-31	142414.0	111889.675227	30524.324773
2019-09-30	112174.0	111889.675227	284.324773
2019-10-31	104498.0	111889.675227	-7391.675227
2019-11-30	84963.0	111889.675227	-26926.675227

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  1228535508.79919
RMSE =  35050.47087842316
MAE =  29263.82507577501

Double Exponential Smoothing

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  60
Test set:  15

# Fit model using Simple Exponential Smoothing
# model = SimpleExpSmoothing(train_set['Total']).fit()

# Double Exponential Smoothing
model = ExponentialSmoothing(train_set['Total'], trend='mul').fit()

predictions = model.forecast(15)

# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params

{'smoothing_level': 0.995,
 'smoothing_trend': 0.04738095238095238,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 51251.999999999985,
 'initial_trend': 1.084850613368525,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}

# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2018-09-30	96242.0	117626.951449	-21384.951449
2018-10-31	90982.0	123616.042220	-32634.042220
2018-11-30	68431.0	129910.073380	-61479.073380
2018-12-31	46941.0	136524.571266	-89583.571266
2019-01-31	72883.0	143475.852752	-70592.852752
2019-02-28	36099.0	150781.065503	-114682.065503
2019-03-31	85457.0	158458.230275	-73001.230275
2019-04-30	87932.0	166526.285367	-78594.285367
2019-05-31	129123.0	175005.133341	-45882.133341
2019-06-30	132512.0	183915.690115	-51403.690115
2019-07-31	137714.0	193279.936564	-55565.936564
2019-08-31	142414.0	203120.972739	-60706.972739
2019-09-30	112174.0	213463.074854	-101289.074854
2019-10-31	104498.0	224331.755168	-119833.755168
2019-11-30	84963.0	235753.824924	-150790.824924

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  6789777046.197141
RMSE =  82400.10343559734
MAE =  75161.63066121914

Triple Exponential Smoothing

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  60
Test set:  15

# Let us use the triple exponential smoothing model
model = ExponentialSmoothing(train_set['Total'],trend='add', \
                             seasonal='add',seasonal_periods=12).fit()

predictions = model.forecast(15)

# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Test vs observed - closeup of the predictions
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params

{'smoothing_level': 0.2525,
 'smoothing_trend': 0.0001,
 'smoothing_seasonal': 0.0001,
 'damping_trend': nan,
 'initial_level': 82880.52777777775,
 'initial_trend': 233.32297979798386,
 'initial_seasons': array([ 12706.58333333,  -1150.10416667, -23387.98958333, -38062.89583333,
        -27297.51041667, -29627.21875   , -15820.59375   ,   1298.15625   ,
         30509.6875    ,  28095.90625   ,  32583.70833333,  30152.27083333]),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}

# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2018-09-30	96242.0	100914.043265	-4672.043265
2018-10-31	90982.0	87291.600234	3690.399766
2018-11-30	68431.0	65286.060329	3144.939671
2018-12-31	46941.0	50843.440823	-3902.440823
2019-01-31	72883.0	61841.794519	11041.205481
2019-02-28	36099.0	59743.301291	-23644.301291
2019-03-31	85457.0	73783.743291	11673.256709
2019-04-30	87932.0	91133.340351	-3201.340351
2019-05-31	129123.0	120579.559129	8543.440871
2019-06-30	132512.0	118396.387406	14115.612594
2019-07-31	137714.0	123117.725595	14596.274405
2019-08-31	142414.0	120917.981585	21496.018415
2019-09-30	112174.0	103703.284649	8470.715351
2019-10-31	104498.0	90080.841618	14417.158382
2019-11-30	84963.0	68075.301713	16887.698287

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  160014279.71049073
RMSE =  12649.675083198412
MAE =  10899.78971069559

model = ExponentialSmoothing(new_df['Total'], trend='mul').fit()

# Fit values
pd.DataFrame({'fitted':model.fittedvalues.shift(-1), 'actual':new_df['Total']})

	fitted	actual
Date
2013-09-30	88374.142596	80729.0
2013-10-31	89066.318586	81352.0
2013-11-30	64512.628727	59270.0
2013-12-31	47037.321040	43553.0
2014-01-31	64853.079989	59873.0
...	...	...
2019-07-31	147204.031288	137714.0
2019-08-31	152114.454701	142414.0
2019-09-30	119265.035267	112174.0
2019-10-31	110660.795745	104498.0
2019-11-30	NaN	84963.0

75 rows × 2 columns

# Examine model parameters
model.params

{'smoothing_level': 0.995,
 'smoothing_trend': 0.02369047619047619,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 51251.999999999985,
 'initial_trend': 1.084850613368525,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}

# RMSE calculation
x = pd.DataFrame({'fitted':model.fittedvalues.shift(-1), 'actual':new_df['Total']}).dropna()
rmse = np.sqrt(mse(x.fitted, x.actual))
rmse

6017.795365317929

# predict the next 15 values
predictions = model.forecast(15)
predictions

2019-12-31     89553.250900
2020-01-31     94248.964768
2020-02-29     99190.897823
2020-03-31    104391.960540
2020-04-30    109865.740351
2020-05-31    115626.537143
2020-06-30    121689.400618
2020-07-31    128070.169604
2020-08-31    134785.513440
2020-09-30    141852.975517
2020-10-31    149291.019112
2020-11-30    157119.075623
2020-12-31    165357.595328
2021-01-31    174028.100817
2021-02-28    183153.243211
Freq: M, dtype: float64

model.fittedvalues

Date
2013-09-30     55600.763636
2013-10-31     88374.142596
2013-11-30     89066.318586
2013-12-31     64512.628727
2014-01-31     47037.321040
                  ...      
2019-07-31    141747.388757
2019-08-31    147204.031288
2019-09-30    152114.454701
2019-10-31    119265.035267
2019-11-30    110660.795745
Freq: M, Length: 75, dtype: float64

ARIMA - Auto Regressive Integrated Moving Average

ARIMA stands for Auto Regressive Integrated Moving Average. It is a general method for understanding and predicting time series data.

ARIMA models come in several different flavors, for example: - Non-seasonal ARIMA
- Seasonal ARIMA (Called SARIMA)
- SARIMA with external variables, called SARIMAX
Which one should we use? ARIMA or Holt-Winters?

Try both. Whatever works better for your use case is the one to use.

ARIMA models have three non-negative integer parameters – p, d and q. - p represents the Auto-Regression component, AR. This is the part of the model that leverages the linear regression between an observation and past observations.
- d represents differencing, the I component. This is the number of times the series has to be differenced to make it stationary.
- q represents the MA component, the number of lagged forecast errors in the prediction. This considers the relationship between an observation and the residual error from a moving average model.

A correct choice of the ‘order’ of your ARIMA model, ie deciding the values of p, d and q, is essential to building a good ARIMA model.

Deciding the values of p, d and q
- Values of p and q can be determined manually by examining auto-correlation and partial-autocorrelation plots.
- The value of d can be determined by repeatedly differencing a series till we get to a stationary series.

The manual methods are time consuming, and less precise. An overview of these is provided in the Appendix to this slide deck.

In reality, we let the computer do a grid search (a brute force test of a set of permutations for p, d and q) to determine the order of our ARIMA model. The Pyramid ARIMA library in Python allows searching through multiple combinations of p, d and q to identify the best model.

ARIMA in Action
1. Split dataset into train and test.
2. Test set should be the last n entries. Can’t use random selection for train-test split.
3. Pyramid ARIMA is a Python package that can identify the values of p, d and q to use. Use Auto ARIMA from Pyramid ARIMA to find good values of p, d and q based on the training data set.
4. Fit a model on the training data set.
5. Predict the test set, and evaluate using MSE, MAE or RMSE.

# Library imports

from pmdarima import auto_arima

# Train-test split

train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  60
Test set:  15

# Clean up

train_set.dropna(inplace=True)

Use Auto ARIMA to find out order

# Build a model using auto_arima

model = auto_arima(train_set['Total'],seasonal=False)

order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']
print('Order = ', order)
print('Seasonal Order = ', seasonal_order)

Order =  (5, 0, 1)
Seasonal Order =  (0, 0, 0, 0)

# Create and fit model

from statsmodels.tsa.arima.model import ARIMA

model_ARIMA = ARIMA(train_set['Total'], order = order)
model_ARIMA = model_ARIMA.fit()

# Predict with ARIMA 

start=len(train_set)
end=len(train_set)+len(test_set)-1
ARIMApredictions = model_ARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('ARIMA Predictions')
# model_ARIMA.summary()

# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = ARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2018-09-30	96242.0	97178.782841	-936.782841
2018-10-31	90982.0	68163.166776	22818.833224
2018-11-30	68431.0	56095.942557	12335.057443
2018-12-31	46941.0	41005.329654	5935.670346
2019-01-31	72883.0	42591.251732	30291.748268
2019-02-28	36099.0	51555.855575	-15456.855575
2019-03-31	85457.0	71734.836673	13722.163327
2019-04-30	87932.0	91198.670052	-3266.670052
2019-05-31	129123.0	110818.433668	18304.566332
2019-06-30	132512.0	121161.132303	11350.867697
2019-07-31	137714.0	122198.078813	15515.921187
2019-08-31	142414.0	111847.671126	30566.328874
2019-09-30	112174.0	94763.626480	17410.373520
2019-10-31	104498.0	73934.456312	30563.543688
2019-11-30	84963.0	56060.536900	28902.463100

# Calculate evaluation metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  385065542.1818603
RMSE =  19623.086968717747
MAE =  17158.523031611756

# Plot results
train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
ARIMApredictions.plot(legend=True)
plt.show()

png

Seasonal ARIMA - SARIMA

SARIMA, or Seasonal ARIMA, accounts for seasonality. In order to account for seasonality, we need three more parameters – P, D and Q – to take care of seasonal variations.

Auto ARIMA takes care of seasonality as well, and provides us the values for P, D and Q just like it does for p, d and q.

To use SARIMA, we now need 7 parameters for our function:
- p
- d
- q
- P
- D
- Q
- m (frequency of our seasons, eg, 12)

SARIMA in Action
1. Split dataset into train and test.
2. Test set should be the last n entries. Can’t use random selection for train-test split.
3. Pyramid ARIMA is a Python package that can identify the values of p, d, q, P, D and Q to use. Use Auto ARIMA from Pyramid ARIMA to find good values of p, d and q based on the training data set.
4. Fit a model on the training data set.
5. Predict the test set, and evaluate using MSE, MAE or RMSE.

# Create a model with auto_arima

model = auto_arima(train_set['Total'],seasonal=True,m=12)

# Get values of p, d, q, P, D and Q

order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']

print('Order = ', order)
print('Seasonal Order = ', seasonal_order)

Order =  (1, 0, 0)
Seasonal Order =  (0, 1, 1, 12)

# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMA = SARIMAX(train_set['Total'], order=order, seasonal_order=seasonal_order)
model_SARIMA = model_SARIMA.fit()
model_SARIMA.params

ar.L1       2.438118e-01
ma.S.L12   -6.668071e-02
sigma2      7.885998e+07
dtype: float64

# Create SARIMA predictions  

start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMApredictions = model_SARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')

# Calculate Evaluation Metrics

y_test = test_set['Total']
y_pred = SARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2018-09-30	96242.0	94423.659176	1818.340824
2018-10-31	90982.0	86518.880961	4463.119039
2018-11-30	68431.0	57965.336973	10465.663027
2018-12-31	46941.0	45396.267772	1544.732228
2019-01-31	72883.0	58009.524643	14873.475357
2019-02-28	36099.0	50177.757219	-14078.757219
2019-03-31	85457.0	76096.865509	9360.134491
2019-04-30	87932.0	79286.971453	8645.028547
2019-05-31	129123.0	128451.795674	671.204326
2019-06-30	132512.0	112789.442695	19722.557305
2019-07-31	137714.0	127353.588372	10360.411628
2019-08-31	142414.0	112330.356235	30083.643765
2019-09-30	112174.0	94550.771990	17623.228010
2019-10-31	104498.0	86549.872568	17948.127432
2019-11-30	84963.0	57972.893093	26990.106907

# Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  231993016.88445473
RMSE =  15231.316978004716
MAE =  12576.56867360885

# Plot results

train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
SARIMApredictions.plot(legend = True)
ARIMApredictions.plot(legend=True)
plt.show()

png

# Models compared to each other - calculate MAE, RMSE

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
print('SARIMA:')
print('  RMSE = ' ,mse(SARIMApredictions, test_set['Total'], squared = False))
print('  MAE = ', mae(SARIMApredictions, test_set['Total']))
print('\nARIMA:')
print('  RMSE = ' ,mse(ARIMApredictions, test_set['Total'], squared = False))
print('  MAE = ', mae(ARIMApredictions, test_set['Total']))
print('\n')
print('  Mean of the data = ', new_df.Total.mean())
print('  St Dev of the data = ', new_df.Total.std())

SARIMA:
  RMSE =  15231.316978004716
  MAE =  12576.56867360885

ARIMA:
  RMSE =  19623.086968717747
  MAE =  17158.523031611756


  Mean of the data =  85078.8
  St Dev of the data =  28012.270090473237

SARIMAX

SARIMAX = Seasonal ARIMA with eXogenous variable

SARIMAX is the same as SARIMA, but there is an additional predictor variable in addition to just the time series itself.

Let us load some data showing weekly power consumption as well as average daily temperature. We will try to predict kwh as a time series, and also use Temp_avg as an exogenous variable.

In order to predict the future, you need the past data series, plus observed values for the exogenous variable. That can sometimes be difficult because the future may not yet have revealed itself yet, and while you may be able to build a model that evaluates well, you will not be able to use it.

# Data adapted from: 
# https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries  
# The data shows weekly electricity usage and the average temperature of the week.  
# Our hypothesis is that the power consumed can be predicted using the average 
# temperature, and the pattern found in the time series.

df_elec = pd.read_csv('pwr_usage.csv')
df_elec

	Date	Temp_avg	kwh
0	1/8/2017	75.542857	106.549
1	1/15/2017	71.014286	129.096
2	1/22/2017	64.414286	68.770
3	1/29/2017	56.728571	71.378
4	2/5/2017	66.128571	107.829
...	...	...	...
152	12/8/2019	74.371429	167.481
153	12/15/2019	61.242857	86.248
154	12/22/2019	50.000000	73.206
155	12/29/2019	60.128571	35.655
156	1/5/2020	7.185714	4.947

157 rows × 3 columns

Data exploration

df_elec.index = pd.DatetimeIndex(df_elec.Date, freq='W-SUN')
df_elec.drop(['Date'], axis = 1, inplace = True)
df_elec

	Temp_avg	kwh
Date
2017-01-08	75.542857	106.549
2017-01-15	71.014286	129.096
2017-01-22	64.414286	68.770
2017-01-29	56.728571	71.378
2017-02-05	66.128571	107.829
...	...	...
2019-12-08	74.371429	167.481
2019-12-15	61.242857	86.248
2019-12-22	50.000000	73.206
2019-12-29	60.128571	35.655
2020-01-05	7.185714	4.947

157 rows × 2 columns

df_elec.index

DatetimeIndex(['2017-01-08', '2017-01-15', '2017-01-22', '2017-01-29',
               '2017-02-05', '2017-02-12', '2017-02-19', '2017-02-26',
               '2017-03-05', '2017-03-12',
               ...
               '2019-11-03', '2019-11-10', '2019-11-17', '2019-11-24',
               '2019-12-01', '2019-12-08', '2019-12-15', '2019-12-22',
               '2019-12-29', '2020-01-05'],
              dtype='datetime64[ns]', name='Date', length=157, freq='W-SUN')

df_elec['kwh'][:60].plot()

<Axes: xlabel='Date'>

png

plt.rc("figure", figsize=(18,4))
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(df_elec['kwh']);
plot_pacf(df_elec['kwh']);

png

result = seasonal_decompose(df_elec['kwh'], model = 'additive') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

# Plot the months to see trends over months
month_plot(df_elec[['kwh']].resample(rule='M').kwh.sum());

png

# Plot the quarter to see trends over quarters
quarter_plot(df_elec[['kwh']].resample(rule='Q').kwh.sum());

png

Train-test split

# Train-test split
test_samples = 12

train_set = df_elec.iloc[:-test_samples]
test_set = df_elec.iloc[-test_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  145
Test set:  12

train_set

	Temp_avg	kwh
Date
2017-01-08	75.542857	106.5490
2017-01-15	71.014286	129.0960
2017-01-22	64.414286	68.7700
2017-01-29	56.728571	71.3780
2017-02-05	66.128571	107.8290
...	...	...
2019-09-15	76.800000	168.3470
2019-09-22	79.385714	157.7260
2019-09-29	80.928571	191.8260
2019-10-06	70.771429	137.1698
2019-10-13	74.285714	147.5200

145 rows × 2 columns

test_set

	Temp_avg	kwh
Date
2019-10-20	73.471429	137.847
2019-10-27	64.557143	84.999
2019-11-03	62.928571	76.744
2019-11-10	77.985714	175.708
2019-11-17	49.042857	77.927
2019-11-24	62.785714	62.805
2019-12-01	69.557143	74.079
2019-12-08	74.371429	167.481
2019-12-15	61.242857	86.248
2019-12-22	50.000000	73.206
2019-12-29	60.128571	35.655
2020-01-05	7.185714	4.947

Uncomment this cell to run auto-ARIMA (very time consuming)
Determine parameters for SARIMAX using Auto-ARIMA

model = auto_arima(train_set['kwh'],seasonal=True,m=52)  
order = model.get_params()['order']  
seasonal_order = model.get_params()['seasonal_order']  

print('Order = ', order)  
print('Seasonal Order = ', seasonal_order)

# Set the order, ie the values of p, d, q and P, D, Q and m.

order =  (1, 0, 1)
seasonal_order =  (1, 0, 1, 52)

First, let us try SARIMA, ignoring temperature

# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMA = SARIMAX(train_set['kwh'],order=order,seasonal_order=seasonal_order)
model_SARIMA = model_SARIMA.fit()
# model_SARIMA.summary()
model_SARIMA.params

ar.L1         0.983883
ma.L1        -0.703757
ar.S.L52      0.994796
ma.S.L52     -0.842830
sigma2      781.564879
dtype: float64

# Create SARIMA predictions

start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMApredictions = model_SARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')

# Calculate Evaluation Metrics

y_test = test_set['kwh']
y_pred = SARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2019-10-20	137.847	108.432110	29.414890
2019-10-27	84.999	84.006054	0.992946
2019-11-03	76.744	99.242296	-22.498296
2019-11-10	175.708	164.413570	11.294430
2019-11-17	77.927	92.077521	-14.150521
2019-11-24	62.805	74.425502	-11.620502
2019-12-01	74.079	81.382111	-7.303111
2019-12-08	167.481	164.427999	3.053001
2019-12-15	86.248	95.116618	-8.868618
2019-12-22	73.206	65.763267	7.442733
2019-12-29	35.655	81.148805	-45.493805
2020-01-05	4.947	113.466587	-108.519587

# Model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  1323.1768701198491
RMSE =  36.375498211293944
MAE =  22.55437004082181

# Plot results

train_set['kwh'].rename('Training Set').plot(legend=True)
test_set['kwh'].rename('Test Set').plot(legend=True)
SARIMApredictions.plot(legend = True)
plt.show()

png

y_test

Date
2019-10-20    137.847
2019-10-27     84.999
2019-11-03     76.744
2019-11-10    175.708
2019-11-17     77.927
2019-11-24     62.805
2019-12-01     74.079
2019-12-08    167.481
2019-12-15     86.248
2019-12-22     73.206
2019-12-29     35.655
2020-01-05      4.947
Freq: W-SUN, Name: kwh, dtype: float64

Let us use SARIMAX - Seasonal ARIMA with eXogenous Variable

model = SARIMAX(endog=train_set['kwh'],exog=train_set['Temp_avg'],order=(1,0,0),seasonal_order=(2,0,0,7),enforce_invertibility=False)
results = model.fit()
results.summary()

SARIMAX Results
Dep. Variable:	kwh	No. Observations:	145
Model:	SARIMAX(1, 0, 0)x(2, 0, 0, 7)	Log Likelihood	-732.174
Date:	Fri, 10 Nov 2023	AIC	1474.348
Time:	22:42:32	BIC	1489.232
Sample:	01-08-2017	HQIC	1480.396
	- 10-13-2019
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
Temp_avg	2.1916	0.096	22.781	0.000	2.003	2.380
ar.L1	0.6524	0.064	10.202	0.000	0.527	0.778
ar.S.L7	-0.2007	0.089	-2.261	0.024	-0.375	-0.027
ar.S.L14	-0.0945	0.090	-1.048	0.294	-0.271	0.082
sigma2	1414.9974	193.794	7.302	0.000	1035.169	1794.826

Ljung-Box (L1) (Q):	0.20	Jarque-Bera (JB):	2.42
Prob(Q):	0.65	Prob(JB):	0.30
Heteroskedasticity (H):	1.42	Skew:	0.30
Prob(H) (two-sided):	0.23	Kurtosis:	2.80

Warnings:[1] Covariance matrix calculated using the outer product of gradients (complex-step).

# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMAX = SARIMAX(train_set['kwh'], exog = train_set['Temp_avg'], order=order,seasonal_order=seasonal_order)
model_SARIMAX = model_SARIMAX.fit()
# model_SARIMA.summary()
model_SARIMAX.params

Temp_avg      4.494951
ar.L1         0.994501
ma.L1        -0.672744
ar.S.L52      0.996258
ma.S.L52     -0.936537
sigma2      680.303411
dtype: float64

# Create SARIMAX predictions

exog = test_set[['Temp_avg']]
start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMAXpredictions = model_SARIMAX.predict(start=start, end=end, exog = exog, dynamic=False, typ='levels').rename('SARIMAX Predictions')

# Calculate Evaluation Metrics 

y_test = test_set['kwh']
y_pred = SARIMAXpredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2019-10-20	137.847	135.762630	2.084370
2019-10-27	84.999	90.884021	-5.885021
2019-11-03	76.744	81.577953	-4.833953
2019-11-10	175.708	172.860135	2.847865
2019-11-17	77.927	35.830430	42.096570
2019-11-24	62.805	89.152707	-26.347707
2019-12-01	74.079	120.706976	-46.627976
2019-12-08	167.481	150.329857	17.151143
2019-12-15	86.248	100.167290	-13.919290
2019-12-22	73.206	22.178695	51.027305
2019-12-29	35.655	96.057753	-60.402753
2020-01-05	4.947	-156.713632	161.660632

# Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  3132.1077834193397
RMSE =  55.96523727653926
MAE =  36.240382161713455

# Plot results

train_set['kwh'].rename('Training Set').plot(legend=True)
test_set['kwh'].rename('Test Set').plot(legend=True)
SARIMAXpredictions.plot(legend = True)
plt.show()

png

y_test

Date
2019-10-20    137.847
2019-10-27     84.999
2019-11-03     76.744
2019-11-10    175.708
2019-11-17     77.927
2019-11-24     62.805
2019-12-01     74.079
2019-12-08    167.481
2019-12-15     86.248
2019-12-22     73.206
2019-12-29     35.655
2020-01-05      4.947
Freq: W-SUN, Name: kwh, dtype: float64

FB Prophet

https://facebook.github.io/prophet/

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Prophet is superior to other methods as it can detect and account for multiple layers of seasonality in the same dataset. (What does this mean: Data may have hourly, weekly as well as monthly seasonality, all distinct from each other. Other methods will calculate seasonality at the granularity level of the data, ie, to get monthly seasonality you need to provide monthly data using resampling.)

# Python
import pandas as pd
from prophet import Prophet

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  60
Test set:  15

train_set.head()

	Total	Moving_Avg_6m	Moving_Avg_12m	EWMA12
Date
2013-09-30	80729.0	97184.000000	74734.583333	82750.448591
2013-10-31	81352.0	98743.000000	76039.333333	82535.302654
2013-11-30	59270.0	90525.666667	76757.916667	78956.025322
2013-12-31	43553.0	81237.833333	77356.583333	73509.406042
2014-01-31	59873.0	71554.333333	78605.666667	71411.497420

# Create the ds and y columns for Prophet
train_set_prophet = train_set.reset_index()
train_set_prophet = train_set_prophet[['Date', 'Total']]
train_set_prophet.columns = ['ds', 'y']
train_set_prophet.head()

	ds	y
0	2013-09-30	80729.0
1	2013-10-31	81352.0
2	2013-11-30	59270.0
3	2013-12-31	43553.0
4	2014-01-31	59873.0

model = Prophet()
model.fit(train_set_prophet)

22:42:37 - cmdstanpy - INFO - Chain [1] start processing
22:42:37 - cmdstanpy - INFO - Chain [1] done processing





<prophet.forecaster.Prophet at 0x14408a54c10>

future = model.make_future_dataframe(periods=15,freq = 'm')
future.tail()

	ds
70	2019-07-31
71	2019-08-31
72	2019-09-30
73	2019-10-31
74	2019-11-30

future.shape

(75, 1)

# Python
forecast = model.predict(future)

forecast.columns

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'yearly', 'yearly_lower', 'yearly_upper', 'multiplicative_terms',
       'multiplicative_terms_lower', 'multiplicative_terms_upper', 'yhat'],
      dtype='object')

forecast.head()

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	yearly	yearly_lower	yearly_upper	yhat
0	2013-09-30	83323.589562	85819.953093	99008.026474	83323.589562	83323.589562	9120.391098	9120.391098	9120.391098	9120.391098	9120.391098	9120.391098	92443.980660
1	2013-10-31	83286.808660	73361.637478	86911.006692	83286.808660	83286.808660	-2989.685567	-2989.685567	-2989.685567	-2989.685567	-2989.685567	-2989.685567	80297.123093
2	2013-11-30	83251.214239	53784.220687	66821.426791	83251.214239	83251.214239	-23011.327489	-23011.327489	-23011.327489	-23011.327489	-23011.327489	-23011.327489	60239.886750
3	2013-12-31	83214.433337	36997.798701	50727.126628	83214.433337	83214.433337	-39468.043632	-39468.043632	-39468.043632	-39468.043632	-39468.043632	-39468.043632	43746.389705
4	2014-01-31	83177.652434	49717.718616	62756.802402	83177.652434	83177.652434	-26943.934360	-26943.934360	-26943.934360	-26943.934360	-26943.934360	-26943.934360	56233.718075

preds = pd.DataFrame({'Prediction': forecast.yhat[-15:]})
preds.index = pd.to_datetime(forecast.ds[-15:])
preds.index.names = ['Date']
preds

	Prediction
Date
2018-09-30	96887.285693
2018-10-31	88140.126499
2018-11-30	62980.805485
2018-12-31	50953.542730
2019-01-31	62367.352742
2019-02-28	56888.480539
2019-03-31	76208.358529
2019-04-30	86669.010005
2019-05-31	122794.851343
2019-06-30	120372.674873
2019-07-31	128458.704108
2019-08-31	114207.558640
2019-09-30	101186.010180
2019-10-31	95354.181624
2019-11-30	64717.364105

# Calculate Evaluation Metrics

y_test = test_set['Total'] 
y_pred = preds['Prediction']
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
Date
2018-09-30	96242.0	96887.285693	-645.285693
2018-10-31	90982.0	88140.126499	2841.873501
2018-11-30	68431.0	62980.805485	5450.194515
2018-12-31	46941.0	50953.542730	-4012.542730
2019-01-31	72883.0	62367.352742	10515.647258
2019-02-28	36099.0	56888.480539	-20789.480539
2019-03-31	85457.0	76208.358529	9248.641471
2019-04-30	87932.0	86669.010005	1262.989995
2019-05-31	129123.0	122794.851343	6328.148657
2019-06-30	132512.0	120372.674873	12139.325127
2019-07-31	137714.0	128458.704108	9255.295892
2019-08-31	142414.0	114207.558640	28206.441360
2019-09-30	112174.0	101186.010180	10987.989820
2019-10-31	104498.0	95354.181624	9143.818376
2019-11-30	84963.0	64717.364105	20245.635895

# Model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  157807682.19504696
RMSE =  12562.15276913344
MAE =  10071.55405527476

# Plot results

train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
preds.Prediction.plot(legend = True)
plt.show()

png

model.plot_components(forecast);

png

Modeling Daily Data with Prophet

df = pd.read_csv('https://data.seattle.gov/api/views/65db-xm6k/rows.csv')

df

	Date	Fremont Bridge Sidewalks, south of N 34th St	Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk	Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
0	08/01/2022 12:00:00 AM	23.0	7.0	16.0
1	08/01/2022 01:00:00 AM	12.0	5.0	7.0
2	08/01/2022 02:00:00 AM	3.0	0.0	3.0
3	08/01/2022 03:00:00 AM	5.0	2.0	3.0
4	08/01/2022 04:00:00 AM	10.0	2.0	8.0
...	...	...	...	...
95635	08/31/2023 07:00:00 PM	224.0	72.0	152.0
95636	08/31/2023 08:00:00 PM	142.0	59.0	83.0
95637	08/31/2023 09:00:00 PM	67.0	35.0	32.0
95638	08/31/2023 10:00:00 PM	43.0	18.0	25.0
95639	08/31/2023 11:00:00 PM	12.0	8.0	4.0

95640 rows × 4 columns

# Rename the columns to make them simpler to use
df.columns = ['Date', 'Total', 'East', 'West']

# Create the Date column
df['Date'] = pd.DatetimeIndex(df.Date)

df.head()

	Date	Total	East	West
0	2022-08-01 00:00:00	23.0	7.0	16.0
1	2022-08-01 01:00:00	12.0	5.0	7.0
2	2022-08-01 02:00:00	3.0	0.0	3.0
3	2022-08-01 03:00:00	5.0	2.0	3.0
4	2022-08-01 04:00:00	10.0	2.0	8.0

# Create the ds and y columns for Prophet

df2 = df[['Date', 'Total']]
df2.columns = ['ds', 'y']
df2

	ds	y
0	2022-08-01 00:00:00	23.0
1	2022-08-01 01:00:00	12.0
2	2022-08-01 02:00:00	3.0
3	2022-08-01 03:00:00	5.0
4	2022-08-01 04:00:00	10.0
...	...	...
95635	2023-08-31 19:00:00	224.0
95636	2023-08-31 20:00:00	142.0
95637	2023-08-31 21:00:00	67.0
95638	2023-08-31 22:00:00	43.0
95639	2023-08-31 23:00:00	12.0

95640 rows × 2 columns

# Remove post-Covid data 

df_precovid = df2[df2.ds < pd.to_datetime('2019-12-31')]

%%time
model = Prophet()
model.fit(df_precovid)

22:42:53 - cmdstanpy - INFO - Chain [1] start processing
22:43:07 - cmdstanpy - INFO - Chain [1] done processing


CPU times: total: 1.7 s
Wall time: 19.3 s





<prophet.forecaster.Prophet at 0x144193be390>

future = model.make_future_dataframe(periods=365)
future.tail()

	ds
63829	2020-12-25 23:00:00
63830	2020-12-26 23:00:00
63831	2020-12-27 23:00:00
63832	2020-12-28 23:00:00
63833	2020-12-29 23:00:00

type(future)

pandas.core.frame.DataFrame

future.shape

(63834, 1)

%%time
forecast = model.predict(future)

CPU times: total: 2.3 s
Wall time: 8.74 s

type(forecast)

pandas.core.frame.DataFrame

forecast.shape

(63834, 22)

forecast.columns

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'daily', 'daily_lower', 'daily_upper', 'weekly', 'weekly_lower',
       'weekly_upper', 'yearly', 'yearly_lower', 'yearly_upper',
       'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')

# forecast.to_excel('forecast.xlsx')

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

	ds	yhat	yhat_lower	yhat_upper
63829	2020-12-25 23:00:00	-16.020172	-142.659821	105.518723
63830	2020-12-26 23:00:00	-67.037196	-191.451418	42.315426
63831	2020-12-27 23:00:00	-22.832529	-145.837458	87.747590
63832	2020-12-28 23:00:00	27.544487	-85.548906	145.626637
63833	2020-12-29 23:00:00	26.487504	-87.689924	157.621258

# Python
fig1 = model.plot(forecast, include_legend=True)

png

# Python
fig2 = model.plot_components(forecast)

png

Deep Learning - Timeseries prediction using RNNs

Prediction with a deep neural network:
1. Split the data into train and test.
2. Decide how many time periods to use for prediction, and divide the data into $X$ and $y$ using Timeseries.Generator from Tensorflow.
3. Create a model and predict the first observation.
4. Then predict subsequent observations after including the earlier prediction.
5. Repeat to get as many prediction as you need (typically equal to the test set size). Evaluate the model.

# We use the same train and test sets that were created in the previous example
# Let us look at the dates in the train_set.  

train_set.index

DatetimeIndex(['2013-09-30', '2013-10-31', '2013-11-30', '2013-12-31',
               '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
               '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31',
               '2014-09-30', '2014-10-31', '2014-11-30', '2014-12-31',
               '2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30',
               '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-31', '2015-11-30', '2015-12-31',
               '2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31'],
              dtype='datetime64[ns]', name='Date', freq='M')

train_set = train_set[['Total']]
test_set = test_set[['Total']]

# We will do standard scaling as we are using a neural net

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

# IGNORE WARNING ITS JUST CONVERTING TO FLOATS
# WE ONLY FIT TO TRAININ DATA, OTHERWISE WE ARE CHEATING ASSUMING INFO ABOUT TEST SET

std_scaler.fit(train_set)

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Now we transform using the standard scaler instantiated above

std_train = std_scaler.transform(train_set)
std_test = std_scaler.transform(test_set)

std_train.shape

(60, 1)

std_test.shape

(15, 1)

Create Sequences

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# define generator
lag = 12
batch_size = 1
sequences = TimeseriesGenerator(std_train, std_train, length=lag, batch_size=batch_size)

# Let us see what our standard scaled training data looks like.

std_train

array([[-0.06864305],
       [-0.0450607 ],
       [-0.88092804],
       [-1.47586179],
       [-0.85810276],
       [-1.34443658],
       [-0.72103747],
       [ 0.16324371],
       [ 1.36654898],
       [ 1.07368123],
       [ 1.44320106],
       [ 1.13360234],
       [ 0.56838311],
       [ 0.02428578],
       [-0.96723261],
       [-1.28833861],
       [-0.82944812],
       [-0.90405615],
       [-0.43146292],
       [ 0.04370431],
       [ 0.955126  ],
       [ 1.18004783],
       [ 1.14457968],
       [ 0.78766485],
       [ 0.32544331],
       [ 0.01743441],
       [-0.97942124],
       [-1.45924438],
       [-1.16622522],
       [-0.83887349],
       [-0.48218578],
       [ 0.42003766],
       [ 1.1967788 ],
       [ 0.94914525],
       [ 0.87593777],
       [ 1.12943852],
       [ 0.43964545],
       [-0.47919541],
       [-0.69821218],
       [-1.6505907 ],
       [-1.23920557],
       [-1.53460946],
       [-0.9007251 ],
       [-0.53483914],
       [ 1.00486469],
       [ 0.95611018],
       [ 1.37639073],
       [ 1.42499383],
       [ 0.52825905],
       [ 0.21199822],
       [-0.94096271],
       [-1.38845949],
       [-0.90663015],
       [-1.20619786],
       [-0.19904623],
       [-0.098244  ],
       [ 1.78932782],
       [ 1.15839598],
       [ 1.72138189],
       [ 1.10782453]])

len(std_train)

len(sequences) # n_input = 2

# What does the first batch look like?
X,y = sequences[0]

array([[[-0.06864305],
        [-0.0450607 ],
        [-0.88092804],
        [-1.47586179],
        [-0.85810276],
        [-1.34443658],
        [-0.72103747],
        [ 0.16324371],
        [ 1.36654898],
        [ 1.07368123],
        [ 1.44320106],
        [ 1.13360234]]])

array([[0.56838311]])

# What does the second batch look like?
X,y = sequences[1]

array([[[-0.0450607 ],
        [-0.88092804],
        [-1.47586179],
        [-0.85810276],
        [-1.34443658],
        [-0.72103747],
        [ 0.16324371],
        [ 1.36654898],
        [ 1.07368123],
        [ 1.44320106],
        [ 1.13360234],
        [ 0.56838311]]])

array([[0.02428578]])

Create the Model

# First, some library imports

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import tensorflow as tf

# We will go with a very simple architecture, a sequential model 
# with just one hidden LSTM layer. Define model:  

model = Sequential()
model.add(LSTM(100, input_shape=(lag, batch_size)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 100)               40800

 dense (Dense)               (None, 1)                 101

=================================================================
Total params: 40901 (159.77 KB)
Trainable params: 40901 (159.77 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

# fit model

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=4)
history = model.fit_generator(sequences,epochs=50, callbacks=[callback])

Epoch 1/50
48/48 [==============================] - 2s 7ms/step - loss: 0.5581
Epoch 2/50
48/48 [==============================] - 0s 7ms/step - loss: 0.2086
Epoch 3/50
48/48 [==============================] - 0s 5ms/step - loss: 0.2686
Epoch 4/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1574
Epoch 5/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1408
Epoch 6/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1744
Epoch 7/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1421
Epoch 8/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1696
Epoch 9/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1318
Epoch 10/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1321
Epoch 11/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1318
Epoch 12/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1423
Epoch 13/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1208
Epoch 14/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1472
Epoch 15/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1156
Epoch 16/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1262
Epoch 17/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1403
Epoch 18/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1193
Epoch 19/50
48/48 [==============================] - 0s 7ms/step - loss: 0.1363

loss_per_epoch = model.history.history['loss']
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)

[<matplotlib.lines.Line2D at 0x14490d21310>]

png

Evaluate on Test Data

As part of our training, we used the preceding 12 values to predict the next one. Now we will use the last 12 values of the training data to predict the first value of the test data. Then we will add this predicted value to our sequence, and predict the next one, and so on.

lag

# Pick the last 12 values in the training data

predictors = std_train[-lag:]

# See what they look like...

predictors

array([[ 0.52825905],
       [ 0.21199822],
       [-0.94096271],
       [-1.38845949],
       [-0.90663015],
       [-1.20619786],
       [-0.19904623],
       [-0.098244  ],
       [ 1.78932782],
       [ 1.15839598],
       [ 1.72138189],
       [ 1.10782453]])

We will need to reshape this data to feed into our LSTM layer which expects a 3 dimensional input. We do that next.

predictors = predictors.reshape((1, lag, 1))

Next, we perform the prediction to get the first item after the training data has ended. This prediction will come in as standard-scaled, so we will need to reverse out the scaling to get the true prediction.

# Predict 

x = model.predict(predictors)
x

1/1 [==============================] - 0s 414ms/step





array([[0.45122966]], dtype=float32)

# Do an inverse transform to get the actual prediction

std_scaler.inverse_transform(x)

array([[94463.03]], dtype=float32)

# Let us see what the actual value in the test set was.

test_set.iloc[1]

Total    90982.0
Name: 2018-10-31 00:00:00, dtype: float64

Now we can predict the next data point. However, doing this manually is going to be very tedious, so we write some code to loop through this and do it for as many times as we need the predictions for.

predictors = std_train[-lag:].reshape((1, lag, 1))
predictions = []
for i in range(len(std_test)):
    next_pred = model.predict(predictors)[0]
    predictions.append(next_pred) 
    predictors = np.append(predictors[:,1:,:],[[next_pred]],axis=1)

predictions

1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 26ms/step
1/1 [==============================] - 0s 35ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 29ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 28ms/step
1/1 [==============================] - 0s 33ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 25ms/step





[array([0.45122966], dtype=float32),
 array([-0.3290154], dtype=float32),
 array([-1.0449202], dtype=float32),
 array([-1.4312159], dtype=float32),
 array([-1.4465744], dtype=float32),
 array([-1.1398427], dtype=float32),
 array([-0.48234788], dtype=float32),
 array([0.2781088], dtype=float32),
 array([1.047395], dtype=float32),
 array([1.4374696], dtype=float32),
 array([1.5440987], dtype=float32),
 array([1.1877004], dtype=float32),
 array([0.53592247], dtype=float32),
 array([-0.24500774], dtype=float32),
 array([-0.94227564], dtype=float32)]

# We inverse transform these scaled predictions

final_predictions = std_scaler.inverse_transform(predictions)
final_predictions

array([[ 94463.03240163],
       [ 73850.4654616 ],
       [ 54937.64395903],
       [ 44732.45864597],
       [ 44326.71497745],
       [ 52429.97380651],
       [ 69799.71784522],
       [ 89889.51390621],
       [110212.56840795],
       [120517.58586778],
       [123334.52152427],
       [113919.16595602],
       [ 96700.45268135],
       [ 76069.78565349],
       [ 57649.31496236]])

# Now let us plot the actuals versus the predicted values

pd.DataFrame({'actual': test_set.Total, 'forecast': final_predictions.flatten()}).plot()

<Axes: xlabel='Date'>

png

# Next, we evaluate our model using some metrics

y_test = test_set.Total
y_pred = final_predictions.flatten()

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  341904917.8754553
RMSE =  18490.671103977144
MAE =  16140.603957905736

Which method performed best?

Here is a summary of model performance above. The numbers would change each time you run it, but should give us a picture.

EWMA
MSE = 1228535508.79919
RMSE = 35050.47087842316
MAE = 29263.82507577501

DES
MSE = 6789777046.197141
RMSE = 82400.10343559734
MAE = 75161.63066121914

TES
MSE = 160014279.71049073
RMSE = 12649.675083198412
MAE = 10899.78971069559

ARIMA no seasonality
MSE = 385065540.80112064
RMSE = 19623.086933536237
MAE = 17158.523051864664

SARIMA
MSE = 231993016.8844547
RMSE = 15231.316978004716
MAE = 12576.568673608848

Prophet
MSE = 157807679.90652037
RMSE = 12562.152678045286
MAE = 10071.553953152606

Deep Learning
MSE = 449731703.1802
RMSE = 21206.87867603811
MAE = 18007.781001223593

Experiment - Can we predict the S&P500

# Let us get some data.  We download the daily time series for the S&P500 for 30 months

import yfinance as yf
SPY = yf.download('SPY', start = '2013-01-01', end = '2015-06-30')

[*********************100%%**********************]  1 of 1 completed

# Clean up

SPY.index = pd.DatetimeIndex(SPY.index) # Set index
SPY = SPY.asfreq('B') # This creates rows for any missing dates
SPY.fillna(method = 'bfill', inplace=True) # Fills missing dates with last observation

SPY.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 2013-01-02 to 2015-06-29
Freq: B
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       649 non-null    float64
 1   High       649 non-null    float64
 2   Low        649 non-null    float64
 3   Close      649 non-null    float64
 4   Adj Close  649 non-null    float64
 5   Volume     649 non-null    float64
dtypes: float64(6)
memory usage: 35.5 KB

SPY['Returns'] = (SPY['Close'].shift(1) / SPY['Close']) - 1
SPY[['Returns']].plot(figsize = (22,6));

png

SPY[['Close']].plot(figsize = (22,6));

png

# Train-test split
train_samples = int(SPY.shape[0] * 0.8)

train_set = SPY.iloc[:train_samples]
test_set = SPY.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  519
Test set:  130

model = auto_arima(train_set['Close'], seasonal=True , m = 12)

order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']

print('Order = ', order)
print('Seasonal Order = ', seasonal_order)

Order =  (0, 1, 0)
Seasonal Order =  (0, 0, 0, 12)

model = ARIMA(train_set['Close'],order = order)
results = model.fit()
results.summary()

start=len(train_set)
end=len(train_set)+len(test_set)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('ARIMA Predictions')

train_set['Close'].rename('Training Set').plot(legend=True)
test_set['Close'].rename('Test Set').plot(legend=True)
predictions.plot(legend = True)
plt.show()

png

# Calculate Evaluation Metrics
y_test = test_set['Close']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2014-12-30	207.600006	208.720001	-1.119995
2014-12-31	205.539993	208.720001	-3.180008
2015-01-01	205.429993	208.720001	-3.290009
2015-01-02	205.429993	208.720001	-3.290009
2015-01-05	201.720001	208.720001	-7.000000
...	...	...	...
2015-06-23	212.039993	208.720001	3.319992
2015-06-24	210.500000	208.720001	1.779999
2015-06-25	209.860001	208.720001	1.139999
2015-06-26	209.820007	208.720001	1.100006
2015-06-29	205.419998	208.720001	-3.300003

130 rows × 3 columns

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  11.971797943699102
RMSE =  3.460028604462556
MAE =  2.769538996769832

plt.rcParams['figure.figsize'] = (20, 9)

# Create and fit model
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train_set['Close'],order=order,seasonal_order=seasonal_order)
results = model.fit()


start=len(train_set)
end=len(train_set)+len(test_set)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')

train_set['Close'].rename('Training Set').plot(legend=True)
test_set['Close'].rename('Test Set').plot(legend=True)
predictions.plot(legend = True)
plt.show()

png

# Calculate Evaluation Metrics
y_test = test_set['Close']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

	y_test	y_pred	diff
2014-12-30	207.600006	208.720001	-1.119995
2014-12-31	205.539993	208.720001	-3.180008
2015-01-01	205.429993	208.720001	-3.290009
2015-01-02	205.429993	208.720001	-3.290009
2015-01-05	201.720001	208.720001	-7.000000
...	...	...	...
2015-06-23	212.039993	208.720001	3.319992
2015-06-24	210.500000	208.720001	1.779999
2015-06-25	209.860001	208.720001	1.139999
2015-06-26	209.820007	208.720001	1.100006
2015-06-29	205.419998	208.720001	-3.300003

130 rows × 3 columns

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  11.971797943699102
RMSE =  3.460028604462556
MAE =  2.769538996769832